Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v2 09/15] Drivers: hv: mshv_vtl: Move hv_vtl_configure_reg_page() to x86
From: Naman Jain @ 2026-04-29  9:57 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157467FDBC0203C67A67042D4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:10 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move hv_vtl_configure_reg_page() from drivers/hv/mshv_vtl_main.c to
>> arch/x86/hyperv/hv_vtl.c. The register page overlay is an x86-specific
>> feature that uses HV_X64_REGISTER_REG_PAGE, so its configuration belongs
>> in architecture-specific code.
>>
>> Move struct mshv_vtl_per_cpu and union hv_synic_overlay_page_msr to
>> include/asm-generic/mshyperv.h so they are visible to both arch and
>> driver code.
>>
>> Change the return type from void to bool so the caller can determine
>> whether the register page was successfully configured and set
>> mshv_has_reg_page accordingly.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/x86/hyperv/hv_vtl.c       | 32 ++++++++++++++++++++++
>>   drivers/hv/mshv_vtl_main.c     | 49 +++-------------------------------
>>   include/asm-generic/mshyperv.h | 17 ++++++++++++
>>   3 files changed, 53 insertions(+), 45 deletions(-)
>>

<snip>

>>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>> +/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
> 
> This comment pre-dates your patch, but I don't understand the point
> it is trying to make. The comment is factually true, but I don't know
> why calling that out is relevant. The REG_PAGE MSR seems to be
> conceptually separate and distinct from the SIMP MSR, so the fact
> that the layouts are the same is just a coincidence. Or is there some
> relationship between the two MSRs that I'm not aware of, and the
> comment is trying (and failing?) to point out?

This was added as per suggestion from Nuno in my initial series for 
MSHV_VTL. If the reference in "identical to" is misleading, I should 
remove it.

https://lore.kernel.org/all/68143eb0-e6a7-4579-bedb-4c2ec5aaef6b@linux.microsoft.com/

Quoting:
"""
it is a generic structure that
appears to be used for several overlay page MSRs (SIMP, SIEF, etc).

But, the type doesn't appear in the hv*dk headers explicitly; it's just
used internally by the hypervisor.

I think it should be renamed with a hv_ prefix to indicate it's part of
the hypervisor ABI, and a brief comment with the provenance:

/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
union hv_synic_overlay_page_msr {
	/* <snip> */
};
"""

> 
>> +union hv_synic_overlay_page_msr {
>> +	u64 as_uint64;
>> +	struct {
>> +		u64 enabled: 1;
>> +		u64 reserved: 11;
>> +		u64 pfn: 52;
>> +	} __packed;
>> +};
>> +
>>   u8 __init get_vtl(void);
>>   void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> +bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
>>   #else
>>   static inline u8 get_vtl(void) { return 0; }
>>   static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> +static inline bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu) { return false; }
> 
> As with Patch 8, if CONFIG_HYPERV_VTL_MODE caused mshv_common.o
> to be built, this stub wouldn't be needed.
> 

Acked.


>>   #endif
>>
>>   #endif
>> --
>> 2.43.0
>>

Regards,
Naman

^ permalink raw reply

* Re: [PATCH v2 08/15] Drivers: hv: Move hv_call_(get|set)_vp_registers() declarations
From: Naman Jain @ 2026-04-29  9:57 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157852404B5258EF13A5450D4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:09 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move hv_call_get_vp_registers() and hv_call_set_vp_registers()
>> declarations from drivers/hv/mshv.h to include/asm-generic/mshyperv.h.
>>
>> These functions are defined in mshv_common.c and are going to be called
>> from both drivers/hv/ and arch/x86/hyperv/hv_vtl.c. The latter never
>> included mshv.h, relying on implicit declaration visibility. Moving the
>> declarations to the arch-generic Hyper-V header makes them properly
>> visible to all architecture-specific callers.
>>
>> Provide static inline stubs returning -EOPNOTSUPP when neither
>> CONFIG_MSHV_ROOT nor CONFIG_MSHV_VTL is enabled.
> 
> Looking at the drivers/hv/Kconfig, it's possible to build with
> CONFIG_HYPERV_VTL_MODE=y, but not CONFIG_MSHV_VTL. In such a
> case, mshv_common.o doesn't get built, which is why the stubs are
> needed. Is such a configuration desirable for some scenarios?
> 
> I wonder if having CONFIG_HYPERV_VTL_MODE force the building of
> mshv_common.o would be a better approach. Then the stubs wouldn't
> be needed. The "ifneq" statement in drivers/hv/Makefile could use
> CONFIG_HYPERV_VTL_MODE instead of CONFIG_MSHV_VTL, and
> everything would be good since CONFIG_MSHV_VTL depends on
> CONFIG_HYPERV_VTL_MODE.
> 

This looks good. I'll try this and make the changes. In case there are 
some challenges with that, I'll revert back.


Regards,
Naman

^ permalink raw reply

* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-04-29  9:56 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157C147A1B915F9B45D3B74D4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:08 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
>> driver on arm64. This function enables the transition between Virtual
>> Trust Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Reviewed-by: Roman Kisel <vdso@mailbox.org>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/hyperv/Makefile        |   1 +
>>   arch/arm64/hyperv/hv_vtl.c        | 158 ++++++++++++++++++++++++++++++
>>   arch/arm64/include/asm/mshyperv.h |  13 +++
>>   arch/x86/include/asm/mshyperv.h   |   2 -
>>   drivers/hv/mshv_vtl.h             |   3 +
>>   include/asm-generic/mshyperv.h    |   2 +
>>   6 files changed, 177 insertions(+), 2 deletions(-)
>>   create mode 100644 arch/arm64/hyperv/hv_vtl.c
>>
> 
> [snip]
> 
>> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
>> index 585b23a26f1b..9eb0e5999f29 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -60,6 +60,18 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>>   				ARM_SMCCC_SMC_64,		\
>>   				ARM_SMCCC_OWNER_VENDOR_HYP,	\
>>   				HV_SMCCC_FUNC_NUMBER)
>> +
>> +struct mshv_vtl_cpu_context {
>> +/*
>> + * x18 is managed by the hypervisor. It won't be reloaded from this array.
>> + * It is included here for convenience in array indexing.
>> + * 'rsvd' field serves as alignment padding so q[] starts at offset 32*8=256.
>> + */
>> +	__u64 x[31];
>> +	__u64 rsvd;
>> +	__uint128_t q[32];
>> +};
>> +
>>   #ifdef CONFIG_HYPERV_VTL_MODE
>>   /*
>>    * Get/Set the register. If the function returns `1`, that must be done via
>> @@ -69,6 +81,7 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs,
>> bool set, b
>>   {
>>   	return 1;
>>   }
>> +
> 
> This appears to be a spurious blank line being added since there
> are no other changes in the vicinity.

Acked.

> 
>>   #endif
>>
>>   #include <asm-generic/mshyperv.h>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index 08278547b84c..b4d80c9a673a 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -286,7 +286,6 @@ struct mshv_vtl_cpu_context {
>>   #ifdef CONFIG_HYPERV_VTL_MODE
>>   void __init hv_vtl_init_platform(void);
>>   int __init hv_vtl_early_init(void);
>> -void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>   void mshv_vtl_return_call_init(u64 vtl_return_offset);
>>   void mshv_vtl_return_hypercall(void);
>>   void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>> @@ -294,7 +293,6 @@ int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set,
>> bool shared);
>>   #else
>>   static inline void __init hv_vtl_init_platform(void) {}
>>   static inline int __init hv_vtl_early_init(void) { return 0; }
>> -static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>>   static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
>>   static inline void mshv_vtl_return_hypercall(void) {}
>>   static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
>> diff --git a/drivers/hv/mshv_vtl.h b/drivers/hv/mshv_vtl.h
>> index a6eea52f7aa2..103f07371f3f 100644
>> --- a/drivers/hv/mshv_vtl.h
>> +++ b/drivers/hv/mshv_vtl.h
>> @@ -22,4 +22,7 @@ struct mshv_vtl_run {
>>   	char vtl_ret_actions[MSHV_MAX_RUN_MSG_SIZE];
>>   };
>>
>> +static_assert(sizeof(struct mshv_vtl_cpu_context) <= 1024,
>> +	      "struct mshv_vtl_cpu_context exceeds reserved space in struct
>> mshv_vtl_run");
>> +
>>   #endif /* _MSHV_VTL_H */
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index db183c8cfb95..8cdf2a9fbdfb 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -396,8 +396,10 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
>>
>>   #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>>   u8 __init get_vtl(void);
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>>   #else
>>   static inline u8 get_vtl(void) { return 0; }
>> +static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
> 
> Is this stub needed? Maybe I missed something, but it looks to me like none
> of the code that calls this gets built unless CONFIG_HYPERV_VTL_MODE is set.
> See further comments about stubs in Patch 8 of this series.
> 

Config dependencies would handle such cases, and this is not required. I 
saw similar stubs added in the code, so I thought this is a norm that 
should be followed, and not rely on config dependencies.
I can remove it.

Regards,
Naman


^ permalink raw reply

* Re: [PATCH v2 07/15] arm64: hyperv: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-04-29  9:56 UTC (permalink / raw)
  To: Mark Rutland, Marc Zyngier
  Cc: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Michael Kelley, Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi,
	Sascha Bischoff, mrigendrachaubey, linux-hyperv, linux-arm-kernel,
	linux-kernel, linux-arch, linux-riscv, vdso, ssengar
In-Reply-To: <aeolHwXHFH4AnX_n@J2N7QTR9R3.cambridge.arm.com>



On 4/23/2026 7:26 PM, Mark Rutland wrote:
> On Thu, Apr 23, 2026 at 12:41:57PM +0000, Naman Jain wrote:
>> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
>> driver on arm64. This function enables the transition between Virtual
>> Trust Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Reviewed-by: Roman Kisel <vdso@mailbox.org>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/arm64/hyperv/Makefile        |   1 +
>>   arch/arm64/hyperv/hv_vtl.c        | 158 ++++++++++++++++++++++++++++++
>>   arch/arm64/include/asm/mshyperv.h |  13 +++
>>   arch/x86/include/asm/mshyperv.h   |   2 -
>>   drivers/hv/mshv_vtl.h             |   3 +
>>   include/asm-generic/mshyperv.h    |   2 +
>>   6 files changed, 177 insertions(+), 2 deletions(-)
>>   create mode 100644 arch/arm64/hyperv/hv_vtl.c
>>
>> diff --git a/arch/arm64/hyperv/Makefile b/arch/arm64/hyperv/Makefile
>> index 87c31c001da9..9701a837a6e1 100644
>> --- a/arch/arm64/hyperv/Makefile
>> +++ b/arch/arm64/hyperv/Makefile
>> @@ -1,2 +1,3 @@
>>   # SPDX-License-Identifier: GPL-2.0
>>   obj-y		:= hv_core.o mshyperv.o
>> +obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
>> diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
>> new file mode 100644
>> index 000000000000..59cbeb74e7b9
>> --- /dev/null
>> +++ b/arch/arm64/hyperv/hv_vtl.c
>> @@ -0,0 +1,158 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2026, Microsoft, Inc.
>> + *
>> + * Authors:
>> + *     Roman Kisel <romank@linux.microsoft.com>
>> + *     Naman Jain <namjain@linux.microsoft.com>
>> + */
>> +
>> +#include <asm/mshyperv.h>
>> +#include <asm/neon.h>
>> +#include <linux/export.h>
>> +
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>> +{
>> +	struct user_fpsimd_state fpsimd_state;
>> +	u64 base_ptr = (u64)vtl0->x;
>> +
>> +	/*
>> +	 * Obtain the CPU FPSIMD registers for VTL context switch.
>> +	 * This saves the current task's FP/NEON state and allows us to
>> +	 * safely load VTL0's FP/NEON context for the hypercall.
>> +	 */
>> +	kernel_neon_begin(&fpsimd_state);
>> +
>> +	/*
>> +	 * VTL switch for ARM64 platform - managing VTL0's CPU context.
>> +	 * We explicitly use the stack to save the base pointer, and use x16
>> +	 * as our working register for accessing the context structure.
>> +	 *
>> +	 * Register Handling:
>> +	 * - X0-X17: Saved/restored (general-purpose, shared for VTL communication)
>> +	 * - X18: NOT touched - hypervisor-managed per-VTL (platform register)
>> +	 * - X19-X30: Saved/restored (part of VTL0's execution context)
>> +	 * - Q0-Q31: Saved/restored (128-bit NEON/floating-point registers, shared)
>> +	 * - SP: Not in structure, hypervisor-managed per-VTL
>> +	 *
>> +	 * X29 (FP) and X30 (LR) are in the structure and must be saved/restored
>> +	 * as part of VTL0's complete execution state.
>> +	 */
>> +	asm __volatile__ (
>> +		/* Save base pointer to stack explicitly, then load into x16 */
>> +		"str %0, [sp, #-16]!\n\t"     /* Push base pointer onto stack */
>> +		"mov x16, %0\n\t"             /* Load base pointer into x16 */
>> +		/* Volatile registers (Windows ARM64 ABI: x0-x17) */
>> +		"ldp x0, x1, [x16]\n\t"
>> +		"ldp x2, x3, [x16, #(2*8)]\n\t"
>> +		"ldp x4, x5, [x16, #(4*8)]\n\t"
>> +		"ldp x6, x7, [x16, #(6*8)]\n\t"
>> +		"ldp x8, x9, [x16, #(8*8)]\n\t"
>> +		"ldp x10, x11, [x16, #(10*8)]\n\t"
>> +		"ldp x12, x13, [x16, #(12*8)]\n\t"
>> +		"ldp x14, x15, [x16, #(14*8)]\n\t"
>> +		/* x16 will be loaded last, after saving base pointer */
>> +		"ldr x17, [x16, #(17*8)]\n\t"
>> +		/* x18 is hypervisor-managed per-VTL - DO NOT LOAD */
>> +
>> +		/* General-purpose registers: x19-x30 */
>> +		"ldp x19, x20, [x16, #(19*8)]\n\t"
>> +		"ldp x21, x22, [x16, #(21*8)]\n\t"
>> +		"ldp x23, x24, [x16, #(23*8)]\n\t"
>> +		"ldp x25, x26, [x16, #(25*8)]\n\t"
>> +		"ldp x27, x28, [x16, #(27*8)]\n\t"
>> +
>> +		/* Frame pointer and link register */
>> +		"ldp x29, x30, [x16, #(29*8)]\n\t"
>> +
>> +		/* Shared NEON/FP registers: Q0-Q31 (128-bit) */
>> +		"ldp q0, q1, [x16, #(32*8)]\n\t"
>> +		"ldp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
>> +		"ldp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
>> +		"ldp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
>> +		"ldp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
>> +		"ldp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
>> +		"ldp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
>> +		"ldp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
>> +		"ldp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
>> +		"ldp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
>> +		"ldp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
>> +		"ldp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
>> +		"ldp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
>> +		"ldp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
>> +		"ldp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
>> +		"ldp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
>> +
>> +		/* Now load x16 itself */
>> +		"ldr x16, [x16, #(16*8)]\n\t"
>> +
>> +		/* Return to the lower VTL */
>> +		"hvc #3\n\t"
> 
> NAK to this.
> 
> * This is a non-SMCCC hypercall, which we have NAK'd in general in the
>    past for various reasons that I am not going to rehash here.
> 
> * It's not clear how this is going to be extended with necessary
>    architecture state in future (e.g. SVE, SME). This is not
>    future-proof, and I don't believe this is maintainable.
> 
> * This breaks general requirements for reliable stacktracing by
>    clobbering state (e.g. x29) that we depend upon being valid AT ALL
>    TIMES outside of entry code.
> 
> * IMO, if this needs to be saved/restored, that should happen in
>    whatever you are calling.
> 
> Mark.


Merging threads for addressing comments from Mark Rutland and Marc 
Zyngier on this patch.

Thanks for reviewing the changes. Please allow me to briefly explain the 
use case here and then address your comments.

Hyper-V's Virtual Trust Levels (VTLs) provide hardware-enforced 
isolation within a single VM, analogous to ARM TrustZone. The kernel 
runs in VTL2 (higher privilege) as a "paravisor", a security monitor 
that handles intercepts for the primary OS in VTL0 (lower privilege). 
The VTL switch (mshv_vtl_return_call) is functionally equivalent to 
KVM's guest enter/exit. It saves VTL2 state, loads VTL0's GPRs other 
registers from a shared context structure, issues hvc #3 to let VTL0 
run, and on return saves VTL0's updated state back.

Coming to the problems with the code, I have identified a few ways to 
address them.

I can put the assembly code in a separate .S file with 
SYM_FUNC_START/SYM_FUNC_END and marked as noinstr, to prevent 
ftrace/kprobes from instrumenting between the GPR load and the hvc, 
which could have corrupted VTL0 register state. This should solve x29 
clobbering, stack tracing problems.

I should use kernel_neon_begin()/kernel_neon_end() to save/restore the 
full extended FP state of the current task in VTL2. VTL0's Q0-Q31 can be 
loaded/saved separately via fpsimd_load_state()/fpsimd_save_state(). 
This way, the assembly touches none of the SIMD registers. This is 
SVE/SME-safe for VTL2's task state. VTL0 still only carries Q0-Q31 in 
the context struct, and extending to SVE, SME is a future context struct 
change, which will need Hyper-V arm64 ABI support.
This way, VTL2's callee-saved regs (x19-x28, x29, x30) are explicitly 
saved to the stack frame at the top and restored at the bottom of 
assembly code. The C caller (in hv_vtl.c) is a clean function call.

Regarding Non-SMCCC "hvc #3" call, I have a limitation here owing to the 
ABI that is defined by the Hyper-V hypervisor. Fixing this requires a 
hypervisor-side change to support SMCCC-style dispatch for VTL return. 
Until then, hvc #3 is the only working interface. Moreover there would 
be backward compatibility issues with this new ABI interface, if at all 
it is added.

Link to TLFS: 
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm#on-arm64-platforms-3

Please correct me if any of the above is incorrect or if I should be 
looking at some other existing examples to solve these problems.

Regards,
Naman

^ permalink raw reply

* Re: [PATCH v2 03/15] Drivers: hv: Move vmbus_handler to common code
From: Naman Jain @ 2026-04-29  9:55 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157E3B0A6F76E4686D8C3E4D4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:08 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move the vmbus_handler global variable and hv_setup_vmbus_handler()/
>> hv_remove_vmbus_handler() from arch/x86 to drivers/hv/hv_common.c.
>>
>> hv_setup_vmbus_handler() is called unconditionally in vmbus_bus_init()
>> and works for both x86 (sysvec handler) and arm64 (vmbus_percpu_isr).
>>
>> This eliminates the need for separate percpu vmbus handler setup
>> functions and __weak stubs, that are needed for adding ARM64 support
>> in MSHV_VTL driver where we need to set a custom per-cpu vmbus handler.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/x86/kernel/cpu/mshyperv.c | 12 ------------
>>   drivers/hv/hv_common.c         |  9 +++++++--
>>   drivers/hv/vmbus_drv.c         | 17 +++++++++--------
>>   include/asm-generic/mshyperv.h |  1 +
>>   4 files changed, 17 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>> index 89a2eb8a0722..68706ff5880e 100644
>> --- a/arch/x86/kernel/cpu/mshyperv.c
>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>> @@ -145,7 +145,6 @@ void hv_set_msr(unsigned int reg, u64 value)
>>   EXPORT_SYMBOL_GPL(hv_set_msr);
>>
>>   static void (*mshv_handler)(void);
>> -static void (*vmbus_handler)(void);
>>   static void (*hv_stimer0_handler)(void);
>>   static void (*hv_kexec_handler)(void);
>>   static void (*hv_crash_handler)(struct pt_regs *regs);
>> @@ -172,17 +171,6 @@ void hv_setup_mshv_handler(void (*handler)(void))
>>   	mshv_handler = handler;
>>   }
>>
>> -void hv_setup_vmbus_handler(void (*handler)(void))
>> -{
>> -	vmbus_handler = handler;
>> -}
>> -
>> -void hv_remove_vmbus_handler(void)
>> -{
>> -	/* We have no way to deallocate the interrupt gate */
>> -	vmbus_handler = NULL;
>> -}
>> -
>>   /*
>>    * Routines to do per-architecture handling of stimer0
>>    * interrupts when in Direct Mode
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index e8633bc51d56..eb7b0028b45d 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -758,13 +758,18 @@ bool __weak hv_isolation_type_tdx(void)
>>   }
>>   EXPORT_SYMBOL_GPL(hv_isolation_type_tdx);
>>
>> -void __weak hv_setup_vmbus_handler(void (*handler)(void))
>> +void (*vmbus_handler)(void);
>> +EXPORT_SYMBOL_GPL(vmbus_handler);
>> +
>> +void hv_setup_vmbus_handler(void (*handler)(void))
>>   {
>> +	vmbus_handler = handler;
>>   }
>>   EXPORT_SYMBOL_GPL(hv_setup_vmbus_handler);
>>
>> -void __weak hv_remove_vmbus_handler(void)
>> +void hv_remove_vmbus_handler(void)
>>   {
>> +	vmbus_handler = NULL;
>>   }
>>   EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
> 
> I'd suggest moving hv_setup_vmbus_handler() and
> hv_remove_vmbus_handler() above or below the group
> of __weak stubs in this source code file. There's a comment
> describing the purpose of these __weak functions, and
> intermixing these two functions that are no longer __weak
> produces something of a jumble.
> 

Acked.

>>
>> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
>> index bc4fc1951ae1..052ca8b11cee 100644
>> --- a/drivers/hv/vmbus_drv.c
>> +++ b/drivers/hv/vmbus_drv.c
>> @@ -1415,7 +1415,8 @@ EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
>>
>>   static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
>>   {
>> -	vmbus_isr();
>> +	if (vmbus_handler)
>> +		vmbus_handler();
> 
> Is it necessary to test vmbus_handler first? From what I can
> see, it is always set before the per-cpu interrupt is setup.

After the shuffle of hv_remove_vmbus_handler() and freeing the irq, it 
can be safely removed. When I was setting the vmbus_handler to NULL 
first, before freeing the IRQ, this was required.

> 
>>   	return IRQ_HANDLED;
>>   }
>>
>> @@ -1517,8 +1518,10 @@ static int vmbus_bus_init(void)
>>   		vmbus_irq_initialized = true;
>>   	}
>>
>> +	hv_setup_vmbus_handler(vmbus_isr);
>> +
>>   	if (vmbus_irq == -1) {
>> -		hv_setup_vmbus_handler(vmbus_isr);
>> +		/* x86: sysvec handler uses vmbus_handler directly */
>>   	} else {
>>   		ret = request_percpu_irq(vmbus_irq, vmbus_percpu_isr,
>>   				"Hyper-V VMbus", &vmbus_evt);
>> @@ -1553,9 +1556,8 @@ static int vmbus_bus_init(void)
>>   	return 0;
>>
>>   err_connect:
>> -	if (vmbus_irq == -1)
>> -		hv_remove_vmbus_handler();
>> -	else
>> +	hv_remove_vmbus_handler();
>> +	if (vmbus_irq != -1)
>>   		free_percpu_irq(vmbus_irq, &vmbus_evt);
> 
> These operations should be reordered so they are the inverse
> of how they are setup.  I.e., free_percpu_irq() first, then remove
> the VMBus handler. That's just good standard practice unless
> there's a specific reason to do the cleanup ordering differently. In
> fact, hv_remove_vmbus_handler() needs to be moved down
> to the err_setup label so it's done if request_percpu_irq()
> fails.


Acked. I will do the same for other hv_remove_vmbus_handler() as well.

> 
>>   err_setup:
>>   	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
>> @@ -3026,9 +3028,8 @@ static void __exit vmbus_exit(void)
>>   	vmbus_connection.conn_state = DISCONNECTED;
>>   	hv_stimer_global_cleanup();
>>   	vmbus_disconnect();
>> -	if (vmbus_irq == -1)
>> -		hv_remove_vmbus_handler();
>> -	else
>> +	hv_remove_vmbus_handler();
>> +	if (vmbus_irq != -1)
>>   		free_percpu_irq(vmbus_irq, &vmbus_evt);
> 
> Ordering should be changed here as well so it is the inverse
> of how things are set up.
> 
>>   	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
>>   		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index 2810aa05dc73..db183c8cfb95 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -179,6 +179,7 @@ static inline u64 hv_generate_guest_id(u64 kernel_version)
>>
>>   int hv_get_hypervisor_version(union hv_hypervisor_version_info *info);
>>
>> +extern void (*vmbus_handler)(void);
>>   void hv_setup_vmbus_handler(void (*handler)(void));
>>   void hv_remove_vmbus_handler(void);
>>   void hv_setup_stimer0_handler(void (*handler)(void));
>> --
>> 2.43.0
>>


Regards,
Naman


^ permalink raw reply

* Re: [PATCH v2 02/15] Drivers: hv: Move hv_vp_assist_page to common files
From: Naman Jain @ 2026-04-29  9:55 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, Sascha Bischoff,
	mrigendrachaubey, linux-hyperv@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-riscv@lists.infradead.org, vdso@mailbox.org,
	ssengar@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157BEAF5480D931C3756B7DD4362@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/27/2026 11:07 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Thursday, April 23, 2026 5:42 AM
>>
>> Move the logic to initialize and export hv_vp_assist_page from x86
>> architecture code to Hyper-V common code to allow it to be used for
>> upcoming arm64 support in MSHV_VTL driver.
>> Note: This change also improves error handling - if VP assist page
>> allocation fails, hyperv_init() now returns early instead of
>> continuing with partial initialization.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Reviewed-by: Roman Kisel <vdso@mailbox.org>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>>   arch/x86/hyperv/hv_init.c       | 88 +-----------------------------
>>   arch/x86/include/asm/mshyperv.h | 14 -----
>>   drivers/hv/hv_common.c          | 94 ++++++++++++++++++++++++++++++++-
>>   include/asm-generic/mshyperv.h  | 16 ++++++
>>   include/hyperv/hvgdk_mini.h     |  6 ++-
>>   5 files changed, 115 insertions(+), 103 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index 323adc93f2dc..75a98b5e451b 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -81,9 +81,6 @@ union hv_ghcb * __percpu *hv_ghcb_pg;
>>   /* Storage to save the hypercall page temporarily for hibernation */
>>   static void *hv_hypercall_pg_saved;
>>
>> -struct hv_vp_assist_page **hv_vp_assist_page;
>> -EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>> -
>>   static int hyperv_init_ghcb(void)
>>   {
>>   	u64 ghcb_gpa;
>> @@ -117,59 +114,12 @@ static int hyperv_init_ghcb(void)
>>
>>   static int hv_cpu_init(unsigned int cpu)
>>   {
>> -	union hv_vp_assist_msr_contents msr = { 0 };
>> -	struct hv_vp_assist_page **hvp;
>>   	int ret;
>>
>>   	ret = hv_common_cpu_init(cpu);
>>   	if (ret)
>>   		return ret;
>>
>> -	if (!hv_vp_assist_page)
>> -		return 0;
>> -
>> -	hvp = &hv_vp_assist_page[cpu];
>> -	if (hv_root_partition()) {
>> -		/*
>> -		 * For root partition we get the hypervisor provided VP assist
>> -		 * page, instead of allocating a new page.
>> -		 */
>> -		rdmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> -		*hvp = memremap(msr.pfn << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> -				PAGE_SIZE, MEMREMAP_WB);
>> -	} else {
>> -		/*
>> -		 * The VP assist page is an "overlay" page (see Hyper-V TLFS's
>> -		 * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> -		 * out to make sure we always write the EOI MSR in
>> -		 * hv_apic_eoi_write() *after* the EOI optimization is disabled
>> -		 * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> -		 * case of CPU offlining and the VM will hang.
>> -		 */
>> -		if (!*hvp) {
>> -			*hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO);
>> -
>> -			/*
>> -			 * Hyper-V should never specify a VM that is a Confidential
>> -			 * VM and also running in the root partition. Root partition
>> -			 * is blocked to run in Confidential VM. So only decrypt assist
>> -			 * page in non-root partition here.
>> -			 */
>> -			if (*hvp && !ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
>> -				WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
>> -				memset(*hvp, 0, PAGE_SIZE);
>> -			}
>> -		}
>> -
>> -		if (*hvp)
>> -			msr.pfn = vmalloc_to_pfn(*hvp);
>> -
>> -	}
>> -	if (!WARN_ON(!(*hvp))) {
>> -		msr.enable = 1;
>> -		wrmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> -	}
>> -
>>   	/* Allow Hyper-V stimer vector to be injected from Hypervisor. */
>>   	if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE)
>>   		apic_update_vector(cpu, HYPERV_STIMER0_VECTOR, true);
>> @@ -286,23 +236,6 @@ static int hv_cpu_die(unsigned int cpu)
>>
>>   	hv_common_cpu_die(cpu);
>>
>> -	if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
>> -		union hv_vp_assist_msr_contents msr = { 0 };
>> -		if (hv_root_partition()) {
>> -			/*
>> -			 * For root partition the VP assist page is mapped to
>> -			 * hypervisor provided page, and thus we unmap the
>> -			 * page here and nullify it, so that in future we have
>> -			 * correct page address mapped in hv_cpu_init.
>> -			 */
>> -			memunmap(hv_vp_assist_page[cpu]);
>> -			hv_vp_assist_page[cpu] = NULL;
>> -			rdmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> -			msr.enable = 0;
>> -		}
>> -		wrmsrq(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> -	}
>> -
>>   	if (hv_reenlightenment_cb == NULL)
>>   		return 0;
>>
>> @@ -460,21 +393,6 @@ void __init hyperv_init(void)
>>   	if (hv_common_init())
>>   		return;
>>
>> -	/*
>> -	 * The VP assist page is useless to a TDX guest: the only use we
>> -	 * would have for it is lazy EOI, which can not be used with TDX.
>> -	 */
>> -	if (hv_isolation_type_tdx())
>> -		hv_vp_assist_page = NULL;
>> -	else
>> -		hv_vp_assist_page = kzalloc_objs(*hv_vp_assist_page, nr_cpu_ids);
>> -	if (!hv_vp_assist_page) {
>> -		ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>> -
>> -		if (!hv_isolation_type_tdx())
>> -			goto common_free;
>> -	}
>> -
>>   	if (ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
>>   		/* Negotiate GHCB Version. */
>>   		if (!hv_ghcb_negotiate_protocol())
>> @@ -483,7 +401,7 @@ void __init hyperv_init(void)
>>
>>   		hv_ghcb_pg = alloc_percpu(union hv_ghcb *);
>>   		if (!hv_ghcb_pg)
>> -			goto free_vp_assist_page;
>> +			goto free_ghcb_page;
> 
> Seems like this should be "goto common_free". The allocation of
> hv_ghcb_pg has failed, so going to a label where hv_ghcb_pg is
> freed seems redundant. It works since free_percpu() checks for
> a NULL argument, but it's a bit unexpected since the common_free
> label is already there.

Thanks for catching this, I'll fix it.


> 
>>   	}
>>
>>   	cpuhp = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "x86/hyperv_init:online",
>> @@ -613,10 +531,6 @@ void __init hyperv_init(void)
>>   	cpuhp_remove_state(CPUHP_AP_HYPERV_ONLINE);
>>   free_ghcb_page:
>>   	free_percpu(hv_ghcb_pg);
>> -free_vp_assist_page:
>> -	kfree(hv_vp_assist_page);
>> -	hv_vp_assist_page = NULL;
>> -common_free:
>>   	hv_common_free();
>>   }
>>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index f64393e853ee..95b452387969 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -155,16 +155,6 @@ static inline u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
>>   	return _hv_do_fast_hypercall16(control, input1, input2);
>>   }
>>
>> -extern struct hv_vp_assist_page **hv_vp_assist_page;
>> -
>> -static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
>> -{
>> -	if (!hv_vp_assist_page)
>> -		return NULL;
>> -
>> -	return hv_vp_assist_page[cpu];
>> -}
>> -
>>   void __init hyperv_init(void);
>>   void hyperv_setup_mmu_ops(void);
>>   void set_hv_tscchange_cb(void (*cb)(void));
>> @@ -254,10 +244,6 @@ static inline void hyperv_setup_mmu_ops(void) {}
>>   static inline void set_hv_tscchange_cb(void (*cb)(void)) {}
>>   static inline void clear_hv_tscchange_cb(void) {}
>>   static inline void hyperv_stop_tsc_emulation(void) {};
>> -static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
>> -{
>> -	return NULL;
>> -}
>>   static inline int hyperv_flush_guest_mapping(u64 as) { return -1; }
>>   static inline int hyperv_flush_guest_mapping_range(u64 as,
>>   		hyperv_fill_flush_list_func fill_func, void *data)
>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 6b67ac616789..e8633bc51d56 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -28,7 +28,11 @@
>>   #include <linux/slab.h>
>>   #include <linux/dma-map-ops.h>
>>   #include <linux/set_memory.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/io.h>
>> +#include <linux/hyperv.h>
>>   #include <hyperv/hvhdk.h>
>> +#include <hyperv/hvgdk.h>
>>   #include <asm/mshyperv.h>
>>
>>   u64 hv_current_partition_id = HV_PARTITION_ID_SELF;
>> @@ -78,6 +82,8 @@ static struct ctl_table_header *hv_ctl_table_hdr;
>>   u8 * __percpu *hv_synic_eventring_tail;
>>   EXPORT_SYMBOL_GPL(hv_synic_eventring_tail);
>>
>> +struct hv_vp_assist_page **hv_vp_assist_page;
>> +EXPORT_SYMBOL_GPL(hv_vp_assist_page);
>>   /*
>>    * Hyper-V specific initialization and shutdown code that is
>>    * common across all architectures.  Called from architecture
>> @@ -92,6 +98,9 @@ void __init hv_common_free(void)
>>   	if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE)
>>   		hv_kmsg_dump_unregister();
>>
>> +	kfree(hv_vp_assist_page);
>> +	hv_vp_assist_page = NULL;
>> +
>>   	kfree(hv_vp_index);
>>   	hv_vp_index = NULL;
>>
>> @@ -394,6 +403,23 @@ int __init hv_common_init(void)
>>   	for (i = 0; i < nr_cpu_ids; i++)
>>   		hv_vp_index[i] = VP_INVAL;
>>
>> +	/*
>> +	 * The VP assist page is useless to a TDX guest: the only use we
>> +	 * would have for it is lazy EOI, which can not be used with TDX.
>> +	 */
>> +	if (hv_isolation_type_tdx()) {
>> +		hv_vp_assist_page = NULL;
>> +#ifdef CONFIG_X86_64
>> +		ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>> +#endif
> 
> I realize that this #ifdef went away for the reason I flagged in v1 of
> this patch set, but it's back again for a different reason.
> 
> Let me suggest another approach. hv_common_init() is called from
> both the x86/64 and arm64 hyperv_init() functions. Immediately after
> the call to hv_common_init() in the x86/64 hyperv_init(), test
> hv_vp_assist_page for NULL and clear
> HV_X64_ENLIGHTENED_VMCS_RECOMMENDED if it is. No #ifdef is
> needed, and x86/64 specific hackery stays under arch/x86 instead of
> being in common code.

Acked. Thanks.

> 
>> +	} else {
>> +		hv_vp_assist_page = kzalloc_objs(*hv_vp_assist_page, nr_cpu_ids);
>> +		if (!hv_vp_assist_page) {
>> +			hv_common_free();
>> +			return -ENOMEM;
>> +		}
>> +	}
>> +
>>   	return 0;
>>   }
>>
>> @@ -471,6 +497,8 @@ void __init ms_hyperv_late_init(void)
>>
>>   int hv_common_cpu_init(unsigned int cpu)
>>   {
>> +	union hv_vp_assist_msr_contents msr = { 0 };
>> +	struct hv_vp_assist_page **hvp;
>>   	void **inputarg, **outputarg;
>>   	u8 **synic_eventring_tail;
>>   	u64 msr_vp_index;
>> @@ -539,7 +567,53 @@ int hv_common_cpu_init(unsigned int cpu)
>>   						sizeof(u8), flags);
>>   		/* No need to unwind any of the above on failure here */
>>   		if (unlikely(!*synic_eventring_tail))
>> -			ret = -ENOMEM;
>> +			return -ENOMEM;
>> +	}
>> +
>> +	if (!hv_vp_assist_page)
>> +		return ret;
>> +
>> +	hvp = &hv_vp_assist_page[cpu];
>> +	if (hv_root_partition()) {
>> +		/*
>> +		 * For root partition we get the hypervisor provided VP assist
>> +		 * page, instead of allocating a new page.
>> +		 */
>> +		msr.as_uint64 = hv_get_msr(HV_MSR_VP_ASSIST_PAGE);
>> +		*hvp = memremap(msr.pfn << HV_VP_ASSIST_PAGE_ADDRESS_SHIFT,
>> +				HV_HYP_PAGE_SIZE, MEMREMAP_WB);
>> +	} else {
>> +		/*
>> +		 * The VP assist page is an "overlay" page (see Hyper-V TLFS's
>> +		 * Section 5.2.1 "GPA Overlay Pages"). Here it must be zeroed
>> +		 * out to make sure that on x86/x64, we always write the EOI MSR in
>> +		 * hv_apic_eoi_write() *after* the EOI optimization is disabled
>> +		 * in hv_cpu_die(), otherwise a CPU may not be stopped in the
>> +		 * case of CPU offlining and the VM will hang.
>> +		 */
>> +		if (!*hvp) {
>> +			*hvp = __vmalloc(HV_HYP_PAGE_SIZE, flags | __GFP_ZERO);
>> +
>> +			/*
>> +			 * Hyper-V should never specify a VM that is a Confidential
>> +			 * VM and also running in the root partition. Root partition
>> +			 * is blocked to run in Confidential VM. So only decrypt assist
>> +			 * page in non-root partition here.
>> +			 */
>> +			if (*hvp &&
>> +			    !ms_hyperv.paravisor_present &&
>> +			    hv_isolation_type_snp()) {
>> +				WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
>> +				memset(*hvp, 0, HV_HYP_PAGE_SIZE);
>> +			}
>> +		}
>> +
>> +		if (*hvp)
>> +			msr.pfn = page_to_hvpfn(vmalloc_to_page(*hvp));
> 
> Your Patch 0 changelog mentions adding a comment about vmalloc_to_pfn(), which
> I didn't see anywhere. I'm not sure what that comment would say, so maybe it
> became unnecessary.

I think I mixed up two things. Changelog was about your suggestion to 
add "x86/x64" in above comment about GPA Overlay Pages.
I also changed this function to page_to_hvpfn(vmalloc_to_page(*hvp)) as 
per your suggestion.
Apologies for the confusion.

> 
>> +	}
>> +	if (!WARN_ON(!(*hvp))) {
>> +		msr.enable = 1;
>> +		hv_set_msr(HV_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>>   	}
>>
>>   	return ret;
>> @@ -566,6 +640,24 @@ int hv_common_cpu_die(unsigned int cpu)
>>   		*synic_eventring_tail = NULL;
>>   	}
>>
>> +	if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
>> +		union hv_vp_assist_msr_contents msr = { 0 };
>> +
>> +		if (hv_root_partition()) {
>> +			/*
>> +			 * For root partition the VP assist page is mapped to
>> +			 * hypervisor provided page, and thus we unmap the
>> +			 * page here and nullify it, so that in future we have
>> +			 * correct page address mapped in hv_cpu_init.
>> +			 */
>> +			memunmap(hv_vp_assist_page[cpu]);
>> +			hv_vp_assist_page[cpu] = NULL;
>> +			msr.as_uint64 = hv_get_msr(HV_MSR_VP_ASSIST_PAGE);
>> +			msr.enable = 0;
>> +		}
>> +		hv_set_msr(HV_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>> +	}
>> +
>>   	return 0;
>>   }
>>
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index d37b68238c97..2810aa05dc73 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -25,6 +25,7 @@
>>   #include <linux/nmi.h>
>>   #include <asm/ptrace.h>
>>   #include <hyperv/hvhdk.h>
>> +#include <hyperv/hvgdk.h>
>>
>>   #define VTPM_BASE_ADDRESS 0xfed40000
>>
>> @@ -299,6 +300,16 @@ do { \
>>   #define hv_status_debug(status, fmt, ...) \
>>   	hv_status_printk(debug, status, fmt, ##__VA_ARGS__)
>>
>> +extern struct hv_vp_assist_page **hv_vp_assist_page;
>> +
>> +static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
>> +{
>> +	if (!hv_vp_assist_page)
>> +		return NULL;
>> +
>> +	return hv_vp_assist_page[cpu];
>> +}
>> +
>>   const char *hv_result_to_string(u64 hv_status);
>>   int hv_result_to_errno(u64 status);
>>   void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
>> @@ -327,6 +338,11 @@ static inline enum hv_isolation_type hv_get_isolation_type(void)
>>   {
>>   	return HV_ISOLATION_TYPE_NONE;
>>   }
>> +
>> +static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
>> +{
>> +	return NULL;
>> +}
>>   #endif /* CONFIG_HYPERV */
>>
>>   #if IS_ENABLED(CONFIG_MSHV_ROOT)
>> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
>> index 056ef7b6b360..c72d04cd5ae4 100644
>> --- a/include/hyperv/hvgdk_mini.h
>> +++ b/include/hyperv/hvgdk_mini.h
>> @@ -149,6 +149,7 @@ struct hv_u128 {
>>   #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT	12
> 
> Can this X64 specific definition of the shift be eliminated entirely,
> and a single common definition for x86/64 and arm64 be used?
> As I understand it, the MSR layout is the same on both architectures.
> The one gotcha is that kvm_hv_set_msr() would need to be updated.
> 
> HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_MASK defined below isn't
> used anywhere, so it could go away too.  (The KVM selftest usage has
> its own definition.)
> 
> I realize these are changes to a source code file that is derived from
> Windows, and I'm not sure of the guidelines for such changes. So maybe
> these suggestions have to be ignored ....

The VP assist page definition is common to both x86 and arm64, so the 
address mask and shift can be shared. Also, I don't see the shift and 
mask definitions in the Hyper-V header so seem to be specific to in 
kernel usage.

> 
>>   #define HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_MASK	\
>>   		(~((1ull << HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT) - 1))
>> +#define HV_MSR_VP_ASSIST_PAGE              (HV_X64_MSR_VP_ASSIST_PAGE)
> 
> This is the correct file for this #define, but it should be placed down around
> line 1148 or so with the other HV_MSR_* definitions in terms of HV_X64_MSR_*
> 

Acked.

>>
>>   /* Hyper-V Enlightened VMCS version mask in nested features CPUID */
>>   #define HV_X64_ENLIGHTENED_VMCS_VERSION		0xff
>> @@ -410,6 +411,7 @@ union hv_x64_msr_hypercall_contents {
>>   #if defined(CONFIG_ARM64)
>>   #define HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE	BIT(8)
>>   #define HV_STIMER_DIRECT_MODE_AVAILABLE		BIT(13)
>> +#define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT 12
>>   #endif /* CONFIG_ARM64 */
>>
>>   #if defined(CONFIG_X86)
>> @@ -1163,6 +1165,8 @@ enum hv_register_name {
>>   #define HV_MSR_STIMER0_CONFIG	(HV_X64_MSR_STIMER0_CONFIG)
>>   #define HV_MSR_STIMER0_COUNT	(HV_X64_MSR_STIMER0_COUNT)
>>
>> +#define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT HV_X64_MSR_VP_ASSIST_PAGE_ADDRESS_SHIFT
>> +
>>   #elif defined(CONFIG_ARM64) /* CONFIG_X86 */
>>
>>   #define HV_MSR_CRASH_P0		(HV_REGISTER_GUEST_CRASH_P0)
>> @@ -1185,7 +1189,7 @@ enum hv_register_name {
>>
>>   #define HV_MSR_STIMER0_CONFIG	(HV_REGISTER_STIMER0_CONFIG)
>>   #define HV_MSR_STIMER0_COUNT	(HV_REGISTER_STIMER0_COUNT)
>> -
>> +#define HV_MSR_VP_ASSIST_PAGE    (HV_REGISTER_VP_ASSIST_PAGE)
> 
> Nit: This definition is slightly mis-aligned. It has spaces where there
> should be a tab to match the similar definitions above it.
> 

Acked.

>>   #endif /* CONFIG_ARM64 */
>>
>>   union hv_explicit_suspend_register {
>> --
>> 2.43.0
>>

Regards,
Naman


^ permalink raw reply

* [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-04-29  9:06 UTC (permalink / raw)
  To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
  Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
	Shradha Gupta, Saurabh Singh Sengar, stable

In mana driver, the number of IRQs allocated is capped by the
min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
than the vcpu count, we want to utilize all the vCPUs, irrespective of
their NUMA/core bindings.

This is important, especially in the envs where number of vCPUs are so
few that the softIRQ handling overhead on two IRQs on the same vCPU is
much more than their overheads if they were spread across sibling vCPUs.

This behaviour is more evident with dynamic IRQ allocation. Since MANA
IRQs are assigned at a later stage compared to static allocation, other
device IRQs may already be affinitized to the vCPUs. As a result, IRQ
weights become imbalanced, causing multiple MANA IRQs to land on the
same vCPU, while some vCPUs have none.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly.

Test envs:
=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC		0
IRQ1:	mana_q1		0
IRQ2:	mana_q2		2
IRQ3:	mana_q3		0
IRQ4:	mana_q4		3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU		0	1	2	3
=======================================================
pass 1:		38.85	0.03	24.89	24.65
pass 2:		39.15	0.03	24.57	25.28
pass 3:		40.36	0.03	23.20	23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42	15.85	14.99	14.51
pass 2:         15.53	15.94	15.81	15.93
pass 3:         16.41	16.35	16.40	16.36

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn	with patch	w/o patch
20480		15.65		7.73
10240		15.63		8.93
8192		15.64		9.69
6144		15.64		13.16
4096		15.69		15.75
2048		15.69		15.83
1024		15.71		15.28

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
Changes in v2
 * Removed the unused skip_first_cpu variable
 * fixed exit condition in irq_setup_linear() with len == 0
 * changed return type of irq_setup_linear() as it will always be 0
 * removed the unnecessary rcu_read_lock() in irq_setup_linear()
 * added appropriate comments to indicate expected behaviour when
   IRQs are more than or equal to num_online_cpus()
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 47 ++++++++++++++++---
 1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..d740d1dc43da 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	} else {
 		/* If dynamic allocation is enabled we have already allocated
 		 * hwc msi
+		 * Also, we make sure in this case the following is always true
+		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
 		 */
 		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
 	}
@@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
 	return 0;
 }
 
+/* should be called with cpus_read_lock() held */
+static void irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (len == 0)
+			break;
+
+		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+		len--;
+	}
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_irq_context *gic;
-	bool skip_first_cpu = false;
 	int *irqs, irq, err, i;
 
 	irqs = kmalloc_objs(int, nvec);
@@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	 * first CPU sibling group since they are already affinitized to HWC IRQ
 	 */
 	cpus_read_lock();
-	if (gc->num_msix_usable <= num_online_cpus())
-		skip_first_cpu = true;
+	if (gc->num_msix_usable <= num_online_cpus()) {
+		err = irq_setup(irqs, nvec, gc->numa_node, true);
+		if (err) {
+			cpus_read_unlock();
+			goto free_irq;
+		}
+	} else {
+		/*
+		 * When num_msix_usable are more than num_online_cpus, we try to
+		 * make sure we are using all vcpus. In such a case NUMA or
+		 * CPU core affinity does not matter.
+		 * Note: in this case the total mana IRQ should always be
+		 * num_online_cpus + 1. The first HWC IRQ is already handled
+		 * in HWC setup calls
+		 * However, if CPUs went offline since num_msix_usable was
+		 * computed, nvec count will be more than num_online_cpus().
+		 * In such cases remaining extra IRQs will retain their default
+		 * affinity.
+		 */
+		if (nvec > num_online_cpus())
+			dev_dbg(&pdev->dev,
+				"IRQ count %d exceeds online CPU count %d. Some IRQs will share CPU\n",
+				nvec, num_online_cpus());
 
-	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
-	if (err) {
-		cpus_read_unlock();
-		goto free_irq;
+		irq_setup_linear(irqs, nvec);
 	}
 
 	cpus_read_unlock();

base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH rc 00/15] Various bug fixes for RDMA drivers in the uapi functions
From: Junxian Huang @ 2026-04-29  7:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Andrew Lunn,
	Broadcom internal kernel review list, Bryan Tan, Eric Dumazet,
	Konstantin Taranov, Jakub Kicinski, Leon Romanovsky, linux-hyperv,
	linux-rdma, netdev, Paolo Abeni, Selvin Xavier, Chengchang Tang,
	Tariq Toukan, Vishnu Dasa, Yishai Hadas
  Cc: Abhijit Gangurde, Adit Ranadive, Allen Hubbe, Andrew Boyer,
	Aditya Sarwade, Brad Spengler, Bryan Tan, David S. Miller,
	Dexuan Cui, Doug Ledford, George Zhang, Jorgen Hansen, Jianbo Liu,
	Kai Aizen, Leon Romanovsky, Leon Romanovsky, Yixian Liu, Long Li,
	Lijun Ou, Parav Pandit, patches, Roland Dreier, Roland Dreier,
	Sagi Grimberg, Ajay Sharma, stable, Tariq Toukan, Wei Hu (Xavier),
	Shaobo Xu, Nenglong Zhao
In-Reply-To: <0-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com>



On 2026/4/29 0:17, Jason Gunthorpe wrote:
> All were found by Sashiko or Claude AI tools. They vary in severity, but
> are all things that shouldn't be present.
> 
> Jason Gunthorpe (15):
>   RDMA/hns: Fix xarray race in hns_roce_create_srq()
>   RDMA/hns: Fix xarray race in hns_roce_create_qp_common()
>   RDMA/hns: Fix unlocked call to hns_roce_qp_remove()

For hns patches:
Reviewed-by: Junxian Huang <huangjunxian6@hisilicon.com>

Thanks,
Junxian

^ permalink raw reply

* RE: [PATCH] Drivers: hv: vmbus: Improve the logc of reserving fb_mmio on Gen2 VMs
From: Dexuan Cui @ 2026-04-29  3:12 UTC (permalink / raw)
  To: Michael Kelley, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Long Li, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, matthew.ruffell@canonical.com,
	johansen@templeofstupid.com
  Cc: stable@vger.kernel.org
In-Reply-To: <SN6PR02MB41576A849B6C4967622B4BA8D42A2@SN6PR02MB4157.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Thursday, April 23, 2026 10:40 AM

Sorry for the late response! I got sidetracked by something else.

> > If vmbus_reserve_fb() in the kdump kernel fails to properly reserve the
> 
> This problem has wider scope than just kdump. Any kexec'ed kernel would see
> the same problem, though kdump is probably the most common case. But the
> discussion here, and the mention of kdump in the code comments, should be
> adjusted accordingly.

Agreed. I'll post v2, which will use "kdump/kexec".

> > framebuffer MMIO range due to a Gen2 VM's screen.lfb_base being zero [1],
> > there is an MMIO conflict between the drivers hyperv_drm and pci-hyperv.
> 
> You describe an MMIO "conflict" without giving the details. Is that
> intentional to keep the commit message from being too long? It might be

Yes.

> helpful to future readers to say a little more about how PCI devices must not
> use MMIO space that the hypervisor has assigned to the frame buffer.

Will do.

> As you noted in the detailed discussion in the other email thread [2],
> there's a Gen1 VM case that this patch doesn't fix. For completeness,
> perhaps that case should be called out in this commit message.

Will do.
 
> > +	/* Hyper-V CoCo guests do not have a framebuffer device. */
> > +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> > +		return;
> 
> This test is testing feature "A" (mem encryption) in order to determine
> the presence of feature "B" (no framebuffer), because current
> configurations happen to always have "A" and "B" at the same time. But
> the linkage between the features is tenuous, and if configurations should
> change in the future, testing this way could be bogus. It works now, but I'm
> leery of depending on the linkage between "A" and "B".
> 
> You could set up a "can_have_framebuffer" flag in ms_hyperv_init_platform()
> if running in a CVM, and test that flag here. But I'd suggest just dropping
> this optimization. CVMs are always Gen2 (and that's not going to change),
> so they have plenty of low mmio space.

This is not true on a lab host, e.g. I have a TDX VM on a lab host created
by these 2 commands (without the 2nd command, Hyper-V won't allow
the TDX VM to start):

    New-VM -Generation 2 -GuestStateIsolationType Tdx -Name $vmName
    Disable-VMConsoleSupport -VMName $vmName

The low_mmio_base is still 4GB-128MB. In this case, it's not a good idea
to try to reserve the 128MB:

1) the available low MMIO size is smaller than 128MB due to the vTPM
MMIO range.

2) even if we can reserve the 109.25 low mmio range
[0xf8000000-0xfed3ffff], we may not want to do that, just in case
some assigned PCI device has 32-bit BARs.

So, IMO we need to keep the check:
 +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
 +		return;

BTW, I think this may be a slightly better check here:
+        if (hv_is_isolation_supported())
+                return;

A CVM on Hyper-V won't start without the command line
    Disable-VMConsoleSupport -VMName $vmName

IMO this is very unlikely to change in the future, because the Hyper-V
synthetic framebuffer VMBus device is not a trusted device for a CVM,
so there is no reason for Hyper-V to offer such a device to CVMs; even
if the host offers it, currently the guest hv_vmbus driver ignores it.

When we assign a physical PCI GPU device to a CVM, I'm not sure if there
is any framebuffer from the GPU or not. Even if there is, that's a completely
different scenario and not reserving some low MMIO for "framebuffer"
is unrelated: I think hyperv_drm (or the deprecated hyperv_fb) is the only
driver that sets the fb_overlap_ok parameter of vmbus_allocate_mmio().

> And at the moment, CVMs don't
> support PCI devices, 

This is not true: recently I created a "Standard DC16eds v6" TDX CVM
on Azure, and I did see two NVMe local temporary disks in "nvme list"
 (here TDISP is not used). In 2023, we added the commit
2c6ba4216844 ("PCI: hv: Enable PCI pass-thru devices in Confidential VMs")
and I believe some users are running CVMs with GPUs.

> so can't encounter a conflict (though conceivably

Correct, since there is no legacy or synthetic framebuffer device for CVMs.

> some new flavor of CVM in the future could support PCI devices).
> 
> > +
> >  	if (efi_enabled(EFI_BOOT)) {
> >  		/* Gen2 VM: get FB base from EFI framebuffer */
> >  		if (IS_ENABLED(CONFIG_SYSFB)) {
> >  			start = sysfb_primary_display.screen.lfb_base;
> >  			size = max_t(__u32,
> sysfb_primary_display.screen.lfb_size, 0x800000);
> > +
> > +			low_mmio_base = hyperv_mmio->start;
> > +			if (!low_mmio_base || low_mmio_base >= SZ_4G ||
> > +			    (start && start < low_mmio_base)) {
> > +				pr_warn("Unexpected low mmio base
> 0x%pa\n", &low_mmio_base);
> > +			} else {
> > +				/*
> > +				 * If the kdump kernel's lfb_base is 0,
> 
> As mentioned earlier, this case isn't just kdump kernels.

Yes, the first kernel also runs here with a non-zero 'start'.

> 
> > +				 * fall back to the low mmio base.
> > +				 */
> > +				if (!start)
> > +					start = low_mmio_base;
> > +				/*
> > +				 * Reserve half of the space below 4GB for
> high
> > +				 * resolutions, but cap the reservation to
> 128MB.
> > +				 */
> > +				size = min((SZ_4G - start) / 2, SZ_128M);
> > +			}
> >  		}
> >  	} else {
> >  		/* Gen1 VM: get FB base from PCI */
> > @@ -2433,6 +2457,8 @@ static void __maybe_unused
> vmbus_reserve_fb(void)
> >  	 */
> >  	for (; !fb_mmio && (size >= 0x100000); size >>= 1)
> >  		fb_mmio = __request_region(hyperv_mmio, start, size,
> fb_mmio_name, 0);
> 
> Just above this "for" loop, "start" is tested for 0. This patch eliminates the main
> reason start might be 0. But I guess it's still possible that the legacy PCI device
> BAR might return 0 for a Gen1 VM?
IMO the legacy PCI BAR's base in a Gen1 VM can't be 0.


> Or you might get 0 if the pr_warn() about low
> mmio base is triggered. But I'm thinking maybe a pr_warn() should be done if
> start is zero.

Ok, will add a pr_warn() here.

> > +
> > +	pr_info("hv_mmio=%pR,%pR fb=%pR\n", hyperv_mmio,
> hyperv_mmio->sibling, fb_mmio);
> 
> Outputting the above info is nice!
> 
> Michael

Thanks for all the good input!  Will post v2 for review.

Thanks,
Dexuan

^ permalink raw reply

* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-29  1:58 UTC (permalink / raw)
  To: Michael Kelley, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Long Li, lpieralisi@kernel.org, kwilczynski@kernel.org,
	mani@kernel.org, robh@kernel.org, bhelgaas@google.com,
	Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	matthew.ruffell@canonical.com, kjlx@templeofstupid.com
  Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D5BAFAE2134276241FFED42A2@SN6PR02MB4157.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Thursday, April 23, 2026 10:40 AM
> > ...
> > Another example is: for a Gen2 VM with the below commands:
> >    Set-VM -LowMemoryMappedIoSpace 1GB \
> >           -VMName decui-u2204-gen2-fb
> >    // i.e. the default setting on Azure. Let's ignore CVMs here.

Sorry for the incorrect statement: this is not the default setting
on Azure. The default for regular VMs on Azure should be
"-LowMemoryMappedIoSpace 3GB".  Not sure how I made the
incorrect statement -- I guess I might have confused my local VM
with my Azure VM, and at some moment, I might have mistaken
the meaning of the "-LowMemoryMappedIoSpace" parameter:
for that local VM, I might somehow incorrectly though that the
param means low_mmio_base rather than low_mmio_size.

> FWIW, I'm seeing that in Gen2 VMs in Azure, the low_mmio_size
> is 3 GiB. I'm looking at a D16ds_v5, and a D16lds_v6. The v5 VM
> is newly created, while the v6 has been around for a few months.

This is also my observation, after I double checked my Azure VM.

> In a CVM, the low_mmio_size should be 1 GiB. This overall example
> is still correct -- it's just the comment that I have doubts about. Or
> maybe you are looking at a different VM size that has a different
> default?

For CVMs, yes, the low_mmio_size is 1GB.

> 
> Some years back, I had gotten into a discussion with Azure about
> this size because the swiotlb memory wants to be allocated below
> the 4 GiB line, and reserving 3 GiB for low mmio limited the size
> of the swiotlb. CVMs were changed to have only 1 GiB for low
> mmio because they need a larger swiotlb.

Right, I also remember the story. :-)

> > With the below command:
> >    Set-VM -LowMemoryMappedIoSpace 3GB \
> >           -VMName decui-u2204-gen2-fb
> >    // i.e. the default setting on Azure. Unlike x86-64, an ARM64
> >    // VM on Azure has 3GB of mmio below 4GB.
> 
> See my previous comment on the same topic. I think arm64
> and x86/x64 are the same.

Agreed.

> Question about Gen 1 VMs: If the Linux frame buffer driver moves
> the frame buffer somewhere other than the default location, and
> then the VM does a kexec/kdump, what does the legacy PCI graphic
> device BAR report as the frame buffer location? Does it *always*
> report 4G-128MB, or does it report the new location? I can run

It always reports 4G-128MB. 
BTW,  I suspect a Gen2 VM may have the same issue, i.e. 
currently we only reserve 8MB below 4GB; if hyperv_drm uses
high MMIO, I suspect the UEFI firmware would still report the
same original low MMIO framebuffer base/size to the kdump kernel,
but there is no easy way to verify this for Gen2 VMs...

> an experiment to find out, but maybe you've already done so and
> not reported that detail here.
> 
> Michael

I have a Gen1 Ubuntu 22.04 VM, and I run the below commands:
Set-VM -LowMemoryMappedIoSpace 128MB -VMName decui-u2204-gen1-fb
Set-VMVideo -VMName decui-u2204-gen1-fb -HorizontalResolution 7680 -VerticalResolution 4320 -ResolutionType Single

When the VM boots up, we reserve 64MB at 4G-128MB:
[   11.492075] hv_vmbus: hv_mmio=[mem 0xf8000000-0xfed3ffff],[mem 0xfe0000000-0xfffffffff] fb=[mem 0xf8000000-0xfbffffff]

Since the required mmio size in the hyperv-drm driver is 128MB:
[   28.631923] hyperv_connect_vsp: hyperv_drm: mmio_megabytes=128 MB
the driver has to allocate MMIO from the high MMIO space, because 
we only reserve 64MB below 4GB, and the available low_mmio_size is
smaller than 128MB due to the vTPM MMIO range:

# cat /proc/iomem
00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c7fff : Video ROM
000e0000-000fffff : Reserved
  000f0000-000fffff : System ROM
00100000-f7feffff : System RAM
  d7000000-f6ffffff : Crash kernel
f7ff0000-f7ffefff : ACPI Tables
f7fff000-f7ffffff : ACPI Non-volatile Storage
f8000000-fffbffff : PCI Bus 0000:00
  f8000000-fbffffff : 0000:00:08.0
  fec00000-fec003ff : IOAPIC 0
  fee00000-fee00fff : PNP0C02:01
fffc0000-ffffffff : PNP0C01:00
100000000-507ffffff : System RAM
  281600000-28295449f : Kernel code
  282a00000-283746fff : Kernel rodata
  283800000-283c5287f : Kernel data
  28411a000-2845fffff : Kernel bss
fe0000000-fffffffff : PCI Bus 0000:00
  fe0000000-fe7ffffff : 5620e0c7-8062-4dce-aeb7-520c7ef76171

However,  when the kdump kernel starts to run, and I print the
pci_resource_start(pdev, 0) and pci_resource_len(pdev, 0)
from vmbus_reserve_fb(), I still see 4G-128MB:
[   12.506159] Gen1 VM: start=0xf8000000, size=0x4000000

In this case, we can't really fix the MMIO conflict, e.g.
if both hv_pci and hyperv_drm are built as modules, then
the order of loading them can be nondeterministic:if the order
in the first kernel is different from the order in
the kdump kernel, we run into trouble.

If the order is deterministic (e.g. hv_pci is
built-in, and hyperv_drm is built as a module),
we should be good since both allocates MMIO from
the high MMIO range in a deterministic way.

Thanks,
Dexuan

^ permalink raw reply

* Re: [PATCH] hv_sock: fix ARM64 support
From: Jakub Kicinski @ 2026-04-28 23:48 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Hamza Mahfooz, netdev@vger.kernel.org, KY Srinivasan,
	Haiyang Zhang, Wei Liu, Long Li, Stefano Garzarella,
	David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Michael Kelley, Himadri Pandya, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB69211500C7F60FC29F1BAA79BF372@SA1PR21MB6921.namprd21.prod.outlook.com>

On Tue, 28 Apr 2026 21:24:59 +0000 Dexuan Cui wrote:
> > Sent: Tuesday, April 28, 2026 5:54 AM
> > Subject: [PATCH] hv_sock: fix ARM64 support  
> 
> Typically, for a change to net/, you'd want to add a "net" or "net-next"
> after the "PATCH", i.e.
> 
> [PATCH net] 
> or 
> [PATCH net v2]
> 
> See "Documentation/process/maintainer-netdev.rst"

Speaking of Documentation/process/maintainer-netdev.rst:

  Reviewer guidance
  -----------------
  [...]
  Reviewers are highly encouraged to do more in-depth review of submissions
  and not focus exclusively on process issues, trivial or subjective
  matters like code formatting, tags etc.
  
See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#reviewer-guidance

^ permalink raw reply

* Re: [PATCH v4 3/3] mshv: unmap debugfs stats pages on kexec
From: Stanislav Kinsburskii @ 2026-04-28 23:31 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, Anirudh Rayabharam, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-4-jloeser@linux.microsoft.com>

On Mon, Apr 27, 2026 at 02:38:54PM -0700, Jork Loeser wrote:
> On L1VH, debugfs stats pages are overlay pages: the kernel allocates
> them and registers the GPAs with the hypervisor via
> HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
> hypervisor across kexec. If the kexec'd kernel reuses those physical
> pages, the hypervisor's overlay semantics cause a machine check
> exception.
> 
> Fix this by calling mshv_debugfs_exit() from the reboot notifier,
> which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
> kexec. This releases the overlay bindings so the physical pages can be
> safely reused. Guard mshv_debugfs_exit() against being called when
> init failed.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

> ---
>  drivers/hv/mshv_debugfs.c | 7 ++++++-
>  drivers/hv/mshv_synic.c   | 1 +
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
> index 418b6dc8f3c2..3c3e02237ae9 100644
> --- a/drivers/hv/mshv_debugfs.c
> +++ b/drivers/hv/mshv_debugfs.c
> @@ -674,8 +674,10 @@ int __init mshv_debugfs_init(void)
>  
>  	mshv_debugfs = debugfs_create_dir("mshv", NULL);
>  	if (IS_ERR(mshv_debugfs)) {
> +		err = PTR_ERR(mshv_debugfs);
> +		mshv_debugfs = NULL;
>  		pr_err("%s: failed to create debugfs directory\n", __func__);
> -		return PTR_ERR(mshv_debugfs);
> +		return err;
>  	}
>  
>  	if (hv_root_partition()) {
> @@ -710,6 +712,9 @@ int __init mshv_debugfs_init(void)
>  
>  void mshv_debugfs_exit(void)
>  {
> +	if (!mshv_debugfs)
> +		return;
> +
>  	mshv_debugfs_parent_partition_remove();
>  
>  	if (hv_root_partition()) {
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 978a1cace341..88170ce6b83f 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -723,6 +723,7 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
>  static int mshv_synic_reboot_notify(struct notifier_block *nb,
>  			      unsigned long code, void *unused)
>  {
> +	mshv_debugfs_exit();
>  	cpuhp_remove_state(synic_cpuhp_online);
>  	return 0;
>  }
> -- 
> 2.43.0
> 

^ permalink raw reply

* Re: [PATCH v4 2/3] mshv: clean up SynIC state on kexec for L1VH
From: Stanislav Kinsburskii @ 2026-04-28 23:28 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, Anirudh Rayabharam, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-3-jloeser@linux.microsoft.com>

On Mon, Apr 27, 2026 at 02:38:53PM -0700, Jork Loeser wrote:
> The reboot notifier that tears down the SynIC cpuhp state guards the
> cleanup with hv_root_partition(), so on L1VH (where
> hv_root_partition() is false) SINT0, SINT5, and SIRBP are never
> cleaned up before kexec. The kexec'd kernel then inherits stale
> unmasked SINTs and an enabled SIRBP pointing to freed memory.
> 
> Remove the hv_root_partition() guard so the cleanup runs for all
> parent partitions.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>


Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

> ---
>  drivers/hv/mshv_synic.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 2db3b0192eac..978a1cace341 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -723,9 +723,6 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
>  static int mshv_synic_reboot_notify(struct notifier_block *nb,
>  			      unsigned long code, void *unused)
>  {
> -	if (!hv_root_partition())
> -		return 0;
> -
>  	cpuhp_remove_state(synic_cpuhp_online);
>  	return 0;
>  }
> -- 
> 2.43.0
> 

^ permalink raw reply

* Re: [PATCH v4 1/3] mshv: limit SynIC management to MSHV-owned resources
From: Stanislav Kinsburskii @ 2026-04-28 23:27 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Michael Kelley, Anirudh Rayabharam, linux-kernel, linux-arch
In-Reply-To: <20260427213855.1675044-2-jloeser@linux.microsoft.com>

On Mon, Apr 27, 2026 at 02:38:52PM -0700, Jork Loeser wrote:
> The SynIC is shared between VMBus and MSHV. VMBus owns the message
> page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
> and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
> 
> Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
> SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
> disables all of them. This is wrong because MSHV can be torn down
> while VMBus is still active. In particular, a kexec reboot notifier
> tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
> from under VMBus causes its later cleanup to write SynIC MSRs while
> SynIC is disabled, which the hypervisor does not tolerate.
> 
> Restrict MSHV to managing only the resources it owns:
> - SINT0, SINT5: mask on cleanup, unmask on init
> - SIRBP: enable/disable as before
> - SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
>   and nested root partition); on a non-nested root partition VMBus
>   does not run, so MSHV must enable/disable them
> 
> While here, fix the SIEFP and SIRBP memremap() and virt_to_phys()
> calls to use HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of
> PAGE_SHIFT/PAGE_SIZE. The hypervisor always uses 4K pages for SynIC
> register GPAs regardless of the kernel page size, so using PAGE_SHIFT
> produces wrong addresses on ARM64 with 64K pages.
> 
> Note that initialization order matters - VMBUS first, MSHV second,
> and the reverse on de-init. Ideally, we would want a dedicated SYNIC
> driver that replaces the cross-dependencies with a clear API and
> dynamic tracking. Such refactor should go into its own dedicated
> series, outside of this kexec fix series.
> 
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
>  drivers/hv/hv.c         |   3 +
>  drivers/hv/mshv_synic.c | 150 ++++++++++++++++++++++++++--------------
>  2 files changed, 103 insertions(+), 50 deletions(-)
> 
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index ae60fd542292..ef4b1b03395d 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -272,6 +272,9 @@ void hv_synic_free(void)
>  /*
>   * hv_hyp_synic_enable_regs - Initialize the Synthetic Interrupt Controller
>   * with the hypervisor.
> + *
> + * Note: When MSHV is present, mshv_synic_cpu_init() intializes further
> + * registers later.
>   */
>  void hv_hyp_synic_enable_regs(unsigned int cpu)
>  {
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index e2288a726fec..2db3b0192eac 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -13,6 +13,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/cpuhotplug.h>
> +#include <linux/hyperv.h>
>  #include <linux/reboot.h>
>  #include <asm/mshyperv.h>
>  #include <linux/acpi.h>
> @@ -456,46 +457,75 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
>  	union hv_synic_sint sint;
> -	union hv_synic_scontrol sctrl;
>  	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  			&spages->synic_event_flags_page;
>  	struct hv_synic_event_ring_page **event_ring_page =
>  			&spages->synic_event_ring_page;
> +	/*
> +	 * VMBus owns SIMP/SIEFP/SCONTROL when it is active.
> +	 * See hv_hyp_synic_enable_regs() for that initialization.
> +	 */
> +	bool vmbus_active = hv_vmbus_exists();
>  
> -	/* Setup the Synic's message page */
> +	/*
> +	 * Map the SYNIC message page. When VMBus is not active the
> +	 * hypervisor pre-provisions the SIMP GPA but may not set
> +	 * simp_enabled — enable it here.
> +	 */
>  	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> -	simp.simp_enabled = true;
> +	if (!vmbus_active) {
> +		simp.simp_enabled = true;
> +		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +	}
>  	*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
>  			     HV_HYP_PAGE_SIZE,
>  			     MEMREMAP_WB);
>  
>  	if (!(*msg_page))
> -		return -EFAULT;
> -
> -	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +		goto cleanup_simp;

It would be cleaner (and simpler to read), if there would be another
goto label to only unset HV_MSR_SIMP instead of checking *msg_page for
NULL again in the cleanup_simp label.

This applies to all the goto labels in this function.

Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

>  
> -	/* Setup the Synic's event flags page */
> +	/*
> +	 * Map the event flags page. Same as SIMP: enable when
> +	 * VMBus is not active, already enabled by VMBus otherwise.
> +	 */
>  	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> -	siefp.siefp_enabled = true;
> -	*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
> -				     PAGE_SIZE, MEMREMAP_WB);
> +	if (!vmbus_active) {
> +		siefp.siefp_enabled = true;
> +		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +	}
> +	*event_flags_page = memremap(siefp.base_siefp_gpa << HV_HYP_PAGE_SHIFT,
> +				     HV_HYP_PAGE_SIZE, MEMREMAP_WB);
>  
>  	if (!(*event_flags_page))
> -		goto cleanup;
> -
> -	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +		goto cleanup_siefp;
>  
>  	/* Setup the Synic's event ring page */
>  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> -	sirbp.sirbp_enabled = true;
> -	*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
> -				    PAGE_SIZE, MEMREMAP_WB);
>  
> -	if (!(*event_ring_page))
> -		goto cleanup;
> +	if (hv_root_partition()) {
> +		*event_ring_page = memremap(sirbp.base_sirbp_gpa << HV_HYP_PAGE_SHIFT,
> +					    HV_HYP_PAGE_SIZE, MEMREMAP_WB);
>  
> +		if (!(*event_ring_page))
> +			goto cleanup_siefp;
> +	} else {
> +		/*
> +		 * On L1VH the hypervisor does not provide a SIRBP page.
> +		 * Allocate one and program its GPA into the MSR.
> +		 */
> +		*event_ring_page = (struct hv_synic_event_ring_page *)
> +			get_zeroed_page(GFP_KERNEL);
> +
> +		if (!(*event_ring_page))
> +			goto cleanup_siefp;
> +
> +		sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
> +				>> HV_HYP_PAGE_SHIFT;
> +	}
> +
> +	sirbp.sirbp_enabled = true;
>  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
>  
>  	if (mshv_sint_irq != -1)
> @@ -518,28 +548,30 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
>  			      sint.as_uint64);
>  
> -	/* Enable global synic bit */
> -	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> -	sctrl.enable = 1;
> -	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	/* When VMBus is active it already enabled SCONTROL. */
> +	if (!vmbus_active) {
> +		union hv_synic_scontrol sctrl;
> +
> +		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> +		sctrl.enable = 1;
> +		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	}
>  
>  	return 0;
>  
> -cleanup:
> -	if (*event_ring_page) {
> -		sirbp.sirbp_enabled = false;
> -		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> -		memunmap(*event_ring_page);
> -	}
> -	if (*event_flags_page) {
> +cleanup_siefp:
> +	if (*event_flags_page)
> +		memunmap(*event_flags_page);
> +	if (!vmbus_active) {
>  		siefp.siefp_enabled = false;
>  		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> -		memunmap(*event_flags_page);
>  	}
> -	if (*msg_page) {
> +cleanup_simp:
> +	if (*msg_page)
> +		memunmap(*msg_page);
> +	if (!vmbus_active) {
>  		simp.simp_enabled = false;
>  		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> -		memunmap(*msg_page);
>  	}
>  
>  	return -EFAULT;
> @@ -548,16 +580,15 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  static int mshv_synic_cpu_exit(unsigned int cpu)
>  {
>  	union hv_synic_sint sint;
> -	union hv_synic_simp simp;
> -	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
> -	union hv_synic_scontrol sctrl;
>  	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  		&spages->synic_event_flags_page;
>  	struct hv_synic_event_ring_page **event_ring_page =
>  		&spages->synic_event_ring_page;
> +	/* VMBus owns SIMP/SIEFP/SCONTROL when it is active */
> +	bool vmbus_active = hv_vmbus_exists();
>  
>  	/* Disable the interrupt */
>  	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
> @@ -574,28 +605,47 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
>  	if (mshv_sint_irq != -1)
>  		disable_percpu_irq(mshv_sint_irq);
>  
> -	/* Disable Synic's event ring page */
> +	/* Disable SYNIC event ring page owned by MSHV */
>  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
>  	sirbp.sirbp_enabled = false;
> -	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> -	memunmap(*event_ring_page);
>  
> -	/* Disable Synic's event flags page */
> -	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> -	siefp.siefp_enabled = false;
> -	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +	if (hv_root_partition()) {
> +		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> +		memunmap(*event_ring_page);
> +	} else {
> +		sirbp.base_sirbp_gpa = 0;
> +		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> +		free_page((unsigned long)*event_ring_page);
> +	}
> +
> +	/*
> +	 * Release our mappings of the message and event flags pages.
> +	 * When VMBus is not active, we enabled SIMP/SIEFP — disable
> +	 * them. Otherwise VMBus owns the MSRs — leave them.
> +	 */
>  	memunmap(*event_flags_page);
> +	if (!vmbus_active) {
> +		union hv_synic_simp simp;
> +		union hv_synic_siefp siefp;
>  
> -	/* Disable Synic's message page */
> -	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> -	simp.simp_enabled = false;
> -	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +		siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> +		siefp.siefp_enabled = false;
> +		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +
> +		simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> +		simp.simp_enabled = false;
> +		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> +	}
>  	memunmap(*msg_page);
>  
> -	/* Disable global synic bit */
> -	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> -	sctrl.enable = 0;
> -	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	/* When VMBus is active it owns SCONTROL — leave it. */
> +	if (!vmbus_active) {
> +		union hv_synic_scontrol sctrl;
> +
> +		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> +		sctrl.enable = 0;
> +		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> +	}
>  
>  	return 0;
>  }
> -- 
> 2.43.0
> 

^ permalink raw reply

* [PATCH] mshv: Simplify GPA map/unmap hypercall helpers
From: Stanislav Kinsburskii @ 2026-04-28 23:21 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
preceding bug-fix patches:

Move "done += completed" before the status checks so that pages mapped
by a partially-successful batch are included in the error cleanup unmap.
Previously these mappings were leaked on failure.

While here, improve type safety and readability:
 - Change "int done" to "u64 done" to match the u64 page_count it is
   compared against, avoiding signed/unsigned comparison hazards.
 - Use u64 for loop iteration and batch size variables consistently.
 - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
 - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
 - Simplify the error-path unmap to use "done << large_shift" directly
   instead of mutating done in place.

Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c |   55 +++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 35 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index e5992c324904a..f5f205a397834 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 	struct hv_input_map_gpa_pages *input_page;
 	u64 status, *pfnlist;
 	unsigned long irq_flags, large_shift = 0;
-	int ret = 0, done = 0;
-	u64 page_count = page_struct_count;
+	u64 done = 0, page_count = page_struct_count;
+	int ret = 0;
 
 	if (page_count == 0 || (pages && mmio_spa))
 		return -EINVAL;
@@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 	}
 
 	while (done < page_count) {
-		ulong i, completed, remain = page_count - done;
-		int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
+		u64 i, completed, remain = page_count - done;
+		u64 rep_count = min(remain, (u64)HV_MAP_GPA_BATCH_SIZE);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -224,23 +224,13 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 		input_page->map_flags = flags;
 		pfnlist = input_page->source_gpa_page_list;
 
-		for (i = 0; i < rep_count; i++)
-			if (flags & HV_MAP_GPA_NO_ACCESS) {
+		for (i = 0; i < rep_count; i++) {
+			if (flags & HV_MAP_GPA_NO_ACCESS)
 				pfnlist[i] = 0;
-			} else if (pages) {
-				u64 index = (done + i) << large_shift;
-
-				if (index >= page_struct_count) {
-					ret = -EINVAL;
-					break;
-				}
-				pfnlist[i] = page_to_pfn(pages[index]);
-			} else {
+			else if (pages)
+				pfnlist[i] = page_to_pfn(pages[(done + i) << large_shift]);
+			else
 				pfnlist[i] = mmio_spa + done + i;
-			}
-		if (ret) {
-			local_irq_restore(irq_flags);
-			break;
 		}
 
 		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
@@ -248,29 +238,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 		local_irq_restore(irq_flags);
 
 		completed = hv_repcomp(status);
+		done += completed;
 
 		if (hv_result_needs_memory(status)) {
 			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
 						    HV_MAP_GPA_DEPOSIT_PAGES);
 			if (ret)
 				break;
-
 		} else if (!hv_result_success(status)) {
 			ret = hv_result_to_errno(status);
 			break;
 		}
-
-		done += completed;
 	}
 
 	if (ret && done) {
 		u32 unmap_flags = 0;
 
-		if (flags & HV_MAP_GPA_LARGE_PAGE) {
+		if (flags & HV_MAP_GPA_LARGE_PAGE)
 			unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
-			done <<= large_shift;
-		}
-		hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
+		hv_call_unmap_gpa_pages(partition_id, gfn,
+					done << large_shift, unmap_flags);
 	}
 
 	return ret;
@@ -305,7 +292,7 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
 	struct hv_input_unmap_gpa_pages *input_page;
 	u64 status, page_count = page_count_4k;
 	unsigned long irq_flags, large_shift = 0;
-	int ret = 0, done = 0;
+	u64 done = 0;
 
 	if (page_count == 0)
 		return -EINVAL;
@@ -319,8 +306,8 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
 	}
 
 	while (done < page_count) {
-		ulong completed, remain = page_count - done;
-		int rep_count = min(remain, HV_UMAP_GPA_PAGES);
+		u64 completed, remain = page_count - done;
+		u64 rep_count = min(remain, (u64)HV_UMAP_GPA_PAGES);
 
 		local_irq_save(irq_flags);
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -333,15 +320,13 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
 		local_irq_restore(irq_flags);
 
 		completed = hv_repcomp(status);
-		if (!hv_result_success(status)) {
-			ret = hv_result_to_errno(status);
-			break;
-		}
-
 		done += completed;
+
+		if (!hv_result_success(status))
+			return hv_result_to_errno(status);
 	}
 
-	return ret;
+	return 0;
 }
 
 int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,



^ permalink raw reply related

* Re: [PATCH v2] mshv: Fix interrupt state corruption in hv_do_map_pfns error path
From: Stanislav Kinsburskii @ 2026-04-28 23:00 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB41578863BCE23D41B6A521B8D4372@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, Apr 28, 2026 at 12:20:35AM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, April 27, 2026 7:44 AM
> > 
> > Restore interrupt state before breaking out of the loop on error.
> > 
> > The irq_flags are saved before entering the loop, but the early exit
> > path on error fails to restore them. This leaves interrupts in an
> > inconsistent state and can lead to lockdep warnings or other
> > interrupt-related issues.
> > 
> > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_hv_call.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > index ab210a7fcb8c3..61291ec6f3468 100644
> > --- a/drivers/hv/mshv_root_hv_call.c
> > +++ b/drivers/hv/mshv_root_hv_call.c
> > @@ -229,8 +229,10 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> >  			} else {
> >  				pfnlist[i] = mmio_spa + done + i;
> >  			}
> > -		if (ret)
> > +		if (ret) {
> > +			local_irq_restore(irq_flags);
> >  			break;
> > +		}
> > 
> 
> This looks good for fixing the immediate bug.
> 
> But I'd note that this error path occurs solely based on the
> if (index >= page_struct_count) test in the preceding 'for' loop. That test is a
> "can't happen" sanity test that never triggers if hv_do_map_gpa_hcall()
> is coded correctly. At the beginning of the function there are validations of
> the input arguments, which is reasonable. But this sanity test isn't based
> on the input arguments, and it adds non-trivial complexity to the code
> because of the nested loops and the need to figure out where the two
> "break" statements go. I'd argue for dropping the sanity test entirely,
> along with this test of 'ret' and the need to restore the interrupt state.
> 

Fair enough. Let me rework this function (and it's unmap peer).

Thanks,
Stanislav

> Michael

^ permalink raw reply

* Re: [PATCH] mshv: Fix interrupt state corruption in hv_do_map_pfns error path
From: Stanislav Kinsburskii @ 2026-04-28 22:49 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260424-merry-elfish-fossa-885586@anirudhrb>

On Fri, Apr 24, 2026 at 02:35:09PM +0000, Anirudh Rayabharam wrote:
> On Wed, Apr 22, 2026 at 12:15:28AM +0000, Stanislav Kinsburskii wrote:
> > Restore interrupt state before breaking out of the loop on error.
> > 
> > The irq_flags are saved before entering the loop, but the early exit
> > path on error fails to restore them. This leaves interrupts in an
> > inconsistent state and can lead to lockdep warnings or other
> > interrupt-related issues.
> > 
> > Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_hv_call.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > index 7ed623668c8ec..6381f949d9d91 100644
> > --- a/drivers/hv/mshv_root_hv_call.c
> > +++ b/drivers/hv/mshv_root_hv_call.c
> > @@ -237,8 +237,10 @@ static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
> 
> Umm... I don't see this function in the hyperv-next tree at all.
> 

Please see v2.

Thanks,
Stanislav

> Anirudh.
> 
> >  			} else {
> >  				pfnlist[i] = mmio_spa + done + i;
> >  			}
> > -		if (ret)
> > +		if (ret) {
> > +			local_irq_restore(irq_flags);
> >  			break;
> > +		}
> >  
> >  		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> >  					     input_page, NULL);
> > 
> > 

^ permalink raw reply

* [PATCH] mshv: Add dedicated ioctl for GVA to GPA translation
From: Stanislav Kinsburskii @ 2026-04-28 22:48 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Add an MSHV_TRANSLATE_GVA ioctl on the VP fd that wraps
HVCALL_TRANSLATE_VIRTUAL_ADDRESS_EX with transparent fault-in handling for
movable memory regions. The passthrough path for this hypercall is retained
for backward compatibility.

When guest-backing pages reside in movable memory regions, the mmu_notifier
invalidation path remaps them to NO_ACCESS in the hypervisor's second-level
address translation tables. If the VMM issues a GVA translation (e.g.
during MMIO emulation) while a page-table page is invalidated, the
hypervisor returns HV_TRANSLATE_GVA_GPA_NO_READ_ACCESS. The VMM cannot
resolve this on its own.

The new ioctl detects this transient GPA access failure, faults the page
back in via mshv_region_handle_gfn_fault(), and retries the translation
until it succeeds or an unrecoverable error occurs.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root.h         |    3 ++
 drivers/hv/mshv_root_hv_call.c |   37 +++++++++++++++++++++
 drivers/hv/mshv_root_main.c    |   69 ++++++++++++++++++++++++++++++++++++++++
 include/hyperv/hvgdk_mini.h    |    1 +
 include/hyperv/hvhdk.h         |   41 ++++++++++++++++++++++++
 include/uapi/linux/mshv.h      |   10 ++++++
 6 files changed, 161 insertions(+)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 1f086dcb7aa1a..2e6c4414740cc 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -290,6 +290,9 @@ int hv_call_delete_vp(u64 partition_id, u32 vp_index);
 int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
 				     u64 dest_addr,
 				     union hv_interrupt_control control);
+int hv_call_translate_virtual_address_ex(u32 vp_index, u64 partition_id,
+					 u64 flags, u64 gva, u64 *gfn,
+					 struct hv_translate_gva_result_ex *result);
 int hv_call_clear_virtual_interrupt(u64 partition_id);
 int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
 				  union hv_gpa_page_access_state_flags state_flags,
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index e5992c324904a..9ff4ba5373f59 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -692,6 +692,43 @@ int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code,
 	return 0;
 }
 
+int hv_call_translate_virtual_address_ex(u32 vp_index, u64 partition_id,
+					 u64 flags, u64 gva, u64 *gfn,
+					 struct hv_translate_gva_result_ex *result)
+{
+	struct hv_input_translate_virtual_address *input;
+	struct hv_output_translate_virtual_address_ex *output;
+	unsigned long irq_flags;
+	u64 status;
+
+	local_irq_save(irq_flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+	memset(input, 0, sizeof(*input));
+	input->partition_id = partition_id;
+	input->vp_index = vp_index;
+	input->control_flags = flags;
+	input->gva_page = gva >> HV_HYP_PAGE_SHIFT;
+
+	status = hv_do_hypercall(HVCALL_TRANSLATE_VIRTUAL_ADDRESS_EX,
+				 input, output);
+
+	if (!hv_result_success(status)) {
+		local_irq_restore(irq_flags);
+		pr_err("%s: %s\n", __func__, hv_result_to_string(status));
+		return hv_result_to_errno(status);
+	}
+
+	*result = output->translation_result;
+	*gfn = output->gpa_page;
+
+	local_irq_restore(irq_flags);
+
+	return 0;
+}
+
 int
 hv_call_clear_virtual_interrupt(u64 partition_id)
 {
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index bd1359eb58dd4..2d7b6923415a8 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -898,6 +898,72 @@ mshv_vp_ioctl_get_set_state(struct mshv_vp *vp,
 	return 0;
 }
 
+static bool mshv_gpa_fault_retryable(u32 result_code)
+{
+	/*
+	 * Note: HV_TRANSLATE_GVA_GPA_UNMAPPED is intentionally not handled
+	 * here. The guest page table cannot be unmapped under normal
+	 * operation. It may be mapped with no access during page moves,
+	 * but a truly unmapped state indicates a kernel driver bug.
+	 * Retrying in this case would only mask the underlying problem of
+	 * an unmapped guest page table.
+	 */
+	return result_code == HV_TRANSLATE_GVA_GPA_NO_READ_ACCESS;
+}
+
+static long
+mshv_vp_ioctl_translate_gva(struct mshv_vp *vp, void __user *user_args)
+{
+	struct mshv_partition *partition = vp->vp_partition;
+	struct mshv_translate_gva args;
+	struct hv_translate_gva_result_ex result;
+	u64 gfn, gpa;
+	int ret;
+
+	if (copy_from_user(&args, user_args, sizeof(args)))
+		return -EFAULT;
+
+	do {
+		ret = hv_call_translate_virtual_address_ex(vp->vp_index,
+							   partition->pt_id,
+							   args.flags, args.gva,
+							   &gfn, &result);
+		if (ret)
+			return ret;
+
+		if (mshv_gpa_fault_retryable(result.result_code)) {
+			struct mshv_mem_region *region;
+			bool faulted;
+
+			region = mshv_partition_region_by_gfn_get(partition,
+								  gfn);
+			if (!region)
+				return -EFAULT;
+
+			faulted = false;
+			if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
+				faulted = mshv_region_handle_gfn_fault(region,
+								       gfn);
+			mshv_region_put(region);
+
+			if (!faulted)
+				return -EFAULT;
+
+			cond_resched();
+		}
+	} while (mshv_gpa_fault_retryable(result.result_code));
+
+	gpa = (gfn << PAGE_SHIFT) | (args.gva & ~PAGE_MASK);
+
+        if (copy_to_user(args.result, &result, sizeof(*args.result)))
+                return -EFAULT;
+
+	if (copy_to_user(args.gpa, &gpa, sizeof(*args.gpa)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static long
 mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
@@ -917,6 +983,9 @@ mshv_vp_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 	case MSHV_SET_VP_STATE:
 		r = mshv_vp_ioctl_get_set_state(vp, (void __user *)arg, true);
 		break;
+	case MSHV_TRANSLATE_GVA:
+		r = mshv_vp_ioctl_translate_gva(vp, (void __user *)arg);
+		break;
 	case MSHV_ROOT_HVCALL:
 		r = mshv_ioctl_passthru_hvcall(vp->vp_partition, false,
 					       (void __user *)arg);
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 6a4e8b9d570fd..ac901801fd397 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -484,6 +484,7 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_CONNECT_PORT				0x0096
 #define HVCALL_START_VP					0x0099
 #define HVCALL_GET_VP_INDEX_FROM_APIC_ID		0x009a
+#define HVCALL_TRANSLATE_VIRTUAL_ADDRESS_EX		0x00ac
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE	0x00af
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST	0x00b0
 #define HVCALL_SIGNAL_EVENT_DIRECT			0x00c0
diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
index 5e83d37149662..08eede666762e 100644
--- a/include/hyperv/hvhdk.h
+++ b/include/hyperv/hvhdk.h
@@ -952,4 +952,45 @@ struct hv_input_modify_sparse_spa_page_host_access {
 #define HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE      0x4
 #define HV_MODIFY_SPA_PAGE_HOST_ACCESS_HUGE_PAGE       0x8
 
+enum hv_translate_gva_result_code {
+	HV_TRANSLATE_GVA_SUCCESS			= 0,
+
+	/* Translation failures */
+	HV_TRANSLATE_GVA_PAGE_NOT_PRESENT		= 1,
+	HV_TRANSLATE_GVA_PRIVILEGE_VIOLATION		= 2,
+	HV_TRANSLATE_GVA_INVALID_PAGE_TABLE_FLAGS	= 3,
+
+	/* GPA access failures */
+	HV_TRANSLATE_GVA_GPA_UNMAPPED			= 4,
+	HV_TRANSLATE_GVA_GPA_NO_READ_ACCESS		= 5,
+	HV_TRANSLATE_GVA_GPA_NO_WRITE_ACCESS		= 6,
+	HV_TRANSLATE_GVA_GPA_ILLEGAL_OVERLAY_ACCESS	= 7,
+
+	HV_TRANSLATE_GVA_INTERCEPT			= 8,
+	HV_TRANSLATE_GVA_GPA_UNACCEPTED			= 9,
+};
+
+struct hv_input_translate_virtual_address {
+	u64 partition_id;
+	u32 vp_index;
+	u32 padding;
+	u64 control_flags;
+	u64 gva_page;
+} __packed;
+
+struct hv_translate_gva_result_ex {
+	u32 result_code; /* enum hv_translate_gva_result_code */
+	u32 cache_type : 8;
+	u32 overlay_page : 1;
+	u32 reserved : 23;
+#if IS_ENABLED(CONFIG_X86)
+	char event_info[40]; /* HV_X64_PENDING_EVENT */
+#endif
+} __packed;
+
+struct hv_output_translate_virtual_address_ex {
+	struct hv_translate_gva_result_ex translation_result;
+	u64 gpa_page;
+} __packed;
+
 #endif /* _HV_HVHDK_H */
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 32ff92b6342b2..29892013a4752 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -318,6 +318,16 @@ struct mshv_get_set_vp_state {
 #define MSHV_RUN_VP			_IOR(MSHV_IOCTL, 0x00, struct mshv_run_vp)
 #define MSHV_GET_VP_STATE		_IOWR(MSHV_IOCTL, 0x01, struct mshv_get_set_vp_state)
 #define MSHV_SET_VP_STATE		_IOWR(MSHV_IOCTL, 0x02, struct mshv_get_set_vp_state)
+
+struct mshv_translate_gva {
+	__u64 gva;
+	__u64 flags;
+	enum hv_translate_gva_result_code *result;
+	__u64 *gpa;
+};
+
+#define MSHV_TRANSLATE_GVA		_IOWR(MSHV_IOCTL, 0xF2, struct mshv_translate_gva)
+
 /*
  * Generic hypercall
  * Defined above in partition IOCTLs, avoid redefining it here



^ permalink raw reply related

* RE: [PATCH] hv_sock: fix ARM64 support
From: Dexuan Cui @ 2026-04-28 21:24 UTC (permalink / raw)
  To: Hamza Mahfooz, netdev@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Wei Liu, Long Li,
	Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Michael Kelley, Himadri Pandya,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260428125339.13963-1-hamzamahfooz@linux.microsoft.com>

> From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> Sent: Tuesday, April 28, 2026 5:54 AM
> Subject: [PATCH] hv_sock: fix ARM64 support

Typically, for a change to net/, you'd want to add a "net" or "net-next"
after the "PATCH", i.e.

[PATCH net] 
or 
[PATCH net v2]

See "Documentation/process/maintainer-netdev.rst"

^ permalink raw reply

* RE: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
From: Dexuan Cui @ 2026-04-28 21:07 UTC (permalink / raw)
  To: Hamza Mahfooz, linux-kernel@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Wei Liu, Long Li,
	Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Himadri Pandya, Michael Kelley,
	linux-hyperv@vger.kernel.org, virtualization@lists.linux.dev,
	netdev@vger.kernel.org, Saurabh Sengar, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Deepak Rawat, dri-devel@lists.freedesktop.org,
	stable@kernel.vger.org
In-Reply-To: <20260425181719.1538483-2-hamzamahfooz@linux.microsoft.com>

> From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> Sent: Saturday, April 25, 2026 11:17 AM
> 
> Cc: stable@kernel.vger.org
I think this should be
  Cc: stable@vger.kernel.org

^ permalink raw reply

* RE: [PATCH 2/2] drm/hyperv: use VMBUS_RING_SIZE()
From: Dexuan Cui @ 2026-04-28 21:05 UTC (permalink / raw)
  To: Michael Kelley, Hamza Mahfooz, Saurabh Singh Sengar
  Cc: linux-kernel@vger.kernel.org, KY Srinivasan, Haiyang Zhang,
	Wei Liu, Long Li, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Himadri Pandya, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	Saurabh Sengar, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Deepak Rawat,
	dri-devel@lists.freedesktop.org, stable@kernel.vger.org
In-Reply-To: <SN6PR02MB41571A5B77A5FDDFE17AEF19D4362@SN6PR02MB4157.namprd02.prod.outlook.com>

> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Monday, April 27, 2026 12:06 PM
> > IMO the Fixes tag is unnecessary because the existing
> > VMBUS_RING_BUFSIZE
> > is 256KB, which is already aligned to 4KB, 16KB and 64KB.
> >
> > VMBUS_RING_SIZE(256 * 1024) is still 256KB.
> 
> Not always. If PAGE_SIZE is 64KiB, VMBUS_RING_SIZE(256 * 1024) is
> 320KiB. If PAGE_SIZE is 16KiB or 4KiB, then VMBUS_RING_SIZE(256 * 1024)
> is indeed 256 KiB. See the explanation in the comment for
> VMBUS_RING_SIZE.
> 
> Michael

Thanks for correcting me!

I didn't realize that sizeof(struct hv_ring_buffer) is based on
PAGE_SIZE, not on HV_HYP_PAGE_SIZE.

However,  it looks like the Fixes tag is still not needed:

without the patch, we always pass two arguments of 256KB to
vmbus_open().

with the patch, we still pass 256KB to vmbus_open() in the case of
PAGE_SIZE=4KB or 16KB, and we pass 320KB in the case of
PAGE_SIZE=64KB.

Both 320K and 256KB are multiples of PAGE_SIZE, so
vmbus_open() -> vmbus_alloc_ring() doesn't return -EINVAL.

In the case of PAGE_SIZE=64KB, it's OK to pass 256KB to vmbus_open()
here since the hyperv-drm driver doesn't really have to use a slightly
bigger VMBus ringbuffer size.

Thanks,
Dexuan

^ permalink raw reply

* RE: [EXTERNAL] [PATCH rc 06/15] RDMA/mana: Fix mana_destroy_wq_obj() cleanup in mana_ib_create_qp_rss()
From: Long Li @ 2026-04-28 17:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Andrew Lunn,
	Broadcom internal kernel review list, Bryan Tan, Eric Dumazet,
	Junxian Huang, Konstantin Taranov, Jakub Kicinski,
	Leon Romanovsky, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org, Paolo Abeni,
	Selvin Xavier, Chengchang Tang, Tariq Toukan, Vishnu Dasa,
	Yishai Hadas
  Cc: Abhijit Gangurde, Adit Ranadive, Allen Hubbe, Andrew Boyer,
	Aditya Sarwade, Brad Spengler, Bryan Tan, David S. Miller,
	Dexuan Cui, Doug Ledford, George Zhang, Jorgen Hansen, Jianbo Liu,
	Kai Aizen, Leon Romanovsky, Leon Romanovsky, Yixian Liu, Lijun Ou,
	Parav Pandit, patches@lists.linux.dev, Roland Dreier,
	Roland Dreier, Sagi Grimberg, Ajay Sharma, stable@vger.kernel.org,
	Tariq Toukan, Wei Hu (Xavier), Shaobo Xu, Nenglong Zhao
In-Reply-To: <6-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com>

>
> Sashiko points out there are two bugs here in the error unwind flow, both related
> to how the WQ table is unwound.
>
> First there is a double i-- on the first failure path due to the while loop having a i--,
> remove it.
>
> Second if mana_ib_install_cq_cb() fails then mana_create_wq_obj() is not undone
> due to the above i--.
>
> Cc: stable@vger.kernel.org
> Fixes: c15d7802a424 ("RDMA/mana_ib: Add CQ interrupt support for RAW QP")
> Link:
> https://sashiko.d/
> ev%2F%23%2Fpatchset%2F0-v2-1c49eeb88c48%252B91-
> rdma_udata_rep_jgg%2540nvidia.com%3Fpart%3D1&data=05%7C02%7Clongli%
> 40microsoft.com%7Cd4d57c89064d4cc1781e08dea541b72a%7C72f988bf86f141
> af91ab2d7cd011db47%7C1%7C0%7C639129898849523924%7CUnknown%7CT
> WFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4
> zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=hbczVL%2F
> QTqw5zawJJPpSNkjtDrBOJNkV5Qn9vGGYbhE%3D&reserved=0
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Long Li <longli@microsoft.com>


> ---
>  drivers/infiniband/hw/mana/qp.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index f7bb0d1f0f8034..8e1f052d0ec976 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -176,11 +176,8 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
>
>               ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
>                                        &wq_spec, &cq_spec, &wq-
> >rx_object);
> -             if (ret) {
> -                     /* Do cleanup starting with index i-1 */
> -                     i--;
> +             if (ret)
>                       goto fail;
> -             }
>
>               /* The GDMA regions are now owned by the WQ object */
>               wq->queue.gdma_region = GDMA_INVALID_DMA_REGION; @@
> -200,8 +197,10 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct
> ib_pd *pd,
>
>               /* Create CQ table entry */
>               ret = mana_ib_install_cq_cb(mdev, cq);
> -             if (ret)
> +             if (ret) {
> +                     mana_destroy_wq_obj(mpc, GDMA_RQ, wq-
> >rx_object);
>                       goto fail;
> +             }
>       }
>       resp.num_entries = i;
>
> --
> 2.43.0


^ permalink raw reply

* RE: [EXTERNAL] [PATCH rc 07/15] RDMA/mana: Fix error unwind in mana_ib_create_qp_rss()
From: Long Li @ 2026-04-28 17:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Andrew Lunn,
	Broadcom internal kernel review list, Bryan Tan, Eric Dumazet,
	Junxian Huang, Konstantin Taranov, Jakub Kicinski,
	Leon Romanovsky, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org, Paolo Abeni,
	Selvin Xavier, Chengchang Tang, Tariq Toukan, Vishnu Dasa,
	Yishai Hadas
  Cc: Abhijit Gangurde, Adit Ranadive, Allen Hubbe, Andrew Boyer,
	Aditya Sarwade, Brad Spengler, Bryan Tan, David S. Miller,
	Dexuan Cui, Doug Ledford, George Zhang, Jorgen Hansen, Jianbo Liu,
	Kai Aizen, Leon Romanovsky, Leon Romanovsky, Yixian Liu, Lijun Ou,
	Parav Pandit, patches@lists.linux.dev, Roland Dreier,
	Roland Dreier, Sagi Grimberg, Ajay Sharma, stable@vger.kernel.org,
	Tariq Toukan, Wei Hu (Xavier), Shaobo Xu, Nenglong Zhao
In-Reply-To: <7-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com>

>
> Sashiko points out that mana_ib_cfg_vport_steering() is leaked, the normal
> destroy path cleans it up.
>
> Cc: stable@vger.kernel.org
> Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure
> Network Adapter")
> Link:
> https://sashiko.d/
> ev%2F%23%2Fpatchset%2F0-v1-e911b76a94d1%252B65d95-
> rdma_udata_rep_jgg%2540nvidia.com%3Fpart%3D4&data=05%7C02%7Clongli%
> 40microsoft.com%7Cb377464abc954481e9b108dea541b646%7C72f988bf86f141
> af91ab2d7cd011db47%7C1%7C0%7C639129898856785811%7CUnknown%7CT
> WFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4
> zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=pqtgE8ULS
> pXgq%2BbpubumadArZO9lTvPki2ATvD9TnGI%3D&reserved=0
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Long Li <longli@microsoft.com>


> ---
>  drivers/infiniband/hw/mana/qp.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 8e1f052d0ec976..0fbcf449c134b5 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -217,13 +217,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
>               ibdev_dbg(&mdev->ib_dev,
>                         "Failed to copy to udata create rss-qp, %d\n",
>                         ret);
> -             goto fail;
> +             goto err_disable_vport_rx;
>       }
>
>       kfree(mana_ind_table);
>
>       return 0;
>
> +err_disable_vport_rx:
> +     mana_disable_vport_rx(mpc);
>  fail:
>       while (i-- > 0) {
>               ibwq = ind_tbl->ind_tbl[i];
> --
> 2.43.0


^ permalink raw reply

* RE: [EXTERNAL] [PATCH rc 04/15] RDMA/mana: Validate rx_hash_key_len
From: Long Li @ 2026-04-28 17:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Andrew Lunn,
	Broadcom internal kernel review list, Bryan Tan, Eric Dumazet,
	Junxian Huang, Konstantin Taranov, Jakub Kicinski,
	Leon Romanovsky, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org, Paolo Abeni,
	Selvin Xavier, Chengchang Tang, Tariq Toukan, Vishnu Dasa,
	Yishai Hadas
  Cc: Abhijit Gangurde, Adit Ranadive, Allen Hubbe, Andrew Boyer,
	Aditya Sarwade, Brad Spengler, Bryan Tan, David S. Miller,
	Dexuan Cui, Doug Ledford, George Zhang, Jorgen Hansen, Jianbo Liu,
	Kai Aizen, Leon Romanovsky, Leon Romanovsky, Yixian Liu, Lijun Ou,
	Parav Pandit, patches@lists.linux.dev, Roland Dreier,
	Roland Dreier, Sagi Grimberg, Ajay Sharma, stable@vger.kernel.org,
	Tariq Toukan, Wei Hu (Xavier), Shaobo Xu, Nenglong Zhao
In-Reply-To: <4-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com>

>
> Sashiko points out that rx_hash_key_len comes from a uAPI structure and is
> blindly passed to memcpy, allowing the userspace to trash kernel memory.
> Bounds check it so the memcpy cannot overflow.
>
> Cc: stable@vger.kernel.org
> Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure
> Network Adapter")
> Link:
> https://sashiko.d/
> ev%2F%23%2Fpatchset%2F0-v2-1c49eeb88c48%252B91-
> rdma_udata_rep_jgg%2540nvidia.com%3Fpart%3D1&data=05%7C02%7Clongli%
> 40microsoft.com%7C12e76b7833a74fb98a8208dea541b8cd%7C72f988bf86f141
> af91ab2d7cd011db47%7C1%7C0%7C639129898875053924%7CUnknown%7CT
> WFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4
> zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=75tKj32YfU
> uN7KdnsW63AjlwgnSLt2KXz34EUbXp2wI%3D&reserved=0
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Long Li <longli@microsoft.com>

> ---
>  drivers/infiniband/hw/mana/qp.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 645581359cee0b..f7bb0d1f0f8034 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -21,6 +21,9 @@ static int mana_ib_cfg_vport_steering(struct mana_ib_dev
> *dev,
>
>       gc = mdev_to_gc(dev);
>
> +     if (rx_hash_key_len > sizeof(req->hashkey))
> +             return -EINVAL;
> +
>       req_buf_size = struct_size(req, indir_tab,
> MANA_INDIRECT_TABLE_DEF_SIZE);
>       req = kzalloc(req_buf_size, GFP_KERNEL);
>       if (!req)
> --
> 2.43.0


^ permalink raw reply

* RE: [EXTERNAL] [PATCH rc 05/15] RDMA/mana: Remove user triggerable WARN_ON() in mana_ib_create_qp_rss()
From: Long Li @ 2026-04-28 17:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Andrew Lunn,
	Broadcom internal kernel review list, Bryan Tan, Eric Dumazet,
	Junxian Huang, Konstantin Taranov, Jakub Kicinski,
	Leon Romanovsky, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org, Paolo Abeni,
	Selvin Xavier, Chengchang Tang, Tariq Toukan, Vishnu Dasa,
	Yishai Hadas
  Cc: Abhijit Gangurde, Adit Ranadive, Allen Hubbe, Andrew Boyer,
	Aditya Sarwade, Brad Spengler, Bryan Tan, David S. Miller,
	Dexuan Cui, Doug Ledford, George Zhang, Jorgen Hansen, Jianbo Liu,
	Kai Aizen, Leon Romanovsky, Leon Romanovsky, Yixian Liu, Lijun Ou,
	Parav Pandit, patches@lists.linux.dev, Roland Dreier,
	Roland Dreier, Sagi Grimberg, Ajay Sharma, stable@vger.kernel.org,
	Tariq Toukan, Wei Hu (Xavier), Shaobo Xu, Nenglong Zhao
In-Reply-To: <5-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com>

>
> Sashiko points out that the user can specify WQs sharing the same CQ as a part of
> the uAPI and this will trigger the WARN_ON() then go on to corrupt the kernel.
>
> Just reject it outright and fail the QP creation.
>
> Cc: stable@vger.kernel.org
> Fixes: c15d7802a424 ("RDMA/mana_ib: Add CQ interrupt support for RAW QP")
> Link:
> https://sashiko.d/
> ev%2F%23%2Fpatchset%2F0-v2-1c49eeb88c48%252B91-
> rdma_udata_rep_jgg%2540nvidia.com%3Fpart%3D1&data=05%7C02%7Clongli%
> 40microsoft.com%7C05b55740f97741c63c8408dea541ba1a%7C72f988bf86f141
> af91ab2d7cd011db47%7C1%7C0%7C639129898905286592%7CUnknown%7CT
> WFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4
> zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=045NsQ2efi
> m0Mmc8EGeKWQ3pyFbs63%2B2023OvD3%2B8IM%3D&reserved=0
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Long Li <longli@microsoft.com>

> ---
>  drivers/infiniband/hw/mana/cq.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
> index f4cbe21763bf11..2d682428ef202a 100644
> --- a/drivers/infiniband/hw/mana/cq.c
> +++ b/drivers/infiniband/hw/mana/cq.c
> @@ -137,8 +137,9 @@ int mana_ib_install_cq_cb(struct mana_ib_dev *mdev,
> struct mana_ib_cq *cq)
>
>       if (cq->queue.id >= gc->max_num_cqs)
>               return -EINVAL;
> -     /* Create CQ table entry */
> -     WARN_ON(gc->cq_table[cq->queue.id]);
> +     /* Create CQ table entry, sharing a CQ between WQs is not supported */
> +     if (gc->cq_table[cq->queue.id])
> +             return -EINVAL;
>       if (cq->queue.kmem)
>               gdma_cq = cq->queue.kmem;
>       else
> --
> 2.43.0


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox