* Re: [PATCH 05/11] drivers: hv: Export vmbus_interrupt for mshv_vtl module
From: Naman Jain @ 2026-04-13 11:46 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157F1DAEF3BC14A67D59FB8D450A@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/1/2026 10:26 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>
> Nit: For the patch Subject, capitalize "Drivers:" in the prefix.
Acked.
Thanks,
Naman
>
>> vmbus_interrupt is used in mshv_vtl_main.c to set the SINT vector.
>> When CONFIG_MSHV_VTL=m and CONFIG_HYPERV_VMBUS=y (built-in), the module
>> cannot access vmbus_interrupt at load time since it is not exported.
>>
>> Export it using EXPORT_SYMBOL_FOR_MODULES consistent with the existing
>> pattern used for vmbus_isr.
>>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> drivers/hv/vmbus_drv.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
>> index f99d4f2d3862..de191799a8f6 100644
>> --- a/drivers/hv/vmbus_drv.c
>> +++ b/drivers/hv/vmbus_drv.c
>> @@ -57,6 +57,7 @@ static DEFINE_PER_CPU(long, vmbus_evt);
>> /* Values parsed from ACPI DSDT */
>> int vmbus_irq;
>> int vmbus_interrupt;
>> +EXPORT_SYMBOL_FOR_MODULES(vmbus_interrupt, "mshv_vtl");
>>
>> /*
>> * If the Confidential VMBus is used, the data on the "wire" is not
>> --
>> 2.43.0
>>
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
^ permalink raw reply
* Re: [PATCH 06/11] Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
From: Naman Jain @ 2026-04-13 11:47 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157521DEF9EA2471B6F3359D450A@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/1/2026 10:27 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>> Generalize Synthetic interrupt source vector (sint) to use
>> vmbus_interrupt variable instead, which automatically takes care of
>> architectures where HYPERVISOR_CALLBACK_VECTOR is not present (arm64).
>
> Sashiko AI raised an interesting question about the startup timing --
> whether the vmbus_platform_driver_probe() is guaranteed to have
> set vmbus_interrupt before the VTL functions below run and use it.
> What causes the mshv_vtl.ko module to be loaded, and hence run
> mshv_vtl_init()?
There is no race condition here. The init ordering guarantees that
vmbus_interrupt is always set before mshv_vtl_synic_enable_regs()
reads it.
The call chain for setting vmbus_interrupt:
subsys_initcall(hv_acpi_init) [level 4]
-> platform_driver_register(&vmbus_platform_driver) and so on.
The call chain for reading vmbus_interrupt:
module_init(mshv_vtl_init) [level 6]
-> hv_vtl_setup_synic()
-> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ...,
mshv_vtl_alloc_context, ...)
-> mshv_vtl_alloc_context()
-> mshv_vtl_synic_enable_regs()
-> sint.vector = vmbus_interrupt
do_initcalls() processes sections in order 0 through 7, so
hv_acpi_init() (level 4) is guaranteed to complete before
mshv_vtl_init() (level 6) runs.
Regarding memory leak on cpu offline/online or module load/unload- it is
beyond the scope of this series, I will fix it separately.
I may need some more time in addressing comments on rest of the patches.
Please bear with me.
Regards,
Naman
>
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> drivers/hv/mshv_vtl_main.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>> index b607b6e7e121..91517b45d526 100644
>> --- a/drivers/hv/mshv_vtl_main.c
>> +++ b/drivers/hv/mshv_vtl_main.c
>> @@ -234,7 +234,7 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
>> union hv_synic_sint sint;
>>
>> sint.as_uint64 = 0;
>> - sint.vector = HYPERVISOR_CALLBACK_VECTOR;
>> + sint.vector = vmbus_interrupt;
>> sint.masked = false;
>> sint.auto_eoi = hv_recommend_using_aeoi();
>>
>> @@ -753,7 +753,7 @@ static void mshv_vtl_synic_mask_vmbus_sint(void *info)
>> const u8 *mask = info;
>>
>> sint.as_uint64 = 0;
>> - sint.vector = HYPERVISOR_CALLBACK_VECTOR;
>> + sint.vector = vmbus_interrupt;
>> sint.masked = (*mask != 0);
>> sint.auto_eoi = hv_recommend_using_aeoi();
>>
>> --
>> 2.43.0
>>
>
> Assuming there's no timing problem vs. the VMBus driver,
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
^ permalink raw reply
* Re: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Jason Gunthorpe @ 2026-04-13 13:46 UTC (permalink / raw)
To: Long Li
Cc: Leon Romanovsky, Erni Sri Satya Vennela, Konstantin Taranov,
linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <LV0PR21MB66700DC2FB827B93ED6A5714CE592@LV0PR21MB6670.namprd21.prod.outlook.com>
On Fri, Apr 10, 2026 at 10:29:45PM +0000, Long Li wrote:
> > On Sat, Mar 21, 2026 at 12:56:39AM +0000, Long Li wrote:
> >
> > > How we rephrase this in this way: the driver should not corrupt or
> > > overflow other parts of the kernel if its device is misbehaving (or
> > > has a bug).
> >
> > If we are going to do this CC hardening stuff I think I want to see a more
> > comphrensive approach, like if we detect an attack then the kernel instantly
> > crashes or something. Or at least an approach in general agreed to by the CC and
> > kernel community.
> >
> > Igoring the issue and continuing seems just wrong.
> >
> > This sprinkling of random checks in this series doesn't feel comprehensive or
> > cohesive to me.
> >
> > Jason
>
> Can we follow the virtio BAD_RING()/vq->broken pattern in
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/virtio/virtio_ring.c#n57.
>
> Add a broken flag to mana_ib_dev. When any hardware response
> contains out-of-range values, mark the device broken and fail the
> operation - during probe this prevents device registration entirely,
> at runtime all subsequent operations return -EIO.
If that's the plan I would think it should be struct device based, but
yeah, I'm more comfortable with this sort of direction as a CC
hardening plan.
Jason
^ permalink raw reply
* RE: [PATCH 04/11] Drivers: hv: Refactor mshv_vtl for ARM64 support to be added
From: Michael Kelley @ 2026-04-13 15:19 UTC (permalink / raw)
To: Naman Jain, Michael Kelley, K . Y . Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org
In-Reply-To: <fe4c0663-4ece-439c-bcb6-cd5780c15ed9@linux.microsoft.com>
From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, April 13, 2026 4:46 AM
>
> On 4/1/2026 10:26 PM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026
> 5:13 AM
> >>
> >> Refactor MSHV_VTL driver to move some of the x86 specific code to arch
> >> specific files, and add corresponding functions for arm64.
> >>
> >> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
> >> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> >> ---
> >> arch/arm64/include/asm/mshyperv.h | 10 +++
> >> arch/x86/hyperv/hv_vtl.c | 98 ++++++++++++++++++++++++++++
> >> arch/x86/include/asm/mshyperv.h | 1 +
> >> drivers/hv/mshv_vtl_main.c | 102 +-----------------------------
> >> 4 files changed, 111 insertions(+), 100 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/mshyperv.h
> >> b/arch/arm64/include/asm/mshyperv.h
> >> index b721d3134ab6..804068e0941b 100644
> >> --- a/arch/arm64/include/asm/mshyperv.h
> >> +++ b/arch/arm64/include/asm/mshyperv.h
> >> @@ -60,6 +60,16 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
> >> ARM_SMCCC_SMC_64, \
> >> ARM_SMCCC_OWNER_VENDOR_HYP, \
> >> HV_SMCCC_FUNC_NUMBER)
> >> +#ifdef CONFIG_HYPERV_VTL_MODE
> >> +/*
> >> + * Get/Set the register. If the function returns `1`, that must be done via
> >> + * a hypercall. Returning `0` means success.
> >> + */
> >> +static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared)
> >> +{
> >> + return 1;
> >> +}
> >> +#endif
> >>
> >> #include <asm-generic/mshyperv.h>
> >>
> >> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
> >> index 9b6a9bc4ab76..72a0bb4ae0c7 100644
> >> --- a/arch/x86/hyperv/hv_vtl.c
> >> +++ b/arch/x86/hyperv/hv_vtl.c
> >> @@ -17,6 +17,8 @@
> >> #include <asm/realmode.h>
> >> #include <asm/reboot.h>
> >> #include <asm/smap.h>
> >> +#include <uapi/asm/mtrr.h>
> >> +#include <asm/debugreg.h>
> >> #include <linux/export.h>
> >> #include <../kernel/smpboot.h>
> >> #include "../../kernel/fpu/legacy.h"
> >> @@ -281,3 +283,99 @@ void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
> >> kernel_fpu_end();
> >> }
> >> EXPORT_SYMBOL(mshv_vtl_return_call);
> >> +
> >> +/* Static table mapping register names to their corresponding actions */
> >> +static const struct {
> >> + enum hv_register_name reg_name;
> >> + int debug_reg_num; /* -1 if not a debug register */
> >> + u32 msr_addr; /* 0 if not an MSR */
> >> +} reg_table[] = {
> >> + /* Debug registers */
> >> + {HV_X64_REGISTER_DR0, 0, 0},
> >> + {HV_X64_REGISTER_DR1, 1, 0},
> >> + {HV_X64_REGISTER_DR2, 2, 0},
> >> + {HV_X64_REGISTER_DR3, 3, 0},
> >> + {HV_X64_REGISTER_DR6, 6, 0},
> >> + /* MTRR MSRs */
> >> + {HV_X64_REGISTER_MSR_MTRR_CAP, -1, MSR_MTRRcap},
> >> + {HV_X64_REGISTER_MSR_MTRR_DEF_TYPE, -1, MSR_MTRRdefType},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE0, -1, MTRRphysBase_MSR(0)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE1, -1, MTRRphysBase_MSR(1)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE2, -1, MTRRphysBase_MSR(2)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE3, -1, MTRRphysBase_MSR(3)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE4, -1, MTRRphysBase_MSR(4)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE5, -1, MTRRphysBase_MSR(5)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE6, -1, MTRRphysBase_MSR(6)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE7, -1, MTRRphysBase_MSR(7)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE8, -1, MTRRphysBase_MSR(8)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASE9, -1, MTRRphysBase_MSR(9)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEA, -1, MTRRphysBase_MSR(0xa)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEB, -1, MTRRphysBase_MSR(0xb)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEC, -1, MTRRphysBase_MSR(0xc)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASED, -1, MTRRphysBase_MSR(0xd)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEE, -1, MTRRphysBase_MSR(0xe)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_BASEF, -1, MTRRphysBase_MSR(0xf)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK0, -1, MTRRphysMask_MSR(0)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK1, -1, MTRRphysMask_MSR(1)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK2, -1, MTRRphysMask_MSR(2)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK3, -1, MTRRphysMask_MSR(3)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK4, -1, MTRRphysMask_MSR(4)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK5, -1, MTRRphysMask_MSR(5)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK6, -1, MTRRphysMask_MSR(6)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK7, -1, MTRRphysMask_MSR(7)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK8, -1, MTRRphysMask_MSR(8)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASK9, -1, MTRRphysMask_MSR(9)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKA, -1, MTRRphysMask_MSR(0xa)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKB, -1, MTRRphysMask_MSR(0xb)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKC, -1, MTRRphysMask_MSR(0xc)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKD, -1, MTRRphysMask_MSR(0xd)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKE, -1, MTRRphysMask_MSR(0xe)},
> >> + {HV_X64_REGISTER_MSR_MTRR_PHYS_MASKF, -1, MTRRphysMask_MSR(0xf)},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX64K00000, -1, MSR_MTRRfix64K_00000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX16K80000, -1, MSR_MTRRfix16K_80000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX16KA0000, -1, MSR_MTRRfix16K_A0000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX4KC0000, -1, MSR_MTRRfix4K_C0000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX4KC8000, -1, MSR_MTRRfix4K_C8000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX4KD0000, -1, MSR_MTRRfix4K_D0000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX4KD8000, -1, MSR_MTRRfix4K_D8000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX4KE0000, -1, MSR_MTRRfix4K_E0000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX4KE8000, -1, MSR_MTRRfix4K_E8000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX4KF0000, -1, MSR_MTRRfix4K_F0000},
> >> + {HV_X64_REGISTER_MSR_MTRR_FIX4KF8000, -1, MSR_MTRRfix4K_F8000},
> >> +};
> >> +
> >> +int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared)
> >> +{
> >> + u64 *reg64;
> >> + enum hv_register_name gpr_name;
> >> + int i;
> >> +
> >> + gpr_name = regs->name;
> >> + reg64 = ®s->value.reg64;
> >> +
> >> + /* Search for the register in the table */
> >> + for (i = 0; i < ARRAY_SIZE(reg_table); i++) {
> >> + if (reg_table[i].reg_name != gpr_name)
> >> + continue;
> >> + if (reg_table[i].debug_reg_num != -1) {
> >> + /* Handle debug registers */
> >> + if (gpr_name == HV_X64_REGISTER_DR6 && !shared)
> >> + goto hypercall;
> >> + if (set)
> >> + native_set_debugreg(reg_table[i].debug_reg_num, *reg64);
> >> + else
> >> + *reg64 = native_get_debugreg(reg_table[i].debug_reg_num);
> >> + } else {
> >> + /* Handle MSRs */
> >> + if (set)
> >> + wrmsrl(reg_table[i].msr_addr, *reg64);
> >> + else
> >> + rdmsrl(reg_table[i].msr_addr, *reg64);
> >> + }
> >> + return 0;
> >> + }
> >> +
> >> +hypercall:
> >> + return 1;
> >> +}
> >> +EXPORT_SYMBOL(hv_vtl_get_set_reg);
> >> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> >> index f64393e853ee..d5355a5b7517 100644
> >> --- a/arch/x86/include/asm/mshyperv.h
> >> +++ b/arch/x86/include/asm/mshyperv.h
> >> @@ -304,6 +304,7 @@ void mshv_vtl_return_call(struct mshv_vtl_cpu_context
> *vtl0);
> >> void mshv_vtl_return_call_init(u64 vtl_return_offset);
> >> void mshv_vtl_return_hypercall(void);
> >> void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
> >> +int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared);
> >
> > Can this move to asm-generic/mshyperv.h? The function is no longer specific
> > to x86/x64, so one would want to not declare it in the arch/x86 version
> > of mshyperv.h. But maybe moving it to asm-generic/mshyperv.h breaks
> > compilation on arm64 because there's also the static inline stub there.
>
> This is still arch specific (x86 to be precise). For ARM64, we always
> want to return 1, which is to tell the client to use hypercall as a
> fallback option. Moving this x86 specific implementation (X64 registers,
> MTRR etc) to a common code, would not be right. One thing that could be
> done here was to move the "return 1" stub code from arm64 to asm-generic
> mshyperv.h, but that would not provide much benefit.
>
> I am currently not planning to make any changes here. If I misunderstood
> your comment, please let me know.
>
Yes, the implementation of hv_vtl_get_set_reg() is arch-specific. But the
one-line declaration is not. If there was a full implementation on arm64
like with x86, then the declaration could move to asm-generic/mshyperv.h.
But the arm64 side is a "static inline" function, and I'm pretty sure the
above declaration would cause a compiler conflict. So not making any
changes is appropriate. If the arm64 side should ever change to a full
implementation that isn't static inline, then the one-line declaration
should move to asm-generic/mshyperv.h.
I probably should have omitted my comment entirely as it was just
musing about a situation that doesn't actually exist at the moment. :-(
Michael
^ permalink raw reply
* RE: [PATCH 06/11] Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
From: Michael Kelley @ 2026-04-13 15:49 UTC (permalink / raw)
To: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org
In-Reply-To: <b5125f61-173f-45d0-a6dc-d795ba0f8693@linux.microsoft.com>
From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, April 13, 2026 4:48 AM
>
> On 4/1/2026 10:27 PM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
> >>
> >> Generalize Synthetic interrupt source vector (sint) to use
> >> vmbus_interrupt variable instead, which automatically takes care of
> >> architectures where HYPERVISOR_CALLBACK_VECTOR is not present (arm64).
> >
> > Sashiko AI raised an interesting question about the startup timing --
> > whether the vmbus_platform_driver_probe() is guaranteed to have
> > set vmbus_interrupt before the VTL functions below run and use it.
> > What causes the mshv_vtl.ko module to be loaded, and hence run
> > mshv_vtl_init()?
>
> There is no race condition here. The init ordering guarantees that
> vmbus_interrupt is always set before mshv_vtl_synic_enable_regs()
> reads it.
>
> The call chain for setting vmbus_interrupt:
>
> subsys_initcall(hv_acpi_init) [level 4]
> -> platform_driver_register(&vmbus_platform_driver) and so on.
>
>
> The call chain for reading vmbus_interrupt:
>
> module_init(mshv_vtl_init) [level 6]
> -> hv_vtl_setup_synic()
> -> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ..., mshv_vtl_alloc_context, ...)
> -> mshv_vtl_alloc_context()
> -> mshv_vtl_synic_enable_regs()
> -> sint.vector = vmbus_interrupt
>
> do_initcalls() processes sections in order 0 through 7, so
> hv_acpi_init() (level 4) is guaranteed to complete before
> mshv_vtl_init() (level 6) runs.
>
I think the situation is more complex than what you describe, depending
on whether the VMBus driver and/or MSHV_VTL are built as modules vs.
being built-in to the kernel image. In include/linux/module.h, see the
comment for module_init() and how subsys_initcall() is mapped
to module_init() when built as a module.
If both are built-in, then what you describe is correct. But if either or
both are modules, then the respective init functions (hv_acpi_init
and mshv_vtl_init) get called at the time the module is loaded, and
not by do_initcalls(). I think hv_vmbus.ko gets loaded when an attempt
is first made to access a disk, but I would need to look more closely to
be sure. I don't have any understanding of what causes mshv_vtl.ko
to be loaded. And what is the ordering if MSHV_VTL is built-in while
VMBus is built as a module, or vice versa?
Michael
^ permalink raw reply
* Re: [PATCH 06/11] Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
From: Naman Jain @ 2026-04-13 16:51 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157DA00A31F0BA8585B9B69D4242@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/13/2026 9:19 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, April 13, 2026 4:48 AM
>>
>> On 4/1/2026 10:27 PM, Michael Kelley wrote:
>>> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>>>
>>>> Generalize Synthetic interrupt source vector (sint) to use
>>>> vmbus_interrupt variable instead, which automatically takes care of
>>>> architectures where HYPERVISOR_CALLBACK_VECTOR is not present (arm64).
>>>
>>> Sashiko AI raised an interesting question about the startup timing --
>>> whether the vmbus_platform_driver_probe() is guaranteed to have
>>> set vmbus_interrupt before the VTL functions below run and use it.
>>> What causes the mshv_vtl.ko module to be loaded, and hence run
>>> mshv_vtl_init()?
>>
>> There is no race condition here. The init ordering guarantees that
>> vmbus_interrupt is always set before mshv_vtl_synic_enable_regs()
>> reads it.
>>
>> The call chain for setting vmbus_interrupt:
>>
>> subsys_initcall(hv_acpi_init) [level 4]
>> -> platform_driver_register(&vmbus_platform_driver) and so on.
>>
>>
>> The call chain for reading vmbus_interrupt:
>>
>> module_init(mshv_vtl_init) [level 6]
>> -> hv_vtl_setup_synic()
>> -> cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ..., mshv_vtl_alloc_context, ...)
>> -> mshv_vtl_alloc_context()
>> -> mshv_vtl_synic_enable_regs()
>> -> sint.vector = vmbus_interrupt
>>
>> do_initcalls() processes sections in order 0 through 7, so
>> hv_acpi_init() (level 4) is guaranteed to complete before
>> mshv_vtl_init() (level 6) runs.
>>
>
> I think the situation is more complex than what you describe, depending
> on whether the VMBus driver and/or MSHV_VTL are built as modules vs.
> being built-in to the kernel image. In include/linux/module.h, see the
> comment for module_init() and how subsys_initcall() is mapped
> to module_init() when built as a module.
>
> If both are built-in, then what you describe is correct. But if either or
> both are modules, then the respective init functions (hv_acpi_init
> and mshv_vtl_init) get called at the time the module is loaded, and
> not by do_initcalls(). I think hv_vmbus.ko gets loaded when an attempt
> is first made to access a disk, but I would need to look more closely to
> be sure. I don't have any understanding of what causes mshv_vtl.ko
> to be loaded. And what is the ordering if MSHV_VTL is built-in while
> VMBus is built as a module, or vice versa?
>
> Michael
>
Based on this, I still feel that this race is not possible.
hv_vmbus mshv_vtl
y y -> different initcall levels, no issues
y m -> use without initialization is not possible
m y -> config dependencies take care of this, and mshv_vtl
is forced to compile as a module in this case.
m m -> config and symbol dependencies should take care of
it. mshv_vtl has symbol and config dependencies on hv_vmbus, and it
won't allow loading mshv_vtl if hv_vmbus module is not loaded.
Relevant code here: kernel/module/main.c
Regards,
Naman
^ permalink raw reply
* Re: [PATCH 07/11] arch: arm64: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-04-13 16:52 UTC (permalink / raw)
To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Catalin Marinas, Will Deacon,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, H . Peter Anvin, Arnd Bergmann, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti
Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
linux-riscv@lists.infradead.org
In-Reply-To: <SN6PR02MB4157D3C4F6F376C8D6C3D234D450A@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/1/2026 10:27 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, March 16, 2026 5:13 AM
>>
>
> Nit: For historical consistency, use "arm64: hyperv:" as the prefix in the patch Subject.
Acked.
>
>> Add support for arm64 specific variant of mshv_vtl_return_call function
>> to be able to add support for arm64 in MSHV_VTL driver. This would
>> help enable the transition between Virtual Trust Levels (VTL) in
>> MSHV_VTL when the kernel acts as a paravisor.
>
> This commit message has a fair number of "filler" words. Suggest simplifying to:
>
> Add the arm64 variant of mshv_vtl_return_call() to support the MSHV_VTL
> driver on arm64. This function enables the transition between Virtual Trust
> Levels (VTLs) in MSHV_VTL when the kernel acts as a paravisor.
>
I can see the difference clearly :-) Will use this in commit message.
>>
>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>> ---
>> arch/arm64/hyperv/Makefile | 1 +
>> arch/arm64/hyperv/hv_vtl.c | 144 ++++++++++++++++++++++++++++++
>> arch/arm64/include/asm/mshyperv.h | 13 +++
>> 3 files changed, 158 insertions(+)
>> create mode 100644 arch/arm64/hyperv/hv_vtl.c
>>
>> diff --git a/arch/arm64/hyperv/Makefile b/arch/arm64/hyperv/Makefile
>> index 87c31c001da9..9701a837a6e1 100644
>> --- a/arch/arm64/hyperv/Makefile
>> +++ b/arch/arm64/hyperv/Makefile
>> @@ -1,2 +1,3 @@
>> # SPDX-License-Identifier: GPL-2.0
>> obj-y := hv_core.o mshyperv.o
>> +obj-$(CONFIG_HYPERV_VTL_MODE) += hv_vtl.o
>> diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
>> new file mode 100644
>> index 000000000000..66318672c242
>> --- /dev/null
>> +++ b/arch/arm64/hyperv/hv_vtl.c
>> @@ -0,0 +1,144 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2026, Microsoft, Inc.
>> + *
>> + * Authors:
>> + * Roman Kisel <romank@linux.microsoft.com>
>> + * Naman Jain <namjain@linux.microsoft.com>
>> + */
>> +
>> +#include <asm/boot.h>
>> +#include <asm/mshyperv.h>
>> +#include <asm/cpu_ops.h>
>> +
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
>> +{
>> + u64 base_ptr = (u64)vtl0->x;
>> +
>> + /*
>> + * VTL switch for ARM64 platform - managing VTL0's CPU context.
>> + * We explicitly use the stack to save the base pointer, and use x16
>> + * as our working register for accessing the context structure.
>> + *
>> + * Register Handling:
>> + * - X0-X17: Saved/restored (general-purpose, shared for VTL communication)
>> + * - X18: NOT touched - hypervisor-managed per-VTL (platform register)
>> + * - X19-X30: Saved/restored (part of VTL0's execution context)
>> + * - Q0-Q31: Saved/restored (128-bit NEON/floating-point registers, shared)
>> + * - SP: Not in structure, hypervisor-managed per-VTL
>> + *
>> + * Note: X29 (FP) and X30 (LR) are in the structure and must be saved/restored
>> + * as part of VTL0's complete execution state.
>
> Could drop "Note:". That's what comments are. :-)
Acked.
>
>
>> + */
>> + asm __volatile__ (
>> + /* Save base pointer to stack explicitly, then load into x16 */
>> + "str %0, [sp, #-16]!\n\t" /* Push base pointer onto stack */
>> + "mov x16, %0\n\t" /* Load base pointer into x16 */
>> + /* Volatile registers (Windows ARM64 ABI: x0-x15) */
>> + "ldp x0, x1, [x16]\n\t"
>> + "ldp x2, x3, [x16, #(2*8)]\n\t"
>
> On the x86 side, there's machinery to generate a series of constants that are
> the offsets of the individual fields in struct mshv_vtl_cpu_context. The x86
> asm code then uses these constants. Here on the arm64 side, you've calculated
> the offsets directly in the asm code. Any reason for the difference? I can see
> it's fairly easy to eyeball the offsets here and compare against the registers
> that are being loaded, where it's not so easy the way registers are named
> on x86. So maybe the additional machinery that's helpful on the x86 side
> is less necessary here. Just wondering ....
There were complexities around static call etc. in x86 which led to that
redesign. Here things are much simpler and the offsets we see are 1-1
mapped to registers. But I still tried to prototype it, and it looked
more complex than it has to be. I think we can keep it in current form.
>
>> + "ldp x4, x5, [x16, #(4*8)]\n\t"
>> + "ldp x6, x7, [x16, #(6*8)]\n\t"
>> + "ldp x8, x9, [x16, #(8*8)]\n\t"
>> + "ldp x10, x11, [x16, #(10*8)]\n\t"
>> + "ldp x12, x13, [x16, #(12*8)]\n\t"
>> + "ldp x14, x15, [x16, #(14*8)]\n\t"
>> + /* x16 will be loaded last, after saving base pointer */
>> + "ldr x17, [x16, #(17*8)]\n\t"
>> + /* x18 is hypervisor-managed per-VTL - DO NOT LOAD */
>> +
>> + /* General-purpose registers: x19-x30 */
>> + "ldp x19, x20, [x16, #(19*8)]\n\t"
>> + "ldp x21, x22, [x16, #(21*8)]\n\t"
>> + "ldp x23, x24, [x16, #(23*8)]\n\t"
>> + "ldp x25, x26, [x16, #(25*8)]\n\t"
>> + "ldp x27, x28, [x16, #(27*8)]\n\t"
>> +
>> + /* Frame pointer and link register */
>> + "ldp x29, x30, [x16, #(29*8)]\n\t"
>> +
>> + /* Shared NEON/FP registers: Q0-Q31 (128-bit) */
>> + "ldp q0, q1, [x16, #(32*8)]\n\t"
>> + "ldp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
>> + "ldp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
>> + "ldp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
>> + "ldp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
>> + "ldp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
>> + "ldp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
>> + "ldp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
>> + "ldp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
>> + "ldp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
>> + "ldp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
>> + "ldp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
>> + "ldp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
>> + "ldp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
>> + "ldp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
>> + "ldp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
>> +
>> + /* Now load x16 itself */
>> + "ldr x16, [x16, #(16*8)]\n\t"
>> +
>> + /* Return to the lower VTL */
>> + "hvc #3\n\t"
>> +
>> + /* Save context after return - reload base pointer from stack */
>> + "stp x16, x17, [sp, #-16]!\n\t" /* Save x16, x17 temporarily */
>> + "ldr x16, [sp, #16]\n\t" /* Reload base pointer (skip saved x16,x17) */
>> +
>> + /* Volatile registers */
>> + "stp x0, x1, [x16]\n\t"
>> + "stp x2, x3, [x16, #(2*8)]\n\t"
>> + "stp x4, x5, [x16, #(4*8)]\n\t"
>> + "stp x6, x7, [x16, #(6*8)]\n\t"
>> + "stp x8, x9, [x16, #(8*8)]\n\t"
>> + "stp x10, x11, [x16, #(10*8)]\n\t"
>> + "stp x12, x13, [x16, #(12*8)]\n\t"
>> + "stp x14, x15, [x16, #(14*8)]\n\t"
>> + "ldp x0, x1, [sp], #16\n\t" /* Recover saved x16, x17 */
>> + "stp x0, x1, [x16, #(16*8)]\n\t"
>> + /* x18 is hypervisor-managed - DO NOT SAVE */
>> +
>> + /* General-purpose registers: x19-x30 */
>> + "stp x19, x20, [x16, #(19*8)]\n\t"
>> + "stp x21, x22, [x16, #(21*8)]\n\t"
>> + "stp x23, x24, [x16, #(23*8)]\n\t"
>> + "stp x25, x26, [x16, #(25*8)]\n\t"
>> + "stp x27, x28, [x16, #(27*8)]\n\t"
>> + "stp x29, x30, [x16, #(29*8)]\n\t" /* Frame pointer and link register */
>> +
>> + /* Shared NEON/FP registers: Q0-Q31 (128-bit) */
>> + "stp q0, q1, [x16, #(32*8)]\n\t"
>> + "stp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
>> + "stp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
>> + "stp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
>> + "stp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
>> + "stp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
>> + "stp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
>> + "stp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
>> + "stp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
>> + "stp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
>> + "stp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
>> + "stp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
>> + "stp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
>> + "stp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
>> + "stp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
>> + "stp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
>> +
>> + /* Clean up stack - pop base pointer */
>> + "add sp, sp, #16\n\t"
>> +
>> + : /* No outputs */
>> + : /* Input */ "r"(base_ptr)
>> + : /* Clobber list - x16 used as base, x18 is hypervisor-managed (not touched) */
>> + "memory", "cc",
>> + "x0", "x1", "x2", "x3", "x4", "x5",
>> + "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13",
>> + "x14", "x15", "x16", "x17", "x19", "x20", "x21",
>> + "x22", "x23", "x24", "x25", "x26", "x27", "x28",
>> + "x29", "x30",
>> + "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
>> + "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
>> + "v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23",
>> + "v24", "v25", "v26", "v27", "v28", "v29", "v30", "v31");
>> +}
>> +EXPORT_SYMBOL(mshv_vtl_return_call);
>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>> index 804068e0941b..de7f3a41a8ea 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -60,6 +60,17 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>> ARM_SMCCC_SMC_64, \
>> ARM_SMCCC_OWNER_VENDOR_HYP, \
>> HV_SMCCC_FUNC_NUMBER)
>> +
>> +struct mshv_vtl_cpu_context {
>> +/*
>> + * NOTE: x18 is managed by the hypervisor. It won't be reloaded from this array.
>> + * It is included here for convenience in the common case.
>
> I'm not getting your point in this last sentence. What is the "common case"?
>
This was really odd :-) I should have spotted it. I'll change it to:
It is included here for convenience in array indexing.
> You could also drop the "NOTE: " prefix.
Acked.
>
>> + */
>> + __u64 x[31];
>> + __u64 rsvd;
>> + __uint128_t q[32];
>> +};
>
> struct mshv_vtl_run reserves 1024 bytes for cpu_context. It would be nice to
> have a compile-time check that the size of struct mshv_vtl_cpu_context fits in
> that 1024 bytes. That check might be better added where struct mshv_vtl_run
> is defined so that it works for both x86 and arm64.
Acked, will add it.
>
>> +
>> #ifdef CONFIG_HYPERV_VTL_MODE
>> /*
>> * Get/Set the register. If the function returns `1`, that must be done via
>> @@ -69,6 +80,8 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u
>> {
>> return 1;
>> }
>> +
>> +void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
>
> This declaration now duplicated in mshyperv.h under arch/arm64 and under
> arch/x86. Instead, it should be added to asm-generic/mshyperv.h, and
> removed from the arch/x86 mshyperv.h, so that there's only a single
> instance of the declaration.
Acked.
>
>> #endif
>>
>> #include <asm-generic/mshyperv.h>
>> --
>> 2.43.0
>>
Regards,
Naman
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Long Li @ 2026-04-13 18:00 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Leon Romanovsky, Erni Sri Satya Vennela, Konstantin Taranov,
linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <20260413134602.GL3694781@ziepe.ca>
> On Fri, Apr 10, 2026 at 10:29:45PM +0000, Long Li wrote:
> > > On Sat, Mar 21, 2026 at 12:56:39AM +0000, Long Li wrote:
> > >
> > > > How we rephrase this in this way: the driver should not corrupt or
> > > > overflow other parts of the kernel if its device is misbehaving
> > > > (or has a bug).
> > >
> > > If we are going to do this CC hardening stuff I think I want to see
> > > a more comphrensive approach, like if we detect an attack then the
> > > kernel instantly crashes or something. Or at least an approach in
> > > general agreed to by the CC and kernel community.
> > >
> > > Igoring the issue and continuing seems just wrong.
> > >
> > > This sprinkling of random checks in this series doesn't feel
> > > comprehensive or cohesive to me.
> > >
> > > Jason
> >
> > Can we follow the virtio BAD_RING()/vq->broken pattern in
> >
> https://git.kernel/
> .org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2
> Fdrivers%2Fvirtio%2Fvirtio_ring.c%23n57&data=05%7C02%7Clongli%40microsoft
> .com%7C698adb98daa64e20184708de996302b5%7C72f988bf86f141af91ab2d7c
> d011db47%7C1%7C0%7C639116847704528406%7CUnknown%7CTWFpbGZsb3d
> 8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoi
> TWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=HkjUfCDysbQKTuiCsiQai
> gySStd%2BI3VrHUnfMC%2FBORc%3D&reserved=0.
> >
> > Add a broken flag to mana_ib_dev. When any hardware response contains
> > out-of-range values, mark the device broken and fail the operation -
> > during probe this prevents device registration entirely, at runtime
> > all subsequent operations return -EIO.
>
> If that's the plan I would think it should be struct device based, but yeah, I'm
> more comfortable with this sort of direction as a CC hardening plan.
>
> Jason
Will do, thank you.
Long
^ permalink raw reply
* Re: [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: vdso @ 2026-04-13 18:10 UTC (permalink / raw)
To: Junrui Luo, Stanislav Kinsburskii
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Nuno Das Neves, Anirudh Rayabharam, Mukesh Rathor, Muminul Islam,
Praveen K Paladugu, Jinank Jain, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, Yuhao Jiang, stable@vger.kernel.org
In-Reply-To: <19EDB8B0-A6F4-460F-8ABA-E9D3E239511B@outlook.com>
> On 04/13/2026 1:43 AM PDT Junrui Luo <moonafterrain@outlook.com> wrote:
>
>
> On Fri, Apr 10, 2026 at 09:05:35PM -0800, vdso@mailbox.org wrote:
> > All in all, from the three options of (generic check for overflow, simple check
> > for arch bad PFNs/GFNs, an elaborated check with all specifics) I suggested the simple check.
> > Fast and still more useful than checking for overflow in my opinion.
>
> Thanks Roman for the thorough write-up. Since the original patch mixes
> host and hypervisor-side constants with an unclear unit, IMO we should
> do the bounds check in bytes instead.
>
> For instance:
>
> u64 start_gpa, end_gpa;
>
> if (check_mul_overflow(mem->guest_pfn, HV_HYP_PAGE_SIZE,
> &start_gpa) ||
> check_add_overflow(start_gpa, mem->size, &end_gpa) ||
> end_gpa > (1ULL << MAX_PHYSMEM_BITS))
> return -EINVAL;
>
> Both sides of the final comparison are bytes, so no host-vs-hv page
> unit conversion is needed.
I like that better indeed!
>
> In addition, it changes return value from -EOVERFLOW to -EINVAL.
I think that good, too: -EOVERFLOW originated iiuc and is more used
in VFS from my cursory glance.
>
> Does this approach look reasonable? Happy to iterate if either of you
> would prefer a different choice.
I agree with all your points, feels like a better place now :)
I'd defer the final smell check to Stanislav. Stanislav maintains this code
as the daily job, and might have a better feel and perspective for it. I've
been happy to add my 2c!
>
> Thanks,
> Junrui Luo
^ permalink raw reply
* RE: [PATCH 0/7] mshv: Refactor memory region management and map pages at creation
From: Michael Kelley @ 2026-04-13 21:07 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490099488.81669.3758562641675983608.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
>
> This series refactors the mshv memory region subsystem in preparation
> for mapping populated pages into the hypervisor at movable region
> creation time, rather than relying solely on demand faulting.
>
> The primary motivation is to ensure that when userspace passes a
> pre-populated mapping for a movable memory region, those pages are
> immediately visible to the hypervisor. Previously, all movable regions
> were created with HV_MAP_GPA_NO_ACCESS on every page regardless of
> whether the backing pages were already present, deferring all mapping
> to the fault handler. This added unnecessary fault overhead and
> complicated the initial setup of child partitions with pre-populated
> memory.
>
This is a nice set of changes. Independent of the new functionality
for pre-populating, it improves the code organization and makes
it more regular.
See a few comments on individual patches. I noticed that Sashiko
wasn't able to review the series because it wouldn't apply. Hopefully
your v2 will apply. From what I've seen so far of Sashiko, it finds some
good issues. I did run the patch set through Co-Pilot, but that didn't
have the benefit of the AI prompts that Sashiko provides.
Michael
^ permalink raw reply
* RE: [PATCH 1/7] mshv: Convert from page pointers to PFNs
From: Michael Kelley @ 2026-04-13 21:08 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490105758.81669.969284388846280218.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
>
> The HMM interface returns PFNs from hmm_range_fault(), and the
> hypervisor hypercalls operate on PFNs. Storing page pointers in
> between these interfaces requires unnecessary conversions and
> temporary allocations.
>
> Store PFNs directly in memory regions to match the natural data flow.
> This eliminates the temporary PFN array allocation in the HMM fault
> path and reduces page_to_pfn() conversions throughout the driver.
> Convert to page structs via pfn_to_page() only when operations like
> unpin_user_page() require them.
General comment for this series: PFN fields are typed as "unsigned long".
But pfn_offset and pfn_count are "u64". GFNs are also "u64". Any
reason not to make PFNs also "u64"? I know that pfn_valid() takes
an "unsigned long" input, but see comment below about pfn_valid().
>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_regions.c | 297 ++++++++++++++++++++++------------------
> drivers/hv/mshv_root.h | 20 +--
> drivers/hv/mshv_root_hv_call.c | 50 +++----
> drivers/hv/mshv_root_main.c | 30 ++--
> 4 files changed, 212 insertions(+), 185 deletions(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index fdffd4f002f6..b1a707d16c07 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -18,12 +18,13 @@
> #include "mshv_root.h"
>
> #define MSHV_MAP_FAULT_IN_PAGES PTRS_PER_PMD
> +#define MSHV_INVALID_PFN ULONG_MAX
>
> /**
> * mshv_chunk_stride - Compute stride for mapping guest memory
> * @page : The page to check for huge page backing
> * @gfn : Guest frame number for the mapping
> - * @page_count: Total number of pages in the mapping
> + * @pfn_count: Total number of pages in the mapping
Nit: The colons are misaligned after this change.
> *
> * Determines the appropriate stride (in pages) for mapping guest memory.
> * Uses huge page stride if the backing page is huge and the guest mapping
> @@ -32,18 +33,18 @@
> * Return: Stride in pages, or -EINVAL if page order is unsupported.
> */
> static int mshv_chunk_stride(struct page *page,
> - u64 gfn, u64 page_count)
> + u64 gfn, u64 pfn_count)
> {
> unsigned int page_order;
>
> /*
> * Use single page stride by default. For huge page stride, the
> * page must be compound and point to the head of the compound
> - * page, and both gfn and page_count must be huge-page aligned.
> + * page, and both gfn and pfn_count must be huge-page aligned.
> */
> if (!PageCompound(page) || !PageHead(page) ||
> !IS_ALIGNED(gfn, PTRS_PER_PMD) ||
> - !IS_ALIGNED(page_count, PTRS_PER_PMD))
> + !IS_ALIGNED(pfn_count, PTRS_PER_PMD))
> return 1;
>
> page_order = folio_order(page_folio(page));
> @@ -57,60 +58,61 @@ static int mshv_chunk_stride(struct page *page,
> /**
> * mshv_region_process_chunk - Processes a contiguous chunk of memory pages
> * in a region.
> - * @region : Pointer to the memory region structure.
> - * @flags : Flags to pass to the handler.
> - * @page_offset: Offset into the region's pages array to start processing.
> - * @page_count : Number of pages to process.
> - * @handler : Callback function to handle the chunk.
> + * @region : Pointer to the memory region structure.
> + * @flags : Flags to pass to the handler.
> + * @pfn_offset: Offset into the region's PFNs array to start processing.
> + * @pfn_count : Number of PFNs to process.
> + * @handler : Callback function to handle the chunk.
> *
> - * This function scans the region's pages starting from @page_offset,
> - * checking for contiguous present pages of the same size (normal or huge).
> - * It invokes @handler for the chunk of contiguous pages found. Returns the
> - * number of pages handled, or a negative error code if the first page is
> - * not present or the handler fails.
> + * This function scans the region's PFNs starting from @pfn_offset,
> + * checking for contiguous valid PFNs backed by pages of the same size
> + * (normal or huge). It invokes @handler for the chunk of contiguous valid
> + * PFNs found. Returns the number of PFNs handled, or a negative error code
> + * if the first PFN is invalid or the handler fails.
> *
> - * Note: The @handler callback must be able to handle both normal and huge
> - * pages.
> + * Note: The @handler callback must be able to handle valid PFNs backed by
> + * both normal and huge pages.
> *
> * Return: Number of pages handled, or negative error code.
> */
> -static long mshv_region_process_chunk(struct mshv_mem_region *region,
> - u32 flags,
> - u64 page_offset, u64 page_count,
> - int (*handler)(struct mshv_mem_region *region,
> - u32 flags,
> - u64 page_offset,
> - u64 page_count,
> - bool huge_page))
> +static long mshv_region_process_pfns(struct mshv_mem_region *region,
> + u32 flags,
> + u64 pfn_offset, u64 pfn_count,
> + int (*handler)(struct mshv_mem_region *region,
> + u32 flags,
> + u64 pfn_offset,
> + u64 pfn_count,
> + bool huge_page))
> {
> - u64 gfn = region->start_gfn + page_offset;
> + u64 gfn = region->start_gfn + pfn_offset;
> u64 count;
> - struct page *page;
> + unsigned long pfn;
> int stride, ret;
>
> - page = region->mreg_pages[page_offset];
> - if (!page)
> + pfn = region->mreg_pfns[pfn_offset];
> + if (!pfn_valid(pfn))
> return -EINVAL;
>
> - stride = mshv_chunk_stride(page, gfn, page_count);
> + stride = mshv_chunk_stride(pfn_to_page(pfn), gfn, pfn_count);
> if (stride < 0)
> return stride;
>
> /* Start at stride since the first stride is validated */
> - for (count = stride; count < page_count; count += stride) {
> - page = region->mreg_pages[page_offset + count];
> + for (count = stride; count < pfn_count ; count += stride) {
> + pfn = region->mreg_pfns[pfn_offset + count];
>
> - /* Break if current page is not present */
> - if (!page)
> + /* Break if current pfn is invalid */
> + if (!pfn_valid(pfn))
pfn_valid() is a relatively expensive test to be doing in a loop
on what may be every single page. It does an RCU lock/unlock
and make other checks that aren't necessary here. Since
mreg_pfns[] is populated from mm calls, the only invalid PFNs
would be MSHV_INVALID_PFN that code in this module has
explicitly put there. Just testing against MSHV_INVALID_PFN
would be a lot faster here and elsewhere in this module. It's
really a "pfn set/not set" test. Defining a pfn_set() macro
here in this module that tests against MSHV_INVALID_PFN
would accomplish the same thing more efficiently.
> break;
>
> /* Break if stride size changes */
> - if (stride != mshv_chunk_stride(page, gfn + count,
> - page_count - count))
> + if (stride != mshv_chunk_stride(pfn_to_page(pfn),
> + gfn + count,
> + pfn_count - count))
> break;
> }
>
> - ret = handler(region, flags, page_offset, count, stride > 1);
> + ret = handler(region, flags, pfn_offset, count, stride > 1);
> if (ret)
> return ret;
>
> @@ -118,70 +120,73 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> }
>
> /**
> - * mshv_region_process_range - Processes a range of memory pages in a
> - * region.
> - * @region : Pointer to the memory region structure.
> - * @flags : Flags to pass to the handler.
> - * @page_offset: Offset into the region's pages array to start processing.
> - * @page_count : Number of pages to process.
> - * @handler : Callback function to handle each chunk of contiguous
> - * pages.
> + * mshv_region_process_range - Processes a range of PFNs in a region.
> + * @region : Pointer to the memory region structure.
> + * @flags : Flags to pass to the handler.
> + * @pfn_offset: Offset into the region's PFNs array to start processing.
> + * @pfn_count : Number of PFNs to process.
> + * @handler : Callback function to handle each chunk of contiguous
> + * valid PFNs.
> *
> - * Iterates over the specified range of pages in @region, skipping
> - * non-present pages. For each contiguous chunk of present pages, invokes
> - * @handler via mshv_region_process_chunk.
> + * Iterates over the specified range of PFNs in @region, skipping
> + * invalid PFNs. For each contiguous chunk of valid PFNS, invokes
> + * @handler via mshv_region_process_pfns.
> *
> - * Note: The @handler callback must be able to handle both normal and huge
> - * pages.
> + * Note: The @handler callback must be able to handle PFNs backed by both
> + * normal and huge pages.
> *
> * Returns 0 on success, or a negative error code on failure.
> */
> static int mshv_region_process_range(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset, u64 page_count,
> + u64 pfn_offset, u64 pfn_count,
> int (*handler)(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset,
> - u64 page_count,
> + u64 pfn_offset,
> + u64 pfn_count,
> bool huge_page))
> {
> + u64 pfn_end;
In Patch 2 of this series, "pfn_end" is changed to just "end", and
the references are adjusted. Patch 2 could be a few lines smaller if it
was named "end" here and Patch 2 didn't have to change it.
> long ret;
>
> - if (page_offset + page_count > region->nr_pages)
> + if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
> + return -EOVERFLOW;
> +
> + if (pfn_end > region->nr_pfns)
> return -EINVAL;
>
> - while (page_count) {
> + while (pfn_count) {
> /* Skip non-present pages */
> - if (!region->mreg_pages[page_offset]) {
> - page_offset++;
> - page_count--;
> + if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
> + pfn_offset++;
> + pfn_count--;
> continue;
> }
>
> - ret = mshv_region_process_chunk(region, flags,
> - page_offset,
> - page_count,
> - handler);
> + ret = mshv_region_process_pfns(region, flags,
> + pfn_offset, pfn_count,
> + handler);
> if (ret < 0)
> return ret;
>
> - page_offset += ret;
> - page_count -= ret;
> + pfn_offset += ret;
> + pfn_count -= ret;
> }
>
> return 0;
> }
>
> -struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> +struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pfns,
> u64 uaddr, u32 flags)
> {
> struct mshv_mem_region *region;
> + u64 i;
>
> - region = vzalloc(sizeof(*region) + sizeof(struct page *) * nr_pages);
> + region = vzalloc(sizeof(*region) + sizeof(unsigned long) * nr_pfns);
Use struct_size(region, mreg_pfns, nr_pfns) instead of open coding the arithmetic?
> if (!region)
> return ERR_PTR(-ENOMEM);
>
> - region->nr_pages = nr_pages;
> + region->nr_pfns = nr_pfns;
> region->start_gfn = guest_pfn;
> region->start_uaddr = uaddr;
> region->hv_map_flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_ADJUSTABLE;
> @@ -190,6 +195,9 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
> if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
> region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
>
> + for (i = 0; i < nr_pfns; i++)
> + region->mreg_pfns[i] = MSHV_INVALID_PFN;
> +
> kref_init(®ion->mreg_refcount);
>
> return region;
> @@ -197,15 +205,15 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
>
> static int mshv_region_chunk_share(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset, u64 page_count,
> + u64 pfn_offset, u64 pfn_count,
> bool huge_page)
> {
> if (huge_page)
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> return hv_call_modify_spa_host_access(region->partition->pt_id,
> - region->mreg_pages + page_offset,
> - page_count,
> + region->mreg_pfns + pfn_offset,
> + pfn_count,
> HV_MAP_GPA_READABLE |
> HV_MAP_GPA_WRITABLE,
> flags, true);
> @@ -216,21 +224,21 @@ int mshv_region_share(struct mshv_mem_region *region)
> u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_SHARED;
>
> return mshv_region_process_range(region, flags,
> - 0, region->nr_pages,
> + 0, region->nr_pfns,
> mshv_region_chunk_share);
> }
>
> static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset, u64 page_count,
> + u64 pfn_offset, u64 pfn_count,
> bool huge_page)
> {
> if (huge_page)
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> return hv_call_modify_spa_host_access(region->partition->pt_id,
> - region->mreg_pages + page_offset,
> - page_count, 0,
> + region->mreg_pfns + pfn_offset,
> + pfn_count, 0,
> flags, false);
> }
>
> @@ -239,30 +247,30 @@ int mshv_region_unshare(struct mshv_mem_region *region)
> u32 flags = HV_MODIFY_SPA_PAGE_HOST_ACCESS_MAKE_EXCLUSIVE;
>
> return mshv_region_process_range(region, flags,
> - 0, region->nr_pages,
> + 0, region->nr_pfns,
> mshv_region_chunk_unshare);
> }
>
> static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset, u64 page_count,
> + u64 pfn_offset, u64 pfn_count,
> bool huge_page)
> {
> if (huge_page)
> flags |= HV_MAP_GPA_LARGE_PAGE;
>
> - return hv_call_map_gpa_pages(region->partition->pt_id,
> - region->start_gfn + page_offset,
> - page_count, flags,
> - region->mreg_pages + page_offset);
> + return hv_call_map_ram_pfns(region->partition->pt_id,
> + region->start_gfn + pfn_offset,
> + pfn_count, flags,
> + region->mreg_pfns + pfn_offset);
> }
>
> -static int mshv_region_remap_pages(struct mshv_mem_region *region,
> - u32 map_flags,
> - u64 page_offset, u64 page_count)
> +static int mshv_region_remap_pfns(struct mshv_mem_region *region,
> + u32 map_flags,
> + u64 pfn_offset, u64 pfn_count)
> {
> return mshv_region_process_range(region, map_flags,
> - page_offset, page_count,
> + pfn_offset, pfn_count,
> mshv_region_chunk_remap);
> }
>
> @@ -270,38 +278,50 @@ int mshv_region_map(struct mshv_mem_region *region)
> {
> u32 map_flags = region->hv_map_flags;
>
> - return mshv_region_remap_pages(region, map_flags,
> - 0, region->nr_pages);
> + return mshv_region_remap_pfns(region, map_flags,
> + 0, region->nr_pfns);
> }
>
> -static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> - u64 page_offset, u64 page_count)
> +static void mshv_region_invalidate_pfns(struct mshv_mem_region *region,
> + u64 pfn_offset, u64 pfn_count)
> {
> - if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> - unpin_user_pages(region->mreg_pages + page_offset, page_count);
> + u64 i;
> +
> + for (i = pfn_offset; i < pfn_offset + pfn_count; i++) {
> + if (!pfn_valid(region->mreg_pfns[i]))
> + continue;
> +
> + if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> + unpin_user_page(pfn_to_page(region->mreg_pfns[i]));
>
> - memset(region->mreg_pages + page_offset, 0,
> - page_count * sizeof(struct page *));
> + region->mreg_pfns[i] = MSHV_INVALID_PFN;
> + }
> }
>
> void mshv_region_invalidate(struct mshv_mem_region *region)
> {
> - mshv_region_invalidate_pages(region, 0, region->nr_pages);
> + mshv_region_invalidate_pfns(region, 0, region->nr_pfns);
> }
>
> int mshv_region_pin(struct mshv_mem_region *region)
> {
> - u64 done_count, nr_pages;
> + u64 done_count, nr_pfns, i;
> + unsigned long *pfns;
> struct page **pages;
> __u64 userspace_addr;
> int ret;
>
> - for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
> - pages = region->mreg_pages + done_count;
> + pages = kmalloc_array(MSHV_PIN_PAGES_BATCH_SIZE,
> + sizeof(struct page *), GFP_KERNEL);
> + if (!pages)
> + return -ENOMEM;
> +
> + for (done_count = 0; done_count < region->nr_pfns; done_count += ret) {
> + pfns = region->mreg_pfns + done_count;
> userspace_addr = region->start_uaddr +
> done_count * HV_HYP_PAGE_SIZE;
> - nr_pages = min(region->nr_pages - done_count,
> - MSHV_PIN_PAGES_BATCH_SIZE);
> + nr_pfns = min(region->nr_pfns - done_count,
> + MSHV_PIN_PAGES_BATCH_SIZE);
>
> /*
> * Pinning assuming 4k pages works for large pages too.
> @@ -311,39 +331,44 @@ int mshv_region_pin(struct mshv_mem_region *region)
> * with the FOLL_LONGTERM flag does a large temporary
> * allocation of contiguous memory.
> */
> - ret = pin_user_pages_fast(userspace_addr, nr_pages,
> + ret = pin_user_pages_fast(userspace_addr, nr_pfns,
> FOLL_WRITE | FOLL_LONGTERM,
> pages);
> - if (ret != nr_pages)
> + if (ret != nr_pfns)
> goto release_pages;
> +
> + for (i = 0; i < ret; i++)
> + pfns[i] = page_to_pfn(pages[i]);
> }
>
> + kfree(pages);
> return 0;
>
> release_pages:
> if (ret > 0)
> done_count += ret;
> - mshv_region_invalidate_pages(region, 0, done_count);
> + mshv_region_invalidate_pfns(region, 0, done_count);
> + kfree(pages);
> return ret < 0 ? ret : -ENOMEM;
> }
>
> static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> u32 flags,
> - u64 page_offset, u64 page_count,
> + u64 pfn_offset, u64 pfn_count,
> bool huge_page)
> {
> if (huge_page)
> flags |= HV_UNMAP_GPA_LARGE_PAGE;
>
> - return hv_call_unmap_gpa_pages(region->partition->pt_id,
> - region->start_gfn + page_offset,
> - page_count, flags);
> + return hv_call_unmap_pfns(region->partition->pt_id,
> + region->start_gfn + pfn_offset,
> + pfn_count, flags);
> }
>
> static int mshv_region_unmap(struct mshv_mem_region *region)
> {
> return mshv_region_process_range(region, 0,
> - 0, region->nr_pages,
> + 0, region->nr_pfns,
> mshv_region_chunk_unmap);
> }
>
> @@ -427,8 +452,8 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> /**
> * mshv_region_range_fault - Handle memory range faults for a given region.
> * @region: Pointer to the memory region structure.
> - * @page_offset: Offset of the page within the region.
> - * @page_count: Number of pages to handle.
> + * @pfn_offset: Offset of the page within the region.
> + * @pfn_count: Number of pages to handle.
> *
> * This function resolves memory faults for a specified range of pages
> * within a memory region. It uses HMM (Heterogeneous Memory Management)
> @@ -437,7 +462,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> * Return: 0 on success, negative error code on failure.
> */
> static int mshv_region_range_fault(struct mshv_mem_region *region,
> - u64 page_offset, u64 page_count)
> + u64 pfn_offset, u64 pfn_count)
> {
> struct hmm_range range = {
> .notifier = ®ion->mreg_mni,
> @@ -447,13 +472,13 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> int ret;
> u64 i;
>
> - pfns = kmalloc_array(page_count, sizeof(*pfns), GFP_KERNEL);
> + pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
> if (!pfns)
> return -ENOMEM;
>
> range.hmm_pfns = pfns;
> - range.start = region->start_uaddr + page_offset * HV_HYP_PAGE_SIZE;
> - range.end = range.start + page_count * HV_HYP_PAGE_SIZE;
> + range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
> + range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
>
> do {
> ret = mshv_region_hmm_fault_and_lock(region, &range);
> @@ -462,11 +487,15 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> if (ret)
> goto out;
>
> - for (i = 0; i < page_count; i++)
> - region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> + for (i = 0; i < pfn_count; i++) {
> + if (!(pfns[i] & HMM_PFN_VALID))
> + continue;
> + /* Drop HMM_PFN_* flags to ensure PFNs are valid. */
> + region->mreg_pfns[pfn_offset + i] = pfns[i] & ~HMM_PFN_FLAGS;
> + }
>
> - ret = mshv_region_remap_pages(region, region->hv_map_flags,
> - page_offset, page_count);
> + ret = mshv_region_remap_pfns(region, region->hv_map_flags,
> + pfn_offset, pfn_count);
>
> mutex_unlock(®ion->mreg_mutex);
> out:
> @@ -476,24 +505,24 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
>
> bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> {
> - u64 page_offset, page_count;
> + u64 pfn_offset, pfn_count;
> int ret;
>
> /* Align the page offset to the nearest MSHV_MAP_FAULT_IN_PAGES. */
> - page_offset = ALIGN_DOWN(gfn - region->start_gfn,
> - MSHV_MAP_FAULT_IN_PAGES);
> + pfn_offset = ALIGN_DOWN(gfn - region->start_gfn,
> + MSHV_MAP_FAULT_IN_PAGES);
>
> /* Map more pages than requested to reduce the number of faults. */
> - page_count = min(region->nr_pages - page_offset,
> - MSHV_MAP_FAULT_IN_PAGES);
> + pfn_count = min(region->nr_pfns - pfn_offset,
> + MSHV_MAP_FAULT_IN_PAGES);
>
> - ret = mshv_region_range_fault(region, page_offset, page_count);
> + ret = mshv_region_range_fault(region, pfn_offset, pfn_count);
>
> WARN_ONCE(ret,
> - "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, page_offset %llu, page_count %llu\n",
> + "p%llu: GPA intercept failed: region %#llx-%#llx, gfn %#llx, pfn_offset %llu, pfn_count %llu\n",
> region->partition->pt_id, region->start_uaddr,
> - region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> - gfn, page_offset, page_count);
> + region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
> + gfn, pfn_offset, pfn_count);
>
> return !ret;
> }
> @@ -523,16 +552,16 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> struct mshv_mem_region *region = container_of(mni,
> struct mshv_mem_region,
> mreg_mni);
> - u64 page_offset, page_count;
> + u64 pfn_offset, pfn_count;
> unsigned long mstart, mend;
> int ret = -EPERM;
>
> mstart = max(range->start, region->start_uaddr);
> mend = min(range->end, region->start_uaddr +
> - (region->nr_pages << HV_HYP_PAGE_SHIFT));
> + (region->nr_pfns << HV_HYP_PAGE_SHIFT));
>
> - page_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> - page_count = HVPFN_DOWN(mend - mstart);
> + pfn_offset = HVPFN_DOWN(mstart - region->start_uaddr);
> + pfn_count = HVPFN_DOWN(mend - mstart);
>
> if (mmu_notifier_range_blockable(range))
> mutex_lock(®ion->mreg_mutex);
> @@ -541,12 +570,12 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
>
> mmu_interval_set_seq(mni, cur_seq);
>
> - ret = mshv_region_remap_pages(region, HV_MAP_GPA_NO_ACCESS,
> - page_offset, page_count);
> + ret = mshv_region_remap_pfns(region, HV_MAP_GPA_NO_ACCESS,
> + pfn_offset, pfn_count);
> if (ret)
> goto out_unlock;
>
> - mshv_region_invalidate_pages(region, page_offset, page_count);
> + mshv_region_invalidate_pfns(region, pfn_offset, pfn_count);
>
> mutex_unlock(®ion->mreg_mutex);
>
> @@ -558,9 +587,9 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> WARN_ONCE(ret,
> "Failed to invalidate region %#llx-%#llx (range %#lx-%#lx, event: %u, pages %#llx-%#llx, mm: %#llx): %d\n",
> region->start_uaddr,
> - region->start_uaddr + (region->nr_pages << HV_HYP_PAGE_SHIFT),
> + region->start_uaddr + (region->nr_pfns << HV_HYP_PAGE_SHIFT),
> range->start, range->end, range->event,
> - page_offset, page_offset + page_count - 1, (u64)range->mm, ret);
> + pfn_offset, pfn_offset + pfn_count - 1, (u64)range->mm, ret);
> return false;
> }
>
> @@ -579,7 +608,7 @@ bool mshv_region_movable_init(struct mshv_mem_region *region)
>
> ret = mmu_interval_notifier_insert(®ion->mreg_mni, current->mm,
> region->start_uaddr,
> - region->nr_pages << HV_HYP_PAGE_SHIFT,
> + region->nr_pfns << HV_HYP_PAGE_SHIFT,
> &mshv_region_mni_ops);
> if (ret)
> return false;
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 947dfb76bb19..f1d4bee97a3f 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -84,15 +84,15 @@ enum mshv_region_type {
> struct mshv_mem_region {
> struct hlist_node hnode;
> struct kref mreg_refcount;
> - u64 nr_pages;
> + u64 nr_pfns;
> u64 start_gfn;
> u64 start_uaddr;
> u32 hv_map_flags;
> struct mshv_partition *partition;
> enum mshv_region_type mreg_type;
> struct mmu_interval_notifier mreg_mni;
> - struct mutex mreg_mutex; /* protects region pages remapping */
> - struct page *mreg_pages[];
> + struct mutex mreg_mutex; /* protects region PFNs remapping */
> + unsigned long mreg_pfns[];
> };
>
> struct mshv_irq_ack_notifier {
> @@ -282,11 +282,11 @@ int hv_call_create_partition(u64 flags,
> int hv_call_initialize_partition(u64 partition_id);
> int hv_call_finalize_partition(u64 partition_id);
> int hv_call_delete_partition(u64 partition_id);
> -int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64
> numpgs);
> -int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> - u32 flags, struct page **pages);
> -int hv_call_unmap_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> - u32 flags);
> +int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs);
> +int hv_call_map_ram_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
> + u32 flags, unsigned long *pfns);
> +int hv_call_unmap_pfns(u64 partition_id, u64 gpa_target, u64 pfn_count,
> + u32 flags);
> int hv_call_delete_vp(u64 partition_id, u32 vp_index);
> int hv_call_assert_virtual_interrupt(u64 partition_id, u32 vector,
> u64 dest_addr,
> @@ -329,8 +329,8 @@ int hv_map_stats_page(enum hv_stats_object_type type,
> int hv_unmap_stats_page(enum hv_stats_object_type type,
> struct hv_stats_page *page_addr,
> const union hv_stats_object_identity *identity);
> -int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> - u64 page_struct_count, u32 host_access,
> +int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
> + u64 pfns_count, u32 host_access,
> u32 flags, u8 acquire);
> int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
> void *property_value, size_t property_value_sz);
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index cb55d4d4be2e..a95f2cfc5da5 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -188,17 +188,16 @@ int hv_call_delete_partition(u64 partition_id)
> return hv_result_to_errno(status);
> }
>
> -/* Ask the hypervisor to map guest ram pages or the guest mmio space */
> -static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> - u32 flags, struct page **pages, u64 mmio_spa)
> +static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,
> + u32 flags, unsigned long *pfns, u64 mmio_spa)
> {
> struct hv_input_map_gpa_pages *input_page;
> u64 status, *pfnlist;
> unsigned long irq_flags, large_shift = 0;
> int ret = 0, done = 0;
> - u64 page_count = page_struct_count;
> + u64 page_count = pfns_count;
>
> - if (page_count == 0 || (pages && mmio_spa))
> + if (page_count == 0 || (pfns && mmio_spa))
> return -EINVAL;
>
> if (flags & HV_MAP_GPA_LARGE_PAGE) {
> @@ -227,14 +226,14 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> for (i = 0; i < rep_count; i++)
> if (flags & HV_MAP_GPA_NO_ACCESS) {
> pfnlist[i] = 0;
> - } else if (pages) {
> + } else if (pfns) {
> u64 index = (done + i) << large_shift;
>
> - if (index >= page_struct_count) {
> + if (index >= pfns_count) {
> ret = -EINVAL;
> break;
> }
> - pfnlist[i] = page_to_pfn(pages[index]);
> + pfnlist[i] = pfns[index];
> } else {
> pfnlist[i] = mmio_spa + done + i;
> }
> @@ -266,37 +265,37 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
>
> if (flags & HV_MAP_GPA_LARGE_PAGE)
> unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> - hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> + hv_call_unmap_pfns(partition_id, gfn, done, unmap_flags);
> }
>
> return ret;
> }
>
> /* Ask the hypervisor to map guest ram pages */
> -int hv_call_map_gpa_pages(u64 partition_id, u64 gpa_target, u64 page_count,
> - u32 flags, struct page **pages)
> +int hv_call_map_ram_pfns(u64 partition_id, u64 gfn, u64 pfn_count,
> + u32 flags, unsigned long *pfns)
> {
> - return hv_do_map_gpa_hcall(partition_id, gpa_target, page_count,
> - flags, pages, 0);
> + return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
> + pfns, 0);
> }
>
> -/* Ask the hypervisor to map guest mmio space */
> -int hv_call_map_mmio_pages(u64 partition_id, u64 gfn, u64 mmio_spa, u64 numpgs)
> +int hv_call_map_mmio_pfns(u64 partition_id, u64 gfn, u64 mmio_spa,
> + u64 pfn_count)
> {
> int i;
> u32 flags = HV_MAP_GPA_READABLE | HV_MAP_GPA_WRITABLE |
> HV_MAP_GPA_NOT_CACHED;
>
> - for (i = 0; i < numpgs; i++)
> + for (i = 0; i < pfn_count; i++)
> if (page_is_ram(mmio_spa + i))
> return -EINVAL;
>
> - return hv_do_map_gpa_hcall(partition_id, gfn, numpgs, flags, NULL,
> - mmio_spa);
> + return hv_do_map_pfns(partition_id, gfn, pfn_count, flags,
> + NULL, mmio_spa);
> }
>
> -int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> - u32 flags)
> +int hv_call_unmap_pfns(u64 partition_id, u64 gfn, u64 page_count_4k,
> + u32 flags)
> {
> struct hv_input_unmap_gpa_pages *input_page;
> u64 status, page_count = page_count_4k;
> @@ -1009,15 +1008,15 @@ int hv_unmap_stats_page(enum hv_stats_object_type type,
> return ret;
> }
>
> -int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> - u64 page_struct_count, u32 host_access,
> +int hv_call_modify_spa_host_access(u64 partition_id, unsigned long *pfns,
> + u64 pfns_count, u32 host_access,
> u32 flags, u8 acquire)
> {
> struct hv_input_modify_sparse_spa_page_host_access *input_page;
> u64 status;
> int done = 0;
> unsigned long irq_flags, large_shift = 0;
> - u64 page_count = page_struct_count;
> + u64 page_count = pfns_count;
> u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
> HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS;
>
> @@ -1051,11 +1050,10 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
> for (i = 0; i < rep_count; i++) {
> u64 index = (done + i) << large_shift;
>
> - if (index >= page_struct_count)
> + if (index >= pfns_count)
> return -EINVAL;
>
> - input_page->spa_page_list[i] =
> - page_to_pfn(pages[index]);
> + input_page->spa_page_list[i] = pfns[index];
> }
>
> status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index f2d83d6c8c4f..685e4b562186 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -619,7 +619,7 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>
> hlist_for_each_entry(region, &partition->pt_mem_regions, hnode) {
> if (gfn >= region->start_gfn &&
> - gfn < region->start_gfn + region->nr_pages)
> + gfn < region->start_gfn + region->nr_pfns)
> return region;
> }
>
> @@ -1221,20 +1221,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> bool is_mmio)
> {
> struct mshv_mem_region *rg;
> - u64 nr_pages = HVPFN_DOWN(mem->size);
> + u64 nr_pfns = HVPFN_DOWN(mem->size);
>
> /* Reject overlapping regions */
> spin_lock(&partition->pt_mem_regions_lock);
> hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
> - if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
> - rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
> + if (mem->guest_pfn + nr_pfns <= rg->start_gfn ||
> + rg->start_gfn + rg->nr_pfns <= mem->guest_pfn)
> continue;
> spin_unlock(&partition->pt_mem_regions_lock);
> return -EEXIST;
> }
> spin_unlock(&partition->pt_mem_regions_lock);
>
> - rg = mshv_region_create(mem->guest_pfn, nr_pages,
> + rg = mshv_region_create(mem->guest_pfn, nr_pfns,
> mem->userspace_addr, mem->flags);
> if (IS_ERR(rg))
> return PTR_ERR(rg);
> @@ -1372,21 +1372,21 @@ mshv_map_user_memory(struct mshv_partition *partition,
> * the hypervisor track dirty pages, enabling pre-copy live
> * migration.
> */
> - ret = hv_call_map_gpa_pages(partition->pt_id,
> - region->start_gfn,
> - region->nr_pages,
> - HV_MAP_GPA_NO_ACCESS, NULL);
> + ret = hv_call_map_ram_pfns(partition->pt_id,
> + region->start_gfn,
> + region->nr_pfns,
> + HV_MAP_GPA_NO_ACCESS, NULL);
> break;
> case MSHV_REGION_TYPE_MMIO:
> - ret = hv_call_map_mmio_pages(partition->pt_id,
> - region->start_gfn,
> - mmio_pfn,
> - region->nr_pages);
> + ret = hv_call_map_mmio_pfns(partition->pt_id,
> + region->start_gfn,
> + mmio_pfn,
> + region->nr_pfns);
> break;
> }
>
> trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
> - region->start_gfn, region->nr_pages,
> + region->start_gfn, region->nr_pfns,
> region->hv_map_flags, ret);
>
> if (ret)
> @@ -1424,7 +1424,7 @@ mshv_unmap_user_memory(struct mshv_partition *partition,
> /* Paranoia check */
> if (region->start_uaddr != mem.userspace_addr ||
> region->start_gfn != mem.guest_pfn ||
> - region->nr_pages != HVPFN_DOWN(mem.size)) {
> + region->nr_pfns != HVPFN_DOWN(mem.size)) {
> spin_unlock(&partition->pt_mem_regions_lock);
> return -EINVAL;
> }
>
>
^ permalink raw reply
* RE: [PATCH 2/7] mshv: Add support to address range holes remapping
From: Michael Kelley @ 2026-04-13 21:08 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490106397.81669.13650500489059864399.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
>
> Consolidate memory region processing to handle both valid and invalid PFNs
> uniformly. This eliminates code duplication across remap, unmap, share, and
> unshare operations by using a common range processing interface.
>
> Holes are now remapped with no-access permissions to enable
> hypervisor dirty page tracking for precopy live migration.
>
> This refactoring is a precursor to an upcoming change that will map
> present pages in movable regions upon region creation, requiring
> consistent handling of both mapped and unmapped ranges.
>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_regions.c | 108
> ++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 95 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index b1a707d16c07..ed9c55841140 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -119,6 +119,57 @@ static long mshv_region_process_pfns(struct mshv_mem_region *region,
> return count;
> }
>
> +/**
> + * mshv_region_process_hole - Handle a hole (invalid PFNs) in a memory
> + * region
> + * @region : Memory region containing the hole
> + * @flags : Flags to pass to the handler function
> + * @pfn_offset: Starting PFN offset within the region
> + * @pfn_count : Number of PFNs in the hole
> + * @handler : Callback function to invoke for the hole
> + *
> + * Invokes the handler function for a contiguous hole with the specified
> + * parameters.
> + *
> + * Return: Number of PFNs handled, or negative error code.
> + */
> +static long mshv_region_process_hole(struct mshv_mem_region *region,
> + u32 flags,
> + u64 pfn_offset, u64 pfn_count,
> + int (*handler)(struct mshv_mem_region *region,
> + u32 flags,
> + u64 pfn_offset,
> + u64 pfn_count,
> + bool huge_page))
> +{
> + long ret;
> +
> + ret = handler(region, flags, pfn_offset, pfn_count, 0);
> + if (ret)
> + return ret;
> +
> + return pfn_count;
> +}
> +
> +static long mshv_region_process_chunk(struct mshv_mem_region *region,
> + u32 flags,
> + u64 pfn_offset, u64 pfn_count,
> + int (*handler)(struct mshv_mem_region *region,
> + u32 flags,
> + u64 pfn_offset,
> + u64 pfn_count,
> + bool huge_page))
> +{
> + if (pfn_valid(region->mreg_pfns[pfn_offset]))
> + return mshv_region_process_pfns(region, flags,
> + pfn_offset, pfn_count,
> + handler);
> + else
> + return mshv_region_process_hole(region, flags,
> + pfn_offset, pfn_count,
> + handler);
> +}
> +
> /**
> * mshv_region_process_range - Processes a range of PFNs in a region.
> * @region : Pointer to the memory region structure.
> @@ -146,33 +197,47 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
> u64 pfn_count,
> bool huge_page))
> {
> - u64 pfn_end;
> + u64 start, end;
> long ret;
>
> - if (check_add_overflow(pfn_offset, pfn_count, &pfn_end))
> + if (!pfn_count)
> + return 0;
> +
> + if (check_add_overflow(pfn_offset, pfn_count, &end))
> return -EOVERFLOW;
>
> - if (pfn_end > region->nr_pfns)
> + if (end > region->nr_pfns)
> return -EINVAL;
>
> - while (pfn_count) {
> - /* Skip non-present pages */
> - if (!pfn_valid(region->mreg_pfns[pfn_offset])) {
> - pfn_offset++;
> - pfn_count--;
> + start = pfn_offset;
> + end = pfn_offset + 1;
> +
> + while (end < pfn_offset + pfn_count) {
> + /*
> + * Accumulate contiguous pfns with the same validity
> + * (valid or not).
> + */
> + if (pfn_valid(region->mreg_pfns[start]) ==
> + pfn_valid(region->mreg_pfns[end])) {
> + end++;
> continue;
> }
>
> - ret = mshv_region_process_pfns(region, flags,
> - pfn_offset, pfn_count,
> - handler);
> + ret = mshv_region_process_chunk(region, flags,
> + start, end - start,
> + handler);
> if (ret < 0)
> return ret;
>
> - pfn_offset += ret;
> - pfn_count -= ret;
> + start += ret;
> }
>
> + ret = mshv_region_process_chunk(region, flags,
> + start, end - start,
> + handler);
> + if (ret < 0)
> + return ret;
> +
> return 0;
> }
>
> @@ -208,6 +273,9 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
> u64 pfn_offset, u64 pfn_count,
> bool huge_page)
> {
> + if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> + return -EINVAL;
> +
> if (huge_page)
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> @@ -233,6 +301,9 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> u64 pfn_offset, u64 pfn_count,
> bool huge_page)
> {
> + if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> + return -EINVAL;
> +
> if (huge_page)
> flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
>
> @@ -256,6 +327,14 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> u64 pfn_offset, u64 pfn_count,
> bool huge_page)
> {
> + /*
> + * Remap missing pages with no access to let the
> + * hypervisor track dirty pages, enabling precopy live
> + * migration.
> + */
> + if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> + flags = HV_MAP_GPA_NO_ACCESS;
Is it OK to wipe out any other flags that might be set? Certainly, any previous
flags in PERMISSIONS_MASK should be removed, but what about ADJUSTABLE
and NOT_CACHED?
> +
> if (huge_page)
> flags |= HV_MAP_GPA_LARGE_PAGE;
>
> @@ -357,6 +436,9 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> u64 pfn_offset, u64 pfn_count,
> bool huge_page)
> {
> + if (!pfn_valid(region->mreg_pfns[pfn_offset]))
> + return 0;
> +
> if (huge_page)
> flags |= HV_UNMAP_GPA_LARGE_PAGE;
>
>
>
^ permalink raw reply
* RE: [PATCH 3/7] mshv: Support regions with different VMAs
From: Michael Kelley @ 2026-04-13 21:08 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490106973.81669.17734971204992890360.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:04 PM
>
> Allow HMM fault handling across memory regions that span multiple VMAs
> with different protection flags. The previous implementation assumed a
> single VMA per region, which would fail when guest memory crosses VMA
> boundaries.
>
> Iterate through VMAs within the range and handle each separately with
> appropriate protection flags, enabling more flexible memory region
> configurations for partitions.
>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_regions.c | 72 +++++++++++++++++++++++++++++++++------------
> 1 file changed, 52 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index ed9c55841140..1bb1bfe177e2 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -492,37 +492,72 @@ int mshv_region_get(struct mshv_mem_region *region)
> }
>
> /**
> - * mshv_region_hmm_fault_and_lock - Handle HMM faults and lock the memory region
> + * mshv_region_hmm_fault_and_lock - Handle HMM faults across VMAs and lock
> + * the memory region
> * @region: Pointer to the memory region structure
> - * @range: Pointer to the HMM range structure
> + * @start : Starting virtual address of the range to fault
> + * @end : Ending virtual address of the range to fault (exclusive)
> + * @pfns : Output array for page frame numbers with HMM flags
> *
> * This function performs the following steps:
> * 1. Reads the notifier sequence for the HMM range.
> * 2. Acquires a read lock on the memory map.
> - * 3. Handles HMM faults for the specified range.
> - * 4. Releases the read lock on the memory map.
> - * 5. If successful, locks the memory region mutex.
> - * 6. Verifies if the notifier sequence has changed during the operation.
> - * If it has, releases the mutex and returns -EBUSY to match with
> - * hmm_range_fault() return code for repeating.
> + * 3. Iterates through VMAs in the specified range, handling each
> + * separately with appropriate protection flags (HMM_PFN_REQ_WRITE set
> + * based on VMA flags).
> + * 4. Handles HMM faults for each VMA segment.
> + * 5. Releases the read lock on the memory map.
> + * 6. If successful, locks the memory region mutex.
> + * 7. Verifies if the notifier sequence has changed during the operation.
> + * If it has, releases the mutex and returns -EBUSY to signal retry.
> + *
> + * The function expects the range [start, end] is backed by valid VMAs.
Use "[start, end)" to describe the range since end is exclusive.
> + * Returns -EFAULT if any address in the range is not covered by a VMA.
> *
> * Return: 0 on success, a negative error code otherwise.
> */
> static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> - struct hmm_range *range)
> + unsigned long start,
> + unsigned long end,
> + unsigned long *pfns)
> {
> + struct hmm_range range = {
> + .notifier = ®ion->mreg_mni,
> + };
> int ret;
>
> - range->notifier_seq = mmu_interval_read_begin(range->notifier);
> + range.notifier_seq = mmu_interval_read_begin(range.notifier);
> mmap_read_lock(region->mreg_mni.mm);
> - ret = hmm_range_fault(range);
> + while (start < end) {
> + struct vm_area_struct *vma;
> +
> + vma = vma_lookup(current->mm, start);
The mmap_read_lock() was obtained on region->mreg_mni.mm, but the
lookup is done against current->mm. Maybe these are the same, but
it looks wrong. (Pointed out by a Co-Pilot AI review.)
> + if (!vma) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + range.hmm_pfns = pfns;
> + range.start = start;
> + range.end = min(vma->vm_end, end);
> + range.default_flags = HMM_PFN_REQ_FAULT;
> + if (vma->vm_flags & VM_WRITE)
> + range.default_flags |= HMM_PFN_REQ_WRITE;
> +
> + ret = hmm_range_fault(&range);
> + if (ret)
> + break;
> +
> + start = range.end + 1;
Since range.end is exclusive, the +1 should not be done.
> + pfns += DIV_ROUND_UP(range.end - range.start, PAGE_SIZE);
Just to confirm, range.end and range.start should always be page aligned,
right? So the ROUND_UP should never kick in.
> + }
> mmap_read_unlock(region->mreg_mni.mm);
> if (ret)
> return ret;
>
> mutex_lock(®ion->mreg_mutex);
>
> - if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
> + if (mmu_interval_read_retry(range.notifier, range.notifier_seq)) {
> mutex_unlock(®ion->mreg_mutex);
> cond_resched();
> return -EBUSY;
> @@ -546,10 +581,7 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> static int mshv_region_range_fault(struct mshv_mem_region *region,
> u64 pfn_offset, u64 pfn_count)
> {
> - struct hmm_range range = {
> - .notifier = ®ion->mreg_mni,
> - .default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> - };
> + unsigned long start, end;
> unsigned long *pfns;
> int ret;
> u64 i;
> @@ -558,12 +590,12 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> if (!pfns)
> return -ENOMEM;
>
> - range.hmm_pfns = pfns;
> - range.start = region->start_uaddr + pfn_offset * HV_HYP_PAGE_SIZE;
> - range.end = range.start + pfn_count * HV_HYP_PAGE_SIZE;
> + start = region->start_uaddr + pfn_offset * PAGE_SIZE;
> + end = start + pfn_count * PAGE_SIZE;
>
> do {
> - ret = mshv_region_hmm_fault_and_lock(region, &range);
> + ret = mshv_region_hmm_fault_and_lock(region, start, end,
> + pfns);
> } while (ret == -EBUSY);
>
> if (ret)
>
>
^ permalink raw reply
* RE: [PATCH 5/7] mshv: Map populated pages on movable region creation
From: Michael Kelley @ 2026-04-13 21:09 UTC (permalink / raw)
To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177490108104.81669.2129535961068627665.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, March 30, 2026 1:05 PM
>
> Map any populated pages into the hypervisor upfront when creating a
> movable region, rather than waiting for faults. Previously, movable
> regions were created with all pages marked as HV_MAP_GPA_NO_ACCESS
> regardless of whether the userspace mapping contained populated pages.
>
> This guarantees that if the caller passes a populated mapping, those
> present pages will be mapped into the hypervisor immediately during
> region creation instead of being faulted in later.
>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_regions.c | 65 ++++++++++++++++++++++++++++++++-----------
> drivers/hv/mshv_root.h | 1 +
> drivers/hv/mshv_root_main.c | 10 +------
> 3 files changed, 50 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> index 133ec7771812..28d3f488d89f 100644
> --- a/drivers/hv/mshv_regions.c
> +++ b/drivers/hv/mshv_regions.c
> @@ -519,7 +519,8 @@ int mshv_region_get(struct mshv_mem_region *region)
> static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> unsigned long start,
> unsigned long end,
> - unsigned long *pfns)
> + unsigned long *pfns,
> + bool do_fault)
> {
> struct hmm_range range = {
> .notifier = ®ion->mreg_mni,
> @@ -540,9 +541,12 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> range.hmm_pfns = pfns;
> range.start = start;
> range.end = min(vma->vm_end, end);
> - range.default_flags = HMM_PFN_REQ_FAULT;
> - if (vma->vm_flags & VM_WRITE)
> - range.default_flags |= HMM_PFN_REQ_WRITE;
> + range.default_flags = 0;
> + if (do_fault) {
> + range.default_flags = HMM_PFN_REQ_FAULT;
> + if (vma->vm_flags & VM_WRITE)
> + range.default_flags |= HMM_PFN_REQ_WRITE;
> + }
>
> ret = hmm_range_fault(&range);
> if (ret)
> @@ -567,26 +571,40 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> }
>
> /**
> - * mshv_region_range_fault - Handle memory range faults for a given region.
> - * @region: Pointer to the memory region structure.
> - * @pfn_offset: Offset of the page within the region.
> - * @pfn_count: Number of pages to handle.
> + * mshv_region_collect_and_map - Collect PFNs for a user range and map them
> + * @region : memory region being processed
> + * @pfn_offset: PFNs offset within the region
> + * @pfn_count : number of PFNs to process
> + * @do_fault : if true, fault in missing pages;
> + * if false, collect only present pages
> *
> - * This function resolves memory faults for a specified range of pages
> - * within a memory region. It uses HMM (Heterogeneous Memory Management)
> - * to fault in the required pages and updates the region's page array.
> + * Collects PFNs for the specified portion of @region from the
> + * corresponding userspace VMA and maps them into the hypervisor. The
Actually, this should be "userspace VMAs" (i.e., plural)
> + * behavior depends on @do_fault:
> *
> - * Return: 0 on success, negative error code on failure.
> + * - true: Fault in missing pages from userspace, ensuring all pages in the
> + * range are present. Used for on-demand page population.
> + * - false: Collect PFNs only for pages already present in userspace,
> + * leaving missing pages as invalid PFN markers.
> + * Used for initial region setup.
> + *
> + * Collected PFNs are stored in region->mreg_pfns[] with HMM bookkeeping
> + * flags cleared, then the range is mapped into the hypervisor. Present
> + * PFNs get mapped with region access permissions; missing PFNs (zero
> + * entries) get mapped with no-access permissions.
Hmmm. The missing PFNs are just skipped and the mreg_pfns[] array
is not updated. Is the corresponding entry in mreg_pfns[] known to
already be set to MSHV_INVALID_PFN? When mapping a new movable
region, that appears to be so. I'm less sure about the
mshv_region_range_fault() case, though mshv_region_invalidate_pfns()
does such initialization of any entries that are invalidated. At that point
in the code, I'd add a comment about that assumption, as it took me a
bit to figure it out.
So does the comment about "zero entries" refer to what is returned
by hmm_range_fault() via mshv_region_hmm_fault_and_lock()?
The mention of "zero entries" here is a bit confusing.
> + *
> + * Return: 0 on success, negative errno on failure.
> */
> -static int mshv_region_range_fault(struct mshv_mem_region *region,
> - u64 pfn_offset, u64 pfn_count)
> +static int mshv_region_collect_and_map(struct mshv_mem_region *region,
> + u64 pfn_offset, u64 pfn_count,
> + bool do_fault)
> {
> unsigned long start, end;
> unsigned long *pfns;
> int ret;
> u64 i;
>
> - pfns = kmalloc_array(pfn_count, sizeof(*pfns), GFP_KERNEL);
> + pfns = vmalloc_array(pfn_count, sizeof(unsigned long));
> if (!pfns)
> return -ENOMEM;
>
> @@ -595,7 +613,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
>
> do {
> ret = mshv_region_hmm_fault_and_lock(region, start, end,
> - pfns);
> + pfns, do_fault);
> } while (ret == -EBUSY);
>
> if (ret)
> @@ -613,10 +631,17 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
>
> mutex_unlock(®ion->mreg_mutex);
> out:
> - kfree(pfns);
> + vfree(pfns);
> return ret;
> }
>
> +static int mshv_region_range_fault(struct mshv_mem_region *region,
> + u64 pfn_offset, u64 pfn_count)
> +{
> + return mshv_region_collect_and_map(region, pfn_offset, pfn_count,
> + true);
> +}
> +
> bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn)
> {
> u64 pfn_offset, pfn_count;
> @@ -800,3 +825,9 @@ int mshv_map_pinned_region(struct mshv_mem_region
> *region)
> err_out:
> return ret;
> }
> +
> +int mshv_map_movable_region(struct mshv_mem_region *region)
> +{
> + return mshv_region_collect_and_map(region, 0, region->nr_pfns,
> + false);
> +}
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index d2e65a137bf4..02c1c11f701c 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -374,5 +374,6 @@ bool mshv_region_handle_gfn_fault(struct mshv_mem_region
> *region, u64 gfn);
> void mshv_region_movable_fini(struct mshv_mem_region *region);
> bool mshv_region_movable_init(struct mshv_mem_region *region);
> int mshv_map_pinned_region(struct mshv_mem_region *region);
> +int mshv_map_movable_region(struct mshv_mem_region *region);
>
> #endif /* _MSHV_ROOT_H_ */
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index c393b5144e0b..91dab2a3bc92 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1299,15 +1299,7 @@ mshv_map_user_memory(struct mshv_partition
> *partition,
> ret = mshv_map_pinned_region(region);
> break;
> case MSHV_REGION_TYPE_MEM_MOVABLE:
> - /*
> - * For movable memory regions, remap with no access to let
> - * the hypervisor track dirty pages, enabling pre-copy live
> - * migration.
> - */
> - ret = hv_call_map_ram_pfns(partition->pt_id,
> - region->start_gfn,
> - region->nr_pfns,
> - HV_MAP_GPA_NO_ACCESS, NULL);
> + ret = mshv_map_movable_region(region);
> break;
> case MSHV_REGION_TYPE_MMIO:
> ret = hv_call_map_mmio_pfns(partition->pt_id,
>
>
^ permalink raw reply
* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id()
From: Easwar Hariharan @ 2026-04-13 21:29 UTC (permalink / raw)
To: Michael Kelley
Cc: easwar.hariharan, Yu Zhang, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
linux-pci@vger.kernel.org, kys@microsoft.com,
haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org,
robh@kernel.org, bhelgaas@google.com, arnd@arndb.de,
joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com,
mrathor@linux.microsoft.com, peterz@infradead.org,
linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB4157DC555A9EC7A73E2CB8CBD4582@SN6PR02MB4157.namprd02.prod.outlook.com>
On 4/9/2026 12:01 PM, Michael Kelley wrote:
> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Wednesday, April 8, 2026 1:21 PM
>>
>> On 1/11/2026 9:36 AM, Michael Kelley wrote:
>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:41 AM
>>>>
>>>> On 1/8/2026 10:46 AM, Michael Kelley wrote:
>>>>> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM
>>>>>>
>>>>>> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>>>>>>
>>>>>> Hyper-V uses a logical device ID to identify a PCI endpoint device for
>>>>>> child partitions. This ID will also be required for future hypercalls
>>>>>> used by the Hyper-V IOMMU driver.
>>>>>>
>>>>>> Refactor the logic for building this logical device ID into a standalone
>>>>>> helper function and export the interface for wider use.
>>>>>>
>>>>>> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
>>>>>> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
>>>>>> ---
>>>>>> drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++--------
>>>>>> include/asm-generic/mshyperv.h | 2 ++
>>>>>> 2 files changed, 22 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
>>>>>> index 146b43981b27..4b82e06b5d93 100644
>>>>>> --- a/drivers/pci/controller/pci-hyperv.c
>>>>>> +++ b/drivers/pci/controller/pci-hyperv.c
>>>>>> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>>>>>>
>>>>>> #define hv_msi_prepare pci_msi_prepare
>>>>>>
>>>>>> +/**
>>>>>> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the
>>>>>> + * function number of the device.
>>>>>> + */
>>>>>> +u64 hv_build_logical_dev_id(struct pci_dev *pdev)
>>>>>> +{
>>>>>> + struct pci_bus *pbus = pdev->bus;
>>>>>> + struct hv_pcibus_device *hbus = container_of(pbus->sysdata,
>>>>>> + struct hv_pcibus_device, sysdata);
>>>>>> +
>>>>>> + return (u64)((hbus->hdev->dev_instance.b[5] << 24) |
>>>>>> + (hbus->hdev->dev_instance.b[4] << 16) |
>>>>>> + (hbus->hdev->dev_instance.b[7] << 8) |
>>>>>> + (hbus->hdev->dev_instance.b[6] & 0xf8) |
>>>>>> + PCI_FUNC(pdev->devfn));
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id);
>>>>>
>>>>> This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the
>>>>> new IOMMU driver because pci-hyperv.c can (and often is) built as a module.
>>>>> The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't
>>>>> use this symbol in that case -- you'll get a link error on vmlinux when building
>>>>> the kernel. Requiring pci-hyperv.c to *not* be built as a module would also
>>>>> require that the VMBus driver not be built as a module, so I don't think that's
>>>>> the right solution.
>>>>>
>>>>> This is a messy problem. The new IOMMU driver needs to start with a generic
>>>>> "struct device" for the PCI device, and somehow find the corresponding VMBus
>>>>> PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking
>>>>> about ways to do this that don't depend on code and data structures that are
>>>>> private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion.
>>>>
>>>> Thank you, Michael. FWIW, I did try to pull out the device ID components out of
>>>> pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h
>>>> but it was just too messy as you say.
>>>
>>> Yes, the current approach for getting the device ID wanders through struct
>>> hv_pcibus_device (which is private to the pci-hyperv driver), and through
>>> struct hv_device (which is a VMBus data structure). That makes the linkage
>>> between the PV IOMMU driver and the pci-hyperv and VMBus drivers rather
>>> substantial, which is not good.
>>
>> Hi Michael,
>>
>> I missed this, or made a mental note to follow up but forgot. Either way, Yu reminded
>> me about this email chain and I started looking at it this week.
>>
>>>
>>> But here's an idea for an alternate approach. The PV IOMMU driver doesn't
>>> have to generate the logical device ID on-the-fly by going to the dev_instance
>>> field of struct hv_device. Instead, the pci-hyperv driver can generate the logical
>>> device ID in hv_pci_probe(), and put it somewhere that's easy for the IOMMU
>>> driver to access. The logical device ID doesn't change while Linux is running, so
>>> stashing another copy somewhere isn't a problem.
>>
>> In my exploration and consulting with Dexuan, I realized that one of the components of
>> the logical device ID, the PCI function number is set only in pci_scan_device(), well into
>> pci_scan_root_bus_bridge() that you call out as the point by which the communication
>> must have occurred.
>>
>> But then, Dexuan also pointed me to hv_pci_assign_slots() with its call to wslot_to_devfn() and I'm
>> honestly confused how these two interact. With the current approach, it looks like whatever
>> devfn pci_scan_device() set is the correct function number to use for the logical device
>> ID, in which case, the best I can do with your suggested approach below is to inform the
>> pvIOMMU driver of the GUID, rather than the logical device ID itself.
>>
>> Perhaps with your history, you can clarify the interaction, and/or share your thoughts
>> on the above?
>
> During hv_pci_probe(), hv_pci_query_relations() is called to ask the Hyper-V
> host about what PCI devices are present. hv_pci_query_relations() sends a
> PCI_QUERY_BUS_RELATIONS message to the host, and the host send back a
> PCI_BUS_RELATIONS or PCI_BUS_RELATIONS2 message. The response message
> is handled in hv_pci_onchannelcallback(), which calls hv_pci_devices_present()
> or hv_pci_devices_present2(). The latter two functions both call
> hv_pci_start_relations_work() to add a request to a workqueue that runs
> pci_devices_present_work(). Finally, pci_devices_present_work() calls
> pc_scan_child_bus(), followed by hv_pci_assign_slots().
>
> In hv_pci_assign_slots, you can see that the PCI_BUS_RELATIONS[2]
> info from the Hyper-V host contains a function number encoded in the
> win_slot field. So the Hyper-V host *does* tell the guest the function number.
> However, the generic Linux PCI subsystem doesn't use this function number.
> It still scans the PCI device, trying successive function numbers to see which
> ones work. The scan should find the same function number that the Hyper-V
> host originally reported.
>
> As you noted, there's a sequencing problem in waiting for
> pci_scan_single_device() to find the function number. In the hv_pci_probe()
> path, after hv_pci_query_relations() runs and before create_root_hv_pci_bus()
> is called, it seems feasible to use the function number provided by the
> Hyper-V host to construct the logical device ID. That should work. But there's
> another path, in that the Hyper-V host can generate a PCI_BUS_RELATIONS[2]
> message without a request from Linux when something on the host side changes
> the PCI device setup. There's a code path where pci_devices_present_work()
> finds the state is "hv_pcibus_installed", and directly calls pci_scan_child_bus().
> This path would presumably also need to construct (or re-construct) the
> logical device ID using the information from the Hyper-V host before calling
> pci_scan_child_bus(). I'm vague on the scenario for this latter case, but the
> code is obviously there to handle it.
>
> The other approach is as you suggest. The Hyper-V PCI driver can tell
> the IOMMU driver the almost complete logical device ID, using just the
> GUID bits. Then the IOMMU driver can then construct the full logical
> device ID by adding the function number from the struct pci_dev. I don't
> see a problem with this approach -- other IOMMU drivers are referencing
> the struct pci_dev, and pulling out the function number doesn't seem like
> a violation of layering.
>
Thanks for that explanation, that makes sense. I didn't see any serialization
that would ensure that the VMBus path to communicate the child devices on the bus
would complete before pci_scan_device() finds and finalizes the pci_dev. I think it's
safest to take the approach to communicate the GUID, and find the function number from
the pci_dev. This does mean that there will be an essentially identical copy of
hv_build_logical_dev_id() in the IOMMU code, but a comment can explain that.
>>
>>>
>>> So have the Hyper-V PV IOMMU driver provide an EXPORTed function to accept
>>> a PCI domain ID and the related logical device ID. The PV IOMMU driver is
>>> responsible for storing this data in a form that it can later search. hv_pci_probe()
>>> calls this new function when it instantiates a new PCI pass-thru device. Then when
>>> the IOMMU driver needs to attach a new device, it can get the PCI domain ID
>>> from the struct pci_dev (or struct pci_bus), search for the related logical device
>>> ID in its own data structure, and use it. The pci-hyperv driver has a dependency
>>> on the IOMMU driver, but that's a dependency in the desired direction. The
>>> PCI domain ID and logical device ID are just integers, so no data structures are
>>> shared.
>>
>> In a previous reply on this thread, you raised the uniqueness issue of bytes 4 and 5
>> of the GUID being used to create the domain number. I thought this approach could
>> help with that too, but as I coded it up, I realized that using the domain number
>> (not guaranteed to be unique) to search for the bus instance GUID (guaranteed to be unique)
>> is the wrong way around. It is unfortunately the only available key in the pci_dev
>> handed to the pvIOMMU driver in this approach though...
>>
>> Do you think that's a fatal flaw?
>
> There are two uniqueness problems, which I didn't fully separate conceptually
> until writing this. One problem is constructing a PCI domain ID that Linux can use
> to identify the virtual PCI bus that the Hyper-V PCI driver creates for each vPCI
> device. The Hyper-V virtual PCI driver uses GUID bytes 4 and 5, and recognizes
> that they might not be unique. So there's code in hv_pci_probe() to pick another
> number if there's a duplicate. Hyper-V doesn't really care how Linux picks the
> domain ID for the virtual PCI bus as it's purely a Linux construct.
This part matters for the IOMMU driver as it is the key we will use to search the data
structure to find the right GUID to construct the logical dev ID that Hyper-V recognizes.
>
> The second problem is the logical device ID that Hyper-V interprets to
> identify a vPCI device in hypercalls such a HVCALL_RETARGET_INTERRUPT
> and the new pvIOMMU related hypercalls. This logical device ID uses
> GUID bytes 4 thru 7 (minus 1 bit). I don’t think Linux uses the
> logical device ID for anything. Since only Hyper-V interprets it, Hyper-V
> must somehow be ensuring uniqueness of bytes 4 thru 7 (minus 1 bit).
> That's something to confirm with the Hyper-V team. If they are just hoping
> for the best, I don't know how Linux can solve the problem.
I checked with the Hyper-V vPCI team on this aspect and the only guarantee that
they provide is that, at any given time, there will only be 1 device with a given
logical ID attached to a VM. Once a device has been removed, everything about it is
forgotten from the Hyper-V stack's perspective, and nothing in the Hyper-V stack would
prevent a scenario where, for example, a data movement accelerator is attached with
logical ID X, then revoked, and let's say a NIC is attached with the same logical ID X.
Also, FWIW, they also stated that the GUID is not unique and cannot be guaranteed to be
unique because it's the GUID for the bus, not the individual devices.
> My original comment about uniqueness somewhat conflated the two problems,
> and that's misleading. The use of the logical device ID has been around for years
> in hv_irq_retarget_interrupt(). Extending its use to the new pvIOMMU
> hypercalls doesn't make things any worse. But I'm still curious about
> what the Hyper-V team says about the uniqueness of bytes 4 thru 7.
>
> Michael
> c
>>
>>>
>>> Note that the pci-hyperv must inform the PV IOMMU driver of the logical
>>> device ID *before* create_root_hv_pci_bus() calls pci_scan_root_bus_bridge().
>>> The latter function eventually invokes hv_iommu_attach_dev(), which will
>>> need the logical device ID. See example stack trace. [1]
>>>
>>> I don't think the pci-hyperv driver even needs to tell the IOMMU driver to
>>> remove the information if a PCI pass-thru device is unbound or removed, as
>>> the logical device ID will be the same if the device ever comes back. At worst,
>>> the IOMMU driver can simply replace an existing logical device ID if a new one
>>> is provided for the same PCI domain ID.
>>
>> As above, replacing a unique GUID when a result is found for a non-unique
>> key value may be prone to failure if it happens that the device that came "back"
>> is not in fact the same device (or class of device) that went away and just happens
>> to, either due to bytes 4 and 5 being identical, or due to collision in the
>> pci_domain_nr_dynamic_ida, have the same domain number.
Given the vPCI team's statements (above), I think we will need to handle unbind or
removal and ensure the pvIOMMU drivers data structure is invalidated when either happens.
<snip>
Thanks,
Easwar (he/him)
^ permalink raw reply
* Re: [PATCH] scsi: storvsc: Handle PERSISTENT_RESERVE_IN truncation for Hyper-V vFC
From: Martin K. Petersen @ 2026-04-14 2:19 UTC (permalink / raw)
To: linux-scsi, Li Tian
Cc: Martin K . Petersen, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, James E.J. Bottomley, linux-hyperv,
linux-kernel
In-Reply-To: <20260406015344.12566-1-litian@redhat.com>
On Mon, 06 Apr 2026 09:53:44 +0800, Li Tian wrote:
> The storvsc driver has become stricter in handling
> SRB status codes returned by the Hyper-V host. When using Virtual
> Fibre Channel (vFC) passthrough, the host may return
> SRB_STATUS_DATA_OVERRUN for PERSISTENT_RESERVE_IN commands if the
> allocation length in the CDB does not match the host's expected
> response size.
>
> [...]
Applied to 7.1/scsi-queue, thanks!
[1/1] scsi: storvsc: Handle PERSISTENT_RESERVE_IN truncation for Hyper-V vFC
https://git.kernel.org/mkp/scsi/c/9cf351b289fb
--
Martin K. Petersen
^ permalink raw reply
* Re: [PATCH] Drivers: hv: vmbus: Export hv_vmbus_exists() and use it in pci-hyperv
From: Wei Liu @ 2026-04-14 4:43 UTC (permalink / raw)
To: Dexuan Cui
Cc: kys, haiyangz, wei.liu, longli, lpieralisi, kwilczynski, mani,
robh, bhelgaas, linux-hyperv, linux-pci, linux-kernel,
Mukesh Rathor
In-Reply-To: <20260409215232.399350-1-decui@microsoft.com>
On Thu, Apr 09, 2026 at 02:52:32PM -0700, Dexuan Cui wrote:
> With commit f84b21da3624 ("PCI: hv: Don't load the driver for baremetal root partition"),
> the bare metal Linux root partition won't use the pci-hyperv driver, but
> when a Linux VM runs on the Linux root partition, pci-hyperv's module_init
> function init_hv_pci_drv() can still run, e.g. in the case of
> CONFIG_PCI_HYPERV=y, even if the VMBus driver is not used in such a VM
> (i.e. the hv_vmbus driver's init function returns -ENODEV due to
> vmbus_root_device being NULL).
>
> In such a Linux VM, init_hv_pci_drv() runs with a side effect: the 3
> hvpci_block_ops callbacks are set to functions that depend on hv_vmbus.
>
> Later, when the MLX driver in such a VM invokes the callbacks, e.g. in
> drivers/net/ethernet/mellanox/mlx5/core/lib/hv.c:
> mlx5_hv_register_invalidate(), hvpci_block_ops.reg_blk_invalidate() is
> hv_register_block_invalidate() rather than a NULL function pointer, and
> hv_register_block_invalidate() assumes that it can find a struct
> hv_pcibus_device from pdev->bus->sysdata, which is false in such a VM.
>
> Consequently, hv_register_block_invalidate() -> get_pcichild_wslot() ->
> spin_lock_irqsave() may hang since it can be accessing an invalid
> spinlock pointer.
>
> Fix the issue by exporting hv_vmbus_exists() and using it in pci-hyperv:
>
> hv_root_partition() is true and hv_nested is false ==>
> hv_vmbus_exists() is false.
>
> hv_root_partition() is true and hv_nested is true ==>
> hv_vmbus_exists() is true.
>
> hv_root_partition() is false ==> hv_vmbus_exists() is true.
>
> While at it, rename vmbus_exists() to hv_vmbus_exists() to follow the
> convention that all public functions have the hv_ prefix; also change
> the return value's type from int to bool to make the code more readable;
> also move the two pr_info() calls.
>
> Reported-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
Applied. Thanks.
^ permalink raw reply
* Re: [PATCH v3] tools: hv: Fix cross-compilation
From: Wei Liu @ 2026-04-14 4:43 UTC (permalink / raw)
To: Aditya Garg
Cc: kys, haiyangz, wei.liu, decui, longli, gregkh, ssengar,
linux-hyperv, linux-kernel, avladu, vdso, gargaditya, Roman Kisel
In-Reply-To: <20260409103218.367589-1-gargaditya@linux.microsoft.com>
On Thu, Apr 09, 2026 at 03:32:18AM -0700, Aditya Garg wrote:
> Use the native ARCH only in case it is not set, this will allow the
> cross-compilation where ARCH is explicitly set.
>
> Additionally, simplify the ARCH check to build the fcopy daemon only
> for x86 and x86_64.
>
> Fixes: 82b0945ce2c2 ("tools: hv: Add new fcopy application based on uio driver")
> Reported-by: Adrian Vladu <avladu@cloudbasesolutions.com>
> Closes: https://lore.kernel.org/linux-hyperv/PR3PR09MB54119DB2FD76977C62D8DD6AB04D2@PR3PR09MB5411.eurprd09.prod.outlook.com/
> Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
> Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Applied. Thanks.
^ permalink raw reply
* Re: [PATCH v2] mshv_vtl: Fix vmemmap_shift exceeding MAX_FOLIO_ORDER
From: Wei Liu @ 2026-04-14 4:44 UTC (permalink / raw)
To: Michael Kelley
Cc: Naman Jain, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Saurabh Sengar, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157AADD1038801A5FA14A5ED45DA@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, Apr 06, 2026 at 02:08:07PM +0000, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Monday, April 6, 2026 2:25 AM
> >
> > When registering VTL0 memory via MSHV_ADD_VTL0_MEMORY, the kernel
> > computes pgmap->vmemmap_shift as the number of trailing zeros in the
> > OR of start_pfn and last_pfn, intending to use the largest compound
> > page order both endpoints are aligned to.
> >
> > However, this value is not clamped to MAX_FOLIO_ORDER, so a
> > sufficiently aligned range (e.g. physical range
> > [0x800000000000, 0x800080000000), corresponding to start_pfn=0x800000000
> > with 35 trailing zeros) can produce a shift larger than what
> > memremap_pages() accepts, triggering a WARN and returning -EINVAL:
> >
> > WARNING: ... memremap_pages+0x512/0x650
> > requested folio size unsupported
> >
> > The MAX_FOLIO_ORDER check was added by
> > commit 646b67d57589 ("mm/memremap: reject unreasonable folio/compound
> > page sizes in memremap_pages()").
> >
> > Fix this by clamping vmemmap_shift to MAX_FOLIO_ORDER so we always
> > request the largest order the kernel supports, in those cases, rather
> > than an out-of-range value.
> >
> > Also fix the error path to propagate the actual error code from
> > devm_memremap_pages() instead of hard-coding -EFAULT, which was
> > masking the real -EINVAL return.
> >
> > Fixes: 7bfe3b8ea6e3 ("Drivers: hv: Introduce mshv_vtl driver")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Applied. Thanks.
^ permalink raw reply
* Re: [PATCH] mshv: Add tracepoint for GPA intercept handling
From: Wei Liu @ 2026-04-14 4:50 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli,
linux-hyperv, linux-kernel
In-Reply-To: <20260404064527.GF60588@liuwe-devbox-debian-v2.local>
On Sat, Apr 04, 2026 at 06:45:27AM +0000, Wei Liu wrote:
> On Thu, Apr 02, 2026 at 03:45:58PM +0000, Anirudh Rayabharam wrote:
> > On Tue, Mar 24, 2026 at 11:59:59PM +0000, Stanislav Kinsburskii wrote:
> > > Provide visibility into GPA intercept operations for debugging and
> > > performance analysis of Microsoft Hypervisor guest memory management.
> > >
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > > drivers/hv/mshv_root_main.c | 6 ++++--
> > > drivers/hv/mshv_trace.h | 29 +++++++++++++++++++++++++++++
> > > 2 files changed, 33 insertions(+), 2 deletions(-)
> >
> > Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> >
>
> This patch has a dependency on another bug fix. That's not listed
> anywhere. In the future, please list the dependency, or post this as
> part of a series.
>
> The tracing support is in hyeprv-next. This patch needs to wait.
>
Now applied.
^ permalink raw reply
* [PATCH] hv: utils: handle and propagate errors in kvp_register
From: Thorsten Blum @ 2026-04-14 11:10 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Greg Kroah-Hartman
Cc: Thorsten Blum, stable, Ky Srinivasan, linux-hyperv, linux-kernel
Make kvp_register() return an error code instead of silently ignoring
failures, and propagate the error from kvp_handle_handshake() instead of
returning success.
This propagates both kzalloc_obj() and hvutil_transport_send() failures
to kvp_handle_handshake() and thus to kvp_on_msg().
Fixes: 245ba56a52a3 ("Staging: hv: Implement key/value pair (KVP)")
Cc: stable@vger.kernel.org
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
---
drivers/hv/hv_kvp.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)
diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c
index 0d73daf745a7..6180ebe040ff 100644
--- a/drivers/hv/hv_kvp.c
+++ b/drivers/hv/hv_kvp.c
@@ -93,7 +93,7 @@ static void kvp_send_key(struct work_struct *dummy);
static void kvp_respond_to_host(struct hv_kvp_msg *msg, int error);
static void kvp_timeout_func(struct work_struct *dummy);
static void kvp_host_handshake_func(struct work_struct *dummy);
-static void kvp_register(int);
+static int kvp_register(int);
static DECLARE_DELAYED_WORK(kvp_timeout_work, kvp_timeout_func);
static DECLARE_DELAYED_WORK(kvp_host_handshake_work, kvp_host_handshake_func);
@@ -127,24 +127,26 @@ static void kvp_register_done(void)
hv_poll_channel(kvp_transaction.recv_channel, kvp_poll_wrapper);
}
-static void
+static int
kvp_register(int reg_value)
{
struct hv_kvp_msg *kvp_msg;
char *version;
+ int ret;
kvp_msg = kzalloc_obj(*kvp_msg);
+ if (!kvp_msg)
+ return -ENOMEM;
- if (kvp_msg) {
- version = kvp_msg->body.kvp_register.version;
- kvp_msg->kvp_hdr.operation = reg_value;
- strcpy(version, HV_DRV_VERSION);
+ version = kvp_msg->body.kvp_register.version;
+ kvp_msg->kvp_hdr.operation = reg_value;
+ strcpy(version, HV_DRV_VERSION);
- hvutil_transport_send(hvt, kvp_msg, sizeof(*kvp_msg),
- kvp_register_done);
- kfree(kvp_msg);
- }
+ ret = hvutil_transport_send(hvt, kvp_msg, sizeof(*kvp_msg),
+ kvp_register_done);
+ kfree(kvp_msg);
+ return ret;
}
static void kvp_timeout_func(struct work_struct *dummy)
@@ -186,9 +188,8 @@ static int kvp_handle_handshake(struct hv_kvp_msg *msg)
*/
pr_debug("KVP: userspace daemon ver. %d connected\n",
msg->kvp_hdr.operation);
- kvp_register(dm_reg_value);
- return 0;
+ return kvp_register(dm_reg_value);
}
^ permalink raw reply related
* RE: [EXTERNAL] [PATCH net] netvsc: transfer lower device max tso size during VF transition
From: Haiyang Zhang @ 2026-04-14 14:08 UTC (permalink / raw)
To: Li Tian, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, Wei Liu, Dexuan Cui, Long Li,
Andrew Lunn, Eric Dumazet, Vitaly Kuznetsov, Paolo Abeni,
Jakub Kicinski, Jason Wang
In-Reply-To: <20260325045006.18607-1-litian@redhat.com>
> -----Original Message-----
> From: Li Tian <litian@redhat.com>
> Sent: Wednesday, March 25, 2026 12:50 AM
> To: netdev@vger.kernel.org; linux-hyperv@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; Haiyang Zhang <haiyangz@microsoft.com>;
> Wei Liu <wei.liu@kernel.org>; Dexuan Cui <DECUI@microsoft.com>; Long Li
> <longli@microsoft.com>; Andrew Lunn <andrew+netdev@lunn.ch>; Eric Dumazet
> <edumazet@google.com>; Vitaly Kuznetsov <vkuznets@redhat.com>; Paolo Abeni
> <pabeni@redhat.com>; Jakub Kicinski <kuba@kernel.org>; Jason Wang
> <jasowang@redhat.com>; Li Tian <litian@redhat.com>
> Subject: [EXTERNAL] [PATCH net] netvsc: transfer lower device max tso size
> during VF transition
>
> When netvsc is accelerated by the lower device, we can advertise the
> lower device max tso size in order to get better performance.
> While a long-term migration to user-space bonding is planned, current
> users on RHEL 10 / Azure are experiencing significant performance
> regressions in 802.3ad environments. This patch provides a localized,
> safe fix within netvsc without introducing new core networking helpers.
>
> Signed-off-by: Li Tian <litian@redhat.com>
> ---
> drivers/net/hyperv/netvsc_drv.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/hyperv/netvsc_drv.c
> b/drivers/net/hyperv/netvsc_drv.c
> index ee5ab5ceb2be..971607c7406f 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -2428,10 +2428,14 @@ static int netvsc_vf_changed(struct net_device
> *vf_netdev, unsigned long event)
> * This value is only increased for netvsc NIC when datapath
> is
> * switched over to the VF
> */
> - if (vf_is_up)
> + if (vf_is_up) {
> netif_set_tso_max_size(ndev, vf_netdev->tso_max_size);
> - else
> + WRITE_ONCE(ndev->gso_max_size, READ_ONCE(vf_netdev-
> >gso_max_size));
> + WRITE_ONCE(ndev->gso_ipv4_max_size,
> + READ_ONCE(vf_netdev->gso_ipv4_max_size));
> + } else {
> netif_set_tso_max_size(ndev, netvsc_dev-
> >netvsc_gso_max_size);
> + }
> }
>
> return NOTIFY_OK;
Thanks.
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
^ permalink raw reply
* [PATCH net-next 1/2] net: mana: Use per-queue allocation for tx_qp to reduce allocation size
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
gargaditya
In-Reply-To: <20260414151456.687506-1-gargaditya@linux.microsoft.com>
Convert tx_qp from a single contiguous array allocation to per-queue
individual allocations. Each mana_tx_qp struct is approximately 35KB.
With many queues (e.g., 32/64), the flat array requires a single
contiguous allocation that can fail under memory fragmentation.
Change mana_tx_qp *tx_qp to mana_tx_qp **tx_qp (array of pointers),
allocating each queue's mana_tx_qp individually via kvzalloc. This
reduces each allocation to ~35KB and provides vmalloc fallback,
avoiding allocation failure due to fragmentation.
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
.../net/ethernet/microsoft/mana/mana_bpf.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 49 ++++++++++++-------
.../ethernet/microsoft/mana/mana_ethtool.c | 2 +-
include/net/mana/mana.h | 2 +-
4 files changed, 33 insertions(+), 22 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 7697c9b52ed3..b5e9bb184a1d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -68,7 +68,7 @@ int mana_xdp_xmit(struct net_device *ndev, int n, struct xdp_frame **frames,
count++;
}
- tx_stats = &apc->tx_qp[q_idx].txq.stats;
+ tx_stats = &apc->tx_qp[q_idx]->txq.stats;
u64_stats_update_begin(&tx_stats->syncp);
tx_stats->xdp_xmit += count;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 09a53c977545..49ee77b0939a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -355,9 +355,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
if (skb_cow_head(skb, MANA_HEADROOM))
goto tx_drop_count;
- txq = &apc->tx_qp[txq_idx].txq;
+ txq = &apc->tx_qp[txq_idx]->txq;
gdma_sq = txq->gdma_sq;
- cq = &apc->tx_qp[txq_idx].tx_cq;
+ cq = &apc->tx_qp[txq_idx]->tx_cq;
tx_stats = &txq->stats;
BUILD_BUG_ON(MAX_TX_WQE_SGL_ENTRIES != MANA_MAX_TX_WQE_SGL_ENTRIES);
@@ -614,7 +614,7 @@ static void mana_get_stats64(struct net_device *ndev,
}
for (q = 0; q < num_queues; q++) {
- tx_stats = &apc->tx_qp[q].txq.stats;
+ tx_stats = &apc->tx_qp[q]->txq.stats;
do {
start = u64_stats_fetch_begin(&tx_stats->syncp);
@@ -2284,21 +2284,26 @@ static void mana_destroy_txq(struct mana_port_context *apc)
return;
for (i = 0; i < apc->num_queues; i++) {
- debugfs_remove_recursive(apc->tx_qp[i].mana_tx_debugfs);
- apc->tx_qp[i].mana_tx_debugfs = NULL;
+ if (!apc->tx_qp[i])
+ continue;
+
+ debugfs_remove_recursive(apc->tx_qp[i]->mana_tx_debugfs);
+ apc->tx_qp[i]->mana_tx_debugfs = NULL;
- napi = &apc->tx_qp[i].tx_cq.napi;
- if (apc->tx_qp[i].txq.napi_initialized) {
+ napi = &apc->tx_qp[i]->tx_cq.napi;
+ if (apc->tx_qp[i]->txq.napi_initialized) {
napi_synchronize(napi);
napi_disable_locked(napi);
netif_napi_del_locked(napi);
- apc->tx_qp[i].txq.napi_initialized = false;
+ apc->tx_qp[i]->txq.napi_initialized = false;
}
- mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
+ mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
- mana_deinit_cq(apc, &apc->tx_qp[i].tx_cq);
+ mana_deinit_cq(apc, &apc->tx_qp[i]->tx_cq);
- mana_deinit_txq(apc, &apc->tx_qp[i].txq);
+ mana_deinit_txq(apc, &apc->tx_qp[i]->txq);
+
+ kvfree(apc->tx_qp[i]);
}
kfree(apc->tx_qp);
@@ -2307,7 +2312,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
static void mana_create_txq_debugfs(struct mana_port_context *apc, int idx)
{
- struct mana_tx_qp *tx_qp = &apc->tx_qp[idx];
+ struct mana_tx_qp *tx_qp = apc->tx_qp[idx];
char qnum[32];
sprintf(qnum, "TX-%d", idx);
@@ -2346,7 +2351,7 @@ static int mana_create_txq(struct mana_port_context *apc,
int err;
int i;
- apc->tx_qp = kzalloc_objs(struct mana_tx_qp, apc->num_queues);
+ apc->tx_qp = kzalloc_objs(struct mana_tx_qp *, apc->num_queues);
if (!apc->tx_qp)
return -ENOMEM;
@@ -2366,10 +2371,16 @@ static int mana_create_txq(struct mana_port_context *apc,
gc = gd->gdma_context;
for (i = 0; i < apc->num_queues; i++) {
- apc->tx_qp[i].tx_object = INVALID_MANA_HANDLE;
+ apc->tx_qp[i] = kvzalloc_obj(*apc->tx_qp[i]);
+ if (!apc->tx_qp[i]) {
+ err = -ENOMEM;
+ goto out;
+ }
+
+ apc->tx_qp[i]->tx_object = INVALID_MANA_HANDLE;
/* Create SQ */
- txq = &apc->tx_qp[i].txq;
+ txq = &apc->tx_qp[i]->txq;
u64_stats_init(&txq->stats.syncp);
txq->ndev = net;
@@ -2387,7 +2398,7 @@ static int mana_create_txq(struct mana_port_context *apc,
goto out;
/* Create SQ's CQ */
- cq = &apc->tx_qp[i].tx_cq;
+ cq = &apc->tx_qp[i]->tx_cq;
cq->type = MANA_CQ_TYPE_TX;
cq->txq = txq;
@@ -2416,7 +2427,7 @@ static int mana_create_txq(struct mana_port_context *apc,
err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
&wq_spec, &cq_spec,
- &apc->tx_qp[i].tx_object);
+ &apc->tx_qp[i]->tx_object);
if (err)
goto out;
@@ -3242,7 +3253,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
*/
for (i = 0; i < apc->num_queues; i++) {
- txq = &apc->tx_qp[i].txq;
+ txq = &apc->tx_qp[i]->txq;
tsleep = 1000;
while (atomic_read(&txq->pending_sends) > 0 &&
time_before(jiffies, timeout)) {
@@ -3261,7 +3272,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
}
for (i = 0; i < apc->num_queues; i++) {
- txq = &apc->tx_qp[i].txq;
+ txq = &apc->tx_qp[i]->txq;
while ((skb = skb_dequeue(&txq->pending_skbs))) {
mana_unmap_skb(skb, apc);
dev_kfree_skb_any(skb);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index f2d220b371b5..f5901e4c9816 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -251,7 +251,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
}
for (q = 0; q < num_queues; q++) {
- tx_stats = &apc->tx_qp[q].txq.stats;
+ tx_stats = &apc->tx_qp[q]->txq.stats;
do {
start = u64_stats_fetch_begin(&tx_stats->syncp);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..60b4a4146ea2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -505,7 +505,7 @@ struct mana_port_context {
bool tx_shortform_allowed;
u16 tx_vp_offset;
- struct mana_tx_qp *tx_qp;
+ struct mana_tx_qp **tx_qp;
/* Indirection Table for RX & TX. The values are queue indexes */
u32 *indir_table;
--
2.43.0
^ permalink raw reply related
* [PATCH net-next 2/2] net: mana: Use kvmalloc for large RX queue and buffer allocations
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
gargaditya
In-Reply-To: <20260414151456.687506-1-gargaditya@linux.microsoft.com>
The RX path allocations for rxbufs_pre, das_pre, and rxq scale with
queue count and queue depth. With high queue counts and depth, these can
exceed what kmalloc can reliably provide from physically contiguous
memory under fragmentation.
Switch these from kmalloc to kvmalloc variants so the allocator
transparently falls back to vmalloc when contiguous memory is scarce,
and update the corresponding frees to kvfree.
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49ee77b0939a..585d891bbbac 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -685,11 +685,11 @@ void mana_pre_dealloc_rxbufs(struct mana_port_context *mpc)
put_page(virt_to_head_page(mpc->rxbufs_pre[i]));
}
- kfree(mpc->das_pre);
+ kvfree(mpc->das_pre);
mpc->das_pre = NULL;
out2:
- kfree(mpc->rxbufs_pre);
+ kvfree(mpc->rxbufs_pre);
mpc->rxbufs_pre = NULL;
out1:
@@ -806,11 +806,11 @@ int mana_pre_alloc_rxbufs(struct mana_port_context *mpc, int new_mtu, int num_qu
num_rxb = num_queues * mpc->rx_queue_size;
WARN(mpc->rxbufs_pre, "mana rxbufs_pre exists\n");
- mpc->rxbufs_pre = kmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
+ mpc->rxbufs_pre = kvmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
if (!mpc->rxbufs_pre)
goto error;
- mpc->das_pre = kmalloc_objs(dma_addr_t, num_rxb);
+ mpc->das_pre = kvmalloc_objs(dma_addr_t, num_rxb);
if (!mpc->das_pre)
goto error;
@@ -2527,7 +2527,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
if (rxq->gdma_rq)
mana_gd_destroy_queue(gc, rxq->gdma_rq);
- kfree(rxq);
+ kvfree(rxq);
}
static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
@@ -2667,7 +2667,7 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
gc = gd->gdma_context;
- rxq = kzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
+ rxq = kvzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
if (!rxq)
return NULL;
--
2.43.0
^ permalink raw reply related
* [PATCH net-next 0/2] net: mana: Avoid queue struct allocation failure under memory fragmentation
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
gargaditya
The MANA driver can fail to load on systems with high memory
utilization because several allocations in the queue setup paths
require large physically contiguous blocks via kmalloc. Under memory
fragmentation these high-order allocations may fail, preventing the
driver from creating queues at probe time or when reconfiguring
channels, ring parameters or MTU at runtime.
Allocation sizes that are problematic:
mana_create_txq -> tx_qp flat array (sizeof(mana_tx_qp) = 35528):
16 queues (default): 35528 * 16 = ~555 KB contiguous
64 queues (max): 35528 * 64 = ~2220 KB contiguous
mana_create_rxq -> rxq struct with flex array
(sizeof(mana_rxq) = 35712, rx_oobs=296 per entry):
depth 1024 (default): 35712 + 296 * 1024 = ~331 KB per queue
depth 8192 (max): 35712 + 296 * 8192 = ~2403 KB per queue
mana_pre_alloc_rxbufs -> rxbufs_pre and das_pre arrays:
16 queues, depth 1024 (default): 16 * 1024 * 8 = 128 KB each
64 queues, depth 8192 (max): 64 * 8192 * 8 = 4096 KB each
This series addresses the issue by:
1. Converting the tx_qp flat array into an array of pointers with
per-queue kvzalloc (~35 KB each), replacing a single contiguous
allocation that can reach ~2.2 MB at 64 queues.
2. Switching rxbufs_pre, das_pre, and rxq allocations to
kvmalloc/kvzalloc so the allocator can fall back to vmalloc
when contiguous memory is unavailable.
Throughput testing confirms no regression. Since kvmalloc falls
back to vmalloc under memory fragmentation, all kvmalloc calls
were temporarily replaced with vmalloc to simulate the fallback
path (iperf3, GBits/sec):
Physically contiguous vmalloc region
Connections TX RX TX RX
--------------------------------------------------------------
1 47.2 46.9 46.8 46.6
16 181 181 181 181
32 181 181 181 181
64 181 181 181 181
Aditya Garg (2):
net: mana: Use per-queue allocation for tx_qp to reduce allocation
size
net: mana: Use kvmalloc for large RX queue and buffer allocations
.../net/ethernet/microsoft/mana/mana_bpf.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++--------
.../ethernet/microsoft/mana/mana_ethtool.c | 2 +-
include/net/mana/mana.h | 2 +-
4 files changed, 39 insertions(+), 28 deletions(-)
--
2.43.0
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox