Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [RFT PATCH] x86/hyperv: Use __naked attribute to fix stackless C function
From: Ard Biesheuvel @ 2026-02-26 10:48 UTC (permalink / raw)
  To: Uros Bizjak, Ard Biesheuvel
  Cc: linux-kernel, Mukesh Rathor, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, linux-hyperv
In-Reply-To: <CAFULd4aSAdKV7XtASr_uQz5hA4qBbWeO-nfgKb979HkwZDbQ_w@mail.gmail.com>

Hi Uros,

On Thu, 26 Feb 2026, at 11:35, Uros Bizjak wrote:
> On Thu, Feb 26, 2026 at 10:51 AM Ard Biesheuvel <ardb+git@google.com> wrote:
>>
>> From: Ard Biesheuvel <ardb@kernel.org>
>>
>> hv_crash_c_entry() is a C function that is entered without a stack,
>> and this is only allowed for functions that have the __naked attribute,
>> which informs the compiler that it must not emit the usual prologue and
>> epilogue or emit any other kind of instrumentation that relies on a
>> stack frame.
>>
>> So split up the function, and set the __naked attribute on the initial
>> part that sets up the stack, GDT, IDT and other pieces that are needed
>> for ordinary C execution. Given that function calls are not permitted
>> either, use the existing long return coded in an asm() block to call the
>> second part of the function, which is an ordinary function that is
>> permitted to call other functions as usual.
>>
>> Fixes: 94212d34618c ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
>> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>> ---
>> Build tested only.
>>
>> Cc: Mukesh Rathor <mrathor@linux.microsoft.com>
>> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
>> Cc: Haiyang Zhang <haiyangz@microsoft.com>
>> Cc: Wei Liu <wei.liu@kernel.org>
>> Cc: Dexuan Cui <decui@microsoft.com>
>> Cc: Long Li <longli@microsoft.com>
>> Cc: Thomas Gleixner <tglx@kernel.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Uros Bizjak <ubizjak@gmail.com>
>> Cc: linux-hyperv@vger.kernel.org
>>
>>  arch/x86/hyperv/hv_crash.c | 80 ++++++++++----------
>>  1 file changed, 42 insertions(+), 38 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
>> index a78e4fed5720..d77766e8d37e 100644
>> --- a/arch/x86/hyperv/hv_crash.c
>> +++ b/arch/x86/hyperv/hv_crash.c
>> @@ -107,14 +107,12 @@ static void __noreturn hv_panic_timeout_reboot(void)
>>                 cpu_relax();
>>  }
>>
>> -/* This cannot be inlined as it needs stack */
>> -static noinline __noclone void hv_crash_restore_tss(void)
>> +static void hv_crash_restore_tss(void)
>>  {
>>         load_TR_desc();
>>  }
>>
>> -/* This cannot be inlined as it needs stack */
>> -static noinline void hv_crash_clear_kernpt(void)
>> +static void hv_crash_clear_kernpt(void)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>> @@ -125,6 +123,25 @@ static noinline void hv_crash_clear_kernpt(void)
>>         native_p4d_clear(p4d);
>>  }
>>
>> +
>> +static void __noreturn hv_crash_handle(void)
>> +{
>> +       hv_crash_restore_tss();
>> +       hv_crash_clear_kernpt();
>> +
>> +       /* we are now fully in devirtualized normal kernel mode */
>> +       __crash_kexec(NULL);
>> +
>> +       hv_panic_timeout_reboot();
>> +}
>> +
>> +/*
>> + * __naked functions do not permit function calls, not even to __always_inline
>> + * functions that only contain asm() blocks themselves. So use a macro instead.
>> + */
>> +#define hv_wrmsr(msr, val) \
>> +       asm("wrmsr" :: "c"(msr), "a"((u32)val), "d"((u32)(val >> 32)) : "memory")
>> +
>>  /*
>>   * This is the C entry point from the asm glue code after the disable hypercall.
>>   * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
>> @@ -133,49 +150,36 @@ static noinline void hv_crash_clear_kernpt(void)
>>   * available. We restore kernel GDT, and rest of the context, and continue
>>   * to kexec.
>>   */
>> -static asmlinkage void __noreturn hv_crash_c_entry(void)
>> +static void __naked hv_crash_c_entry(void)
>>  {
>> -       struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>> -
>>         /* first thing, restore kernel gdt */
>> -       native_load_gdt(&ctxt->gdtr);
>> +       asm volatile("lgdt %0" : : "m" (hv_crash_ctxt.gdtr));
>>
>> -       asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
>> -       asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
>> +       asm volatile("movw %%ax, %%ss" : : "a"(hv_crash_ctxt.ss));
>> +       asm volatile("movq %0, %%rsp" : : "m"(hv_crash_ctxt.rsp));
>>
>> -       asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
>> -       asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
>> -       asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
>> -       asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
>> +       asm volatile("movw %%ax, %%ds" : : "a"(hv_crash_ctxt.ds));
>> +       asm volatile("movw %%ax, %%es" : : "a"(hv_crash_ctxt.es));
>> +       asm volatile("movw %%ax, %%fs" : : "a"(hv_crash_ctxt.fs));
>> +       asm volatile("movw %%ax, %%gs" : : "a"(hv_crash_ctxt.gs));
>>
>> -       native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
>> -       asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
>> +       hv_wrmsr(MSR_IA32_CR_PAT, hv_crash_ctxt.pat);
>> +       asm volatile("movq %0, %%cr0" : : "r"(hv_crash_ctxt.cr0));
>>
>> -       asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
>> -       asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
>> -       asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
>> +       asm volatile("movq %0, %%cr8" : : "r"(hv_crash_ctxt.cr8));
>> +       asm volatile("movq %0, %%cr4" : : "r"(hv_crash_ctxt.cr4));
>> +       asm volatile("movq %0, %%cr2" : : "r"(hv_crash_ctxt.cr4));
>>
>> -       native_load_idt(&ctxt->idtr);
>> -       native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
>> -       native_wrmsrq(MSR_EFER, ctxt->efer);
>> +       asm volatile("lidt %0" : : "m" (hv_crash_ctxt.idtr));
>> +       hv_wrmsr(MSR_GS_BASE, hv_crash_ctxt.gsbase);
>> +       hv_wrmsr(MSR_EFER, hv_crash_ctxt.efer);
>>
>>         /* restore the original kernel CS now via far return */
>> -       asm volatile("movzwq %0, %%rax\n\t"
>> -                    "pushq %%rax\n\t"
>> -                    "pushq $1f\n\t"
>> -                    "lretq\n\t"
>> -                    "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
>> -
>> -       /* We are in asmlinkage without stack frame, hence make C function
>> -        * calls which will buy stack frames.
>> -        */
>> -       hv_crash_restore_tss();
>> -       hv_crash_clear_kernpt();
>> -
>> -       /* we are now fully in devirtualized normal kernel mode */
>> -       __crash_kexec(NULL);
>> -
>> -       hv_panic_timeout_reboot();
>> +       asm volatile("pushq     %q0             \n\t"
>> +                    "leaq      %c1(%%rip), %q0 \n\t"
>
> You can use %a1 instead of %c1(%%rip).
>

Nice.

>> +                    "pushq     %q0             \n\t"
>> +                    "lretq                     \n\t"
>
> No need for terminating \n\t after the last insn in the asm template.
>
>> +                    :: "a"(hv_crash_ctxt.cs), "i"(hv_crash_handle));
>
> Pedantically, you need ': "+a"(...) : "i"(...)' here.
>

Right, so the compiler knows that the register will be updated by the asm() block. But what is preventing it from writing back this value to hv_crash_ctxt.cs? The generated code doesn't seem to do so, but the semantics of "+r" suggest otherwise AIUI.

The code following the asm() block is unreachable anyway, so it doesn't really matter either way in practice. Just curious ...

^ permalink raw reply

* Re: [RFT PATCH] x86/hyperv: Use __naked attribute to fix stackless C function
From: Uros Bizjak @ 2026-02-26 10:51 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Ard Biesheuvel, linux-kernel, Mukesh Rathor, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H . Peter Anvin,
	linux-hyperv
In-Reply-To: <ac6778bd-a701-47e6-8521-768726246ce9@app.fastmail.com>

On Thu, Feb 26, 2026 at 11:48 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Hi Uros,
>
> On Thu, 26 Feb 2026, at 11:35, Uros Bizjak wrote:
> > On Thu, Feb 26, 2026 at 10:51 AM Ard Biesheuvel <ardb+git@google.com> wrote:
> >>
> >> From: Ard Biesheuvel <ardb@kernel.org>
> >>
> >> hv_crash_c_entry() is a C function that is entered without a stack,
> >> and this is only allowed for functions that have the __naked attribute,
> >> which informs the compiler that it must not emit the usual prologue and
> >> epilogue or emit any other kind of instrumentation that relies on a
> >> stack frame.
> >>
> >> So split up the function, and set the __naked attribute on the initial
> >> part that sets up the stack, GDT, IDT and other pieces that are needed
> >> for ordinary C execution. Given that function calls are not permitted
> >> either, use the existing long return coded in an asm() block to call the
> >> second part of the function, which is an ordinary function that is
> >> permitted to call other functions as usual.
> >>
> >> Fixes: 94212d34618c ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
> >> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >> ---
> >> Build tested only.
> >>
> >> Cc: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> Cc: "K. Y. Srinivasan" <kys@microsoft.com>
> >> Cc: Haiyang Zhang <haiyangz@microsoft.com>
> >> Cc: Wei Liu <wei.liu@kernel.org>
> >> Cc: Dexuan Cui <decui@microsoft.com>
> >> Cc: Long Li <longli@microsoft.com>
> >> Cc: Thomas Gleixner <tglx@kernel.org>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Borislav Petkov <bp@alien8.de>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: "H. Peter Anvin" <hpa@zytor.com>
> >> Cc: Uros Bizjak <ubizjak@gmail.com>
> >> Cc: linux-hyperv@vger.kernel.org
> >>
> >>  arch/x86/hyperv/hv_crash.c | 80 ++++++++++----------
> >>  1 file changed, 42 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
> >> index a78e4fed5720..d77766e8d37e 100644
> >> --- a/arch/x86/hyperv/hv_crash.c
> >> +++ b/arch/x86/hyperv/hv_crash.c
> >> @@ -107,14 +107,12 @@ static void __noreturn hv_panic_timeout_reboot(void)
> >>                 cpu_relax();
> >>  }
> >>
> >> -/* This cannot be inlined as it needs stack */
> >> -static noinline __noclone void hv_crash_restore_tss(void)
> >> +static void hv_crash_restore_tss(void)
> >>  {
> >>         load_TR_desc();
> >>  }
> >>
> >> -/* This cannot be inlined as it needs stack */
> >> -static noinline void hv_crash_clear_kernpt(void)
> >> +static void hv_crash_clear_kernpt(void)
> >>  {
> >>         pgd_t *pgd;
> >>         p4d_t *p4d;
> >> @@ -125,6 +123,25 @@ static noinline void hv_crash_clear_kernpt(void)
> >>         native_p4d_clear(p4d);
> >>  }
> >>
> >> +
> >> +static void __noreturn hv_crash_handle(void)
> >> +{
> >> +       hv_crash_restore_tss();
> >> +       hv_crash_clear_kernpt();
> >> +
> >> +       /* we are now fully in devirtualized normal kernel mode */
> >> +       __crash_kexec(NULL);
> >> +
> >> +       hv_panic_timeout_reboot();
> >> +}
> >> +
> >> +/*
> >> + * __naked functions do not permit function calls, not even to __always_inline
> >> + * functions that only contain asm() blocks themselves. So use a macro instead.
> >> + */
> >> +#define hv_wrmsr(msr, val) \
> >> +       asm("wrmsr" :: "c"(msr), "a"((u32)val), "d"((u32)(val >> 32)) : "memory")
> >> +
> >>  /*
> >>   * This is the C entry point from the asm glue code after the disable hypercall.
> >>   * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
> >> @@ -133,49 +150,36 @@ static noinline void hv_crash_clear_kernpt(void)
> >>   * available. We restore kernel GDT, and rest of the context, and continue
> >>   * to kexec.
> >>   */
> >> -static asmlinkage void __noreturn hv_crash_c_entry(void)
> >> +static void __naked hv_crash_c_entry(void)
> >>  {
> >> -       struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
> >> -
> >>         /* first thing, restore kernel gdt */
> >> -       native_load_gdt(&ctxt->gdtr);
> >> +       asm volatile("lgdt %0" : : "m" (hv_crash_ctxt.gdtr));
> >>
> >> -       asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
> >> -       asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
> >> +       asm volatile("movw %%ax, %%ss" : : "a"(hv_crash_ctxt.ss));
> >> +       asm volatile("movq %0, %%rsp" : : "m"(hv_crash_ctxt.rsp));
> >>
> >> -       asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
> >> -       asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
> >> -       asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
> >> -       asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
> >> +       asm volatile("movw %%ax, %%ds" : : "a"(hv_crash_ctxt.ds));
> >> +       asm volatile("movw %%ax, %%es" : : "a"(hv_crash_ctxt.es));
> >> +       asm volatile("movw %%ax, %%fs" : : "a"(hv_crash_ctxt.fs));
> >> +       asm volatile("movw %%ax, %%gs" : : "a"(hv_crash_ctxt.gs));
> >>
> >> -       native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
> >> -       asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
> >> +       hv_wrmsr(MSR_IA32_CR_PAT, hv_crash_ctxt.pat);
> >> +       asm volatile("movq %0, %%cr0" : : "r"(hv_crash_ctxt.cr0));
> >>
> >> -       asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
> >> -       asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
> >> -       asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
> >> +       asm volatile("movq %0, %%cr8" : : "r"(hv_crash_ctxt.cr8));
> >> +       asm volatile("movq %0, %%cr4" : : "r"(hv_crash_ctxt.cr4));
> >> +       asm volatile("movq %0, %%cr2" : : "r"(hv_crash_ctxt.cr4));
> >>
> >> -       native_load_idt(&ctxt->idtr);
> >> -       native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
> >> -       native_wrmsrq(MSR_EFER, ctxt->efer);
> >> +       asm volatile("lidt %0" : : "m" (hv_crash_ctxt.idtr));
> >> +       hv_wrmsr(MSR_GS_BASE, hv_crash_ctxt.gsbase);
> >> +       hv_wrmsr(MSR_EFER, hv_crash_ctxt.efer);
> >>
> >>         /* restore the original kernel CS now via far return */
> >> -       asm volatile("movzwq %0, %%rax\n\t"
> >> -                    "pushq %%rax\n\t"
> >> -                    "pushq $1f\n\t"
> >> -                    "lretq\n\t"
> >> -                    "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
> >> -
> >> -       /* We are in asmlinkage without stack frame, hence make C function
> >> -        * calls which will buy stack frames.
> >> -        */
> >> -       hv_crash_restore_tss();
> >> -       hv_crash_clear_kernpt();
> >> -
> >> -       /* we are now fully in devirtualized normal kernel mode */
> >> -       __crash_kexec(NULL);
> >> -
> >> -       hv_panic_timeout_reboot();
> >> +       asm volatile("pushq     %q0             \n\t"
> >> +                    "leaq      %c1(%%rip), %q0 \n\t"
> >
> > You can use %a1 instead of %c1(%%rip).
> >
>
> Nice.
>
> >> +                    "pushq     %q0             \n\t"
> >> +                    "lretq                     \n\t"
> >
> > No need for terminating \n\t after the last insn in the asm template.
> >
> >> +                    :: "a"(hv_crash_ctxt.cs), "i"(hv_crash_handle));
> >
> > Pedantically, you need ': "+a"(...) : "i"(...)' here.
> >
>
> Right, so the compiler knows that the register will be updated by the asm() block. But what is preventing it from writing back this value to hv_crash_ctxt.cs? The generated code doesn't seem to do so, but the semantics of "+r" suggest otherwise AIUI.
>
> The code following the asm() block is unreachable anyway, so it doesn't really matter either way in practice. Just curious ...

Oh, you just need a temporary here... the original is OK. Indeed, "+r"
will write back the value to the memory location, and this is not what
we want here.

Uros.

^ permalink raw reply

* Re: [RFT PATCH] x86/hyperv: Use __naked attribute to fix stackless C function
From: Andrew Cooper @ 2026-02-26 12:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Andrew Cooper, ardb, bp, dave.hansen, decui, haiyangz, hpa, kys,
	linux-hyperv, linux-kernel, longli, mingo, mrathor, tglx, ubizjak,
	wei.liu
In-Reply-To: <20260226095056.46410-2-ardb+git@google.com>

> @@ -133,49 +150,36 @@ static noinline void hv_crash_clear_kernpt(void)   * available. We restore kernel GDT, and rest of the context, and continue
>   * to kexec.
>   */
> -static asmlinkage void __noreturn hv_crash_c_entry(void) +static void
> __naked hv_crash_c_entry(void)  {
> - struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; -  	/* first thing, restore kernel gdt */
> - native_load_gdt(&ctxt->gdtr); + asm volatile("lgdt %0" : : "m"
> (hv_crash_ctxt.gdtr));  
> - asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); - asm
> volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); + asm volatile("movw
> %%ax, %%ss" : : "a"(hv_crash_ctxt.ss)); + asm volatile("movq %0,
> %%rsp" : : "m"(hv_crash_ctxt.rsp));

I know this is pre-existing, but the asm here is poor.

All segment registers loads can have a memory operand, rather than
forcing through %eax, which in turn reduces the setup logic the compiler
needs to emit.

Something like this:

    "movl %0, %%ss" : : "m"(hv_crash_ctxt.ss)

ought to do.

>  
> - asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds)); - asm
> volatile("movw %%ax, %%es" : : "a"(ctxt->es)); - asm volatile("movw
> %%ax, %%fs" : : "a"(ctxt->fs)); - asm volatile("movw %%ax, %%gs" : :
> "a"(ctxt->gs)); + asm volatile("movw %%ax, %%ds" : :
> "a"(hv_crash_ctxt.ds)); + asm volatile("movw %%ax, %%es" : :
> "a"(hv_crash_ctxt.es)); + asm volatile("movw %%ax, %%fs" : :
> "a"(hv_crash_ctxt.fs)); + asm volatile("movw %%ax, %%gs" : :
> "a"(hv_crash_ctxt.gs));  
> - native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat); - asm volatile("movq %0,
> %%cr0" : : "r"(ctxt->cr0)); + hv_wrmsr(MSR_IA32_CR_PAT,
> hv_crash_ctxt.pat); + asm volatile("movq %0, %%cr0" : :
> "r"(hv_crash_ctxt.cr0));  
> - asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8)); - asm
> volatile("movq %0, %%cr4" : : "r"(ctxt->cr4)); - asm volatile("movq
> %0, %%cr2" : : "r"(ctxt->cr4)); + asm volatile("movq %0, %%cr8" : :
> "r"(hv_crash_ctxt.cr8)); + asm volatile("movq %0, %%cr4" : :
> "r"(hv_crash_ctxt.cr4)); + asm volatile("movq %0, %%cr2" : :
> "r"(hv_crash_ctxt.cr4));  
> - native_load_idt(&ctxt->idtr); - native_wrmsrq(MSR_GS_BASE,
> ctxt->gsbase); - native_wrmsrq(MSR_EFER, ctxt->efer); + asm
> volatile("lidt %0" : : "m" (hv_crash_ctxt.idtr)); +
> hv_wrmsr(MSR_GS_BASE, hv_crash_ctxt.gsbase); + hv_wrmsr(MSR_EFER,
> hv_crash_ctxt.efer);  
>  	/* restore the original kernel CS now via far return */
> - asm volatile("movzwq %0, %%rax\n\t" - "pushq %%rax\n\t" - "pushq
> $1f\n\t" - "lretq\n\t" - "1:nop\n\t" : : "m"(ctxt->cs) : "rax"); - -
> /* We are in asmlinkage without stack frame, hence make C function - *
> calls which will buy stack frames. - */ - hv_crash_restore_tss(); -
> hv_crash_clear_kernpt(); - - /* we are now fully in devirtualized
> normal kernel mode */ - __crash_kexec(NULL); - -
> hv_panic_timeout_reboot(); + asm volatile("pushq %q0 \n\t" + "leaq
> %c1(%%rip), %q0 \n\t" + "pushq %q0 \n\t" + "lretq \n\t" + ::
> "a"(hv_crash_ctxt.cs), "i"(hv_crash_handle));

As Uros notes, "a" is clobbered here but the compiler is not informed. 
But, it's not necessary.

As a naked function you could even use 3x asm() statements, but you can
get the compiler to sort out the function reference automatically with:

    asm volatile ("push %q0\n\t"
                  "push %q1\n\t"
                  "lretq"
                  :: "r"(hv_crash_ctxt.cs), "r"(hv_crash_handle));


(Only tested in godbolt)

~Andrew

^ permalink raw reply

* Re: [RFT PATCH] x86/hyperv: Use __naked attribute to fix stackless C function
From: Ard Biesheuvel @ 2026-02-26 13:07 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Borislav Petkov, dave.hansen, decui, haiyangz, H . Peter Anvin,
	kys, linux-hyperv, linux-kernel, Long Li, Ingo Molnar,
	Mukesh Rathor, Thomas Gleixner, Uros Bizjak, wei.liu
In-Reply-To: <5a2f3ffd-1692-4c32-b6f7-b94e5066dd95@citrix.com>



On Thu, 26 Feb 2026, at 13:01, Andrew Cooper wrote:
>> @@ -133,49 +150,36 @@ static noinline void hv_crash_clear_kernpt(void)   * available. We restore kernel GDT, and rest of the context, and continue
>>   * to kexec.
>>   */
>> -static asmlinkage void __noreturn hv_crash_c_entry(void) +static void
>> __naked hv_crash_c_entry(void)  {
>> - struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; -  	/* first thing, restore kernel gdt */
>> - native_load_gdt(&ctxt->gdtr); + asm volatile("lgdt %0" : : "m"
>> (hv_crash_ctxt.gdtr));  
>> - asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); - asm
>> volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); + asm volatile("movw
>> %%ax, %%ss" : : "a"(hv_crash_ctxt.ss)); + asm volatile("movq %0,
>> %%rsp" : : "m"(hv_crash_ctxt.rsp));
>
> I know this is pre-existing, but the asm here is poor.
>
> All segment registers loads can have a memory operand, rather than
> forcing through %eax, which in turn reduces the setup logic the compiler
> needs to emit.
>
> Something like this:
>
>     "movl %0, %%ss" : : "m"(hv_crash_ctxt.ss)
>
> ought to do.
>

'movw' seems to work, yes.
...
>
> As Uros notes, "a" is clobbered here but the compiler is not informed. 
> But, it's not necessary.
>
> As a naked function you could even use 3x asm() statements, but you can
> get the compiler to sort out the function reference automatically with:
>
>     asm volatile ("push %q0\n\t"
>                   "push %q1\n\t"
>                   "lretq"
>                   :: "r"(hv_crash_ctxt.cs), "r"(hv_crash_handle));
>
>

Yeah much better - thanks.

^ permalink raw reply

* Re: [RFT PATCH] x86/hyperv: Use __naked attribute to fix stackless C function
From: Andrew Cooper @ 2026-02-26 13:24 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Andrew Cooper, Borislav Petkov, dave.hansen, decui, haiyangz,
	H . Peter Anvin, kys, linux-hyperv, linux-kernel, Long Li,
	Ingo Molnar, Mukesh Rathor, Thomas Gleixner, Uros Bizjak, wei.liu
In-Reply-To: <a7e1b5c1-f933-44e5-99ec-a83b27fcf81e@app.fastmail.com>

On 26/02/2026 1:07 pm, Ard Biesheuvel wrote:
>
> On Thu, 26 Feb 2026, at 13:01, Andrew Cooper wrote:
>>> @@ -133,49 +150,36 @@ static noinline void hv_crash_clear_kernpt(void)   * available. We restore kernel GDT, and rest of the context, and continue
>>>   * to kexec.
>>>   */
>>> -static asmlinkage void __noreturn hv_crash_c_entry(void) +static void
>>> __naked hv_crash_c_entry(void)  {
>>> - struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; -  	/* first thing, restore kernel gdt */
>>> - native_load_gdt(&ctxt->gdtr); + asm volatile("lgdt %0" : : "m"
>>> (hv_crash_ctxt.gdtr));  
>>> - asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); - asm
>>> volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); + asm volatile("movw
>>> %%ax, %%ss" : : "a"(hv_crash_ctxt.ss)); + asm volatile("movq %0,
>>> %%rsp" : : "m"(hv_crash_ctxt.rsp));
>> I know this is pre-existing, but the asm here is poor.
>>
>> All segment registers loads can have a memory operand, rather than
>> forcing through %eax, which in turn reduces the setup logic the compiler
>> needs to emit.
>>
>> Something like this:
>>
>>     "movl %0, %%ss" : : "m"(hv_crash_ctxt.ss)
>>
>> ought to do.
>>
> 'movw' seems to work, yes.

movw works, but is sub-optimal.

The segment register instructions are somewhat weird even by x86 standards.

They should always be written as 32-bit operations (movl, and %eax),
removing the operand size prefix which is not necessary for these
instructions to function correctly.

It's absolutely marginal, but it does always pain me to read asm like
this and see the myth of how to access segment selectors being repeated
time and time again.

~Andrew

^ permalink raw reply

* Re: [RFT PATCH] x86/hyperv: Use __naked attribute to fix stackless C function
From: Ard Biesheuvel @ 2026-02-26 13:29 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Borislav Petkov, dave.hansen, decui, haiyangz, H . Peter Anvin,
	kys, linux-hyperv, linux-kernel, Long Li, Ingo Molnar,
	Mukesh Rathor, Thomas Gleixner, Uros Bizjak, wei.liu
In-Reply-To: <ccc4f915-3623-406e-8df6-f468427264f4@citrix.com>



On Thu, 26 Feb 2026, at 14:24, Andrew Cooper wrote:
> On 26/02/2026 1:07 pm, Ard Biesheuvel wrote:
>>
>> On Thu, 26 Feb 2026, at 13:01, Andrew Cooper wrote:
>>>> @@ -133,49 +150,36 @@ static noinline void hv_crash_clear_kernpt(void)   * available. We restore kernel GDT, and rest of the context, and continue
>>>>   * to kexec.
>>>>   */
>>>> -static asmlinkage void __noreturn hv_crash_c_entry(void) +static void
>>>> __naked hv_crash_c_entry(void)  {
>>>> - struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; -  	/* first thing, restore kernel gdt */
>>>> - native_load_gdt(&ctxt->gdtr); + asm volatile("lgdt %0" : : "m"
>>>> (hv_crash_ctxt.gdtr));  
>>>> - asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); - asm
>>>> volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); + asm volatile("movw
>>>> %%ax, %%ss" : : "a"(hv_crash_ctxt.ss)); + asm volatile("movq %0,
>>>> %%rsp" : : "m"(hv_crash_ctxt.rsp));
>>> I know this is pre-existing, but the asm here is poor.
>>>
>>> All segment registers loads can have a memory operand, rather than
>>> forcing through %eax, which in turn reduces the setup logic the compiler
>>> needs to emit.
>>>
>>> Something like this:
>>>
>>>     "movl %0, %%ss" : : "m"(hv_crash_ctxt.ss)
>>>
>>> ought to do.
>>>
>> 'movw' seems to work, yes.
>
> movw works, but is sub-optimal.
>

Can you give an asm example where movl with a segment register is accepted by the assembler? I only managed that with movw, hence my comment.

^ permalink raw reply

* Re: [RFT PATCH] x86/hyperv: Use __naked attribute to fix stackless C function
From: Andrew Cooper @ 2026-02-26 13:52 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Andrew Cooper, Borislav Petkov, dave.hansen, decui, haiyangz,
	H . Peter Anvin, kys, linux-hyperv, linux-kernel, Long Li,
	Ingo Molnar, Mukesh Rathor, Thomas Gleixner, Uros Bizjak, wei.liu
In-Reply-To: <2ee05c7f-60cb-445b-b761-562385c4e6ba@app.fastmail.com>

On 26/02/2026 1:29 pm, Ard Biesheuvel wrote:
>
> On Thu, 26 Feb 2026, at 14:24, Andrew Cooper wrote:
>> On 26/02/2026 1:07 pm, Ard Biesheuvel wrote:
>>> On Thu, 26 Feb 2026, at 13:01, Andrew Cooper wrote:
>>>>> @@ -133,49 +150,36 @@ static noinline void hv_crash_clear_kernpt(void)   * available. We restore kernel GDT, and rest of the context, and continue
>>>>>   * to kexec.
>>>>>   */
>>>>> -static asmlinkage void __noreturn hv_crash_c_entry(void) +static void
>>>>> __naked hv_crash_c_entry(void)  {
>>>>> - struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; -  	/* first thing, restore kernel gdt */
>>>>> - native_load_gdt(&ctxt->gdtr); + asm volatile("lgdt %0" : : "m"
>>>>> (hv_crash_ctxt.gdtr));  
>>>>> - asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); - asm
>>>>> volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); + asm volatile("movw
>>>>> %%ax, %%ss" : : "a"(hv_crash_ctxt.ss)); + asm volatile("movq %0,
>>>>> %%rsp" : : "m"(hv_crash_ctxt.rsp));
>>>> I know this is pre-existing, but the asm here is poor.
>>>>
>>>> All segment registers loads can have a memory operand, rather than
>>>> forcing through %eax, which in turn reduces the setup logic the compiler
>>>> needs to emit.
>>>>
>>>> Something like this:
>>>>
>>>>     "movl %0, %%ss" : : "m"(hv_crash_ctxt.ss)
>>>>
>>>> ought to do.
>>>>
>>> 'movw' seems to work, yes.
>> movw works, but is sub-optimal.
>>
> Can you give an asm example where movl with a segment register is accepted by the assembler? I only managed that with movw, hence my comment.

Oh lovely, that looks like a binutils bug, but I bet it comes from not
realising that `mov sreg` is different to the more general mov forms.

Using no suffix will emit the optimal instruction without a warning.

https://godbolt.org/z/GYKs31Gqn

~Andrew

^ permalink raw reply

* Re: [PATCH net] net: mana: Ring doorbell at 4 CQ wraparounds
From: Vadim Fedorenko @ 2026-02-26 14:28 UTC (permalink / raw)
  To: Long Li, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Shradha Gupta, Erni Sri Satya Vennela, linux-hyperv, netdev,
	linux-kernel, stable
In-Reply-To: <20260225184948.941599-1-longli@microsoft.com>

On 25/02/2026 18:49, Long Li wrote:
> MANA hardware requires at least one doorbell ring every 8 wraparounds
> of the CQ. The driver rings the doorbell as a form of flow control to
> inform hardware that CQEs have been consumed.
> 
> The NAPI poll functions mana_poll_tx_cq() and mana_poll_rx_cq() can
> poll up to CQE_POLLING_BUFFER (512) completions per call. If the CQ
> has fewer than 512 entries, a single poll call can process more than
> 4 wraparounds without ringing the doorbell. The doorbell threshold
> check also uses ">" instead of ">=", delaying the ring by one extra
> CQE beyond 4 wraparounds. Combined, these issues can cause the driver
> to exceed the 8-wraparound hardware limit, leading to missed
> completions and stalled queues.
> 
> Fix this by capping the number of CQEs polled per call to 4 wraparounds
> of the CQ in both TX and RX paths. Also change the doorbell threshold
> from ">" to ">=" so the doorbell is rung as soon as 4 wraparounds are
> reached.
> 
> Cc: stable@vger.kernel.org
> Fixes: 58a63729c957 ("net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings")
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>   drivers/net/ethernet/microsoft/mana/mana_en.c | 23 +++++++++++++++----
>   1 file changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9919183ad39e..fe667e0d930d 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1770,8 +1770,14 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
>   	ndev = txq->ndev;
>   	apc = netdev_priv(ndev);
>   
> +	/* Limit CQEs polled to 4 wraparounds of the CQ to ensure the
> +	 * doorbell can be rung in time for the hardware's requirement
> +	 * of at least one doorbell ring every 8 wraparounds.
> +	 */
>   	comp_read = mana_gd_poll_cq(cq->gdma_cq, completions,
> -				    CQE_POLLING_BUFFER);
> +				    min_t(u32, (cq->gdma_cq->queue_size /

no need for min_t, simple min() can be used, queue_size is already u32

> +					   COMP_ENTRY_SIZE) * 4,
> +					  CQE_POLLING_BUFFER));
>   
>   	if (comp_read < 1)
>   		return;
> @@ -2156,7 +2162,14 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
>   	struct mana_rxq *rxq = cq->rxq;
>   	int comp_read, i;
>   
> -	comp_read = mana_gd_poll_cq(cq->gdma_cq, comp, CQE_POLLING_BUFFER);
> +	/* Limit CQEs polled to 4 wraparounds of the CQ to ensure the
> +	 * doorbell can be rung in time for the hardware's requirement
> +	 * of at least one doorbell ring every 8 wraparounds.
> +	 */
> +	comp_read = mana_gd_poll_cq(cq->gdma_cq, comp,
> +				    min_t(u32, (cq->gdma_cq->queue_size /

same here

> +					   COMP_ENTRY_SIZE) * 4,
> +					  CQE_POLLING_BUFFER));
>   	WARN_ON_ONCE(comp_read > CQE_POLLING_BUFFER);
>   
>   	rxq->xdp_flush = false;

^ permalink raw reply

* Re: [PATCH 2/3] hv_balloon: Change default page reporting order
From: David Hildenbrand (Arm) @ 2026-02-26 17:34 UTC (permalink / raw)
  To: Yuvraj Sakshith, akpm, mst
  Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm, jasowang,
	xuanzhuo, eperezma, virtualization, kys, haiyangz, wei.liu, decui,
	longli, linux-hyperv, linux-kernel
In-Reply-To: <20260226070125.3732265-3-yuvraj.sakshith@oss.qualcomm.com>

On 2/26/26 08:01, Yuvraj Sakshith wrote:
> page_reporting_order used to fall back to default
> value (passed as parameter or MAX_PAGE_ORDER) if
> the driver wishes to not provide it.
> 
> The way the driver used to do this was by passing
> the order as zero.
> 
> Now that zero is a valid order that can be passed by
> a driver to page reporting, we use -1 to signal
> default value to be used.
> 
> Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com>
> ---
>  drivers/hv/hv_balloon.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index 2b4080e51..e33d6e3b2 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -1663,7 +1663,7 @@ static void enable_page_reporting(void)
>  	 * We let the page_reporting_order parameter decide the order
>  	 * in the page_reporting code
>  	 */
> -	dm_device.pr_dev_info.order = 0;
> +	dm_device.pr_dev_info.order = -1;
>  	ret = page_reporting_register(&dm_device.pr_dev_info);
>  	if (ret < 0) {
>  		dm_device.pr_dev_info.report = NULL;

Logically, that patch must come before #1. And the patch description
should be rephrased to clarify that we want to change that behavior.

Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 3/3] virtio_balloon: Set pr_dev.order to new default
From: David Hildenbrand (Arm) @ 2026-02-26 17:43 UTC (permalink / raw)
  To: Yuvraj Sakshith, akpm, mst
  Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm, jasowang,
	xuanzhuo, eperezma, virtualization, kys, haiyangz, wei.liu, decui,
	longli, linux-hyperv, linux-kernel
In-Reply-To: <20260226070125.3732265-4-yuvraj.sakshith@oss.qualcomm.com>

On 2/26/26 08:01, Yuvraj Sakshith wrote:
> Drivers registering with page reporting used zero
> as a way to signal page_reporting_order to be set
> as a default value (either passed as a param or
> MAX_PAGE_ORDER).
> 
> Since page_reporting_order can now have zero as
> valid order, default fallback value send by drivers
> to page reporting is now -1.
> 
> Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com>
> ---
>  drivers/virtio/virtio_balloon.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 74fe59f5a..3cc3dc28a 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -1044,6 +1044,20 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  			goto out_unregister_oom;
>  		}
>  
> +		/*
> +		 * page_reporting_register() takes the order either
> +		 * from the driver or the commandline. If neither
> +		 * are provided, it falls back to MAX_PAGE_ORDER.
> +		 *
> +		 * Order given by the driver is required to be in the
> +		 * range [0, MAX_PAGE_ORDER].
> +		 *
> +		 * One way for the driver to not provide any order
> +		 * is by setting it to -1.
> +		 */
> +
> +		vb->pr_dev_info.order = -1;

That overly-long comment indicates that we can do better.


What about the following:

Patch #1: Introduce PAGE_REPORTING_DEFAULT_ORDER

diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index fe648dfa3a7c..3e21bfbb49a4 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -8,6 +8,9 @@
 /* This value should always be a power of 2, see page_reporting_cycle() */
 #define PAGE_REPORTING_CAPACITY                32
 
+/* Specifying this value as minimal reporting order selects the default. */
+#define PAGE_REPORTING_DEFAULT_ORDER   0
+
 struct page_reporting_dev_info {
        /* function that alters pages to make them "reported" */
        int (*report)(struct page_reporting_dev_info *prdev,
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index f0042d5743af..d5191cb6b31c 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -370,7 +370,8 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
         */
 
        if (page_reporting_order == -1) {
-               if (prdev->order > 0 && prdev->order <= MAX_PAGE_ORDER)
+               if (prdev->order != PAGE_REPORTING_DEFAULT_ORDER &&
+                   prdev->order <= MAX_PAGE_ORDER)
                        page_reporting_order = prdev->order;
                else
                        page_reporting_order = pageblock_order;


Patch #2: Use the define in hyperv

Patch #3: Use the define in virtio-balloon

Patch #4: Change PAGE_REPORTING_DEFAULT_ORDER to "MAX_PAGE_ORDER + 1" or sth. like that.


Then I think you can just drop the comment in virtballoon_probe() completely.

-- 
Cheers,

David

^ permalink raw reply related

* [PATCH] mshv: Introduce tracing support
From: Stanislav Kinsburskii @ 2026-02-26 19:18 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Introduces various trace events and use them in the corresponding places
in the driver.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/Makefile            |    1 
 drivers/hv/mshv_eventfd.c      |   14 +
 drivers/hv/mshv_irq.c          |    4 
 drivers/hv/mshv_root.h         |    1 
 drivers/hv/mshv_root_hv_call.c |   22 +-
 drivers/hv/mshv_root_main.c    |   78 +++++-
 drivers/hv/mshv_trace.c        |    9 +
 drivers/hv/mshv_trace.h        |  515 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 629 insertions(+), 15 deletions(-)
 create mode 100644 drivers/hv/mshv_trace.c
 create mode 100644 drivers/hv/mshv_trace.h

diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index 2593711c3628..888a748cc7cb 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -16,6 +16,7 @@ hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
 mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
 	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
 mshv_root-$(CONFIG_DEBUG_FS) += mshv_debugfs.o
+mshv_root-$(CONFIG_TRACEPOINTS) += mshv_trace.o
 mshv_vtl-y := mshv_vtl_main.o
 
 # Code that must be built-in
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 492c6258045c..d2efe248ca9b 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -733,6 +733,14 @@ static int mshv_assign_ioeventfd(struct mshv_partition *pt,
 	ret = mshv_register_doorbell(pt->pt_id, ioeventfd_mmio_write,
 				     (void *)pt, p->iovntfd_addr,
 				     p->iovntfd_datamatch, doorbell_flags);
+
+	trace_mshv_assign_ioeventfd(pt->pt_id, p->iovntfd_addr,
+				    p->iovntfd_length,
+				    p->iovntfd_datamatch,
+				    p->iovntfd_wildcard,
+				    p->iovntfd_eventfd,
+				    ret);
+
 	if (ret < 0)
 		goto unlock_fail;
 
@@ -780,6 +788,12 @@ static int mshv_deassign_ioeventfd(struct mshv_partition *pt,
 		    p->iovntfd_datamatch != args->datamatch)
 			continue;
 
+		trace_mshv_deassign_ioeventfd(pt->pt_id, p->iovntfd_addr,
+					      p->iovntfd_length,
+					      p->iovntfd_datamatch,
+					      p->iovntfd_wildcard,
+					      p->iovntfd_eventfd);
+
 		hlist_del_rcu(&p->iovntfd_hnode);
 		synchronize_rcu();
 		ioeventfd_release(p, pt->pt_id);
diff --git a/drivers/hv/mshv_irq.c b/drivers/hv/mshv_irq.c
index 798e7e1ab06e..aba7d3c431b8 100644
--- a/drivers/hv/mshv_irq.c
+++ b/drivers/hv/mshv_irq.c
@@ -71,6 +71,10 @@ int mshv_update_routing_table(struct mshv_partition *partition,
 	mutex_unlock(&partition->pt_irq_lock);
 
 	synchronize_srcu_expedited(&partition->pt_irq_srcu);
+
+	trace_mshv_update_routing_table(partition->pt_id,
+					old, new, numents);
+
 	new = old;
 
 out:
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 04c2a1910a8a..947dfb76bb19 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -17,6 +17,7 @@
 #include <linux/build_bug.h>
 #include <linux/mmu_notifier.h>
 #include <uapi/linux/mshv.h>
+#include "mshv_trace.h"
 
 /*
  * Hypervisor must be between these version numbers (inclusive)
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 317191462b63..bdcb8de7fb47 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -44,8 +44,7 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
 	struct hv_output_withdraw_memory *output_page;
 	struct page *page;
 	u16 completed;
-	unsigned long remaining = count;
-	u64 status;
+	u64 status, withdrawn = 0;
 	int i;
 	unsigned long flags;
 
@@ -54,7 +53,7 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
 		return -ENOMEM;
 	output_page = page_address(page);
 
-	while (remaining) {
+	while (withdrawn < count) {
 		local_irq_save(flags);
 
 		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -62,7 +61,7 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
 		memset(input_page, 0, sizeof(*input_page));
 		input_page->partition_id = partition_id;
 		status = hv_do_rep_hypercall(HVCALL_WITHDRAW_MEMORY,
-					     min(remaining, HV_WITHDRAW_BATCH_SIZE),
+					     min(count - withdrawn, HV_WITHDRAW_BATCH_SIZE),
 					     0, input_page, output_page);
 
 		local_irq_restore(flags);
@@ -78,10 +77,12 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
 			break;
 		}
 
-		remaining -= completed;
+		withdrawn += completed;
 	}
 	free_page((unsigned long)output_page);
 
+	trace_mshv_hvcall_withdraw_memory(partition_id, withdrawn, status);
+
 	return hv_result_to_errno(status);
 }
 
@@ -125,6 +126,8 @@ int hv_call_create_partition(u64 flags,
 		ret = hv_deposit_memory(hv_current_partition_id, status);
 	} while (!ret);
 
+	trace_mshv_hvcall_create_partition(flags, ret ? ret : *partition_id);
+
 	return ret;
 }
 
@@ -152,6 +155,8 @@ int hv_call_initialize_partition(u64 partition_id)
 		ret = hv_deposit_memory(partition_id, status);
 	} while (!ret);
 
+	trace_mshv_hvcall_initialize_partition(partition_id, status);
+
 	return ret;
 }
 
@@ -164,6 +169,8 @@ int hv_call_finalize_partition(u64 partition_id)
 	status = hv_do_fast_hypercall8(HVCALL_FINALIZE_PARTITION,
 				       *(u64 *)&input);
 
+	trace_mshv_hvcall_finalize_partition(partition_id, status);
+
 	return hv_result_to_errno(status);
 }
 
@@ -175,6 +182,8 @@ int hv_call_delete_partition(u64 partition_id)
 	input.partition_id = partition_id;
 	status = hv_do_fast_hypercall8(HVCALL_DELETE_PARTITION, *(u64 *)&input);
 
+	trace_mshv_hvcall_delete_partition(partition_id, status);
+
 	return hv_result_to_errno(status);
 }
 
@@ -571,6 +580,9 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
 		ret = hv_deposit_memory(partition_id, status);
 	} while (!ret);
 
+	trace_mshv_hvcall_map_vp_state_page(partition_id, vp_index,
+					    type, status);
+
 	return ret;
 }
 
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e6509c980763..53dbe151de7b 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -430,6 +430,17 @@ mshv_vp_dispatch(struct mshv_vp *vp, u32 flags,
 	status = hv_do_hypercall(HVCALL_DISPATCH_VP, input, output);
 	vp->run.flags.root_sched_dispatched = 0;
 
+	trace_mshv_hvcall_dispatch_vp(vp->vp_partition->pt_id,
+				      vp->vp_index, flags,
+				      output->dispatch_state,
+				      output->dispatch_event,
+#if defined(CONFIG_X86_64)
+				      vp->vp_register_page->interrupt_vectors.as_uint64,
+#else
+				      0,
+#endif
+				      status);
+
 	*res = *output;
 	preempt_enable();
 
@@ -452,6 +463,9 @@ mshv_vp_clear_explicit_suspend(struct mshv_vp *vp)
 	ret = mshv_set_vp_registers(vp->vp_index, vp->vp_partition->pt_id,
 				    1, &explicit_suspend);
 
+	trace_mshv_vp_clear_explicit_suspend(vp->vp_partition->pt_id,
+					     vp->vp_index, ret);
+
 	if (ret)
 		vp_err(vp, "Failed to unsuspend\n");
 
@@ -494,6 +508,12 @@ mshv_vp_wait_for_hv_kick(struct mshv_vp *vp)
 	if (ret)
 		return -EINTR;
 
+	trace_mshv_vp_wait_for_hv_kick(vp->vp_partition->pt_id,
+				       vp->vp_index,
+				       vp->run.kicked_by_hv,
+				       mshv_vp_dispatch_thread_blocked(vp),
+				       mshv_vp_interrupt_pending(vp));
+
 	vp->run.flags.root_sched_blocked = 0;
 	vp->run.kicked_by_hv = 0;
 
@@ -522,6 +542,12 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
 
 		if (__xfer_to_guest_mode_work_pending()) {
 			ret = xfer_to_guest_mode_handle_work();
+
+			trace_mshv_xfer_to_guest_mode_work(vp->vp_partition->pt_id,
+							   vp->vp_index,
+							   read_thread_flags(),
+							   ret);
+
 			if (ret)
 				break;
 		}
@@ -673,6 +699,8 @@ static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
 {
 	long rc;
 
+	trace_mshv_run_vp_entry(vp->vp_partition->pt_id, vp->vp_index);
+
 	do {
 		if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
 			rc = mshv_run_vp_with_root_scheduler(vp);
@@ -680,6 +708,10 @@ static long mshv_vp_ioctl_run_vp(struct mshv_vp *vp, void __user *ret_msg)
 			rc = mshv_run_vp_with_hyp_scheduler(vp);
 	} while (rc == 0 && mshv_vp_handle_intercept(vp));
 
+	trace_mshv_run_vp_exit(vp->vp_partition->pt_id, vp->vp_index,
+			       vp->vp_intercept_msg_page->header.message_type,
+			       rc);
+
 	if (rc)
 		return rc;
 
@@ -941,6 +973,8 @@ mshv_vp_release(struct inode *inode, struct file *filp)
 {
 	struct mshv_vp *vp = filp->private_data;
 
+	trace_mshv_vp_release(vp->vp_partition->pt_id, vp->vp_index);
+
 	/* Rest of VP cleanup happens in destroy_partition() */
 	mshv_partition_put(vp->vp_partition);
 	return 0;
@@ -1113,7 +1147,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	partition->pt_vp_count++;
 	partition->pt_vp_array[args.vp_index] = vp;
 
-	return ret;
+	goto out;
 
 remove_debugfs_vp:
 	mshv_debugfs_vp_remove(vp);
@@ -1139,6 +1173,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 			       intercept_msg_page, input_vtl_zero);
 destroy_vp:
 	hv_call_delete_vp(partition->pt_id, args.vp_index);
+out:
+	trace_mshv_create_vp(partition->pt_id, vp->vp_index, ret);
 	return ret;
 }
 
@@ -1338,6 +1374,10 @@ mshv_map_user_memory(struct mshv_partition *partition,
 		break;
 	}
 
+	trace_mshv_map_user_memory(partition->pt_id, region->start_uaddr,
+				   region->start_gfn, region->nr_pages,
+				   region->hv_map_flags, ret);
+
 	if (ret)
 		goto errout;
 
@@ -1633,6 +1673,9 @@ disable_vp_dispatch(struct mshv_vp *vp)
 	if (ret)
 		vp_err(vp, "failed to suspend\n");
 
+	trace_mshv_disable_vp_dispatch(vp->vp_partition->pt_id,
+				       vp->vp_index, ret);
+
 	return ret;
 }
 
@@ -1681,6 +1724,8 @@ drain_vp_signals(struct mshv_vp *vp)
 		vp->run.kicked_by_hv = 0;
 		vp_signal_count = atomic64_read(&vp->run.vp_signaled_count);
 	}
+
+	trace_mshv_drain_vp_signals(vp->vp_partition->pt_id, vp->vp_index);
 }
 
 static void drain_all_vps(const struct mshv_partition *partition)
@@ -1734,6 +1779,8 @@ static void destroy_partition(struct mshv_partition *partition)
 		return;
 	}
 
+	trace_mshv_destroy_partition(partition->pt_id);
+
 	if (partition->pt_initialized) {
 		/*
 		 * We only need to drain signals for root scheduler. This should be
@@ -1840,6 +1887,8 @@ mshv_partition_release(struct inode *inode, struct file *filp)
 {
 	struct mshv_partition *partition = filp->private_data;
 
+	trace_mshv_partition_release(partition->pt_id);
+
 	mshv_eventfd_release(partition);
 
 	cleanup_srcu_struct(&partition->pt_irq_srcu);
@@ -1969,6 +2018,7 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
 	struct hv_partition_creation_properties creation_properties;
 	union hv_partition_isolation_properties isolation_properties;
 	struct mshv_partition *partition;
+	u64 pt_id = -1;
 	long ret;
 
 	ret = mshv_ioctl_process_pt_flags(user_arg, &creation_flags,
@@ -2008,22 +2058,29 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
 	ret = hv_call_create_partition(creation_flags,
 				       creation_properties,
 				       isolation_properties,
-				       &partition->pt_id);
+				       &pt_id);
 	if (ret)
 		goto cleanup_irq_srcu;
 
+	partition->pt_id = pt_id;
+
 	ret = add_partition(partition);
 	if (ret)
 		goto delete_partition;
 
 	ret = mshv_init_async_handler(partition);
-	if (!ret) {
-		ret = FD_ADD(O_CLOEXEC, anon_inode_getfile("mshv_partition",
-							   &mshv_partition_fops,
-							   partition, O_RDWR));
-		if (ret >= 0)
-			return ret;
-	}
+	if (ret)
+		goto remove_partition;
+
+	ret = FD_ADD(O_CLOEXEC, anon_inode_getfile("mshv_partition",
+						   &mshv_partition_fops,
+						   partition, O_RDWR));
+	if (ret < 0)
+		goto remove_partition;
+
+	goto out;
+
+remove_partition:
 	remove_partition(partition);
 delete_partition:
 	hv_call_delete_partition(partition->pt_id);
@@ -2031,7 +2088,8 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
 	cleanup_srcu_struct(&partition->pt_irq_srcu);
 free_partition:
 	kfree(partition);
-
+out:
+	trace_mshv_create_partition(pt_id, ret);
 	return ret;
 }
 
diff --git a/drivers/hv/mshv_trace.c b/drivers/hv/mshv_trace.c
new file mode 100644
index 000000000000..0936b2f95edd
--- /dev/null
+++ b/drivers/hv/mshv_trace.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * Tracepoint definitions for mshv driver.
+ */
+
+#define CREATE_TRACE_POINTS
+#include "mshv_trace.h"
diff --git a/drivers/hv/mshv_trace.h b/drivers/hv/mshv_trace.h
new file mode 100644
index 000000000000..ba3b3f575983
--- /dev/null
+++ b/drivers/hv/mshv_trace.h
@@ -0,0 +1,515 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * Tracepoint declarations for mshv driver.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mshv
+
+#if !defined(__MSHV_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _MSHV_TRACE_H_
+
+#include <linux/tracepoint.h>
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../drivers/hv
+
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE mshv_trace
+
+TRACE_EVENT(mshv_create_partition,
+	    TP_PROTO(u64 partition_id, int vm_fd),
+	    TP_ARGS(partition_id, vm_fd),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(int, vm_fd)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vm_fd = vm_fd;
+	    ),
+	    TP_printk("partition_id=%llu vm_fd=%d",
+		    __entry->partition_id,
+		    __entry->vm_fd
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_create_partition,
+	    TP_PROTO(u64 flags, s64 partition_id),
+	    TP_ARGS(flags, partition_id),
+	    TP_STRUCT__entry(
+		    __field(u64, flags)
+		    __field(s64, partition_id)
+	    ),
+	    TP_fast_assign(
+		    __entry->flags = flags;
+		    __entry->partition_id = partition_id;
+	    ),
+	    TP_printk("flags=%#llx partition_id=%lld",
+		    __entry->flags,
+		    __entry->partition_id
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_initialize_partition,
+	    TP_PROTO(u64 partition_id, u64 status),
+	    TP_ARGS(partition_id, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu status=%#llx",
+		    __entry->partition_id,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_partition_release,
+	    TP_PROTO(u64 partition_id),
+	    TP_ARGS(partition_id),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+	    ),
+	    TP_printk("partition_id=%llu",
+		    __entry->partition_id
+	    )
+);
+
+TRACE_EVENT(mshv_destroy_partition,
+	    TP_PROTO(u64 partition_id),
+	    TP_ARGS(partition_id),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+	    ),
+	    TP_printk("partition_id=%llu",
+		    __entry->partition_id
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_finalize_partition,
+	    TP_PROTO(u64 partition_id, u64 status),
+	    TP_ARGS(partition_id, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu status=%#llx ",
+		    __entry->partition_id,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_withdraw_memory,
+	    TP_PROTO(u64 partition_id, u64 withdrawn, u64 status),
+	    TP_ARGS(partition_id, withdrawn, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, withdrawn)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->withdrawn = withdrawn;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu withdrawn=%llu status=%#llx",
+		    __entry->partition_id,
+		    __entry->withdrawn,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_delete_partition,
+	    TP_PROTO(u64 partition_id, u64 status),
+	    TP_ARGS(partition_id, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu status=%#llx",
+		    __entry->partition_id,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_create_vp,
+	    TP_PROTO(u64 partition_id, u32 vp_index, long vp_fd),
+	    TP_ARGS(partition_id, vp_index, vp_fd),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(long, vp_fd)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->vp_fd = vp_fd;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u vp_fd=%ld",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->vp_fd
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_map_vp_state_page,
+	    TP_PROTO(u64 partition_id, u32 vp_index, u32 page_type, u64 status),
+	    TP_ARGS(partition_id, vp_index, page_type, status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(u32, page_type)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->page_type = page_type;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u page_type=%u status=%#llx",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->page_type,
+		    __entry->status
+	    )
+);
+
+TRACE_EVENT(mshv_drain_vp_signals,
+	    TP_PROTO(u64 partition_id, u32 vp_index),
+	    TP_ARGS(partition_id, vp_index),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u",
+		    __entry->partition_id,
+		    __entry->vp_index
+	    )
+);
+
+TRACE_EVENT(mshv_disable_vp_dispatch,
+	    TP_PROTO(u64 partition_id, u32 vp_index, int ret),
+	    TP_ARGS(partition_id, vp_index, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(int, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u ret=%d",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->ret
+	    )
+);
+
+TRACE_EVENT(mshv_vp_release,
+	    TP_PROTO(u64 partition_id, u32 vp_index),
+	    TP_ARGS(partition_id, vp_index),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u",
+		    __entry->partition_id,
+		    __entry->vp_index
+	    )
+);
+
+TRACE_EVENT(mshv_run_vp_entry,
+	    TP_PROTO(u64 partition_id, u32 vp_index),
+	    TP_ARGS(partition_id, vp_index),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u",
+		    __entry->partition_id,
+		    __entry->vp_index
+	    )
+);
+
+TRACE_EVENT(mshv_run_vp_exit,
+	    TP_PROTO(u64 partition_id, u32 vp_index, u64 hv_message_type, long ret),
+	    TP_ARGS(partition_id, vp_index, hv_message_type, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(u64, hv_message_type)
+		    __field(long, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->hv_message_type = hv_message_type;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u hv_message_type=%#llx ret=%ld",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->hv_message_type,
+		    __entry->ret
+	    )
+);
+
+TRACE_EVENT(mshv_vp_clear_explicit_suspend,
+	    TP_PROTO(u64 partition_id, u32 vp_index, int ret),
+	    TP_ARGS(partition_id, vp_index, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(int, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u ret=%d",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->ret
+	    )
+);
+
+TRACE_EVENT(mshv_xfer_to_guest_mode_work,
+	    TP_PROTO(u64 partition_id, u32 vp_index, unsigned long thread_info_flag, long ret),
+	    TP_ARGS(partition_id, vp_index, thread_info_flag, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(unsigned long, thread_info_flag)
+		    __field(long, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->thread_info_flag = thread_info_flag;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u thread_info_flag=%#lx ret=%ld",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->thread_info_flag,
+		    __entry->ret
+	    )
+);
+
+TRACE_EVENT(mshv_hvcall_dispatch_vp,
+	    TP_PROTO(u64 partition_id, u32 vp_index, u32 flags,
+		     u32 dispatch_state, u32 dispatch_event, u64 irq_vectors, u64 status),
+	    TP_ARGS(partition_id, vp_index, flags, dispatch_state, dispatch_event, irq_vectors,
+		    status),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(u32, flags)
+		    __field(u32, dispatch_state)
+		    __field(u32, dispatch_event)
+		    __field(u64, irq_vectors)
+		    __field(u64, status)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->flags = flags;
+		    __entry->dispatch_state = dispatch_state;
+		    __entry->dispatch_event = dispatch_event;
+		    __entry->irq_vectors = irq_vectors;
+		    __entry->status = status;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u flags=%#x dispatch_state=%#x dispatch_event=%#x irq_vectors=%#016llx status=%#llx",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->flags,
+		    __entry->dispatch_state,
+		    __entry->dispatch_event,
+		    __entry->irq_vectors,
+		    __entry->status
+	     )
+);
+
+TRACE_EVENT(mshv_update_routing_table,
+	    TP_PROTO(u64 partition_id, void *old, void *new, u32 numents),
+	    TP_ARGS(partition_id, old, new, numents),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(struct mshv_girq_routing_table *, old)
+		    __field(struct mshv_girq_routing_table *, new)
+		    __field(u32, numents)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->old = old;
+		    __entry->new = new;
+		    __entry->numents = numents;
+	    ),
+	    TP_printk("partition_id=%llu old=%p new=%p numents=%u",
+		    __entry->partition_id,
+		    __entry->old,
+		    __entry->new,
+		    __entry->numents
+	    )
+);
+
+TRACE_EVENT(mshv_map_user_memory,
+	    TP_PROTO(u64 partition_id, u64 start_uaddr, u64 start_gfn, u64 nr_pages, u32 map_flags,
+		     long ret),
+	    TP_ARGS(partition_id, start_uaddr, start_gfn, nr_pages, map_flags, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, start_uaddr)
+		    __field(u64, start_gfn)
+		    __field(u64, nr_pages)
+		    __field(u32, map_flags)
+		    __field(long, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->start_uaddr = start_uaddr;
+		    __entry->start_gfn = start_gfn;
+		    __entry->nr_pages = nr_pages;
+		    __entry->map_flags = map_flags;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu start_uaddr=%#llx start_gfn=%#llx nr_pages=%llu map_flags=%#x ret=%ld",
+		    __entry->partition_id,
+		    __entry->start_uaddr,
+		    __entry->start_gfn,
+		    __entry->nr_pages,
+		    __entry->map_flags,
+		    __entry->ret
+	     )
+);
+
+TRACE_EVENT(mshv_assign_ioeventfd,
+	    TP_PROTO(u64 partition_id, u64 addr, u64 length, u64 datamatch, bool wildcard,
+		     void *eventfd, int ret),
+	    TP_ARGS(partition_id, addr, length, datamatch, wildcard, eventfd, ret),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, addr)
+		    __field(u64, length)
+		    __field(u64, datamatch)
+		    __field(bool, wildcard)
+		    __field(struct eventfd_ctx *, eventfd)
+		    __field(int, ret)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->addr = addr;
+		    __entry->length = length;
+		    __entry->datamatch = datamatch;
+		    __entry->wildcard = wildcard;
+		    __entry->eventfd = eventfd;
+		    __entry->ret = ret;
+	    ),
+	    TP_printk("partition_id=%llu addr=%#016llx length=%#llx datamatch=%#llx wildcard=%d eventfd=%p ret=%d",
+		    __entry->partition_id,
+		    __entry->addr,
+		    __entry->length,
+		    __entry->datamatch,
+		    __entry->wildcard,
+		    __entry->eventfd,
+		    __entry->ret
+	     )
+);
+
+TRACE_EVENT(mshv_deassign_ioeventfd,
+	    TP_PROTO(u64 partition_id, u64 addr, u64 length, u64 datamatch, bool wildcard,
+		     void *eventfd),
+	    TP_ARGS(partition_id, addr, length, datamatch, wildcard, eventfd),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u64, addr)
+		    __field(u64, length)
+		    __field(u64, datamatch)
+		    __field(bool, wildcard)
+		    __field(struct eventfd_ctx *, eventfd)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->addr = addr;
+		    __entry->length = length;
+		    __entry->datamatch = datamatch;
+		    __entry->wildcard = wildcard;
+		    __entry->eventfd = eventfd;
+	    ),
+	    TP_printk("partition_id=%llu addr=%#016llx length=%#llx datamatch=%#llx wildcard=%d eventfd=%p",
+		    __entry->partition_id,
+		    __entry->addr,
+		    __entry->length,
+		    __entry->datamatch,
+		    __entry->wildcard,
+		    __entry->eventfd
+	     )
+);
+
+TRACE_EVENT(mshv_vp_wait_for_hv_kick,
+	    TP_PROTO(u64 partition_id, u32 vp_index, bool kicked_by_hv, bool blocked,
+		     bool irq_pending),
+	    TP_ARGS(partition_id, vp_index, kicked_by_hv, blocked, irq_pending),
+	    TP_STRUCT__entry(
+		    __field(u64, partition_id)
+		    __field(u32, vp_index)
+		    __field(bool, kicked_by_hv)
+		    __field(bool, blocked)
+		    __field(bool, irq_pending)
+	    ),
+	    TP_fast_assign(
+		    __entry->partition_id = partition_id;
+		    __entry->vp_index = vp_index;
+		    __entry->kicked_by_hv = kicked_by_hv;
+		    __entry->blocked = blocked;
+		    __entry->irq_pending = irq_pending;
+	    ),
+	    TP_printk("partition_id=%llu vp_index=%u kicked_by_hv=%d blocked=%d irq_pending=%d",
+		    __entry->partition_id,
+		    __entry->vp_index,
+		    __entry->kicked_by_hv,
+		    __entry->blocked,
+		    __entry->irq_pending
+	    )
+);
+
+#endif /* _MSHV_TRACE_H_ */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>



^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH net] net: mana: Ring doorbell at 4 CQ wraparounds
From: Long Li @ 2026-02-26 19:22 UTC (permalink / raw)
  To: Vadim Fedorenko, KY Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Shradha Gupta, Erni Sri Satya Vennela,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <46896339-b3a3-4109-a2e2-324446be5aeb@linux.dev>

> Subject: [EXTERNAL] Re: [PATCH net] net: mana: Ring doorbell at 4 CQ
> wraparounds
> 
> On 25/02/2026 18:49, Long Li wrote:
> > MANA hardware requires at least one doorbell ring every 8 wraparounds
> > of the CQ. The driver rings the doorbell as a form of flow control to
> > inform hardware that CQEs have been consumed.
> >
> > The NAPI poll functions mana_poll_tx_cq() and mana_poll_rx_cq() can
> > poll up to CQE_POLLING_BUFFER (512) completions per call. If the CQ
> > has fewer than 512 entries, a single poll call can process more than
> > 4 wraparounds without ringing the doorbell. The doorbell threshold
> > check also uses ">" instead of ">=", delaying the ring by one extra
> > CQE beyond 4 wraparounds. Combined, these issues can cause the driver
> > to exceed the 8-wraparound hardware limit, leading to missed
> > completions and stalled queues.
> >
> > Fix this by capping the number of CQEs polled per call to 4
> > wraparounds of the CQ in both TX and RX paths. Also change the
> > doorbell threshold from ">" to ">=" so the doorbell is rung as soon as
> > 4 wraparounds are reached.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: 58a63729c957 ("net: mana: Fix doorbell out of order violation
> > and avoid unnecessary doorbell rings")
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> >   drivers/net/ethernet/microsoft/mana/mana_en.c | 23 +++++++++++++++----
> >   1 file changed, 18 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9919183ad39e..fe667e0d930d 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -1770,8 +1770,14 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
> >   	ndev = txq->ndev;
> >   	apc = netdev_priv(ndev);
> >
> > +	/* Limit CQEs polled to 4 wraparounds of the CQ to ensure the
> > +	 * doorbell can be rung in time for the hardware's requirement
> > +	 * of at least one doorbell ring every 8 wraparounds.
> > +	 */
> >   	comp_read = mana_gd_poll_cq(cq->gdma_cq, completions,
> > -				    CQE_POLLING_BUFFER);
> > +				    min_t(u32, (cq->gdma_cq->queue_size /
> 
> no need for min_t, simple min() can be used, queue_size is already u32

Thank you, I'm sending v2.

Long

> 
> > +					   COMP_ENTRY_SIZE) * 4,
> > +					  CQE_POLLING_BUFFER));
> >
> >   	if (comp_read < 1)
> >   		return;
> > @@ -2156,7 +2162,14 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
> >   	struct mana_rxq *rxq = cq->rxq;
> >   	int comp_read, i;
> >
> > -	comp_read = mana_gd_poll_cq(cq->gdma_cq, comp,
> CQE_POLLING_BUFFER);
> > +	/* Limit CQEs polled to 4 wraparounds of the CQ to ensure the
> > +	 * doorbell can be rung in time for the hardware's requirement
> > +	 * of at least one doorbell ring every 8 wraparounds.
> > +	 */
> > +	comp_read = mana_gd_poll_cq(cq->gdma_cq, comp,
> > +				    min_t(u32, (cq->gdma_cq->queue_size /
> 
> same here
> 
> > +					   COMP_ENTRY_SIZE) * 4,
> > +					  CQE_POLLING_BUFFER));
> >   	WARN_ON_ONCE(comp_read > CQE_POLLING_BUFFER);
> >
> >   	rxq->xdp_flush = false;

^ permalink raw reply

* [PATCH net v2] net: mana: Ring doorbell at 4 CQ wraparounds
From: Long Li @ 2026-02-26 19:28 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Shradha Gupta, Erni Sri Satya Vennela, linux-hyperv, netdev,
	linux-kernel, stable

MANA hardware requires at least one doorbell ring every 8 wraparounds
of the CQ. The driver rings the doorbell as a form of flow control to
inform hardware that CQEs have been consumed.

The NAPI poll functions mana_poll_tx_cq() and mana_poll_rx_cq() can
poll up to CQE_POLLING_BUFFER (512) completions per call. If the CQ
has fewer than 512 entries, a single poll call can process more than
4 wraparounds without ringing the doorbell. The doorbell threshold
check also uses ">" instead of ">=", delaying the ring by one extra
CQE beyond 4 wraparounds. Combined, these issues can cause the driver
to exceed the 8-wraparound hardware limit, leading to missed
completions and stalled queues.

Fix this by capping the number of CQEs polled per call to 4 wraparounds
of the CQ in both TX and RX paths. Also change the doorbell threshold
from ">" to ">=" so the doorbell is rung as soon as 4 wraparounds are
reached.

Cc: stable@vger.kernel.org
Fixes: 58a63729c957 ("net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings")
Signed-off-by: Long Li <longli@microsoft.com>
---
v2: Use min() instead of min_t(u32, ...) since queue_size is already u32
 drivers/net/ethernet/microsoft/mana/mana_en.c | 23 +++++++++++++++----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 9919183ad39e..7fed4ae07071 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1770,8 +1770,14 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
 	ndev = txq->ndev;
 	apc = netdev_priv(ndev);
 
+	/* Limit CQEs polled to 4 wraparounds of the CQ to ensure the
+	 * doorbell can be rung in time for the hardware's requirement
+	 * of at least one doorbell ring every 8 wraparounds.
+	 */
 	comp_read = mana_gd_poll_cq(cq->gdma_cq, completions,
-				    CQE_POLLING_BUFFER);
+				    min((cq->gdma_cq->queue_size /
+					  COMP_ENTRY_SIZE) * 4,
+					 CQE_POLLING_BUFFER));
 
 	if (comp_read < 1)
 		return;
@@ -2156,7 +2162,14 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
 	struct mana_rxq *rxq = cq->rxq;
 	int comp_read, i;
 
-	comp_read = mana_gd_poll_cq(cq->gdma_cq, comp, CQE_POLLING_BUFFER);
+	/* Limit CQEs polled to 4 wraparounds of the CQ to ensure the
+	 * doorbell can be rung in time for the hardware's requirement
+	 * of at least one doorbell ring every 8 wraparounds.
+	 */
+	comp_read = mana_gd_poll_cq(cq->gdma_cq, comp,
+				    min((cq->gdma_cq->queue_size /
+					  COMP_ENTRY_SIZE) * 4,
+					 CQE_POLLING_BUFFER));
 	WARN_ON_ONCE(comp_read > CQE_POLLING_BUFFER);
 
 	rxq->xdp_flush = false;
@@ -2201,11 +2214,11 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
 		mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
 		cq->work_done_since_doorbell = 0;
 		napi_complete_done(&cq->napi, w);
-	} else if (cq->work_done_since_doorbell >
-		   cq->gdma_cq->queue_size / COMP_ENTRY_SIZE * 4) {
+	} else if (cq->work_done_since_doorbell >=
+		   (cq->gdma_cq->queue_size / COMP_ENTRY_SIZE) * 4) {
 		/* MANA hardware requires at least one doorbell ring every 8
 		 * wraparounds of CQ even if there is no need to arm the CQ.
-		 * This driver rings the doorbell as soon as we have exceeded
+		 * This driver rings the doorbell as soon as it has processed
 		 * 4 wraparounds.
 		 */
 		mana_gd_ring_cq(gdma_queue, 0);
-- 
2.43.0


^ permalink raw reply related

* RE: [PATCH, net-next] net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout
From: Long Li @ 2026-02-26 19:48 UTC (permalink / raw)
  To: Dipayaan Roy, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, andrew+netdev@lunn.ch, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	Konstantin Taranov, horms@kernel.org,
	shradhagupta@linux.microsoft.com, ssengar@linux.microsoft.com,
	ernis@linux.microsoft.com, Shiraz Saleem,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	Dipayaan Roy
In-Reply-To: <aZwUDlTkb5xunIkH@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

> The GF stats periodic query is used as mechanism to monitor HWC health check.
> If this HWC command times out, it is a strong indication that the device/SoC is in a
> faulty state and requires recovery.
> 
> Today, when a timeout is detected, the driver marks hwc_timeout_occurred,
> clears cached stats, and stops rescheduling the periodic work. However, the
> device itself is left in the same failing state.
> 
> Extend the timeout handling path to trigger the existing MANA VF recovery
> service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
> This is expected to initiate the appropriate recovery flow by suspende resume
> first and if it fails then trigger a bus rescan.
> 
> This change is intentionally limited to HWC command timeouts and does not
> trigger recovery for errors reported by the SoC as a normal command response.
> 
> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 14 +++-------
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 28 ++++++++++++++++++-
>  include/net/mana/gdma.h                       | 16 +++++++++--
>  3 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 0055c231acf6..16c438d2aaa3 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -490,15 +490,9 @@ static void mana_serv_reset(struct pci_dev *pdev)
>  		dev_info(&pdev->dev, "MANA reset cycle completed\n");
> 
>  out:
> -	gc->in_service = false;
> +	clear_bit(GC_IN_SERVICE, &gc->flags);
>  }
> 
> -struct mana_serv_work {
> -	struct work_struct serv_work;
> -	struct pci_dev *pdev;
> -	enum gdma_eqe_type type;
> -};
> -
>  static void mana_do_service(enum gdma_eqe_type type, struct pci_dev *pdev)
> {
>  	switch (type) {
> @@ -542,7 +536,7 @@ static void mana_recovery_delayed_func(struct
> work_struct *w)
>  	spin_unlock_irqrestore(&work->lock, flags);  }
> 
> -static void mana_serv_func(struct work_struct *w)
> +void mana_serv_func(struct work_struct *w)
>  {
>  	struct mana_serv_work *mns_wk;
>  	struct pci_dev *pdev;
> @@ -624,7 +618,7 @@ static void mana_gd_process_eqe(struct gdma_queue
> *eq)
>  			break;
>  		}
> 
> -		if (gc->in_service) {
> +		if (test_bit(GC_IN_SERVICE, &gc->flags)) {
>  			dev_info(gc->dev, "Already in service\n");
>  			break;
>  		}
> @@ -641,7 +635,7 @@ static void mana_gd_process_eqe(struct gdma_queue
> *eq)
>  		}
> 
>  		dev_info(gc->dev, "Start MANA service type:%d\n", type);
> -		gc->in_service = true;
> +		set_bit(GC_IN_SERVICE, &gc->flags);
>  		mns_wk->pdev = to_pci_dev(gc->dev);
>  		mns_wk->type = type;
>  		pci_dev_get(mns_wk->pdev);
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 91c418097284..8da574cf06f2 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -879,7 +879,7 @@ static void mana_tx_timeout(struct net_device *netdev,
> unsigned int txqueue)
>  	struct gdma_context *gc = ac->gdma_dev->gdma_context;
> 
>  	/* Already in service, hence tx queue reset is not required.*/
> -	if (gc->in_service)
> +	if (test_bit(GC_IN_SERVICE, &gc->flags))
>  		return;
> 
>  	/* Note: If there are pending queue reset work for this port(apc), @@ -
> 3533,6 +3533,8 @@ static void mana_gf_stats_work_handler(struct work_struct
> *work)  {
>  	struct mana_context *ac =
>  		container_of(to_delayed_work(work), struct mana_context,
> gf_stats_work);
> +	struct gdma_context *gc = ac->gdma_dev->gdma_context;
> +	struct mana_serv_work *mns_wk;
>  	int err;
> 
>  	err = mana_query_gf_stats(ac);
> @@ -3540,6 +3542,30 @@ static void mana_gf_stats_work_handler(struct
> work_struct *work)
>  		/* HWC timeout detected - reset stats and stop rescheduling */
>  		ac->hwc_timeout_occurred = true;
>  		memset(&ac->hc_stats, 0, sizeof(ac->hc_stats));
> +		dev_warn(gc->dev,
> +			 "Gf stats wk handler: gf stats query timed out.\n");
> +
> +		/* As HWC timed out, indicating a faulty HW state and needs a
> +		 * reset.
> +		 */
> +		if (!test_and_set_bit(GC_IN_SERVICE, &gc->flags)) {
> +			if (!try_module_get(THIS_MODULE)) {
> +				dev_info(gc->dev, "Module is unloading\n");
> +				return;
> +			}
> +
> +			mns_wk = kzalloc(sizeof(*mns_wk), GFP_ATOMIC);
> +			if (!mns_wk) {
> +				module_put(THIS_MODULE);

Maybe it's not necessary: check if you want to call  clear_bit(GC_IN_SERVICE, &gc->flags) here?

> +				return;
> +			}
> +
> +			mns_wk->pdev = to_pci_dev(gc->dev);
> +			mns_wk->type = GDMA_EQE_HWC_RESET_REQUEST;
> +			pci_dev_get(mns_wk->pdev);
> +			INIT_WORK(&mns_wk->serv_work, mana_serv_func);
> +			schedule_work(&mns_wk->serv_work);
> +		}
>  		return;
>  	}
>  	schedule_delayed_work(&ac->gf_stats_work,
> MANA_GF_STATS_PERIOD); diff --git a/include/net/mana/gdma.h


^ permalink raw reply

* RE: [PATCH net v2] net: mana: Ring doorbell at 4 CQ wraparounds
From: Haiyang Zhang @ 2026-02-26 20:40 UTC (permalink / raw)
  To: Long Li, KY Srinivasan, Wei Liu, Dexuan Cui, Long Li, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Shradha Gupta, Erni Sri Satya Vennela,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260226192833.1050807-1-longli@microsoft.com>



> -----Original Message-----
> From: Long Li <longli@microsoft.com>
> Sent: Thursday, February 26, 2026 2:29 PM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; Andrew Lunn
> <andrew+netdev@lunn.ch>; David S . Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>; Paolo
> Abeni <pabeni@redhat.com>
> Cc: Shradha Gupta <shradhagupta@linux.microsoft.com>; Erni Sri Satya
> Vennela <ernis@linux.microsoft.com>; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> stable@vger.kernel.org
> Subject: [PATCH net v2] net: mana: Ring doorbell at 4 CQ wraparounds
> 
> MANA hardware requires at least one doorbell ring every 8 wraparounds
> of the CQ. The driver rings the doorbell as a form of flow control to
> inform hardware that CQEs have been consumed.
> 
> The NAPI poll functions mana_poll_tx_cq() and mana_poll_rx_cq() can
> poll up to CQE_POLLING_BUFFER (512) completions per call. If the CQ
> has fewer than 512 entries, a single poll call can process more than
> 4 wraparounds without ringing the doorbell. The doorbell threshold
> check also uses ">" instead of ">=", delaying the ring by one extra
> CQE beyond 4 wraparounds. Combined, these issues can cause the driver
> to exceed the 8-wraparound hardware limit, leading to missed
> completions and stalled queues.
> 
> Fix this by capping the number of CQEs polled per call to 4 wraparounds
> of the CQ in both TX and RX paths. Also change the doorbell threshold
> from ">" to ">=" so the doorbell is rung as soon as 4 wraparounds are
> reached.
> 
> Cc: stable@vger.kernel.org
> Fixes: 58a63729c957 ("net: mana: Fix doorbell out of order violation and
> avoid unnecessary doorbell rings")
> Signed-off-by: Long Li <longli@microsoft.com>
> ---

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>


^ permalink raw reply

* Re: [PATCH] mshv: Introduce tracing support
From: kernel test robot @ 2026-02-27  3:45 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli
  Cc: llvm, oe-kbuild-all, linux-hyperv, linux-kernel
In-Reply-To: <177213348504.92223.5330421592610811972.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Hi Stanislav,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v7.0-rc1 next-20260226]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Stanislav-Kinsburskii/mshv-Introduce-tracing-support/20260227-031942
base:   linus/master
patch link:    https://lore.kernel.org/r/177213348504.92223.5330421592610811972.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net
patch subject: [PATCH] mshv: Introduce tracing support
config: x86_64-randconfig-072-20260227 (https://download.01.org/0day-ci/archive/20260227/202602271123.ilt6wmeA-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260227/202602271123.ilt6wmeA-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602271123.ilt6wmeA-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/hv/mshv_root_main.c:1106:6: warning: variable 'vp' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
    1106 |         if (ret)
         |             ^~~
   drivers/hv/mshv_root_main.c:1177:41: note: uninitialized use occurs here
    1177 |         trace_mshv_create_vp(partition->pt_id, vp->vp_index, ret);
         |                                                ^~
   drivers/hv/mshv_root_main.c:1106:2: note: remove the 'if' if its condition is always false
    1106 |         if (ret)
         |         ^~~~~~~~
    1107 |                 goto unmap_ghcb_page;
         |                 ~~~~~~~~~~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:1100:7: warning: variable 'vp' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
    1100 |                 if (ret)
         |                     ^~~
   drivers/hv/mshv_root_main.c:1177:41: note: uninitialized use occurs here
    1177 |         trace_mshv_create_vp(partition->pt_id, vp->vp_index, ret);
         |                                                ^~
   drivers/hv/mshv_root_main.c:1100:3: note: remove the 'if' if its condition is always false
    1100 |                 if (ret)
         |                 ^~~~~~~~
    1101 |                         goto unmap_register_page;
         |                         ~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:1091:7: warning: variable 'vp' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
    1091 |                 if (ret)
         |                     ^~~
   drivers/hv/mshv_root_main.c:1177:41: note: uninitialized use occurs here
    1177 |         trace_mshv_create_vp(partition->pt_id, vp->vp_index, ret);
         |                                                ^~
   drivers/hv/mshv_root_main.c:1091:3: note: remove the 'if' if its condition is always false
    1091 |                 if (ret)
         |                 ^~~~~~~~
    1092 |                         goto unmap_intercept_message_page;
         |                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:1084:6: warning: variable 'vp' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
    1084 |         if (ret)
         |             ^~~
   drivers/hv/mshv_root_main.c:1177:41: note: uninitialized use occurs here
    1177 |         trace_mshv_create_vp(partition->pt_id, vp->vp_index, ret);
         |                                                ^~
   drivers/hv/mshv_root_main.c:1084:2: note: remove the 'if' if its condition is always false
    1084 |         if (ret)
         |         ^~~~~~~~
    1085 |                 goto destroy_vp;
         |                 ~~~~~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:1062:20: note: initialize the variable 'vp' to silence this warning
    1062 |         struct mshv_vp *vp;
         |                           ^
         |                            = NULL
   4 warnings generated.


vim +1106 drivers/hv/mshv_root_main.c

621191d709b1488 Nuno Das Neves        2025-03-14  1056  
621191d709b1488 Nuno Das Neves        2025-03-14  1057  static long
621191d709b1488 Nuno Das Neves        2025-03-14  1058  mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
621191d709b1488 Nuno Das Neves        2025-03-14  1059  			       void __user *arg)
621191d709b1488 Nuno Das Neves        2025-03-14  1060  {
621191d709b1488 Nuno Das Neves        2025-03-14  1061  	struct mshv_create_vp args;
621191d709b1488 Nuno Das Neves        2025-03-14  1062  	struct mshv_vp *vp;
19c515c27cee3bb Jinank Jain           2025-10-10  1063  	struct page *intercept_msg_page, *register_page, *ghcb_page;
2de4516aa8f7269 Stanislav Kinsburskii 2026-01-28  1064  	struct hv_stats_page *stats_pages[2];
621191d709b1488 Nuno Das Neves        2025-03-14  1065  	long ret;
621191d709b1488 Nuno Das Neves        2025-03-14  1066  
621191d709b1488 Nuno Das Neves        2025-03-14  1067  	if (copy_from_user(&args, arg, sizeof(args)))
621191d709b1488 Nuno Das Neves        2025-03-14  1068  		return -EFAULT;
621191d709b1488 Nuno Das Neves        2025-03-14  1069  
621191d709b1488 Nuno Das Neves        2025-03-14  1070  	if (args.vp_index >= MSHV_MAX_VPS)
621191d709b1488 Nuno Das Neves        2025-03-14  1071  		return -EINVAL;
621191d709b1488 Nuno Das Neves        2025-03-14  1072  
621191d709b1488 Nuno Das Neves        2025-03-14  1073  	if (partition->pt_vp_array[args.vp_index])
621191d709b1488 Nuno Das Neves        2025-03-14  1074  		return -EEXIST;
621191d709b1488 Nuno Das Neves        2025-03-14  1075  
621191d709b1488 Nuno Das Neves        2025-03-14  1076  	ret = hv_call_create_vp(NUMA_NO_NODE, partition->pt_id, args.vp_index,
621191d709b1488 Nuno Das Neves        2025-03-14  1077  				0 /* Only valid for root partition VPs */);
621191d709b1488 Nuno Das Neves        2025-03-14  1078  	if (ret)
621191d709b1488 Nuno Das Neves        2025-03-14  1079  		return ret;
621191d709b1488 Nuno Das Neves        2025-03-14  1080  
19c515c27cee3bb Jinank Jain           2025-10-10  1081  	ret = hv_map_vp_state_page(partition->pt_id, args.vp_index,
621191d709b1488 Nuno Das Neves        2025-03-14  1082  				   HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
19c515c27cee3bb Jinank Jain           2025-10-10  1083  				   input_vtl_zero, &intercept_msg_page);
621191d709b1488 Nuno Das Neves        2025-03-14  1084  	if (ret)
621191d709b1488 Nuno Das Neves        2025-03-14  1085  		goto destroy_vp;
621191d709b1488 Nuno Das Neves        2025-03-14  1086  
621191d709b1488 Nuno Das Neves        2025-03-14  1087  	if (!mshv_partition_encrypted(partition)) {
19c515c27cee3bb Jinank Jain           2025-10-10  1088  		ret = hv_map_vp_state_page(partition->pt_id, args.vp_index,
621191d709b1488 Nuno Das Neves        2025-03-14  1089  					   HV_VP_STATE_PAGE_REGISTERS,
19c515c27cee3bb Jinank Jain           2025-10-10  1090  					   input_vtl_zero, &register_page);
621191d709b1488 Nuno Das Neves        2025-03-14  1091  		if (ret)
621191d709b1488 Nuno Das Neves        2025-03-14  1092  			goto unmap_intercept_message_page;
621191d709b1488 Nuno Das Neves        2025-03-14  1093  	}
621191d709b1488 Nuno Das Neves        2025-03-14  1094  
621191d709b1488 Nuno Das Neves        2025-03-14  1095  	if (mshv_partition_encrypted(partition) &&
621191d709b1488 Nuno Das Neves        2025-03-14  1096  	    is_ghcb_mapping_available()) {
19c515c27cee3bb Jinank Jain           2025-10-10  1097  		ret = hv_map_vp_state_page(partition->pt_id, args.vp_index,
621191d709b1488 Nuno Das Neves        2025-03-14  1098  					   HV_VP_STATE_PAGE_GHCB,
19c515c27cee3bb Jinank Jain           2025-10-10  1099  					   input_vtl_normal, &ghcb_page);
621191d709b1488 Nuno Das Neves        2025-03-14  1100  		if (ret)
621191d709b1488 Nuno Das Neves        2025-03-14  1101  			goto unmap_register_page;
621191d709b1488 Nuno Das Neves        2025-03-14  1102  	}
621191d709b1488 Nuno Das Neves        2025-03-14  1103  
621191d709b1488 Nuno Das Neves        2025-03-14  1104  	ret = mshv_vp_stats_map(partition->pt_id, args.vp_index,
621191d709b1488 Nuno Das Neves        2025-03-14  1105  				stats_pages);
621191d709b1488 Nuno Das Neves        2025-03-14 @1106  	if (ret)
621191d709b1488 Nuno Das Neves        2025-03-14  1107  		goto unmap_ghcb_page;
621191d709b1488 Nuno Das Neves        2025-03-14  1108  
bf4afc53b77aeaa Linus Torvalds        2026-02-21  1109  	vp = kzalloc_obj(*vp);
621191d709b1488 Nuno Das Neves        2025-03-14  1110  	if (!vp)
621191d709b1488 Nuno Das Neves        2025-03-14  1111  		goto unmap_stats_pages;
621191d709b1488 Nuno Das Neves        2025-03-14  1112  
621191d709b1488 Nuno Das Neves        2025-03-14  1113  	vp->vp_partition = mshv_partition_get(partition);
621191d709b1488 Nuno Das Neves        2025-03-14  1114  	if (!vp->vp_partition) {
621191d709b1488 Nuno Das Neves        2025-03-14  1115  		ret = -EBADF;
621191d709b1488 Nuno Das Neves        2025-03-14  1116  		goto free_vp;
621191d709b1488 Nuno Das Neves        2025-03-14  1117  	}
621191d709b1488 Nuno Das Neves        2025-03-14  1118  
621191d709b1488 Nuno Das Neves        2025-03-14  1119  	mutex_init(&vp->vp_mutex);
621191d709b1488 Nuno Das Neves        2025-03-14  1120  	init_waitqueue_head(&vp->run.vp_suspend_queue);
621191d709b1488 Nuno Das Neves        2025-03-14  1121  	atomic64_set(&vp->run.vp_signaled_count, 0);
621191d709b1488 Nuno Das Neves        2025-03-14  1122  
621191d709b1488 Nuno Das Neves        2025-03-14  1123  	vp->vp_index = args.vp_index;
19c515c27cee3bb Jinank Jain           2025-10-10  1124  	vp->vp_intercept_msg_page = page_to_virt(intercept_msg_page);
621191d709b1488 Nuno Das Neves        2025-03-14  1125  	if (!mshv_partition_encrypted(partition))
621191d709b1488 Nuno Das Neves        2025-03-14  1126  		vp->vp_register_page = page_to_virt(register_page);
621191d709b1488 Nuno Das Neves        2025-03-14  1127  
621191d709b1488 Nuno Das Neves        2025-03-14  1128  	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
621191d709b1488 Nuno Das Neves        2025-03-14  1129  		vp->vp_ghcb_page = page_to_virt(ghcb_page);
621191d709b1488 Nuno Das Neves        2025-03-14  1130  
621191d709b1488 Nuno Das Neves        2025-03-14  1131  	memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
621191d709b1488 Nuno Das Neves        2025-03-14  1132  
ff225ba9ad71c4c Nuno Das Neves        2026-01-28  1133  	ret = mshv_debugfs_vp_create(vp);
ff225ba9ad71c4c Nuno Das Neves        2026-01-28  1134  	if (ret)
ff225ba9ad71c4c Nuno Das Neves        2026-01-28  1135  		goto put_partition;
ff225ba9ad71c4c Nuno Das Neves        2026-01-28  1136  
621191d709b1488 Nuno Das Neves        2025-03-14  1137  	/*
621191d709b1488 Nuno Das Neves        2025-03-14  1138  	 * Keep anon_inode_getfd last: it installs fd in the file struct and
621191d709b1488 Nuno Das Neves        2025-03-14  1139  	 * thus makes the state accessible in user space.
621191d709b1488 Nuno Das Neves        2025-03-14  1140  	 */
621191d709b1488 Nuno Das Neves        2025-03-14  1141  	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
621191d709b1488 Nuno Das Neves        2025-03-14  1142  			       O_RDWR | O_CLOEXEC);
621191d709b1488 Nuno Das Neves        2025-03-14  1143  	if (ret < 0)
ff225ba9ad71c4c Nuno Das Neves        2026-01-28  1144  		goto remove_debugfs_vp;
621191d709b1488 Nuno Das Neves        2025-03-14  1145  
621191d709b1488 Nuno Das Neves        2025-03-14  1146  	/* already exclusive with the partition mutex for all ioctls */
621191d709b1488 Nuno Das Neves        2025-03-14  1147  	partition->pt_vp_count++;
621191d709b1488 Nuno Das Neves        2025-03-14  1148  	partition->pt_vp_array[args.vp_index] = vp;
621191d709b1488 Nuno Das Neves        2025-03-14  1149  
33c08ba966cf231 Stanislav Kinsburskii 2026-02-26  1150  	goto out;
621191d709b1488 Nuno Das Neves        2025-03-14  1151  
ff225ba9ad71c4c Nuno Das Neves        2026-01-28  1152  remove_debugfs_vp:
ff225ba9ad71c4c Nuno Das Neves        2026-01-28  1153  	mshv_debugfs_vp_remove(vp);
621191d709b1488 Nuno Das Neves        2025-03-14  1154  put_partition:
621191d709b1488 Nuno Das Neves        2025-03-14  1155  	mshv_partition_put(partition);
621191d709b1488 Nuno Das Neves        2025-03-14  1156  free_vp:
621191d709b1488 Nuno Das Neves        2025-03-14  1157  	kfree(vp);
621191d709b1488 Nuno Das Neves        2025-03-14  1158  unmap_stats_pages:
d62313bdf5961b5 Jinank Jain           2025-10-10  1159  	mshv_vp_stats_unmap(partition->pt_id, args.vp_index, stats_pages);
621191d709b1488 Nuno Das Neves        2025-03-14  1160  unmap_ghcb_page:
19c515c27cee3bb Jinank Jain           2025-10-10  1161  	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
19c515c27cee3bb Jinank Jain           2025-10-10  1162  		hv_unmap_vp_state_page(partition->pt_id, args.vp_index,
19c515c27cee3bb Jinank Jain           2025-10-10  1163  				       HV_VP_STATE_PAGE_GHCB, ghcb_page,
621191d709b1488 Nuno Das Neves        2025-03-14  1164  				       input_vtl_normal);
621191d709b1488 Nuno Das Neves        2025-03-14  1165  unmap_register_page:
19c515c27cee3bb Jinank Jain           2025-10-10  1166  	if (!mshv_partition_encrypted(partition))
19c515c27cee3bb Jinank Jain           2025-10-10  1167  		hv_unmap_vp_state_page(partition->pt_id, args.vp_index,
621191d709b1488 Nuno Das Neves        2025-03-14  1168  				       HV_VP_STATE_PAGE_REGISTERS,
19c515c27cee3bb Jinank Jain           2025-10-10  1169  				       register_page, input_vtl_zero);
621191d709b1488 Nuno Das Neves        2025-03-14  1170  unmap_intercept_message_page:
19c515c27cee3bb Jinank Jain           2025-10-10  1171  	hv_unmap_vp_state_page(partition->pt_id, args.vp_index,
621191d709b1488 Nuno Das Neves        2025-03-14  1172  			       HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
19c515c27cee3bb Jinank Jain           2025-10-10  1173  			       intercept_msg_page, input_vtl_zero);
621191d709b1488 Nuno Das Neves        2025-03-14  1174  destroy_vp:
621191d709b1488 Nuno Das Neves        2025-03-14  1175  	hv_call_delete_vp(partition->pt_id, args.vp_index);
33c08ba966cf231 Stanislav Kinsburskii 2026-02-26  1176  out:
33c08ba966cf231 Stanislav Kinsburskii 2026-02-26  1177  	trace_mshv_create_vp(partition->pt_id, vp->vp_index, ret);
621191d709b1488 Nuno Das Neves        2025-03-14  1178  	return ret;
621191d709b1488 Nuno Das Neves        2025-03-14  1179  }
621191d709b1488 Nuno Das Neves        2025-03-14  1180  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [net-next] net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout
From: Dipayaan Roy @ 2026-02-27  8:07 UTC (permalink / raw)
  To: Simon Horman
  Cc: kuba, decui, pabeni, dipayanroy, kotaranov, linux-kernel,
	edumazet, davem, linux-rdma, shradhagupta, kys, ssengar, haiyangz,
	linux-hyperv, wei.liu, netdev, longli, ernis, shirazsaleem,
	andrew+netdev
In-Reply-To: <20260224082641.130868-1-horms@kernel.org>

On Tue, Feb 24, 2026 at 08:26:41AM +0000, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> 
> For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
> ---
> net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout
> 
> The GF stats periodic query is used as mechanism to monitor HWC health
> check. If this HWC command times out, it is a strong indication that
> the device/SoC is in a faulty state and requires recovery.
> 
> Today, when a timeout is detected, the driver marks
> hwc_timeout_occurred, clears cached stats, and stops rescheduling the
> periodic work. However, the device itself is left in the same failing
> state.
> 
> Extend the timeout handling path to trigger the existing MANA VF
> recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
> This is expected to initiate the appropriate recovery flow by suspende
> resume first and if it fails then trigger a bus rescan.
> 
> This change is intentionally limited to HWC command timeouts and does
> not trigger recovery for errors reported by the SoC as a normal command
> response.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 0055c231acf6..16c438d2aaa3 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> 
> [ ... ]
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9b5a72ada5c4..e6a4034b40f9 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> 
> [ ... ]
> 
> > @@ -3530,6 +3530,8 @@ static void mana_gf_stats_work_handler(struct work_struct *work)
> >  {
> >  	struct mana_context *ac =
> >  		container_of(to_delayed_work(work), struct mana_context, gf_stats_work);
> > +	struct gdma_context *gc = ac->gdma_dev->gdma_context;
> > +	struct mana_serv_work *mns_wk;
> >  	int err;
> >
> >  	err = mana_query_gf_stats(ac);
> > @@ -3537,6 +3539,30 @@ static void mana_gf_stats_work_handler(struct work_struct *work)
> >  		/* HWC timeout detected - reset stats and stop rescheduling */
> >  		ac->hwc_timeout_occurred = true;
> >  		memset(&ac->hc_stats, 0, sizeof(ac->hc_stats));
> > +		dev_warn(gc->dev,
> > +			 "Gf stats wk handler: gf stats query timed out.\n");
> > +
> > +		/* As HWC timed out, indicating a faulty HW state and needs a
> > +		 * reset.
> > +		 */
> > +		if (!test_and_set_bit(GC_IN_SERVICE, &gc->flags)) {
> > +			if (!try_module_get(THIS_MODULE)) {
> > +				dev_info(gc->dev, "Module is unloading\n");
> > +				return;
> > +			}
> > +
> > +			mns_wk = kzalloc(sizeof(*mns_wk), GFP_ATOMIC);
> > +			if (!mns_wk) {
> > +				module_put(THIS_MODULE);
> > +				return;
> > +			}
> 
> Does this leave the GC_IN_SERVICE flag permanently set? When
> test_and_set_bit() atomically sets the bit and then try_module_get()
> fails or kzalloc() fails, the function returns without clearing
> GC_IN_SERVICE.
> 
> This would permanently block all future recovery attempts. Looking at
> mana_gd_process_eqe() in gdma_main.c:
>
ACK. 
> 	if (test_bit(GC_IN_SERVICE, &gc->flags)) {
> 		dev_info(gc->dev, "Already in service\n");
> 		break;
> 	}
> 
> It would detect the set bit and skip the reset. Similarly, mana_tx_timeout()
> in mana_en.c would also short-circuit:
> 
> 	if (test_bit(GC_IN_SERVICE, &gc->flags))
> 		return;
> 
> The device cannot recover without manual intervention since the flag is only
> cleared by mana_serv_reset() at the end of the service work:
> 
> drivers/net/ethernet/microsoft/mana/gdma_main.c:mana_serv_reset()
> {
> 	...
> out:
> 	clear_bit(GC_IN_SERVICE, &gc->flags);
> }
> 
> But since the work is never scheduled in these error paths, clear_bit() is
> never called.
> 
> Note that the existing mana_gd_process_eqe() path avoids this problem by
> calling set_bit(GC_IN_SERVICE) only AFTER all allocations succeed:
> 
> drivers/net/ethernet/microsoft/mana/gdma_main.c:mana_gd_process_eqe()
> {
> 	...
> 	if (test_bit(GC_IN_SERVICE, &gc->flags)) {
> 		dev_info(gc->dev, "Already in service\n");
> 		break;
> 	}
> 
> 	if (!try_module_get(THIS_MODULE)) {
> 		...
> 		break;
> 	}
> 
> 	mns_wk = kzalloc(sizeof(*mns_wk), GFP_KERNEL);
> 	if (!mns_wk) {
> 		module_put(THIS_MODULE);
> 		break;
> 	}
> 
> 	dev_info(gc->dev, "Start MANA service type:%d\n", type);
> 	set_bit(GC_IN_SERVICE, &gc->flags);  // Set only after success
> 	...
> }
> 
> Should clear_bit(GC_IN_SERVICE, &gc->flags) be added before each error
> return?
> 
ACK.
> > +
> > +			mns_wk->pdev = to_pci_dev(gc->dev);
> > +			mns_wk->type = GDMA_EQE_HWC_RESET_REQUEST;
> > +			pci_dev_get(mns_wk->pdev);
> > +			INIT_WORK(&mns_wk->serv_work, mana_serv_func);
> > +			schedule_work(&mns_wk->serv_work);
> > +		}
> 
> This isn't a bug, but the recovery initiation sequence here is a
> near-duplicate of the pattern in mana_gd_process_eqe() (gdma_main.c).
> Both perform the identical 7-step sequence: try_module_get, kzalloc
> mana_serv_work, set pdev, set type, pci_dev_get, INIT_WORK, schedule_work.
> 
> The duplication already contributed to the GC_IN_SERVICE error path issue
> above, where this new copy diverged from the original in its error handling.
> 
> Consider extracting a common helper to avoid divergence.
> 
ACK.
> >  		return;
> >  	}
> >  	schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
> 
> [ ... ]

Thanks Simon, I am addressing these in v2.

Regards

^ permalink raw reply

* Re: [PATCH, net-next] net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout
From: Dipayaan Roy @ 2026-02-27  8:10 UTC (permalink / raw)
  To: Long Li
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Konstantin Taranov,
	horms@kernel.org, shradhagupta@linux.microsoft.com,
	ssengar@linux.microsoft.com, ernis@linux.microsoft.com,
	Shiraz Saleem, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, Dipayaan Roy
In-Reply-To: <DS3PR21MB5735F00E300CB4B7E54DA710CE72A@DS3PR21MB5735.namprd21.prod.outlook.com>

On Thu, Feb 26, 2026 at 07:48:31PM +0000, Long Li wrote:
> > The GF stats periodic query is used as mechanism to monitor HWC health check.
> > If this HWC command times out, it is a strong indication that the device/SoC is in a
> > faulty state and requires recovery.
> > 
> > Today, when a timeout is detected, the driver marks hwc_timeout_occurred,
> > clears cached stats, and stops rescheduling the periodic work. However, the
> > device itself is left in the same failing state.
> > 
> > Extend the timeout handling path to trigger the existing MANA VF recovery
> > service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
> > This is expected to initiate the appropriate recovery flow by suspende resume
> > first and if it fails then trigger a bus rescan.
> > 
> > This change is intentionally limited to HWC command timeouts and does not
> > trigger recovery for errors reported by the SoC as a normal command response.
> > 
> > Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 14 +++-------
> >  drivers/net/ethernet/microsoft/mana/mana_en.c | 28 ++++++++++++++++++-
> >  include/net/mana/gdma.h                       | 16 +++++++++--
> >  3 files changed, 45 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 0055c231acf6..16c438d2aaa3 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -490,15 +490,9 @@ static void mana_serv_reset(struct pci_dev *pdev)
> >  		dev_info(&pdev->dev, "MANA reset cycle completed\n");
> > 
> >  out:
> > -	gc->in_service = false;
> > +	clear_bit(GC_IN_SERVICE, &gc->flags);
> >  }
> > 
> > -struct mana_serv_work {
> > -	struct work_struct serv_work;
> > -	struct pci_dev *pdev;
> > -	enum gdma_eqe_type type;
> > -};
> > -
> >  static void mana_do_service(enum gdma_eqe_type type, struct pci_dev *pdev)
> > {
> >  	switch (type) {
> > @@ -542,7 +536,7 @@ static void mana_recovery_delayed_func(struct
> > work_struct *w)
> >  	spin_unlock_irqrestore(&work->lock, flags);  }
> > 
> > -static void mana_serv_func(struct work_struct *w)
> > +void mana_serv_func(struct work_struct *w)
> >  {
> >  	struct mana_serv_work *mns_wk;
> >  	struct pci_dev *pdev;
> > @@ -624,7 +618,7 @@ static void mana_gd_process_eqe(struct gdma_queue
> > *eq)
> >  			break;
> >  		}
> > 
> > -		if (gc->in_service) {
> > +		if (test_bit(GC_IN_SERVICE, &gc->flags)) {
> >  			dev_info(gc->dev, "Already in service\n");
> >  			break;
> >  		}
> > @@ -641,7 +635,7 @@ static void mana_gd_process_eqe(struct gdma_queue
> > *eq)
> >  		}
> > 
> >  		dev_info(gc->dev, "Start MANA service type:%d\n", type);
> > -		gc->in_service = true;
> > +		set_bit(GC_IN_SERVICE, &gc->flags);
> >  		mns_wk->pdev = to_pci_dev(gc->dev);
> >  		mns_wk->type = type;
> >  		pci_dev_get(mns_wk->pdev);
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 91c418097284..8da574cf06f2 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -879,7 +879,7 @@ static void mana_tx_timeout(struct net_device *netdev,
> > unsigned int txqueue)
> >  	struct gdma_context *gc = ac->gdma_dev->gdma_context;
> > 
> >  	/* Already in service, hence tx queue reset is not required.*/
> > -	if (gc->in_service)
> > +	if (test_bit(GC_IN_SERVICE, &gc->flags))
> >  		return;
> > 
> >  	/* Note: If there are pending queue reset work for this port(apc), @@ -
> > 3533,6 +3533,8 @@ static void mana_gf_stats_work_handler(struct work_struct
> > *work)  {
> >  	struct mana_context *ac =
> >  		container_of(to_delayed_work(work), struct mana_context,
> > gf_stats_work);
> > +	struct gdma_context *gc = ac->gdma_dev->gdma_context;
> > +	struct mana_serv_work *mns_wk;
> >  	int err;
> > 
> >  	err = mana_query_gf_stats(ac);
> > @@ -3540,6 +3542,30 @@ static void mana_gf_stats_work_handler(struct
> > work_struct *work)
> >  		/* HWC timeout detected - reset stats and stop rescheduling */
> >  		ac->hwc_timeout_occurred = true;
> >  		memset(&ac->hc_stats, 0, sizeof(ac->hc_stats));
> > +		dev_warn(gc->dev,
> > +			 "Gf stats wk handler: gf stats query timed out.\n");
> > +
> > +		/* As HWC timed out, indicating a faulty HW state and needs a
> > +		 * reset.
> > +		 */
> > +		if (!test_and_set_bit(GC_IN_SERVICE, &gc->flags)) {
> > +			if (!try_module_get(THIS_MODULE)) {
> > +				dev_info(gc->dev, "Module is unloading\n");
> > +				return;
> > +			}
> > +
> > +			mns_wk = kzalloc(sizeof(*mns_wk), GFP_ATOMIC);
> > +			if (!mns_wk) {
> > +				module_put(THIS_MODULE);
> 
> Maybe it's not necessary: check if you want to call  clear_bit(GC_IN_SERVICE, &gc->flags) here?
>
yes it makes sense to clear it here. 
> > +				return;
> > +			}
> > +
> > +			mns_wk->pdev = to_pci_dev(gc->dev);
> > +			mns_wk->type = GDMA_EQE_HWC_RESET_REQUEST;
> > +			pci_dev_get(mns_wk->pdev);
> > +			INIT_WORK(&mns_wk->serv_work, mana_serv_func);
> > +			schedule_work(&mns_wk->serv_work);
> > +		}
> >  		return;
> >  	}
> >  	schedule_delayed_work(&ac->gf_stats_work,
> > MANA_GF_STATS_PERIOD); diff --git a/include/net/mana/gdma.h
> 

Regards


^ permalink raw reply

* Re: [PATCH] mshv: Introduce tracing support
From: Dan Carpenter @ 2026-02-27  8:11 UTC (permalink / raw)
  To: oe-kbuild, Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui,
	longli
  Cc: lkp, oe-kbuild-all, linux-hyperv, linux-kernel
In-Reply-To: <177213348504.92223.5330421592610811972.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Hi Stanislav,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Stanislav-Kinsburskii/mshv-Introduce-tracing-support/20260227-031942
base:   linus/master
patch link:    https://lore.kernel.org/r/177213348504.92223.5330421592610811972.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net
patch subject: [PATCH] mshv: Introduce tracing support
config: x86_64-randconfig-161-20260227 (https://download.01.org/0day-ci/archive/20260227/202602271528.jLhA59mn-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
smatch version: v0.5.0-8994-gd50c5a4c

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202602271528.jLhA59mn-lkp@intel.com/

New smatch warnings:
drivers/hv/mshv_root_main.c:1177 mshv_partition_ioctl_create_vp() error: we previously assumed 'vp' could be null (see line 1110)
drivers/hv/mshv_root_main.c:1177 mshv_partition_ioctl_create_vp() error: dereferencing freed memory 'vp' (line 1157)

vim +/vp +1177 drivers/hv/mshv_root_main.c

621191d709b148 Nuno Das Neves        2025-03-14  1057  static long
621191d709b148 Nuno Das Neves        2025-03-14  1058  mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
621191d709b148 Nuno Das Neves        2025-03-14  1059  			       void __user *arg)
621191d709b148 Nuno Das Neves        2025-03-14  1060  {
621191d709b148 Nuno Das Neves        2025-03-14  1061  	struct mshv_create_vp args;
621191d709b148 Nuno Das Neves        2025-03-14  1062  	struct mshv_vp *vp;
19c515c27cee3b Jinank Jain           2025-10-10  1063  	struct page *intercept_msg_page, *register_page, *ghcb_page;
2de4516aa8f726 Stanislav Kinsburskii 2026-01-28  1064  	struct hv_stats_page *stats_pages[2];
621191d709b148 Nuno Das Neves        2025-03-14  1065  	long ret;
621191d709b148 Nuno Das Neves        2025-03-14  1066  
621191d709b148 Nuno Das Neves        2025-03-14  1067  	if (copy_from_user(&args, arg, sizeof(args)))
621191d709b148 Nuno Das Neves        2025-03-14  1068  		return -EFAULT;
621191d709b148 Nuno Das Neves        2025-03-14  1069  
621191d709b148 Nuno Das Neves        2025-03-14  1070  	if (args.vp_index >= MSHV_MAX_VPS)
621191d709b148 Nuno Das Neves        2025-03-14  1071  		return -EINVAL;
621191d709b148 Nuno Das Neves        2025-03-14  1072  
621191d709b148 Nuno Das Neves        2025-03-14  1073  	if (partition->pt_vp_array[args.vp_index])
621191d709b148 Nuno Das Neves        2025-03-14  1074  		return -EEXIST;
621191d709b148 Nuno Das Neves        2025-03-14  1075  
621191d709b148 Nuno Das Neves        2025-03-14  1076  	ret = hv_call_create_vp(NUMA_NO_NODE, partition->pt_id, args.vp_index,
621191d709b148 Nuno Das Neves        2025-03-14  1077  				0 /* Only valid for root partition VPs */);
621191d709b148 Nuno Das Neves        2025-03-14  1078  	if (ret)
621191d709b148 Nuno Das Neves        2025-03-14  1079  		return ret;
621191d709b148 Nuno Das Neves        2025-03-14  1080  
19c515c27cee3b Jinank Jain           2025-10-10  1081  	ret = hv_map_vp_state_page(partition->pt_id, args.vp_index,
621191d709b148 Nuno Das Neves        2025-03-14  1082  				   HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
19c515c27cee3b Jinank Jain           2025-10-10  1083  				   input_vtl_zero, &intercept_msg_page);
621191d709b148 Nuno Das Neves        2025-03-14  1084  	if (ret)
621191d709b148 Nuno Das Neves        2025-03-14  1085  		goto destroy_vp;
621191d709b148 Nuno Das Neves        2025-03-14  1086  
621191d709b148 Nuno Das Neves        2025-03-14  1087  	if (!mshv_partition_encrypted(partition)) {
19c515c27cee3b Jinank Jain           2025-10-10  1088  		ret = hv_map_vp_state_page(partition->pt_id, args.vp_index,
621191d709b148 Nuno Das Neves        2025-03-14  1089  					   HV_VP_STATE_PAGE_REGISTERS,
19c515c27cee3b Jinank Jain           2025-10-10  1090  					   input_vtl_zero, &register_page);
621191d709b148 Nuno Das Neves        2025-03-14  1091  		if (ret)
621191d709b148 Nuno Das Neves        2025-03-14  1092  			goto unmap_intercept_message_page;
621191d709b148 Nuno Das Neves        2025-03-14  1093  	}
621191d709b148 Nuno Das Neves        2025-03-14  1094  
621191d709b148 Nuno Das Neves        2025-03-14  1095  	if (mshv_partition_encrypted(partition) &&
621191d709b148 Nuno Das Neves        2025-03-14  1096  	    is_ghcb_mapping_available()) {
19c515c27cee3b Jinank Jain           2025-10-10  1097  		ret = hv_map_vp_state_page(partition->pt_id, args.vp_index,
621191d709b148 Nuno Das Neves        2025-03-14  1098  					   HV_VP_STATE_PAGE_GHCB,
19c515c27cee3b Jinank Jain           2025-10-10  1099  					   input_vtl_normal, &ghcb_page);
621191d709b148 Nuno Das Neves        2025-03-14  1100  		if (ret)
621191d709b148 Nuno Das Neves        2025-03-14  1101  			goto unmap_register_page;
621191d709b148 Nuno Das Neves        2025-03-14  1102  	}
621191d709b148 Nuno Das Neves        2025-03-14  1103  
621191d709b148 Nuno Das Neves        2025-03-14  1104  	ret = mshv_vp_stats_map(partition->pt_id, args.vp_index,
621191d709b148 Nuno Das Neves        2025-03-14  1105  				stats_pages);
621191d709b148 Nuno Das Neves        2025-03-14  1106  	if (ret)
621191d709b148 Nuno Das Neves        2025-03-14  1107  		goto unmap_ghcb_page;
621191d709b148 Nuno Das Neves        2025-03-14  1108  
bf4afc53b77aea Linus Torvalds        2026-02-21  1109  	vp = kzalloc_obj(*vp);
621191d709b148 Nuno Das Neves        2025-03-14 @1110  	if (!vp)
621191d709b148 Nuno Das Neves        2025-03-14  1111  		goto unmap_stats_pages;

vp is NULL

621191d709b148 Nuno Das Neves        2025-03-14  1112  
621191d709b148 Nuno Das Neves        2025-03-14  1113  	vp->vp_partition = mshv_partition_get(partition);
621191d709b148 Nuno Das Neves        2025-03-14  1114  	if (!vp->vp_partition) {
621191d709b148 Nuno Das Neves        2025-03-14  1115  		ret = -EBADF;
621191d709b148 Nuno Das Neves        2025-03-14  1116  		goto free_vp;
621191d709b148 Nuno Das Neves        2025-03-14  1117  	}
621191d709b148 Nuno Das Neves        2025-03-14  1118  
621191d709b148 Nuno Das Neves        2025-03-14  1119  	mutex_init(&vp->vp_mutex);
621191d709b148 Nuno Das Neves        2025-03-14  1120  	init_waitqueue_head(&vp->run.vp_suspend_queue);
621191d709b148 Nuno Das Neves        2025-03-14  1121  	atomic64_set(&vp->run.vp_signaled_count, 0);
621191d709b148 Nuno Das Neves        2025-03-14  1122  
621191d709b148 Nuno Das Neves        2025-03-14  1123  	vp->vp_index = args.vp_index;
19c515c27cee3b Jinank Jain           2025-10-10  1124  	vp->vp_intercept_msg_page = page_to_virt(intercept_msg_page);
621191d709b148 Nuno Das Neves        2025-03-14  1125  	if (!mshv_partition_encrypted(partition))
621191d709b148 Nuno Das Neves        2025-03-14  1126  		vp->vp_register_page = page_to_virt(register_page);
621191d709b148 Nuno Das Neves        2025-03-14  1127  
621191d709b148 Nuno Das Neves        2025-03-14  1128  	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
621191d709b148 Nuno Das Neves        2025-03-14  1129  		vp->vp_ghcb_page = page_to_virt(ghcb_page);
621191d709b148 Nuno Das Neves        2025-03-14  1130  
621191d709b148 Nuno Das Neves        2025-03-14  1131  	memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
621191d709b148 Nuno Das Neves        2025-03-14  1132  
ff225ba9ad71c4 Nuno Das Neves        2026-01-28  1133  	ret = mshv_debugfs_vp_create(vp);
ff225ba9ad71c4 Nuno Das Neves        2026-01-28  1134  	if (ret)
ff225ba9ad71c4 Nuno Das Neves        2026-01-28  1135  		goto put_partition;
ff225ba9ad71c4 Nuno Das Neves        2026-01-28  1136  
621191d709b148 Nuno Das Neves        2025-03-14  1137  	/*
621191d709b148 Nuno Das Neves        2025-03-14  1138  	 * Keep anon_inode_getfd last: it installs fd in the file struct and
621191d709b148 Nuno Das Neves        2025-03-14  1139  	 * thus makes the state accessible in user space.
621191d709b148 Nuno Das Neves        2025-03-14  1140  	 */
621191d709b148 Nuno Das Neves        2025-03-14  1141  	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
621191d709b148 Nuno Das Neves        2025-03-14  1142  			       O_RDWR | O_CLOEXEC);
621191d709b148 Nuno Das Neves        2025-03-14  1143  	if (ret < 0)
ff225ba9ad71c4 Nuno Das Neves        2026-01-28  1144  		goto remove_debugfs_vp;
621191d709b148 Nuno Das Neves        2025-03-14  1145  
621191d709b148 Nuno Das Neves        2025-03-14  1146  	/* already exclusive with the partition mutex for all ioctls */
621191d709b148 Nuno Das Neves        2025-03-14  1147  	partition->pt_vp_count++;
621191d709b148 Nuno Das Neves        2025-03-14  1148  	partition->pt_vp_array[args.vp_index] = vp;
621191d709b148 Nuno Das Neves        2025-03-14  1149  
33c08ba966cf23 Stanislav Kinsburskii 2026-02-26  1150  	goto out;
621191d709b148 Nuno Das Neves        2025-03-14  1151  
ff225ba9ad71c4 Nuno Das Neves        2026-01-28  1152  remove_debugfs_vp:
ff225ba9ad71c4 Nuno Das Neves        2026-01-28  1153  	mshv_debugfs_vp_remove(vp);
621191d709b148 Nuno Das Neves        2025-03-14  1154  put_partition:
621191d709b148 Nuno Das Neves        2025-03-14  1155  	mshv_partition_put(partition);
621191d709b148 Nuno Das Neves        2025-03-14  1156  free_vp:
621191d709b148 Nuno Das Neves        2025-03-14 @1157  	kfree(vp);
                                                              ^^
freed.

621191d709b148 Nuno Das Neves        2025-03-14  1158  unmap_stats_pages:
d62313bdf5961b Jinank Jain           2025-10-10  1159  	mshv_vp_stats_unmap(partition->pt_id, args.vp_index, stats_pages);
621191d709b148 Nuno Das Neves        2025-03-14  1160  unmap_ghcb_page:
19c515c27cee3b Jinank Jain           2025-10-10  1161  	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
19c515c27cee3b Jinank Jain           2025-10-10  1162  		hv_unmap_vp_state_page(partition->pt_id, args.vp_index,
19c515c27cee3b Jinank Jain           2025-10-10  1163  				       HV_VP_STATE_PAGE_GHCB, ghcb_page,
621191d709b148 Nuno Das Neves        2025-03-14  1164  				       input_vtl_normal);
621191d709b148 Nuno Das Neves        2025-03-14  1165  unmap_register_page:
19c515c27cee3b Jinank Jain           2025-10-10  1166  	if (!mshv_partition_encrypted(partition))
19c515c27cee3b Jinank Jain           2025-10-10  1167  		hv_unmap_vp_state_page(partition->pt_id, args.vp_index,
621191d709b148 Nuno Das Neves        2025-03-14  1168  				       HV_VP_STATE_PAGE_REGISTERS,
19c515c27cee3b Jinank Jain           2025-10-10  1169  				       register_page, input_vtl_zero);
621191d709b148 Nuno Das Neves        2025-03-14  1170  unmap_intercept_message_page:
19c515c27cee3b Jinank Jain           2025-10-10  1171  	hv_unmap_vp_state_page(partition->pt_id, args.vp_index,
621191d709b148 Nuno Das Neves        2025-03-14  1172  			       HV_VP_STATE_PAGE_INTERCEPT_MESSAGE,
19c515c27cee3b Jinank Jain           2025-10-10  1173  			       intercept_msg_page, input_vtl_zero);
621191d709b148 Nuno Das Neves        2025-03-14  1174  destroy_vp:
621191d709b148 Nuno Das Neves        2025-03-14  1175  	hv_call_delete_vp(partition->pt_id, args.vp_index);
33c08ba966cf23 Stanislav Kinsburskii 2026-02-26  1176  out:
33c08ba966cf23 Stanislav Kinsburskii 2026-02-26 @1177  	trace_mshv_create_vp(partition->pt_id, vp->vp_index, ret);
                                                                                               ^^^^^^^^^^^^
vp dereferenced.

621191d709b148 Nuno Das Neves        2025-03-14  1178  	return ret;
621191d709b148 Nuno Das Neves        2025-03-14  1179  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* [PATCH net-next, v2] net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout
From: Dipayaan Roy @ 2026-02-27  8:15 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, dipayanroy

The GF stats periodic query is used as mechanism to monitor HWC health
check. If this HWC command times out, it is a strong indication that
the device/SoC is in a faulty state and requires recovery.

Today, when a timeout is detected, the driver marks
hwc_timeout_occurred, clears cached stats, and stops rescheduling the
periodic work. However, the device itself is left in the same failing
state.

Extend the timeout handling path to trigger the existing MANA VF
recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
This is expected to initiate the appropriate recovery flow by suspende
resume first and if it fails then trigger a bus rescan.

This change is intentionally limited to HWC command timeouts and does
not trigger recovery for errors reported by the SoC as a normal command
response.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
Changes in v2:
  - Added common helper, proper clearing of gc flags.
---
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 65 ++++++++++---------
 drivers/net/ethernet/microsoft/mana/mana_en.c |  9 ++-
 include/net/mana/gdma.h                       | 16 ++++-
 3 files changed, 55 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 37d2f108a839..aef8612b73cb 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -490,15 +490,9 @@ static void mana_serv_reset(struct pci_dev *pdev)
 		dev_info(&pdev->dev, "MANA reset cycle completed\n");
 
 out:
-	gc->in_service = false;
+	clear_bit(GC_IN_SERVICE, &gc->flags);
 }
 
-struct mana_serv_work {
-	struct work_struct serv_work;
-	struct pci_dev *pdev;
-	enum gdma_eqe_type type;
-};
-
 static void mana_do_service(enum gdma_eqe_type type, struct pci_dev *pdev)
 {
 	switch (type) {
@@ -558,12 +552,42 @@ static void mana_serv_func(struct work_struct *w)
 	module_put(THIS_MODULE);
 }
 
+int mana_schedule_serv_work(struct gdma_context *gc, enum gdma_eqe_type type)
+{
+	struct mana_serv_work *mns_wk;
+
+	if (test_and_set_bit(GC_IN_SERVICE, &gc->flags)) {
+		dev_info(gc->dev, "Already in service\n");
+		return -EBUSY;
+	}
+
+	if (!try_module_get(THIS_MODULE)) {
+		dev_info(gc->dev, "Module is unloading\n");
+		clear_bit(GC_IN_SERVICE, &gc->flags);
+		return -ENODEV;
+	}
+
+	mns_wk = kzalloc(sizeof(*mns_wk), GFP_ATOMIC);
+	if (!mns_wk) {
+		module_put(THIS_MODULE);
+		clear_bit(GC_IN_SERVICE, &gc->flags);
+		return -ENOMEM;
+	}
+
+	dev_info(gc->dev, "Start MANA service type:%d\n", type);
+	mns_wk->pdev = to_pci_dev(gc->dev);
+	mns_wk->type = type;
+	pci_dev_get(mns_wk->pdev);
+	INIT_WORK(&mns_wk->serv_work, mana_serv_func);
+	schedule_work(&mns_wk->serv_work);
+	return 0;
+}
+
 static void mana_gd_process_eqe(struct gdma_queue *eq)
 {
 	u32 head = eq->head % (eq->queue_size / GDMA_EQE_SIZE);
 	struct gdma_context *gc = eq->gdma_dev->gdma_context;
 	struct gdma_eqe *eq_eqe_ptr = eq->queue_mem_ptr;
-	struct mana_serv_work *mns_wk;
 	union gdma_eqe_info eqe_info;
 	enum gdma_eqe_type type;
 	struct gdma_event event;
@@ -623,30 +647,7 @@ static void mana_gd_process_eqe(struct gdma_queue *eq)
 				 "Service is to be processed in probe\n");
 			break;
 		}
-
-		if (gc->in_service) {
-			dev_info(gc->dev, "Already in service\n");
-			break;
-		}
-
-		if (!try_module_get(THIS_MODULE)) {
-			dev_info(gc->dev, "Module is unloading\n");
-			break;
-		}
-
-		mns_wk = kzalloc_obj(*mns_wk, GFP_ATOMIC);
-		if (!mns_wk) {
-			module_put(THIS_MODULE);
-			break;
-		}
-
-		dev_info(gc->dev, "Start MANA service type:%d\n", type);
-		gc->in_service = true;
-		mns_wk->pdev = to_pci_dev(gc->dev);
-		mns_wk->type = type;
-		pci_dev_get(mns_wk->pdev);
-		INIT_WORK(&mns_wk->serv_work, mana_serv_func);
-		schedule_work(&mns_wk->serv_work);
+		mana_schedule_serv_work(gc, type);
 		break;
 
 	default:
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 933e9d681ded..56ee993e3a43 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -875,7 +875,7 @@ static void mana_tx_timeout(struct net_device *netdev, unsigned int txqueue)
 	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 
 	/* Already in service, hence tx queue reset is not required.*/
-	if (gc->in_service)
+	if (test_bit(GC_IN_SERVICE, &gc->flags))
 		return;
 
 	/* Note: If there are pending queue reset work for this port(apc),
@@ -3525,6 +3525,7 @@ static void mana_gf_stats_work_handler(struct work_struct *work)
 {
 	struct mana_context *ac =
 		container_of(to_delayed_work(work), struct mana_context, gf_stats_work);
+	struct gdma_context *gc = ac->gdma_dev->gdma_context;
 	int err;
 
 	err = mana_query_gf_stats(ac);
@@ -3532,6 +3533,12 @@ static void mana_gf_stats_work_handler(struct work_struct *work)
 		/* HWC timeout detected - reset stats and stop rescheduling */
 		ac->hwc_timeout_occurred = true;
 		memset(&ac->hc_stats, 0, sizeof(ac->hc_stats));
+		dev_warn(gc->dev,
+			 "Gf stats wk handler: gf stats query timed out.\n");
+		/* As HWC timed out, indicating a faulty HW state and needs a
+		 * reset.
+		 */
+		mana_schedule_serv_work(gc, GDMA_EQE_HWC_RESET_REQUEST);
 		return;
 	}
 	schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 766f4fb25e26..ec17004b10c0 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -215,6 +215,12 @@ enum gdma_page_type {
 
 #define GDMA_INVALID_DMA_REGION 0
 
+struct mana_serv_work {
+	struct work_struct serv_work;
+	struct pci_dev *pdev;
+	enum gdma_eqe_type type;
+};
+
 struct gdma_mem_info {
 	struct device *dev;
 
@@ -386,6 +392,7 @@ struct gdma_irq_context {
 
 enum gdma_context_flags {
 	GC_PROBE_SUCCEEDED	= 0,
+	GC_IN_SERVICE		= 1,
 };
 
 struct gdma_context {
@@ -411,7 +418,6 @@ struct gdma_context {
 	u32			test_event_eq_id;
 
 	bool			is_pf;
-	bool			in_service;
 
 	phys_addr_t		bar0_pa;
 	void __iomem		*bar0_va;
@@ -473,6 +479,8 @@ int mana_gd_poll_cq(struct gdma_queue *cq, struct gdma_comp *comp, int num_cqe);
 
 void mana_gd_ring_cq(struct gdma_queue *cq, u8 arm_bit);
 
+int mana_schedule_serv_work(struct gdma_context *gc, enum gdma_eqe_type type);
+
 struct gdma_wqe {
 	u32 reserved	:24;
 	u32 last_vbytes	:8;
@@ -615,6 +623,9 @@ enum {
 /* Driver can handle hardware recovery events during probe */
 #define GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY BIT(22)
 
+/* Driver supports self recovery on Hardware Channel timeouts */
+#define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY BIT(25)
+
 #define GDMA_DRV_CAP_FLAGS1 \
 	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
 	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
@@ -628,7 +639,8 @@ enum {
 	 GDMA_DRV_CAP_FLAG_1_PERIODIC_STATS_QUERY | \
 	 GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
 	 GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
-	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY)
+	 GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
+	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH net-next] net: mana: Expose page_pool stats via ethtool
From: Dipayaan Roy @ 2026-02-27  9:39 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, dipayanroy

MANA relies on page_pool for RX buffers, and the buffer refill paths
can behave quite differently across architectures and configurations (e.g.
base page size, fragment vs full-page usage). This makes it harder to
understand and compare RX buffer behavior when investigating performance
and memory differences across platforms.

Wire up the generic page_pool ethtool stats helpers and report
page_pool allocation/recycle statistics via ethtool -S when
CONFIG_PAGE_POOL_STATS is enabled. The counters are exposed with the
standard "rx_pp_*" names, for example:

  rx_pp_alloc_fast
  rx_pp_alloc_slow
  rx_pp_alloc_slow_ho
  rx_pp_alloc_empty
  rx_pp_alloc_refill
  rx_pp_alloc_waive
  rx_pp_recycle_cached
  rx_pp_recycle_cache_full
  rx_pp_recycle_ring
  rx_pp_recycle_ring_full
  rx_pp_recycle_released_ref

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../ethernet/microsoft/mana/mana_ethtool.c    | 30 +++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index f2d220b371b5..8fec74cdd3c3 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -6,6 +6,7 @@
 #include <linux/ethtool.h>
 
 #include <net/mana/mana.h>
+#include <net/page_pool/helpers.h>
 
 struct mana_stats_desc {
 	char name[ETH_GSTRING_LEN];
@@ -143,8 +144,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 	if (stringset != ETH_SS_STATS)
 		return -EINVAL;
 
-	return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
-			num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	return  ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) +
+		ARRAY_SIZE(mana_hc_stats) +
+		num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT) +
+		page_pool_ethtool_stats_get_count();
 }
 
 static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
@@ -185,6 +188,27 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 		ethtool_sprintf(&data, "tx_%d_csum_partial", i);
 		ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
 	}
+
+	page_pool_ethtool_stats_get_strings(data);
+}
+
+static void mana_get_page_pool_stats(struct net_device *ndev, u64 *data)
+{
+#ifdef CONFIG_PAGE_POOL_STATS
+	struct mana_port_context *apc = netdev_priv(ndev);
+	unsigned int num_queues = apc->num_queues;
+	struct page_pool_stats pp_stats = {};
+	int q;
+
+	for (q = 0; q < num_queues; q++) {
+		if (!apc->rxqs[q] || !apc->rxqs[q]->page_pool)
+			continue;
+
+		page_pool_get_stats(apc->rxqs[q]->page_pool, &pp_stats);
+	}
+
+	page_pool_ethtool_stats_get(data, &pp_stats);
+#endif /* CONFIG_PAGE_POOL_STATS */
 }
 
 static void mana_get_ethtool_stats(struct net_device *ndev,
@@ -280,6 +304,8 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 		data[i++] = csum_partial;
 		data[i++] = mana_map_err;
 	}
+
+	mana_get_page_pool_stats(ndev, &data[i]);
 }
 
 static u32 mana_get_rx_ring_count(struct net_device *ndev)
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next] net: mana: Force full-page RX buffers for 4K page size on specific systems.
From: Dipayaan Roy @ 2026-02-27 10:15 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, dipayanroy

On certain systems configured with 4K PAGE_SIZE, utilizing page_pool
fragments for RX buffers results in a significant throughput regression.
Profiling reveals that this regression correlates with high overhead in the
fragment allocation and reference counting paths on these specific
platforms, rendering the multi-buffer-per-page strategy counterproductive.

To mitigate this, bypass the page_pool fragment path and force a single RX
packet per page allocation when all the following conditions are met:
  1. The system is configured with a 4K PAGE_SIZE.
  2. A processor-specific quirk is detected via SMBIOS Type 4 data.

This approach restores expected line-rate performance by ensuring
predictable RX refill behavior on affected hardware.

There is no behavioral change for systems using larger page sizes
(16K/64K), or platforms where this processor-specific quirk do not
apply.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 120 ++++++++++++++++++
 drivers/net/ethernet/microsoft/mana/mana_en.c |  23 +++-
 include/net/mana/gdma.h                       |  10 ++
 3 files changed, 151 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 0055c231acf6..26bbe736a770 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -9,6 +9,7 @@
 #include <linux/msi.h>
 #include <linux/irqdomain.h>
 #include <linux/export.h>
+#include <linux/dmi.h>
 
 #include <net/mana/mana.h>
 #include <net/mana/hw_channel.h>
@@ -1955,6 +1956,115 @@ static bool mana_is_pf(unsigned short dev_id)
 	return dev_id == MANA_PF_DEVICE_ID;
 }
 
+/*
+ * Table for Processor Version strings found from SMBIOS Type 4 information,
+ * for processors that needs to force single RX buffer per page quirk for
+ * meeting line rate performance with ARM64 + 4K pages.
+ * Note: These strings are exactly matched with version fetched from SMBIOS.
+ */
+static const char * const mana_single_rxbuf_per_page_quirk_tbl[] = {
+	"Cobalt 200",
+};
+
+static const char *smbios_get_string(const struct dmi_header *hdr, u8 idx)
+{
+	const u8 *start, *end;
+	u8 i;
+
+	/* Indexing starts from 1. */
+	if (!idx)
+		return NULL;
+
+	start   = (const u8 *)hdr + hdr->length;
+	end = start + SMBIOS_STR_AREA_MAX;
+
+	for (i = 1; i < idx; i++) {
+		while (start < end && *start)
+			start++;
+		if (start < end)
+			start++;
+		if (start + 1 < end && start[0] == 0 && start[1] == 0)
+			return NULL;
+	}
+
+	if (start >= end || *start == 0)
+		return NULL;
+
+	return (const char *)start;
+}
+
+/* On some systems with 4K PAGE_SIZE, page_pool RX fragments can
+ * trigger a throughput regression. Hence identify those processors
+ * from the extracted SMBIOS table and apply the quirk to forces one
+ * RX buffer per page to avoid the fragment allocation/refcounting
+ * overhead in the RX refill path for those processors only.
+ */
+static bool mana_needs_single_rxbuf_per_page(struct gdma_context *gc)
+{
+	int i = 0;
+	const char *ver = gc->processor_version;
+
+	if (!ver)
+		return false;
+
+	if (PAGE_SIZE != SZ_4K)
+		return false;
+
+	while (i < ARRAY_SIZE(mana_single_rxbuf_per_page_quirk_tbl)) {
+		if (!strcmp(ver, mana_single_rxbuf_per_page_quirk_tbl[i]))
+			return true;
+		i++;
+	}
+
+	return false;
+}
+
+static void mana_get_proc_ver_from_smbios(const struct dmi_header *hdr,
+					  void *data)
+{
+	struct gdma_context *gc = data;
+	const char *ver_str;
+	u8 idx;
+
+	/* We are only looking for Type 4: Processor Information */
+	if (hdr->type != SMBIOS_TYPE_4_PROCESSOR_INFO)
+		return;
+
+	/* Ensure the record is long enough to contain the Processor Version
+	 * field
+	 */
+	if (hdr->length <= SMBIOS_TYPE4_PROC_VERSION_OFFSET)
+		return;
+
+	/* The 'Processor Version' string is located at index pointed by
+	 * SMBIOS_TYPE4_PROC_VERSION_OFFSET. If found make a copy of it.
+	 * There could be multiple Type 4 tables so read and copy the
+	 * processor version found the first time.
+	 */
+	idx = ((const u8 *)hdr)[SMBIOS_TYPE4_PROC_VERSION_OFFSET];
+	ver_str = smbios_get_string(hdr, idx);
+	if (ver_str && !gc->processor_version)
+		gc->processor_version = kstrdup(ver_str, GFP_KERNEL);
+}
+
+/* Check and initialize all processor optimizations/quirks here */
+static bool mana_init_processor_optimization(struct gdma_context *gc)
+{
+	bool opt_initialized = false;
+
+	gc->processor_version = NULL;
+	dmi_walk(mana_get_proc_ver_from_smbios, gc);
+	if (!gc->processor_version)
+		return false;
+
+	if (mana_needs_single_rxbuf_per_page(gc)) {
+		gc->force_full_page_rx_buffer = true;
+		opt_initialized = true;
+	}
+
+	return opt_initialized;
+}
+
 static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	struct gdma_context *gc;
@@ -2009,6 +2119,11 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		gc->mana_pci_debugfs = debugfs_create_dir(pci_slot_name(pdev->slot),
 							  mana_debugfs_root);
 
+	if (mana_init_processor_optimization(gc))
+		dev_info(&pdev->dev,
+			 "Processor specific optimization initialized on: %s\n",
+			gc->processor_version);
+
 	err = mana_gd_setup(pdev);
 	if (err)
 		goto unmap_bar;
@@ -2051,6 +2166,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	pci_iounmap(pdev, bar0_va);
 free_gc:
 	pci_set_drvdata(pdev, NULL);
+	kfree(gc->processor_version);
+	gc->processor_version = NULL;
 	vfree(gc);
 release_region:
 	pci_release_regions(pdev);
@@ -2106,6 +2223,9 @@ static void mana_gd_remove(struct pci_dev *pdev)
 
 	pci_iounmap(pdev, gc->bar0_va);
 
+	kfree(gc->processor_version);
+	gc->processor_version = NULL;
+
 	vfree(gc);
 
 	pci_release_regions(pdev);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 91c418097284..a53a8921050b 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -748,6 +748,26 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static inline bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	struct gdma_context *gc = apc->ac->gdma_dev->gdma_context;
+
+	/* On some systems with 4K PAGE_SIZE, page_pool RX fragments can
+	 * trigger a throughput regression. Hence forces one RX buffer per page
+	 * to avoid the fragment allocation/refcounting overhead in the RX
+	 * refill path for those processors only.
+	 */
+	if (gc->force_full_page_rx_buffer)
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -758,8 +778,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index a59bd4035a99..0ef2d6ac5203 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -9,6 +9,14 @@
 
 #include "shm_channel.h"
 
+#define SMBIOS_STR_AREA_MAX   4096
+
+/* SMBIOS Type 4: Processor Information table */
+#define SMBIOS_TYPE_4_PROCESSOR_INFO 4
+
+/* Byte offset containing the Processor Version string number.*/
+#define SMBIOS_TYPE4_PROC_VERSION_OFFSET 0x10
+
 #define GDMA_STATUS_MORE_ENTRIES	0x00000105
 #define GDMA_STATUS_CMD_UNSUPPORTED	0xffffffff
 
@@ -436,6 +444,8 @@ struct gdma_context {
 	struct workqueue_struct *service_wq;
 
 	unsigned long		flags;
+	u8			*processor_version;
+	bool			force_full_page_rx_buffer;
 };
 
 static inline bool mana_gd_is_mana(struct gdma_dev *gd)
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net v2] net: mana: Ring doorbell at 4 CQ wraparounds
From: Vadim Fedorenko @ 2026-02-27 10:53 UTC (permalink / raw)
  To: Long Li, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Shradha Gupta, Erni Sri Satya Vennela, linux-hyperv, netdev,
	linux-kernel, stable
In-Reply-To: <20260226192833.1050807-1-longli@microsoft.com>

On 26/02/2026 19:28, Long Li wrote:
> MANA hardware requires at least one doorbell ring every 8 wraparounds
> of the CQ. The driver rings the doorbell as a form of flow control to
> inform hardware that CQEs have been consumed.
> 
> The NAPI poll functions mana_poll_tx_cq() and mana_poll_rx_cq() can
> poll up to CQE_POLLING_BUFFER (512) completions per call. If the CQ
> has fewer than 512 entries, a single poll call can process more than
> 4 wraparounds without ringing the doorbell. The doorbell threshold
> check also uses ">" instead of ">=", delaying the ring by one extra
> CQE beyond 4 wraparounds. Combined, these issues can cause the driver
> to exceed the 8-wraparound hardware limit, leading to missed
> completions and stalled queues.
> 
> Fix this by capping the number of CQEs polled per call to 4 wraparounds
> of the CQ in both TX and RX paths. Also change the doorbell threshold
> from ">" to ">=" so the doorbell is rung as soon as 4 wraparounds are
> reached.
> 
> Cc: stable@vger.kernel.org
> Fixes: 58a63729c957 ("net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings")
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
> v2: Use min() instead of min_t(u32, ...) since queue_size is already u32
>   drivers/net/ethernet/microsoft/mana/mana_en.c | 23 +++++++++++++++----
>   1 file changed, 18 insertions(+), 5 deletions(-)

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>

^ permalink raw reply

* [PATCH v1 0/4] Allow order zero pages in page reporting
From: Yuvraj Sakshith @ 2026-02-27 14:06 UTC (permalink / raw)
  To: akpm
  Cc: mst, david, kys, haiyangz, wei.liu, decui, longli, jasowang,
	xuanzhuo, eperezma, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, jackmanb, hannes, ziy, linux-hyperv,
	virtualization, linux-mm, linux-kernel

Today, page reporting sets page_reporting_order in two ways:

(1) page_reporting.page_reporting_order cmdline parameter
(2) Driver can pass order while registering itself.

In both cases, order zero is ignored by free page reporting
because it is used to set page_reporting_order to a default
value, like MAX_PAGE_ORDER.

In some cases we might want page_reporting_order to be zero.

For instance, when virtio-balloon runs inside a guest with
tiny memory (say, 16MB), it might not be able to find a order 1 page
(or in the worst case order MAX_PAGE_ORDER page) after some uptime.
Page reporting should be able to return order zero pages back for
optimal memory relinquishment.

This patch changes the default fallback value from '0' to '-1' in
all possible clients of free page reporting (hv_balloon and
virtio-balloon) together with allowing '0' as a valid order in
page_reporting_register().

Changes in v1:
- Introduce PAGE_REPORTING_DEFAULT_ORDER macro (initially set to 0).
- Make use of new macro in drivers (hv_balloon and virtio-balloon)
	working with page reporting.
- Change PAGE_REPORTING_DEFAULT_ORDER to -1 as zero is a valid
	page order that can be requested.

Yuvraj Sakshith (3):
  mm/page_reporting: Allow zero page_reporting_order
  hv_balloon: Change default page reporting order
  virtio_balloon: Set pr_dev.order to new default

 drivers/hv/hv_balloon.c         |  2 +-
 drivers/virtio/virtio_balloon.c | 14 ++++++++++++++
 mm/page_reporting.c             |  2 +-
 3 files changed, 16 insertions(+), 2 deletions(-)

-- 
2.34.1

^ permalink raw reply

* [PATCH v1 1/4] page_reporting: add PAGE_REPORTING_DEFAULT_ORDER
From: Yuvraj Sakshith @ 2026-02-27 14:06 UTC (permalink / raw)
  To: akpm
  Cc: mst, david, kys, haiyangz, wei.liu, decui, longli, jasowang,
	xuanzhuo, eperezma, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, jackmanb, hannes, ziy, linux-hyperv,
	virtualization, linux-mm, linux-kernel
In-Reply-To: <20260227140655.360696-1-yuvraj.sakshith@oss.qualcomm.com>

Drivers can pass order of pages to be reported while
registering itself. Today, this is a magic number, 0.

Label this with PAGE_REPORTING_DEFAULT_ORDER and
check for it when the driver is being registered.

Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com>
---
 include/linux/page_reporting.h | 1 +
 mm/page_reporting.c            | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
index fe648dfa3..a7e3e30f2 100644
--- a/include/linux/page_reporting.h
+++ b/include/linux/page_reporting.h
@@ -7,6 +7,7 @@
 
 /* This value should always be a power of 2, see page_reporting_cycle() */
 #define PAGE_REPORTING_CAPACITY		32
+#define PAGE_REPORTING_DEFAULT_ORDER	0
 
 struct page_reporting_dev_info {
 	/* function that alters pages to make them "reported" */
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e4c428e61..9ad4fc3f8 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -370,7 +370,8 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
 	 */
 
 	if (page_reporting_order == -1) {
-		if (prdev->order > 0 && prdev->order <= MAX_PAGE_ORDER)
+		if (prdev->order != PAGE_REPORTING_DEFAULT_ORDER &&
+			prdev->order <= MAX_PAGE_ORDER)
 			page_reporting_order = prdev->order;
 		else
 			page_reporting_order = pageblock_order;
-- 
2.34.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox