Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* RE: [PATCH 1/1] x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
From: Michael Kelley (LINUX) @ 2023-07-21  0:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	stable@vger.kernel.org
In-Reply-To: <20230720211553.GA3615208@hirez.programming.kicks-ass.net>

From: Peter Zijlstra <peterz@infradead.org> Sent: Thursday, July 20, 2023 2:16 PM
> 
> On Thu, Jul 20, 2023 at 01:33:57PM -0700, Michael Kelley wrote:
> > On hardware that supports Indirect Branch Tracking (IBT), Hyper-V VMs
> > with ConfigVersion 9.3 or later support IBT in the guest. However,
> > current versions of Hyper-V have a bug in that there's not an ENDBR64
> > instruction at the beginning of the hypercall page.
> 
> Whoops :/
> 
> > Since hypercalls are
> > made with an indirect call to the hypercall page, all hypercall attempts
> > fail with an exception and Linux panics.
> >
> > A Hyper-V fix is in progress to add ENDBR64. But guard against the Linux
> > panic by clearing X86_FEATURE_IBT if the hypercall page doesn't start
> > with ENDBR. The VM will boot and run without IBT.
> >
> > If future Linux 32-bit kernels were to support IBT, additional hypercall
> > page hackery would be needed to make IBT work for such kernels in a
> > Hyper-V VM.
> 
> There are currently no plans to add IBT support to 32bit.

That's what I thought.

> 
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> > ---
> >  arch/x86/hyperv/hv_init.c | 21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> >
> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > index 6c04b52..5cbee24 100644
> > --- a/arch/x86/hyperv/hv_init.c
> > +++ b/arch/x86/hyperv/hv_init.c
> > @@ -14,6 +14,7 @@
> >  #include <asm/apic.h>
> >  #include <asm/desc.h>
> >  #include <asm/sev.h>
> > +#include <asm/ibt.h>
> >  #include <asm/hypervisor.h>
> >  #include <asm/hyperv-tlfs.h>
> >  #include <asm/mshyperv.h>
> > @@ -472,6 +473,26 @@ void __init hyperv_init(void)
> >  	}
> >
> >  	/*
> > +	 * Some versions of Hyper-V that provide IBT in guest VMs have a bug
> > +	 * in that there's no ENDBR64 instruction at the entry to the
> > +	 * hypercall page. Because hypercalls are invoked via an indirect call
> > +	 * to the hypercall page, all hypercall attempts fail when IBT is
> > +	 * enabled, and Linux panics. For such buggy versions, disable IBT.
> > +	 *
> > +	 * Fixed versions of Hyper-V always provide ENDBR64 on the hypercall
> > +	 * page, so if future Linux kernel versions enable IBT for 32-bit
> > +	 * builds, additional hypercall page hackery will be required here
> > +	 * to provide an ENDBR32.
> > +	 */
> > +#ifdef CONFIG_X86_KERNEL_IBT
> > +	if (cpu_feature_enabled(X86_FEATURE_IBT) &&
> > +	    *(u32 *)hv_hypercall_pg != gen_endbr()) {
> > +		setup_clear_cpu_cap(X86_FEATURE_IBT);
> > +		pr_info("Hyper-V: Disabling IBT because of Hyper-V bug\n");
> > +	}
> > +#endif
> 
> pr_warn() perhaps?

I wanted pr_info() so there's an immediate way to check for this
case in the dmesg output if a user complains about IBT not being
enabled when he expects it.   In some sense, the message is temporary
because once the Hyper-V patch is available and users install it,
the message will go away.  The pipeline for the Hyper-V patch is a
bit long, so availability is at least several months away.  This Linux
workaround will be available much faster.  Once it is picked up on
stable branches, we will avoid the situations like we saw where
someone upgraded Fedora 38 from a 6.2 to a 6.3 kernel, and the 6.3
kernel wouldn't boot because it has kernel IBT enabled.

> 
> Other than that, this seems fairly straight forward. One thing I
> wondered about; wouldn't it be possible to re-write the indirect
> hypercall thingies to a direct call? I mean, once we have the hypercall
> page mapped, the address is known right?

Yes, the address is known.  It does not change across things like
hibernation.  But the indirect call instruction is part of an inline assembly
sequence, so the call instructions that need re-writing are scattered
throughout the code. There's also the SEV-SNP case from the
latest version of Tianyu Lan's patch set [1] where vmmcall may be used
instead, based on your recent enhancement for nested ALTERNATIVE.
Re-writing seems like that's more complexity than warranted for a
mostly interim situation until the Hyper-V patch is available and
users install it.

Michael

[1] https://lore.kernel.org/lkml/20230718032304.136888-6-ltykernel@gmail.com/

^ permalink raw reply

* Re: [PATCH v2 1/9] vgacon: rework Kconfig dependencies
From: Michael Ellerman @ 2023-07-21  4:59 UTC (permalink / raw)
  To: Arnd Bergmann, linux-fbdev, Thomas Zimmermann, Helge Deller,
	Javier Martinez Canillas
  Cc: Arnd Bergmann, David S. Miller, K. Y. Srinivasan, Ard Biesheuvel,
	Borislav Petkov, Brian Cain, Catalin Marinas, Christophe Leroy,
	Daniel Vetter, Dave Hansen, David Airlie, Deepak Rawat,
	Dexuan Cui, Dinh Nguyen, Greg Kroah-Hartman, Guo Ren,
	Haiyang Zhang, Huacai Chen, Ingo Molnar,
	John Paul Adrian Glaubitz, Khalid Aziz, Linus Walleij,
	Matt Turner, Max Filippov, Nicholas Piggin, Palmer Dabbelt,
	Russell King, Thomas Bogendoerfer, Thomas Gleixner, WANG Xuerui,
	Wei Liu, Will Deacon, x86, linux-alpha, linux-kernel,
	linux-arm-kernel, linux-efi, linux-csky, linux-hexagon,
	linux-ia64, loongarch, linux-mips, linuxppc-dev, linux-riscv,
	linux-sh, sparclinux, linux-hyperv, dri-devel
In-Reply-To: <20230719123944.3438363-2-arnd@kernel.org>

Arnd Bergmann <arnd@kernel.org> writes:
> From: Arnd Bergmann <arnd@arndb.de>
>
> The list of dependencies here is phrased as an opt-out, but this is missing
> a lot of architectures that don't actually support VGA consoles, and some
> of the entries are stale:
>
>  - powerpc used to support VGA consoles in the old arch/ppc codebase, but
>    the merged arch/powerpc never did

Not disputing this, but how did you come to that conclusion? I grepped
around and couldn't convince myself whether it can work on powerpc or
not. ie. currently it's possible to enable CONFIG_VGA_CONSOLE and
powerpc does have a struct screen_info defined which seems like it would
allow vgacon_startup() to complete.

My only concern is that someone could be using it with Qemu?

cheers

^ permalink raw reply

* Re: [PATCH 1/1] x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
From: Peter Zijlstra @ 2023-07-21  7:58 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	stable@vger.kernel.org
In-Reply-To: <SN6PR2101MB16933FAC4E09E15D824EB2FDD73FA@SN6PR2101MB1693.namprd21.prod.outlook.com>

On Fri, Jul 21, 2023 at 12:41:35AM +0000, Michael Kelley (LINUX) wrote:

> > Other than that, this seems fairly straight forward. One thing I
> > wondered about; wouldn't it be possible to re-write the indirect
> > hypercall thingies to a direct call? I mean, once we have the hypercall
> > page mapped, the address is known right?
> 
> Yes, the address is known.  It does not change across things like
> hibernation.  But the indirect call instruction is part of an inline assembly
> sequence, so the call instructions that need re-writing are scattered
> throughout the code. There's also the SEV-SNP case from the
> latest version of Tianyu Lan's patch set [1] where vmmcall may be used
> instead, based on your recent enhancement for nested ALTERNATIVE.
> Re-writing seems like that's more complexity than warranted for a
> mostly interim situation until the Hyper-V patch is available and
> users install it.

Well, we have a lot of infrastructure for this already. Specifically
this is very like the paravirt patching.

Also, direct calls are both faster and have less speculation issues, so
it might still be worth looking at.

The way to do something like this would be:


	asm volatile ("   ANNOTATE_RETPOLINE_SAFE	\n\t"
		      "1: call *hv_hypercall_page	\n\t"
		      ".pushsection .hv_call_sites	\n\t"
		      ".long 1b - .			\n\t"
		      ".popsection			\n\t");


And then (see alternative.c for many other examples):


patch_hypercalls()
{
	s32 *s;

	for (s = __hv_call_sites_begin; s < __hv_call_sites_end; s++) {
		void *addr = (void *)s + *s;
		struct insn insn;

		ret = insn_decode_kernel(&insn, addr);
		if (WARN_ON_ONCE(ret < 0))
			continue;

		/*
		 * indirect call: ff 15 disp32
		 * direct call:   2e e8 disp32
		 */
		if (insn.length == 6 &&
		    insn.opcode.bytes[0] == 0xFF &&
		    X86_MODRM_REG(insn.modrm.bytes[0]) == 2) {

			/* verify it was calling hy_hypercall_page */
			if (WARN_ON_ONCE(addr + 6 + insn.displacement.value != &hv_hypercall_page))
				continue;

			/*
			 * write a CS padded direct call -- assumes the
			 * hypercall page is in the 2G immediate range
			 * of the kernel text
			 */
			addr[0] = 0x2e; /* CS prefix */
			addr[1] = CALL_INSN_OPCODE;
			(s32 *)&Addr[2] = *hv_hypercall_page - (addr + 6);
		}
	}
}


See, easy :-)

^ permalink raw reply

* Re: [PATCH v2 1/9] vgacon: rework Kconfig dependencies
From: Arnd Bergmann @ 2023-07-21  8:26 UTC (permalink / raw)
  To: Michael Ellerman, Arnd Bergmann, linux-fbdev, Thomas Zimmermann,
	Helge Deller, Javier Martinez Canillas
  Cc: David S . Miller, K. Y. Srinivasan, Ard Biesheuvel,
	Borislav Petkov, Brian Cain, Catalin Marinas, Christophe Leroy,
	Daniel Vetter, Dave Hansen, Dave Airlie, Deepak Rawat, Dexuan Cui,
	Dinh Nguyen, Greg Kroah-Hartman, guoren, Haiyang Zhang,
	Huacai Chen, Ingo Molnar, John Paul Adrian Glaubitz, Khalid Aziz,
	Linus Walleij, Matt Turner, Max Filippov, Nicholas Piggin,
	Palmer Dabbelt, Russell King, Thomas Bogendoerfer,
	Thomas Gleixner, WANG Xuerui, Wei Liu, Will Deacon, x86,
	linux-alpha, linux-kernel, linux-arm-kernel, linux-efi,
	linux-csky@vger.kernel.org, linux-hexagon, linux-ia64, loongarch,
	linux-mips, linuxppc-dev, linux-riscv, linux-sh, sparclinux,
	linux-hyperv, dri-devel
In-Reply-To: <87pm4lj1w3.fsf@mail.lhotse>

On Fri, Jul 21, 2023, at 06:59, Michael Ellerman wrote:
> Arnd Bergmann <arnd@kernel.org> writes:
>> From: Arnd Bergmann <arnd@arndb.de>
>>
>> The list of dependencies here is phrased as an opt-out, but this is missing
>> a lot of architectures that don't actually support VGA consoles, and some
>> of the entries are stale:
>>
>>  - powerpc used to support VGA consoles in the old arch/ppc codebase, but
>>    the merged arch/powerpc never did
>
> Not disputing this, but how did you come to that conclusion? I grepped
> around and couldn't convince myself whether it can work on powerpc or
> not. ie. currently it's possible to enable CONFIG_VGA_CONSOLE and
> powerpc does have a struct screen_info defined which seems like it would
> allow vgacon_startup() to complete.

The VGA console needs both screen_info and vga_con to work. In arch/ppc
we had both, but in arch/powerpc we only retained the screen_info:

$ git grep vga_con v2.6.26 -- arch/ppc arch/ppc64 arch/powerpc
v2.6.26:arch/ppc/platforms/pplus.c:     conswitchp = &vga_con;
v2.6.26:arch/ppc/platforms/prep_setup.c:        conswitchp = &vga_con;

so after arch/ppc was removed, this became impossible to use on both
pplus and prep. These two platforms were also (as far as I can tell)
the only ones to support vga16fb as an alternative to vgacon, but
both platforms were removed later on.

> My only concern is that someone could be using it with Qemu?

I have not yet ruled out anyone using vga16fb on qemu before
commit 0db5b61e0dc07 ("fbdev/vga16fb: Create EGA/VGA devices
in sysfb code"), but I can see that this has been broken for
12 months without anyone complaining about it, since vga16fb
no longer works with the "orig_video_isVGA == 1" setting
in arch/powerpc (the device is not created).

In the qemu sources, I see five powerpc machines that intialize
VGA support: mac_newworld, mac_oldworld, pegasos2, prep, and spapr.
I think we can exclude prep (which was removed from the kernel)
and spapr (64-bit VGA_MAP_MEM() looks broken). I think the
macs always come up in graphical mode and only use
offb/atifb/rivafb/matroxfb but not vga16fb that would require
running the x86 VGA BIOS initialization.

I suppose it's possible to use vga16fb (not vgacon) with
"qemu-system-ppc -M pegasos2 -vga std" if that still boots
at all. Support for pegasos2 hardware appears to have been
removed with commit 04debf21fa174 ("powerpc: Remove core
support for Marvell mv64x60 hostbridges"), but it's possible
that this did not break qemu support if that only uses
devices under arch/powerpc/platforms/chrp/pci.c. I could
not get it to boot, but did not try very hard.

      Arnd

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH v3] x86/hyperv: add noop functions to x86_init mpparse functions
From: Saurabh Singh Sengar @ 2023-07-21 12:58 UTC (permalink / raw)
  To: Wei Liu, Saurabh Sengar, dave.hansen@linux.intel.com
  Cc: KY Srinivasan, Haiyang Zhang, Dexuan Cui, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, Michael Kelley (LINUX),
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	hpa@zytor.com
In-Reply-To: <ZJx02GzzoQ9E1TH9@liuwe-devbox-debian-v2>



> -----Original Message-----
> From: Wei Liu <wei.liu@kernel.org>
> Sent: Wednesday, June 28, 2023 11:29 PM
> To: Saurabh Sengar <ssengar@linux.microsoft.com>
> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; tglx@linutronix.de; mingo@redhat.com;
> bp@alien8.de; dave.hansen@linux.intel.com; x86@kernel.org; Michael Kelley
> (LINUX) <mikelley@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> hyperv@vger.kernel.org; hpa@zytor.com
> Subject: [EXTERNAL] Re: [PATCH v3] x86/hyperv: add noop functions to
> x86_init mpparse functions
> 
> On Fri, Jun 23, 2023 at 09:28:08AM -0700, Saurabh Sengar wrote:
> > Hyper-V can run VMs at different privilege "levels" known as Virtual
> > Trust Levels (VTL). Sometimes, it chooses to run two different VMs at
> > different levels but they share some of their address space. In such
> > setups VTL2 (higher level VM) has visibility of all of the
> > VTL0 (level 0) memory space.
> >
> > When the CONFIG_X86_MPPARSE is enabled for VTL2, the VTL2 kernel
> > performs a search within the low memory to locate MP tables. However,
> > in systems where VTL0 manages the low memory and may contain valid
> > tables, this scanning can result in incorrect MP table information
> > being provided to the VTL2 kernel, mistakenly considering VTL0's MP
> > table as its own
> >
> > Add noop functions to avoid MP parse scan by VTL2.
> >
> > Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> 
> Hi Dave, are you happy with the commit message?

HI Dave,

If there is no concern, can I get your ack

- Saurabh

> 
> > ---
> > [V3]
> >  - modify commit message.
> >
> >  arch/x86/hyperv/hv_vtl.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c index
> > 85d38b9f3586..db5d2ea39fc0 100644
> > --- a/arch/x86/hyperv/hv_vtl.c
> > +++ b/arch/x86/hyperv/hv_vtl.c
> > @@ -25,6 +25,10 @@ void __init hv_vtl_init_platform(void)
> >  	x86_init.irqs.pre_vector_init = x86_init_noop;
> >  	x86_init.timers.timer_init = x86_init_noop;
> >
> > +	/* Avoid searching for BIOS MP tables */
> > +	x86_init.mpparse.find_smp_config = x86_init_noop;
> > +	x86_init.mpparse.get_smp_config = x86_init_uint_noop;
> > +
> >  	x86_platform.get_wallclock = get_rtc_noop;
> >  	x86_platform.set_wallclock = set_rtc_noop;
> >  	x86_platform.get_nmi_reason = hv_get_nmi_reason;
> > --
> > 2.34.1
> >

^ permalink raw reply

* RE: [PATCH 1/1] x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
From: Michael Kelley (LINUX) @ 2023-07-21 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	stable@vger.kernel.org
In-Reply-To: <20230721075848.GA3630545@hirez.programming.kicks-ass.net>

From: Peter Zijlstra <peterz@infradead.org> Sent: Friday, July 21, 2023 12:59 AM
> 
> On Fri, Jul 21, 2023 at 12:41:35AM +0000, Michael Kelley (LINUX) wrote:
> 
> > > Other than that, this seems fairly straight forward. One thing I
> > > wondered about; wouldn't it be possible to re-write the indirect
> > > hypercall thingies to a direct call? I mean, once we have the hypercall
> > > page mapped, the address is known right?
> >
> > Yes, the address is known.  It does not change across things like
> > hibernation.  But the indirect call instruction is part of an inline assembly
> > sequence, so the call instructions that need re-writing are scattered
> > throughout the code. There's also the SEV-SNP case from the
> > latest version of Tianyu Lan's patch set [1] where vmmcall may be used
> > instead, based on your recent enhancement for nested ALTERNATIVE.
> > Re-writing seems like that's more complexity than warranted for a
> > mostly interim situation until the Hyper-V patch is available and
> > users install it.
> 
> Well, we have a lot of infrastructure for this already. Specifically
> this is very like the paravirt patching.
> 
> Also, direct calls are both faster and have less speculation issues, so
> it might still be worth looking at.
> 
> The way to do something like this would be:
> 
> 
> 	asm volatile ("   ANNOTATE_RETPOLINE_SAFE	\n\t"
> 		      "1: call *hv_hypercall_page	\n\t"
> 		      ".pushsection .hv_call_sites	\n\t"
> 		      ".long 1b - .			\n\t"
> 		      ".popsection			\n\t");
> 
> 
> And then (see alternative.c for many other examples):
> 
> 
> patch_hypercalls()
> {
> 	s32 *s;
> 
> 	for (s = __hv_call_sites_begin; s < __hv_call_sites_end; s++) {
> 		void *addr = (void *)s + *s;
> 		struct insn insn;
> 
> 		ret = insn_decode_kernel(&insn, addr);
> 		if (WARN_ON_ONCE(ret < 0))
> 			continue;
> 
> 		/*
> 		 * indirect call: ff 15 disp32
> 		 * direct call:   2e e8 disp32
> 		 */
> 		if (insn.length == 6 &&
> 		    insn.opcode.bytes[0] == 0xFF &&
> 		    X86_MODRM_REG(insn.modrm.bytes[0]) == 2) {
> 
> 			/* verify it was calling hy_hypercall_page */
> 			if (WARN_ON_ONCE(addr + 6 + insn.displacement.value != &hv_hypercall_page))
> 				continue;
> 
> 			/*
> 			 * write a CS padded direct call -- assumes the
> 			 * hypercall page is in the 2G immediate range
> 			 * of the kernel text

Probably not true -- the hypercall page has a vmalloc address.

> 			 */
> 			addr[0] = 0x2e; /* CS prefix */
> 			addr[1] = CALL_INSN_OPCODE;
> 			(s32 *)&Addr[2] = *hv_hypercall_page - (addr + 6);
> 		}
> 	}
> }
> 
> 
> See, easy :-)

OK, worth looking into.  This is a corner of the Linux kernel code that
I've never looked at before.  I appreciate the pointers.

Hypercall sites also exist in loadable modules, so would need to hook
into module_finalize() as well.  Processing a new section type looks
straightforward.

But altogether, this feels like more change than should go as a bug
fix to be backported to stable kernels.  It's something to look at for a
future kernel release.

Michael

^ permalink raw reply

* RE: [PATCH 1/1] x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
From: Michael Kelley (LINUX) @ 2023-07-21 14:05 UTC (permalink / raw)
  To: Michael Kelley (LINUX), Peter Zijlstra
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	stable@vger.kernel.org
In-Reply-To: <SN6PR2101MB16933FAC4E09E15D824EB2FDD73FA@SN6PR2101MB1693.namprd21.prod.outlook.com>

From: Michael Kelley (LINUX) <mikelley@microsoft.com> Sent: Thursday, July 20, 2023 5:42 PM
> 
> From: Peter Zijlstra <peterz@infradead.org> Sent: Thursday, July 20, 2023 2:16 PM
> >
> > > @@ -472,6 +473,26 @@ void __init hyperv_init(void)
> > >  	}
> > >
> > >  	/*
> > > +	 * Some versions of Hyper-V that provide IBT in guest VMs have a bug
> > > +	 * in that there's no ENDBR64 instruction at the entry to the
> > > +	 * hypercall page. Because hypercalls are invoked via an indirect call
> > > +	 * to the hypercall page, all hypercall attempts fail when IBT is
> > > +	 * enabled, and Linux panics. For such buggy versions, disable IBT.
> > > +	 *
> > > +	 * Fixed versions of Hyper-V always provide ENDBR64 on the hypercall
> > > +	 * page, so if future Linux kernel versions enable IBT for 32-bit
> > > +	 * builds, additional hypercall page hackery will be required here
> > > +	 * to provide an ENDBR32.
> > > +	 */
> > > +#ifdef CONFIG_X86_KERNEL_IBT
> > > +	if (cpu_feature_enabled(X86_FEATURE_IBT) &&
> > > +	    *(u32 *)hv_hypercall_pg != gen_endbr()) {
> > > +		setup_clear_cpu_cap(X86_FEATURE_IBT);
> > > +		pr_info("Hyper-V: Disabling IBT because of Hyper-V bug\n");
> > > +	}
> > > +#endif
> >
> > pr_warn() perhaps?
> 
> I wanted pr_info() so there's an immediate way to check for this
> case in the dmesg output if a user complains about IBT not being
> enabled when he expects it.   In some sense, the message is temporary
> because once the Hyper-V patch is available and users install it,
> the message will go away.  The pipeline for the Hyper-V patch is a
> bit long, so availability is at least several months away.  This Linux
> workaround will be available much faster.  Once it is picked up on
> stable branches, we will avoid the situations like we saw where
> someone upgraded Fedora 38 from a 6.2 to a 6.3 kernel, and the 6.3
> kernel wouldn't boot because it has kernel IBT enabled.
> 

I realized in the middle of the night that my reply was nonsense. :-(
pr_warn() makes the message visible when pr_info() might not.  I'm
happy to change to pr_warn().

Michael

^ permalink raw reply

* RE: [PATCH 1/1] x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
From: David Laight @ 2023-07-21 14:07 UTC (permalink / raw)
  To: 'Michael Kelley (LINUX)', Peter Zijlstra
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	stable@vger.kernel.org
In-Reply-To: <BYAPR21MB16889A4BD21DA1F8357008FFD73FA@BYAPR21MB1688.namprd21.prod.outlook.com>

...
> I realized in the middle of the night that my reply was nonsense. :-(
> pr_warn() makes the message visible when pr_info() might not.  I'm
> happy to change to pr_warn().

PANIC_ON_WARN??

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply

* RE: [PATCH 1/1] x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
From: Michael Kelley (LINUX) @ 2023-07-21 14:21 UTC (permalink / raw)
  To: David Laight, Peter Zijlstra
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	stable@vger.kernel.org
In-Reply-To: <533c3cae084f4c2aaf7227e113029d9f@AcuMS.aculab.com>

From: David Laight <David.Laight@ACULAB.COM> Sent: Friday, July 21, 2023 7:07 AM
> 
> ...
> > I realized in the middle of the night that my reply was nonsense. :-(
> > pr_warn() makes the message visible when pr_info() might not.  I'm
> > happy to change to pr_warn().
> 
> PANIC_ON_WARN??
> 

panic_on_warn applies to WARN() and variants.  pr_warn() is unrelated;
it's just kernel logging level 4 vs. logging level 6 for pr_info().

Michael

^ permalink raw reply

* Re: [PATCH 1/1] x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
From: Peter Zijlstra @ 2023-07-21 18:49 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	stable@vger.kernel.org
In-Reply-To: <BYAPR21MB168878CCD076762A1EA58635D73FA@BYAPR21MB1688.namprd21.prod.outlook.com>

On Fri, Jul 21, 2023 at 02:00:35PM +0000, Michael Kelley (LINUX) wrote:

> > Well, we have a lot of infrastructure for this already. Specifically
> > this is very like the paravirt patching.
> > 
> > Also, direct calls are both faster and have less speculation issues, so
> > it might still be worth looking at.
> > 
> > The way to do something like this would be:
> > 
> > 
> > 	asm volatile ("   ANNOTATE_RETPOLINE_SAFE	\n\t"
> > 		      "1: call *hv_hypercall_page	\n\t"
> > 		      ".pushsection .hv_call_sites	\n\t"
> > 		      ".long 1b - .			\n\t"
> > 		      ".popsection			\n\t");
> > 
> > 
> > And then (see alternative.c for many other examples):
> > 
> > 
> > patch_hypercalls()
> > {
> > 	s32 *s;
> > 
> > 	for (s = __hv_call_sites_begin; s < __hv_call_sites_end; s++) {
> > 		void *addr = (void *)s + *s;
> > 		struct insn insn;
> > 
> > 		ret = insn_decode_kernel(&insn, addr);
> > 		if (WARN_ON_ONCE(ret < 0))
> > 			continue;
> > 
> > 		/*
> > 		 * indirect call: ff 15 disp32
> > 		 * direct call:   2e e8 disp32
> > 		 */
> > 		if (insn.length == 6 &&
> > 		    insn.opcode.bytes[0] == 0xFF &&
> > 		    X86_MODRM_REG(insn.modrm.bytes[0]) == 2) {
> > 
> > 			/* verify it was calling hy_hypercall_page */
> > 			if (WARN_ON_ONCE(addr + 6 + insn.displacement.value != &hv_hypercall_page))
> > 				continue;
> > 
> > 			/*
> > 			 * write a CS padded direct call -- assumes the
> > 			 * hypercall page is in the 2G immediate range
> > 			 * of the kernel text
> 
> Probably not true -- the hypercall page has a vmalloc address.

See module_alloc(), that uses vmalloc but constrains the address to stay
within the 2G immediate address limit.

> > 			 */
> > 			addr[0] = 0x2e; /* CS prefix */
> > 			addr[1] = CALL_INSN_OPCODE;
> > 			(s32 *)&Addr[2] = *hv_hypercall_page - (addr + 6);
			*(s32 *)...
> > 		}
> > 	}
> > }
> > 
> > 
> > See, easy :-)
> 
> OK, worth looking into.  This is a corner of the Linux kernel code that
> I've never looked at before.  I appreciate the pointers.

No problem, I've been doing too much of this the past few years :-)

> Hypercall sites also exist in loadable modules, so would need to hook
> into module_finalize() as well.  Processing a new section type looks
> straightforward.

Yep,

> But altogether, this feels like more change than should go as a bug
> fix to be backported to stable kernels.  It's something to look at for a
> future kernel release.

Agreed!

^ permalink raw reply

* [PATCH V3,net-next] net: mana: Add page pool for RX buffers
From: Haiyang Zhang @ 2023-07-21 19:05 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, decui, kys, paulros, olaf, vkuznets, davem, wei.liu,
	edumazet, kuba, pabeni, leon, longli, ssengar, linux-rdma, daniel,
	john.fastabend, bpf, ast, sharmaajay, hawk, tglx, shradhagupta,
	linux-kernel

Add page pool for RX buffers for faster buffer cycle and reduce CPU
usage.

The standard page pool API is used.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
V3:
Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski
V2:
Use the standard page pool API as suggested by Jesper Dangaard Brouer

---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++----
 include/net/mana/mana.h                       |  3 +
 2 files changed, 78 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a499e460594b..4307f25f8c7a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1414,8 +1414,8 @@ static struct sk_buff *mana_build_skb(struct mana_rxq *rxq, void *buf_va,
 	return skb;
 }
 
-static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
-			struct mana_rxq *rxq)
+static void mana_rx_skb(void *buf_va, bool from_pool,
+			struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq)
 {
 	struct mana_stats_rx *rx_stats = &rxq->stats;
 	struct net_device *ndev = rxq->ndev;
@@ -1448,6 +1448,9 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 	if (!skb)
 		goto drop;
 
+	if (from_pool)
+		skb_mark_for_recycle(skb);
+
 	skb->dev = napi->dev;
 
 	skb->protocol = eth_type_trans(skb, ndev);
@@ -1498,9 +1501,14 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 	u64_stats_update_end(&rx_stats->syncp);
 
 drop:
-	WARN_ON_ONCE(rxq->xdp_save_va);
-	/* Save for reuse */
-	rxq->xdp_save_va = buf_va;
+	if (from_pool) {
+		page_pool_recycle_direct(rxq->page_pool,
+					 virt_to_head_page(buf_va));
+	} else {
+		WARN_ON_ONCE(rxq->xdp_save_va);
+		/* Save for reuse */
+		rxq->xdp_save_va = buf_va;
+	}
 
 	++ndev->stats.rx_dropped;
 
@@ -1508,11 +1516,13 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 }
 
 static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
-			     dma_addr_t *da, bool is_napi)
+			     dma_addr_t *da, bool *from_pool, bool is_napi)
 {
 	struct page *page;
 	void *va;
 
+	*from_pool = false;
+
 	/* Reuse XDP dropped page if available */
 	if (rxq->xdp_save_va) {
 		va = rxq->xdp_save_va;
@@ -1533,17 +1543,22 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 			return NULL;
 		}
 	} else {
-		page = dev_alloc_page();
+		page = page_pool_dev_alloc_pages(rxq->page_pool);
 		if (!page)
 			return NULL;
 
+		*from_pool = true;
 		va = page_to_virt(page);
 	}
 
 	*da = dma_map_single(dev, va + rxq->headroom, rxq->datasize,
 			     DMA_FROM_DEVICE);
 	if (dma_mapping_error(dev, *da)) {
-		put_page(virt_to_head_page(va));
+		if (*from_pool)
+			page_pool_put_full_page(rxq->page_pool, page, is_napi);
+		else
+			put_page(virt_to_head_page(va));
+
 		return NULL;
 	}
 
@@ -1552,21 +1567,25 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 
 /* Allocate frag for rx buffer, and save the old buf */
 static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
-			       struct mana_recv_buf_oob *rxoob, void **old_buf)
+			       struct mana_recv_buf_oob *rxoob, void **old_buf,
+			       bool *old_fp)
 {
+	bool from_pool;
 	dma_addr_t da;
 	void *va;
 
-	va = mana_get_rxfrag(rxq, dev, &da, true);
+	va = mana_get_rxfrag(rxq, dev, &da, &from_pool, true);
 	if (!va)
 		return;
 
 	dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
 			 DMA_FROM_DEVICE);
 	*old_buf = rxoob->buf_va;
+	*old_fp = rxoob->from_pool;
 
 	rxoob->buf_va = va;
 	rxoob->sgl[0].address = da;
+	rxoob->from_pool = from_pool;
 }
 
 static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -1580,6 +1599,7 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	struct device *dev = gc->dev;
 	void *old_buf = NULL;
 	u32 curr, pktlen;
+	bool old_fp;
 
 	apc = netdev_priv(ndev);
 
@@ -1622,12 +1642,12 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	rxbuf_oob = &rxq->rx_oobs[curr];
 	WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-	mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf);
+	mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
 
 	/* Unsuccessful refill will have old_buf == NULL.
 	 * In this case, mana_rx_skb() will drop the packet.
 	 */
-	mana_rx_skb(old_buf, oob, rxq);
+	mana_rx_skb(old_buf, old_fp, oob, rxq);
 
 drop:
 	mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
@@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
 
 	if (rxq->xdp_flush)
 		xdp_do_flush();
+
+	page_pool_nid_changed(rxq->page_pool, numa_mem_id());
 }
 
 static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
@@ -1881,6 +1903,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 	struct mana_recv_buf_oob *rx_oob;
 	struct device *dev = gc->dev;
 	struct napi_struct *napi;
+	struct page *page;
 	int i;
 
 	if (!rxq)
@@ -1913,10 +1936,18 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 		dma_unmap_single(dev, rx_oob->sgl[0].address,
 				 rx_oob->sgl[0].size, DMA_FROM_DEVICE);
 
-		put_page(virt_to_head_page(rx_oob->buf_va));
+		page = virt_to_head_page(rx_oob->buf_va);
+
+		if (rx_oob->from_pool)
+			page_pool_put_full_page(rxq->page_pool, page, false);
+		else
+			put_page(page);
+
 		rx_oob->buf_va = NULL;
 	}
 
+	page_pool_destroy(rxq->page_pool);
+
 	if (rxq->gdma_rq)
 		mana_gd_destroy_queue(gc, rxq->gdma_rq);
 
@@ -1927,18 +1958,20 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 			    struct mana_rxq *rxq, struct device *dev)
 {
 	struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+	bool from_pool = false;
 	dma_addr_t da;
 	void *va;
 
 	if (mpc->rxbufs_pre)
 		va = mana_get_rxbuf_pre(rxq, &da);
 	else
-		va = mana_get_rxfrag(rxq, dev, &da, false);
+		va = mana_get_rxfrag(rxq, dev, &da, &from_pool, false);
 
 	if (!va)
 		return -ENOMEM;
 
 	rx_oob->buf_va = va;
+	rx_oob->from_pool = from_pool;
 
 	rx_oob->sgl[0].address = da;
 	rx_oob->sgl[0].size = rxq->datasize;
@@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq *rxq)
 	return 0;
 }
 
+static int mana_create_page_pool(struct mana_rxq *rxq)
+{
+	struct page_pool_params pprm = {};
+	int ret;
+
+	pprm.pool_size = RX_BUFFERS_PER_QUEUE;
+	pprm.napi = &rxq->rx_cq.napi;
+
+	rxq->page_pool = page_pool_create(&pprm);
+
+	if (IS_ERR(rxq->page_pool)) {
+		ret = PTR_ERR(rxq->page_pool);
+		rxq->page_pool = NULL;
+		return ret;
+	}
+
+	return 0;
+}
+
 static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 					u32 rxq_idx, struct mana_eq *eq,
 					struct net_device *ndev)
@@ -2037,6 +2089,13 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	mana_get_rxbuf_cfg(ndev->mtu, &rxq->datasize, &rxq->alloc_size,
 			   &rxq->headroom);
 
+	/* Create page pool for RX queue */
+	err = mana_create_page_pool(rxq);
+	if (err) {
+		netdev_err(ndev, "Create page pool err:%d\n", err);
+		goto out;
+	}
+
 	err = mana_alloc_rx_wqe(apc, rxq, &rq_size, &cq_size);
 	if (err)
 		goto out;
@@ -2108,8 +2167,8 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 
 	WARN_ON(xdp_rxq_info_reg(&rxq->xdp_rxq, ndev, rxq_idx,
 				 cq->napi.napi_id));
-	WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq,
-					   MEM_TYPE_PAGE_SHARED, NULL));
+	WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL,
+					   rxq->page_pool));
 
 	napi_enable(&cq->napi);
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 024ad8ddb27e..b12859511839 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -280,6 +280,7 @@ struct mana_recv_buf_oob {
 	struct gdma_wqe_request wqe_req;
 
 	void *buf_va;
+	bool from_pool; /* allocated from a page pool */
 
 	/* SGL of the buffer going to be sent has part of the work request. */
 	u32 num_sge;
@@ -330,6 +331,8 @@ struct mana_rxq {
 	bool xdp_flush;
 	int xdp_rc; /* XDP redirect return code */
 
+	struct page_pool *page_pool;
+
 	/* MUST BE THE LAST MEMBER:
 	 * Each receive buffer has an associated mana_recv_buf_oob.
 	 */
-- 
2.25.1


^ permalink raw reply related

* [PATCH v2 1/1] x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
From: Michael Kelley @ 2023-07-22  4:51 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, hpa,
	peterz, x86, linux-kernel, linux-hyperv
  Cc: mikelley, stable

On hardware that supports Indirect Branch Tracking (IBT), Hyper-V VMs
with ConfigVersion 9.3 or later support IBT in the guest. However,
current versions of Hyper-V have a bug in that there's not an ENDBR64
instruction at the beginning of the hypercall page. Since hypercalls are
made with an indirect call to the hypercall page, all hypercall attempts
fail with an exception and Linux panics.

A Hyper-V fix is in progress to add ENDBR64. But guard against the Linux
panic by clearing X86_FEATURE_IBT if the hypercall page doesn't start
with ENDBR. The VM will boot and run without IBT.

If future Linux 32-bit kernels were to support IBT, additional hypercall
page hackery would be needed to make IBT work for such kernels in a
Hyper-V VM.

Cc: stable@vger.kernel.org
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
---

Changes in v2:
* Use pr_warn() instead of pr_info() [Peter Zijlstra]

 arch/x86/hyperv/hv_init.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 6c04b52..5cbee24 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -14,6 +14,7 @@
 #include <asm/apic.h>
 #include <asm/desc.h>
 #include <asm/sev.h>
+#include <asm/ibt.h>
 #include <asm/hypervisor.h>
 #include <asm/hyperv-tlfs.h>
 #include <asm/mshyperv.h>
@@ -472,6 +473,26 @@ void __init hyperv_init(void)
 	}
 
 	/*
+	 * Some versions of Hyper-V that provide IBT in guest VMs have a bug
+	 * in that there's no ENDBR64 instruction at the entry to the
+	 * hypercall page. Because hypercalls are invoked via an indirect call
+	 * to the hypercall page, all hypercall attempts fail when IBT is
+	 * enabled, and Linux panics. For such buggy versions, disable IBT.
+	 *
+	 * Fixed versions of Hyper-V always provide ENDBR64 on the hypercall
+	 * page, so if future Linux kernel versions enable IBT for 32-bit
+	 * builds, additional hypercall page hackery will be required here
+	 * to provide an ENDBR32.
+	 */
+#ifdef CONFIG_X86_KERNEL_IBT
+	if (cpu_feature_enabled(X86_FEATURE_IBT) &&
+	    *(u32 *)hv_hypercall_pg != gen_endbr()) {
+		setup_clear_cpu_cap(X86_FEATURE_IBT);
+		pr_warn("Hyper-V: Disabling IBT because of Hyper-V bug\n");
+	}
+#endif
+
+	/*
 	 * hyperv_init() is called before LAPIC is initialized: see
 	 * apic_intr_mode_init() -> x86_platform.apic_post_init() and
 	 * apic_bsp_setup() -> setup_local_APIC(). The direct-mode STIMER
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH RFC net-next v5 07/14] virtio/vsock: add common datagram send path
From: Arseniy Krasnov @ 2023-07-22  8:16 UTC (permalink / raw)
  To: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, bpf
In-Reply-To: <20230413-b4-vsock-dgram-v5-7-581bd37fdb26@bytedance.com>



On 19.07.2023 03:50, Bobby Eshleman wrote:
> This commit implements the common function
> virtio_transport_dgram_enqueue for enqueueing datagrams. It does not add
> usage in either vhost or virtio yet.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  net/vmw_vsock/virtio_transport_common.c | 76 ++++++++++++++++++++++++++++++++-
>  1 file changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index ffcbdd77feaa..3bfaff758433 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -819,7 +819,81 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
>  			       struct msghdr *msg,
>  			       size_t dgram_len)
>  {
> -	return -EOPNOTSUPP;
> +	/* Here we are only using the info struct to retain style uniformity
> +	 * and to ease future refactoring and merging.
> +	 */
> +	struct virtio_vsock_pkt_info info_stack = {
> +		.op = VIRTIO_VSOCK_OP_RW,
> +		.msg = msg,
> +		.vsk = vsk,
> +		.type = VIRTIO_VSOCK_TYPE_DGRAM,
> +	};
> +	const struct virtio_transport *t_ops;
> +	struct virtio_vsock_pkt_info *info;
> +	struct sock *sk = sk_vsock(vsk);
> +	struct virtio_vsock_hdr *hdr;
> +	u32 src_cid, src_port;
> +	struct sk_buff *skb;
> +	void *payload;
> +	int noblock;
> +	int err;
> +
> +	info = &info_stack;

I think 'info' assignment could be moved below, to the place where it is used
first time.

> +
> +	if (dgram_len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)
> +		return -EMSGSIZE;
> +
> +	t_ops = virtio_transport_get_ops(vsk);
> +	if (unlikely(!t_ops))
> +		return -EFAULT;
> +
> +	/* Unlike some of our other sending functions, this function is not
> +	 * intended for use without a msghdr.
> +	 */
> +	if (WARN_ONCE(!msg, "vsock dgram bug: no msghdr found for dgram enqueue\n"))
> +		return -EFAULT;

Sorry, but is that possible? I thought 'msg' is always provided by general socket layer (e.g. before
af_vsock.c code) and can't be NULL for DGRAM. Please correct me if i'm wrong.

Also I see, that in af_vsock.c , 'vsock_dgram_sendmsg()' dereferences 'msg' for checking MSG_OOB without any
checks (before calling transport callback - this function in case of virtio). So I think if we want to keep
this type of check - such check must be placed in af_vsock.c or somewhere before first dereference of this pointer.

> +
> +	noblock = msg->msg_flags & MSG_DONTWAIT;
> +
> +	/* Use sock_alloc_send_skb to throttle by sk_sndbuf. This helps avoid
> +	 * triggering the OOM.
> +	 */
> +	skb = sock_alloc_send_skb(sk, dgram_len + VIRTIO_VSOCK_SKB_HEADROOM,
> +				  noblock, &err);
> +	if (!skb)
> +		return err;
> +
> +	skb_reserve(skb, VIRTIO_VSOCK_SKB_HEADROOM);
> +
> +	src_cid = t_ops->transport.get_local_cid();
> +	src_port = vsk->local_addr.svm_port;
> +
> +	hdr = virtio_vsock_hdr(skb);
> +	hdr->type	= cpu_to_le16(info->type);
> +	hdr->op		= cpu_to_le16(info->op);
> +	hdr->src_cid	= cpu_to_le64(src_cid);
> +	hdr->dst_cid	= cpu_to_le64(remote_addr->svm_cid);
> +	hdr->src_port	= cpu_to_le32(src_port);
> +	hdr->dst_port	= cpu_to_le32(remote_addr->svm_port);
> +	hdr->flags	= cpu_to_le32(info->flags);
> +	hdr->len	= cpu_to_le32(dgram_len);
> +
> +	skb_set_owner_w(skb, sk);
> +
> +	payload = skb_put(skb, dgram_len);
> +	err = memcpy_from_msg(payload, msg, dgram_len);
> +	if (err)
> +		return err;

Do we need free allocated skb here ?

> +
> +	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> +					 remote_addr->svm_cid,
> +					 remote_addr->svm_port,
> +					 dgram_len,
> +					 info->type,
> +					 info->op,
> +					 0);
> +
> +	return t_ops->send_pkt(skb);
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
>  
> 

Thanks, Arseniy

^ permalink raw reply

* Re: [PATCH RFC net-next v5 11/14] vhost/vsock: implement datagram support
From: Arseniy Krasnov @ 2023-07-22  8:42 UTC (permalink / raw)
  To: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, bpf
In-Reply-To: <20230413-b4-vsock-dgram-v5-11-581bd37fdb26@bytedance.com>



On 19.07.2023 03:50, Bobby Eshleman wrote:
> This commit implements datagram support for vhost/vsock by teaching
> vhost to use the common virtio transport datagram functions.
> 
> If the virtio RX buffer is too small, then the transmission is
> abandoned, the packet dropped, and EHOSTUNREACH is added to the socket's
> error queue.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  drivers/vhost/vsock.c    | 62 +++++++++++++++++++++++++++++++++++++++++++++---
>  net/vmw_vsock/af_vsock.c |  5 +++-
>  2 files changed, 63 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index d5d6a3c3f273..da14260c6654 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -8,6 +8,7 @@
>   */
>  #include <linux/miscdevice.h>
>  #include <linux/atomic.h>
> +#include <linux/errqueue.h>
>  #include <linux/module.h>
>  #include <linux/mutex.h>
>  #include <linux/vmalloc.h>
> @@ -32,7 +33,8 @@
>  enum {
>  	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
>  			       (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> -			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> +			       (1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> +			       (1ULL << VIRTIO_VSOCK_F_DGRAM)
>  };
>  
>  enum {
> @@ -56,6 +58,7 @@ struct vhost_vsock {
>  	atomic_t queued_replies;
>  
>  	u32 guest_cid;
> +	bool dgram_allow;
>  	bool seqpacket_allow;
>  };
>  
> @@ -86,6 +89,32 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>  	return NULL;
>  }
>  
> +/* Claims ownership of the skb, do not free the skb after calling! */
> +static void
> +vhost_transport_error(struct sk_buff *skb, int err)
> +{
> +	struct sock_exterr_skb *serr;
> +	struct sock *sk = skb->sk;
> +	struct sk_buff *clone;
> +
> +	serr = SKB_EXT_ERR(skb);
> +	memset(serr, 0, sizeof(*serr));
> +	serr->ee.ee_errno = err;
> +	serr->ee.ee_origin = SO_EE_ORIGIN_NONE;
> +
> +	clone = skb_clone(skb, GFP_KERNEL);

May for skb which is error carrier we can use 'sock_omalloc()', not 'skb_clone()' ? TCP uses skb
allocated by this function as carriers of error structure. I guess 'skb_clone()' also clones data of origin,
but i think that there is no need in data as we insert it to error queue of the socket.

What do You think?

> +	if (!clone)
> +		return;

What will happen here 'if (!clone)' ? skb will leak as it was removed from queue?

> +
> +	if (sock_queue_err_skb(sk, clone))
> +		kfree_skb(clone);
> +
> +	sk->sk_err = err;
> +	sk_error_report(sk);
> +
> +	kfree_skb(skb);
> +}
> +
>  static void
>  vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>  			    struct vhost_virtqueue *vq)
> @@ -160,9 +189,15 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>  		hdr = virtio_vsock_hdr(skb);
>  
>  		/* If the packet is greater than the space available in the
> -		 * buffer, we split it using multiple buffers.
> +		 * buffer, we split it using multiple buffers for connectible
> +		 * sockets and drop the packet for datagram sockets.
>  		 */
>  		if (payload_len > iov_len - sizeof(*hdr)) {
> +			if (le16_to_cpu(hdr->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> +				vhost_transport_error(skb, EHOSTUNREACH);
> +				continue;
> +			}
> +
>  			payload_len = iov_len - sizeof(*hdr);
>  
>  			/* As we are copying pieces of large packet's buffer to
> @@ -394,6 +429,7 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
>  	return val < vq->num;
>  }
>  
> +static bool vhost_transport_dgram_allow(u32 cid, u32 port);
>  static bool vhost_transport_seqpacket_allow(u32 remote_cid);
>  
>  static struct virtio_transport vhost_transport = {
> @@ -410,7 +446,8 @@ static struct virtio_transport vhost_transport = {
>  		.cancel_pkt               = vhost_transport_cancel_pkt,
>  
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> -		.dgram_allow              = virtio_transport_dgram_allow,
> +		.dgram_allow              = vhost_transport_dgram_allow,
> +		.dgram_addr_init          = virtio_transport_dgram_addr_init,
>  
>  		.stream_enqueue           = virtio_transport_stream_enqueue,
>  		.stream_dequeue           = virtio_transport_stream_dequeue,
> @@ -443,6 +480,22 @@ static struct virtio_transport vhost_transport = {
>  	.send_pkt = vhost_transport_send_pkt,
>  };
>  
> +static bool vhost_transport_dgram_allow(u32 cid, u32 port)
> +{
> +	struct vhost_vsock *vsock;
> +	bool dgram_allow = false;
> +
> +	rcu_read_lock();
> +	vsock = vhost_vsock_get(cid);
> +
> +	if (vsock)
> +		dgram_allow = vsock->dgram_allow;
> +
> +	rcu_read_unlock();
> +
> +	return dgram_allow;
> +}
> +
>  static bool vhost_transport_seqpacket_allow(u32 remote_cid)
>  {
>  	struct vhost_vsock *vsock;
> @@ -799,6 +852,9 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
>  	if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
>  		vsock->seqpacket_allow = true;
>  
> +	if (features & (1ULL << VIRTIO_VSOCK_F_DGRAM))
> +		vsock->dgram_allow = true;
> +
>  	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
>  		vq = &vsock->vqs[i];
>  		mutex_lock(&vq->mutex);
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index e73f3b2c52f1..449ed63ac2b0 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -1427,9 +1427,12 @@ int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
>  		return prot->recvmsg(sk, msg, len, flags, NULL);
>  #endif
>  
> -	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> +	if (unlikely(flags & MSG_OOB))
>  		return -EOPNOTSUPP;
>  
> +	if (unlikely(flags & MSG_ERRQUEUE))
> +		return sock_recv_errqueue(sk, msg, len, SOL_VSOCK, 0);
> +

Sorry, but I get build error here, because SOL_VSOCK in undefined. I think it should be added to
include/linux/socket.h and to uapi files also for future use in userspace.

Also Stefano Garzarella <sgarzare@redhat.com> suggested to add define something like VSOCK_RECVERR,
in the same way as IP_RECVERR, and use it as last parameter of 'sock_recv_errqueue()'.

>  	transport = vsk->transport;
>  
>  	/* Retrieve the head sk_buff from the socket's receive queue. */
> 

Thanks, Arseniy

^ permalink raw reply

* Re: [PATCH RFC net-next v5 13/14] virtio/vsock: implement datagram support
From: Arseniy Krasnov @ 2023-07-22  8:45 UTC (permalink / raw)
  To: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, bpf
In-Reply-To: <20230413-b4-vsock-dgram-v5-13-581bd37fdb26@bytedance.com>



On 19.07.2023 03:50, Bobby Eshleman wrote:
> This commit implements datagram support for virtio/vsock by teaching
> virtio to use the general virtio transport ->dgram_addr_init() function
> and implementation a new version of ->dgram_allow().
> 
> Additionally, it drops virtio_transport_dgram_allow() as an exported
> symbol because it is no longer used in other transports.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  include/linux/virtio_vsock.h            |  1 -
>  net/vmw_vsock/virtio_transport.c        | 24 +++++++++++++++++++++++-
>  net/vmw_vsock/virtio_transport_common.c |  6 ------
>  3 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index b3856b8a42b3..d0a4f08b12c1 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -211,7 +211,6 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
>  u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
>  bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
>  bool virtio_transport_stream_allow(u32 cid, u32 port);
> -bool virtio_transport_dgram_allow(u32 cid, u32 port);
>  void virtio_transport_dgram_addr_init(struct sk_buff *skb,
>  				      struct sockaddr_vm *addr);
>  
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index ac2126c7dac5..713718861bd4 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -63,6 +63,7 @@ struct virtio_vsock {
>  
>  	u32 guest_cid;
>  	bool seqpacket_allow;
> +	bool dgram_allow;
>  };
>  
>  static u32 virtio_transport_get_local_cid(void)
> @@ -413,6 +414,7 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
>  	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>  }
>  
> +static bool virtio_transport_dgram_allow(u32 cid, u32 port);

May be add body here? Without prototyping? Same for loopback and vhost.

>  static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>  
>  static struct virtio_transport virtio_transport = {
> @@ -430,6 +432,7 @@ static struct virtio_transport virtio_transport = {
>  
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
>  		.dgram_allow              = virtio_transport_dgram_allow,
> +		.dgram_addr_init          = virtio_transport_dgram_addr_init,
>  
>  		.stream_dequeue           = virtio_transport_stream_dequeue,
>  		.stream_enqueue           = virtio_transport_stream_enqueue,
> @@ -462,6 +465,21 @@ static struct virtio_transport virtio_transport = {
>  	.send_pkt = virtio_transport_send_pkt,
>  };
>  
> +static bool virtio_transport_dgram_allow(u32 cid, u32 port)
> +{
> +	struct virtio_vsock *vsock;
> +	bool dgram_allow;
> +
> +	dgram_allow = false;
> +	rcu_read_lock();
> +	vsock = rcu_dereference(the_virtio_vsock);
> +	if (vsock)
> +		dgram_allow = vsock->dgram_allow;
> +	rcu_read_unlock();
> +
> +	return dgram_allow;
> +}
> +
>  static bool virtio_transport_seqpacket_allow(u32 remote_cid)
>  {
>  	struct virtio_vsock *vsock;
> @@ -655,6 +673,9 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_SEQPACKET))
>  		vsock->seqpacket_allow = true;
>  
> +	if (virtio_has_feature(vdev, VIRTIO_VSOCK_F_DGRAM))
> +		vsock->dgram_allow = true;
> +
>  	vdev->priv = vsock;
>  
>  	ret = virtio_vsock_vqs_init(vsock);
> @@ -747,7 +768,8 @@ static struct virtio_device_id id_table[] = {
>  };
>  
>  static unsigned int features[] = {
> -	VIRTIO_VSOCK_F_SEQPACKET
> +	VIRTIO_VSOCK_F_SEQPACKET,
> +	VIRTIO_VSOCK_F_DGRAM
>  };
>  
>  static struct virtio_driver virtio_vsock_driver = {
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 96118e258097..77898f5325cd 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -783,12 +783,6 @@ bool virtio_transport_stream_allow(u32 cid, u32 port)
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
>  
> -bool virtio_transport_dgram_allow(u32 cid, u32 port)
> -{
> -	return false;
> -}
> -EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> -
>  int virtio_transport_connect(struct vsock_sock *vsk)
>  {
>  	struct virtio_vsock_pkt_info info = {
> 

Thanks, Arseniy

^ permalink raw reply

* Re: [PATCH RFC net-next v5 00/14] virtio/vsock: support datagrams
From: Arseniy Krasnov @ 2023-07-22  8:51 UTC (permalink / raw)
  To: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, bpf, Jiang Wang
In-Reply-To: <20230413-b4-vsock-dgram-v5-0-581bd37fdb26@bytedance.com>

Hello Bobby!

Thanks for this patchset! I left some comments and continue review and tests in
the next few days

Thanks, Arseniy

On 19.07.2023 03:50, Bobby Eshleman wrote:
> Hey all!
> 
> This series introduces support for datagrams to virtio/vsock.
> 
> It is a spin-off (and smaller version) of this series from the summer:
>   https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/
> 
> Please note that this is an RFC and should not be merged until
> associated changes are made to the virtio specification, which will
> follow after discussion from this series.
> 
> Another aside, the v4 of the series has only been mildly tested with a
> run of tools/testing/vsock/vsock_test. Some code likely needs cleaning
> up, but I'm hoping to get some of the design choices agreed upon before
> spending too much time making it pretty.
> 
> This series first supports datagrams in a basic form for virtio, and
> then optimizes the sendpath for all datagram transports.
> 
> The result is a very fast datagram communication protocol that
> outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
> of multi-threaded workload samples.
> 
> For those that are curious, some summary data comparing UDP and VSOCK
> DGRAM (N=5):
> 
> 	vCPUS: 16
> 	virtio-net queues: 16
> 	payload size: 4KB
> 	Setup: bare metal + vm (non-nested)
> 
> 	UDP: 287.59 MB/s
> 	VSOCK DGRAM: 509.2 MB/s
> 
> Some notes about the implementation...
> 
> This datagram implementation forces datagrams to self-throttle according
> to the threshold set by sk_sndbuf. It behaves similar to the credits
> used by streams in its effect on throughput and memory consumption, but
> it is not influenced by the receiving socket as credits are.
> 
> The device drops packets silently.
> 
> As discussed previously, this series introduces datagrams and defers
> fairness to future work. See discussion in v2 for more context around
> datagrams, fairness, and this implementation.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
> Changes in v5:
> - teach vhost to drop dgram when a datagram exceeds the receive buffer
>   - now uses MSG_ERRQUEUE and depends on Arseniy's zerocopy patch:
> 	"vsock: read from socket's error queue"
> - replace multiple ->dgram_* callbacks with single ->dgram_addr_init()
>   callback
> - refactor virtio dgram skb allocator to reduce conflicts w/ zerocopy series
> - add _fallback/_FALLBACK suffix to dgram transport variables/macros
> - add WARN_ONCE() for table_size / VSOCK_HASH issue
> - add static to vsock_find_bound_socket_common
> - dedupe code in vsock_dgram_sendmsg() using module_got var
> - drop concurrent sendmsg() for dgram and defer to future series
> - Add more tests
>   - test EHOSTUNREACH in errqueue
>   - test stream + dgram address collision
> - improve clarity of dgram msg bounds test code
> - Link to v4: https://lore.kernel.org/r/20230413-b4-vsock-dgram-v4-0-0cebbb2ae899@bytedance.com
> 
> Changes in v4:
> - style changes
>   - vsock: use sk_vsock(vsk) in vsock_dgram_recvmsg instead of
>     &sk->vsk
>   - vsock: fix xmas tree declaration
>   - vsock: fix spacing issues
>   - virtio/vsock: virtio_transport_recv_dgram returns void because err
>     unused
> - sparse analysis warnings/errors
>   - virtio/vsock: fix unitialized skerr on destroy
>   - virtio/vsock: fix uninitialized err var on goto out
>   - vsock: fix declarations that need static
>   - vsock: fix __rcu annotation order
> - bugs
>   - vsock: fix null ptr in remote_info code
>   - vsock/dgram: make transport_dgram a fallback instead of first
>     priority
>   - vsock: remove redundant rcu read lock acquire in getname()
> - tests
>   - add more tests (message bounds and more)
>   - add vsock_dgram_bind() helper
>   - add vsock_dgram_connect() helper
> 
> Changes in v3:
> - Support multi-transport dgram, changing logic in connect/bind
>   to support VMCI case
> - Support per-pkt transport lookup for sendto() case
> - Fix dgram_allow() implementation
> - Fix dgram feature bit number (now it is 3)
> - Fix binding so dgram and connectible (cid,port) spaces are
>   non-overlapping
> - RCU protect transport ptr so connect() calls never leave
>   a lockless read of the transport and remote_addr are always
>   in sync
> - Link to v2: https://lore.kernel.org/r/20230413-b4-vsock-dgram-v2-0-079cc7cee62e@bytedance.com
> 
> ---
> Bobby Eshleman (13):
>       af_vsock: generalize vsock_dgram_recvmsg() to all transports
>       af_vsock: refactor transport lookup code
>       af_vsock: support multi-transport datagrams
>       af_vsock: generalize bind table functions
>       af_vsock: use a separate dgram bind table
>       virtio/vsock: add VIRTIO_VSOCK_TYPE_DGRAM
>       virtio/vsock: add common datagram send path
>       af_vsock: add vsock_find_bound_dgram_socket()
>       virtio/vsock: add common datagram recv path
>       virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
>       vhost/vsock: implement datagram support
>       vsock/loopback: implement datagram support
>       virtio/vsock: implement datagram support
> 
> Jiang Wang (1):
>       test/vsock: add vsock dgram tests
> 
>  drivers/vhost/vsock.c                   |  64 ++-
>  include/linux/virtio_vsock.h            |  10 +-
>  include/net/af_vsock.h                  |  14 +-
>  include/uapi/linux/virtio_vsock.h       |   2 +
>  net/vmw_vsock/af_vsock.c                | 281 ++++++++++---
>  net/vmw_vsock/hyperv_transport.c        |  13 -
>  net/vmw_vsock/virtio_transport.c        |  26 +-
>  net/vmw_vsock/virtio_transport_common.c | 190 +++++++--
>  net/vmw_vsock/vmci_transport.c          |  60 +--
>  net/vmw_vsock/vsock_loopback.c          |  10 +-
>  tools/testing/vsock/util.c              | 141 ++++++-
>  tools/testing/vsock/util.h              |   6 +
>  tools/testing/vsock/vsock_test.c        | 680 ++++++++++++++++++++++++++++++++
>  13 files changed, 1320 insertions(+), 177 deletions(-)
> ---
> base-commit: 37cadc266ebdc7e3531111c2b3304fa01b2131e8
> change-id: 20230413-b4-vsock-dgram-3b6eba6a64e5
> 
> Best regards,

^ permalink raw reply

* Re: [PATCH RFC net-next v5 03/14] af_vsock: support multi-transport datagrams
From: Arseniy Krasnov @ 2023-07-22 21:53 UTC (permalink / raw)
  To: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers
  Cc: Dan Carpenter, Simon Horman, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, bpf
In-Reply-To: <20230413-b4-vsock-dgram-v5-3-581bd37fdb26@bytedance.com>



On 19.07.2023 03:50, Bobby Eshleman wrote:
> This patch adds support for multi-transport datagrams.
> 
> This includes:
> - Per-packet lookup of transports when using sendto(sockaddr_vm)
> - Selecting H2G or G2H transport using VMADDR_FLAG_TO_HOST and CID in
>   sockaddr_vm
> - rename VSOCK_TRANSPORT_F_DGRAM to VSOCK_TRANSPORT_F_DGRAM_FALLBACK
> - connect() now assigns the transport for (similar to connectible
>   sockets)
> 
> To preserve backwards compatibility with VMCI, some important changes
> are made. The "transport_dgram" / VSOCK_TRANSPORT_F_DGRAM is changed to
> be used for dgrams only if there is not yet a g2h or h2g transport that
> has been registered that can transmit the packet. If there is a g2h/h2g
> transport for that remote address, then that transport will be used and
> not "transport_dgram". This essentially makes "transport_dgram" a
> fallback transport for when h2g/g2h has not yet gone online, and so it
> is renamed "transport_dgram_fallback". VMCI implements this transport.
> 
> The logic around "transport_dgram" needs to be retained to prevent
> breaking VMCI:
> 
> 1) VMCI datagrams existed prior to h2g/g2h and so operate under a
>    different paradigm. When the vmci transport comes online, it registers
>    itself with the DGRAM feature, but not H2G/G2H. Only later when the
>    transport has more information about its environment does it register
>    H2G or G2H.  In the case that a datagram socket is created after
>    VSOCK_TRANSPORT_F_DGRAM registration but before G2H/H2G registration,
>    the "transport_dgram" transport is the only registered transport and so
>    needs to be used.
> 
> 2) VMCI seems to require a special message be sent by the transport when a
>    datagram socket calls bind(). Under the h2g/g2h model, the transport
>    is selected using the remote_addr which is set by connect(). At
>    bind time there is no remote_addr because often no connect() has been
>    called yet: the transport is null. Therefore, with a null transport
>    there doesn't seem to be any good way for a datagram socket to tell the
>    VMCI transport that it has just had bind() called upon it.
> 
> With the new fallback logic, after H2G/G2H comes online the socket layer
> will access the VMCI transport via transport_{h2g,g2h}. Prior to H2G/G2H
> coming online, the socket layer will access the VMCI transport via
> "transport_dgram_fallback".
> 
> Only transports with a special datagram fallback use-case such as VMCI
> need to register VSOCK_TRANSPORT_F_DGRAM_FALLBACK.
> 
> Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> ---
>  drivers/vhost/vsock.c                   |  1 -
>  include/linux/virtio_vsock.h            |  2 --
>  include/net/af_vsock.h                  | 10 +++---
>  net/vmw_vsock/af_vsock.c                | 64 ++++++++++++++++++++++++++-------
>  net/vmw_vsock/hyperv_transport.c        |  6 ----
>  net/vmw_vsock/virtio_transport.c        |  1 -
>  net/vmw_vsock/virtio_transport_common.c |  7 ----
>  net/vmw_vsock/vmci_transport.c          |  2 +-
>  net/vmw_vsock/vsock_loopback.c          |  1 -
>  9 files changed, 58 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index ae8891598a48..d5d6a3c3f273 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -410,7 +410,6 @@ static struct virtio_transport vhost_transport = {
>  		.cancel_pkt               = vhost_transport_cancel_pkt,
>  
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
> -		.dgram_bind               = virtio_transport_dgram_bind,
>  		.dgram_allow              = virtio_transport_dgram_allow,
>  
>  		.stream_enqueue           = virtio_transport_stream_enqueue,
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 18cbe8d37fca..7632552bee58 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -211,8 +211,6 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
>  u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
>  bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
>  bool virtio_transport_stream_allow(u32 cid, u32 port);
> -int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> -				struct sockaddr_vm *addr);
>  bool virtio_transport_dgram_allow(u32 cid, u32 port);
>  
>  int virtio_transport_connect(struct vsock_sock *vsk);
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 305d57502e89..f6a0ca9d7c3e 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -96,13 +96,13 @@ struct vsock_transport_send_notify_data {
>  
>  /* Transport features flags */
>  /* Transport provides host->guest communication */
> -#define VSOCK_TRANSPORT_F_H2G		0x00000001
> +#define VSOCK_TRANSPORT_F_H2G			0x00000001
>  /* Transport provides guest->host communication */
> -#define VSOCK_TRANSPORT_F_G2H		0x00000002
> -/* Transport provides DGRAM communication */
> -#define VSOCK_TRANSPORT_F_DGRAM		0x00000004
> +#define VSOCK_TRANSPORT_F_G2H			0x00000002
> +/* Transport provides fallback for DGRAM communication */
> +#define VSOCK_TRANSPORT_F_DGRAM_FALLBACK	0x00000004
>  /* Transport provides local (loopback) communication */
> -#define VSOCK_TRANSPORT_F_LOCAL		0x00000008
> +#define VSOCK_TRANSPORT_F_LOCAL			0x00000008
>  
>  struct vsock_transport {
>  	struct module *module;
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index ae5ac5531d96..26c97b33d55a 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -139,8 +139,8 @@ struct proto vsock_proto = {
>  static const struct vsock_transport *transport_h2g;
>  /* Transport used for guest->host communication */
>  static const struct vsock_transport *transport_g2h;
> -/* Transport used for DGRAM communication */
> -static const struct vsock_transport *transport_dgram;
> +/* Transport used as a fallback for DGRAM communication */
> +static const struct vsock_transport *transport_dgram_fallback;
>  /* Transport used for local communication */
>  static const struct vsock_transport *transport_local;
>  static DEFINE_MUTEX(vsock_register_mutex);
> @@ -439,6 +439,18 @@ vsock_connectible_lookup_transport(unsigned int cid, __u8 flags)
>  	return transport;
>  }
>  
> +static const struct vsock_transport *
> +vsock_dgram_lookup_transport(unsigned int cid, __u8 flags)
> +{
> +	const struct vsock_transport *transport;
> +
> +	transport = vsock_connectible_lookup_transport(cid, flags);
> +	if (transport)
> +		return transport;
> +
> +	return transport_dgram_fallback;
> +}
> +
>  /* Assign a transport to a socket and call the .init transport callback.
>   *
>   * Note: for connection oriented socket this must be called when vsk->remote_addr
> @@ -475,7 +487,8 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
>  
>  	switch (sk->sk_type) {
>  	case SOCK_DGRAM:
> -		new_transport = transport_dgram;
> +		new_transport = vsock_dgram_lookup_transport(remote_cid,
> +							     remote_flags);

I'm a little bit confused about this: 
1) Let's create SOCK_DGRAM socket using vsock_create()
2) for SOCK_DGRAM it calls 'vsock_assign_transport()' and we go here, remote_cid == -1
3) I guess 'vsock_dgram_lookup_transport()' calls logic from 0002 and returns h2g for such remote cid, which is not
   correct I think...

Please correct me if i'm wrong

Thanks, Arseniy

>  		break;
>  	case SOCK_STREAM:
>  	case SOCK_SEQPACKET:
> @@ -692,6 +705,9 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
>  static int __vsock_bind_dgram(struct vsock_sock *vsk,
>  			      struct sockaddr_vm *addr)
>  {
> +	if (!vsk->transport || !vsk->transport->dgram_bind)
> +		return -EINVAL;
> +
>  	return vsk->transport->dgram_bind(vsk, addr);
>  }
>  
> @@ -1162,6 +1178,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>  	struct vsock_sock *vsk;
>  	struct sockaddr_vm *remote_addr;
>  	const struct vsock_transport *transport;
> +	bool module_got = false;
>  
>  	if (msg->msg_flags & MSG_OOB)
>  		return -EOPNOTSUPP;
> @@ -1173,19 +1190,34 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>  
>  	lock_sock(sk);
>  
> -	transport = vsk->transport;
> -
>  	err = vsock_auto_bind(vsk);
>  	if (err)
>  		goto out;
>  
> -
>  	/* If the provided message contains an address, use that.  Otherwise
>  	 * fall back on the socket's remote handle (if it has been connected).
>  	 */
>  	if (msg->msg_name &&
>  	    vsock_addr_cast(msg->msg_name, msg->msg_namelen,
>  			    &remote_addr) == 0) {
> +		transport = vsock_dgram_lookup_transport(remote_addr->svm_cid,
> +							 remote_addr->svm_flags);
> +		if (!transport) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (!try_module_get(transport->module)) {
> +			err = -ENODEV;
> +			goto out;
> +		}
> +
> +		/* When looking up a transport dynamically and acquiring a
> +		 * reference on the module, we need to remember to release the
> +		 * reference later.
> +		 */
> +		module_got = true;
> +
>  		/* Ensure this address is of the right type and is a valid
>  		 * destination.
>  		 */
> @@ -1200,6 +1232,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>  	} else if (sock->state == SS_CONNECTED) {
>  		remote_addr = &vsk->remote_addr;
>  
> +		transport = vsk->transport;
>  		if (remote_addr->svm_cid == VMADDR_CID_ANY)
>  			remote_addr->svm_cid = transport->get_local_cid();
>  
> @@ -1224,6 +1257,8 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
>  	err = transport->dgram_enqueue(vsk, remote_addr, msg, len);
>  
>  out:
> +	if (module_got)
> +		module_put(transport->module);
>  	release_sock(sk);
>  	return err;
>  }
> @@ -1256,13 +1291,18 @@ static int vsock_dgram_connect(struct socket *sock,
>  	if (err)
>  		goto out;
>  
> +	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
> +
> +	err = vsock_assign_transport(vsk, NULL);
> +	if (err)
> +		goto out;
> +
>  	if (!vsk->transport->dgram_allow(remote_addr->svm_cid,
>  					 remote_addr->svm_port)) {
>  		err = -EINVAL;
>  		goto out;
>  	}
>  
> -	memcpy(&vsk->remote_addr, remote_addr, sizeof(vsk->remote_addr));
>  	sock->state = SS_CONNECTED;
>  
>  	/* sock map disallows redirection of non-TCP sockets with sk_state !=
> @@ -2487,7 +2527,7 @@ int vsock_core_register(const struct vsock_transport *t, int features)
>  
>  	t_h2g = transport_h2g;
>  	t_g2h = transport_g2h;
> -	t_dgram = transport_dgram;
> +	t_dgram = transport_dgram_fallback;
>  	t_local = transport_local;
>  
>  	if (features & VSOCK_TRANSPORT_F_H2G) {
> @@ -2506,7 +2546,7 @@ int vsock_core_register(const struct vsock_transport *t, int features)
>  		t_g2h = t;
>  	}
>  
> -	if (features & VSOCK_TRANSPORT_F_DGRAM) {
> +	if (features & VSOCK_TRANSPORT_F_DGRAM_FALLBACK) {
>  		if (t_dgram) {
>  			err = -EBUSY;
>  			goto err_busy;
> @@ -2524,7 +2564,7 @@ int vsock_core_register(const struct vsock_transport *t, int features)
>  
>  	transport_h2g = t_h2g;
>  	transport_g2h = t_g2h;
> -	transport_dgram = t_dgram;
> +	transport_dgram_fallback = t_dgram;
>  	transport_local = t_local;
>  
>  err_busy:
> @@ -2543,8 +2583,8 @@ void vsock_core_unregister(const struct vsock_transport *t)
>  	if (transport_g2h == t)
>  		transport_g2h = NULL;
>  
> -	if (transport_dgram == t)
> -		transport_dgram = NULL;
> +	if (transport_dgram_fallback == t)
> +		transport_dgram_fallback = NULL;
>  
>  	if (transport_local == t)
>  		transport_local = NULL;
> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> index 7f1ea434656d..c29000f2612a 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -551,11 +551,6 @@ static void hvs_destruct(struct vsock_sock *vsk)
>  	kfree(hvs);
>  }
>  
> -static int hvs_dgram_bind(struct vsock_sock *vsk, struct sockaddr_vm *addr)
> -{
> -	return -EOPNOTSUPP;
> -}
> -
>  static int hvs_dgram_enqueue(struct vsock_sock *vsk,
>  			     struct sockaddr_vm *remote, struct msghdr *msg,
>  			     size_t dgram_len)
> @@ -826,7 +821,6 @@ static struct vsock_transport hvs_transport = {
>  	.connect                  = hvs_connect,
>  	.shutdown                 = hvs_shutdown,
>  
> -	.dgram_bind               = hvs_dgram_bind,
>  	.dgram_enqueue            = hvs_dgram_enqueue,
>  	.dgram_allow              = hvs_dgram_allow,
>  
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index 66edffdbf303..ac2126c7dac5 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -428,7 +428,6 @@ static struct virtio_transport virtio_transport = {
>  		.shutdown                 = virtio_transport_shutdown,
>  		.cancel_pkt               = virtio_transport_cancel_pkt,
>  
> -		.dgram_bind               = virtio_transport_dgram_bind,
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
>  		.dgram_allow              = virtio_transport_dgram_allow,
>  
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 01ea1402ad40..ffcbdd77feaa 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -781,13 +781,6 @@ bool virtio_transport_stream_allow(u32 cid, u32 port)
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
>  
> -int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> -				struct sockaddr_vm *addr)
> -{
> -	return -EOPNOTSUPP;
> -}
> -EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> -
>  bool virtio_transport_dgram_allow(u32 cid, u32 port)
>  {
>  	return false;
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index 0bbbdb222245..857b0461f856 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -2072,7 +2072,7 @@ static int __init vmci_transport_init(void)
>  	/* Register only with dgram feature, other features (H2G, G2H) will be
>  	 * registered when the first host or guest becomes active.
>  	 */
> -	err = vsock_core_register(&vmci_transport, VSOCK_TRANSPORT_F_DGRAM);
> +	err = vsock_core_register(&vmci_transport, VSOCK_TRANSPORT_F_DGRAM_FALLBACK);
>  	if (err < 0)
>  		goto err_unsubscribe;
>  
> diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
> index 2a59dd177c74..278235ea06c4 100644
> --- a/net/vmw_vsock/vsock_loopback.c
> +++ b/net/vmw_vsock/vsock_loopback.c
> @@ -61,7 +61,6 @@ static struct virtio_transport loopback_transport = {
>  		.shutdown                 = virtio_transport_shutdown,
>  		.cancel_pkt               = vsock_loopback_cancel_pkt,
>  
> -		.dgram_bind               = virtio_transport_dgram_bind,
>  		.dgram_enqueue            = virtio_transport_dgram_enqueue,
>  		.dgram_allow              = virtio_transport_dgram_allow,
>  
> 

^ permalink raw reply

* Re: [PATCH 1/1] scsi: storvsc: Limit max_sectors for virtual Fibre Channel devices
From: Martin K. Petersen @ 2023-07-23 20:42 UTC (permalink / raw)
  To: kys, longli, wei.liu, decui, jejb, linux-hyperv, linux-kernel,
	linux-scsi, Michael Kelley
  Cc: Martin K . Petersen, stable
In-Reply-To: <1689887102-32806-1-git-send-email-mikelley@microsoft.com>

On Thu, 20 Jul 2023 14:05:02 -0700, Michael Kelley wrote:

> The Hyper-V host is queried to get the max transfer size that it
> supports, and this value is used to set max_sectors for the synthetic
> SCSI controller.  However, this max transfer size may be too large
> for virtual Fibre Channel devices, which are limited to 512 Kbytes.
> If a larger transfer size is used with a vFC device, Hyper-V always
> returns an error, and storvsc logs a message like this where the SRB
> status and SCSI status are both zero:
> 
> [...]

Applied to 6.5/scsi-fixes, thanks!

[1/1] scsi: storvsc: Limit max_sectors for virtual Fibre Channel devices
      https://git.kernel.org/mkp/scsi/c/010c1e1c5741

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH v2 1/1] x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction
From: Wei Liu @ 2023-07-23 23:08 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, hpa,
	peterz, x86, linux-kernel, linux-hyperv, stable
In-Reply-To: <1690001476-98594-1-git-send-email-mikelley@microsoft.com>

On Fri, Jul 21, 2023 at 09:51:16PM -0700, Michael Kelley wrote:
> On hardware that supports Indirect Branch Tracking (IBT), Hyper-V VMs
> with ConfigVersion 9.3 or later support IBT in the guest. However,
> current versions of Hyper-V have a bug in that there's not an ENDBR64
> instruction at the beginning of the hypercall page. Since hypercalls are
> made with an indirect call to the hypercall page, all hypercall attempts
> fail with an exception and Linux panics.
> 
> A Hyper-V fix is in progress to add ENDBR64. But guard against the Linux
> panic by clearing X86_FEATURE_IBT if the hypercall page doesn't start
> with ENDBR. The VM will boot and run without IBT.
> 
> If future Linux 32-bit kernels were to support IBT, additional hypercall
> page hackery would be needed to make IBT work for such kernels in a
> Hyper-V VM.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>

Applied to hyperv-fixes. Thanks.

^ permalink raw reply

* Re: [PATCH] clocksource: Fix warnings in mshyperv.h
From: Wei Liu @ 2023-07-23 23:15 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: huzhi001@208suo.com, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, tglx@linutronix.de,
	KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	hpa@zytor.com, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <BYAPR21MB16885FB2E77FC9ADB1C8B48DD736A@BYAPR21MB1688.namprd21.prod.outlook.com>

On Wed, Jul 12, 2023 at 09:06:30PM +0000, Michael Kelley (LINUX) wrote:
> From: huzhi001@208suo.com <huzhi001@208suo.com> Sent: Wednesday, July 12, 2023 4:23 AM
> > 
> > The following checkpatch warnings are removed:
> > WARNING: Use #include <linux/io.h> instead of <asm/io.h>
> 
> The "Subject:" of the patch should probably start with "x86/hyperv",
> not "clocksource".  Usually I look back at the commit history of a
> particular file and try to be consistent with the Subject: prefix that
> has been used in the past.  "x86/hyperv" is typical for this
> include file.
> 
> Other than that,
> 
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>

There is only one warning fixed.

I changed the subject line and commit message.

Applied to hyperv-fixes. Thanks.

> > 
> > Signed-off-by: ZhiHu <huzhi001@208suo.com>
> > ---
> >   arch/x86/include/asm/mshyperv.h | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/mshyperv.h
> > b/arch/x86/include/asm/mshyperv.h
> > index 88d9ef98e087..fa83d88e4c99 100644
> > --- a/arch/x86/include/asm/mshyperv.h
> > +++ b/arch/x86/include/asm/mshyperv.h
> > @@ -5,7 +5,7 @@
> >   #include <linux/types.h>
> >   #include <linux/nmi.h>
> >   #include <linux/msi.h>
> > -#include <asm/io.h>
> > +#include <linux/io.h>
> >   #include <asm/hyperv-tlfs.h>
> >   #include <asm/nospec-branch.h>
> >   #include <asm/paravirt.h>

^ permalink raw reply

* Re: [PATCH] vmbus_testing: fix wrong python syntax for integer value comparison
From: Wei Liu @ 2023-07-23 23:19 UTC (permalink / raw)
  To: Ani Sinha
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	linux-hyperv, linux-kernel
In-Reply-To: <761F02E9-7A80-486B-8CB4-B5E067D7F587@redhat.com>

On Mon, Jul 17, 2023 at 01:13:25PM +0530, Ani Sinha wrote:
> 
> 
> > On 05-Jul-2023, at 7:14 PM, Ani Sinha <anisinha@redhat.com> wrote:
> > 
> > It is incorrect in python to compare integer values using the "is" keyword.
> > The "is" keyword in python is used to compare references to two objects,
> > not their values. Newer version of python3 (version 3.8) throws a warning
> > when such incorrect comparison is made. For value comparison, "==" should
> > be used.
> > 
> > Fix this in the code and suppress the following warning:
> > 
> > /usr/sbin/vmbus_testing:167: SyntaxWarning: "is" with a literal. Did you mean "=="?
> 
> Ping …
> 

Applied to hyperv-fixes. Thanks.

^ permalink raw reply

* [PATCH AUTOSEL 6.4 20/58] RDMA/mana_ib: Use v2 version of cfg_rx_steer_req to enable RX coalescing
From: Sasha Levin @ 2023-07-24  1:12 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Long Li, Jason Gunthorpe, Sasha Levin, sharmaajay, kys, haiyangz,
	wei.liu, decui, davem, edumazet, kuba, pabeni, tglx, alardam,
	shradhagupta, linux-rdma, linux-hyperv, netdev
In-Reply-To: <20230724011338.2298062-1-sashal@kernel.org>

From: Long Li <longli@microsoft.com>

[ Upstream commit 2145328515c8fa9b8a9f7889250bc6c032f2a0e6 ]

With RX coalescing, one CQE entry can be used to indicate multiple packets
on the receive queue. This saves processing time and PCI bandwidth over
the CQ.

The MANA Ethernet driver also uses the v2 version of the protocol. It
doesn't use RX coalescing and its behavior is not changed.

Link: https://lore.kernel.org/r/1684045095-31228-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/infiniband/hw/mana/qp.c               | 5 ++++-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 5 ++++-
 include/net/mana/mana.h                       | 4 +++-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 54b61930a7fdb..4b3b5b274e849 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -13,7 +13,7 @@ static int mana_ib_cfg_vport_steering(struct mana_ib_dev *dev,
 				      u8 *rx_hash_key)
 {
 	struct mana_port_context *mpc = netdev_priv(ndev);
-	struct mana_cfg_rx_steer_req *req = NULL;
+	struct mana_cfg_rx_steer_req_v2 *req;
 	struct mana_cfg_rx_steer_resp resp = {};
 	mana_handle_t *req_indir_tab;
 	struct gdma_context *gc;
@@ -33,6 +33,8 @@ static int mana_ib_cfg_vport_steering(struct mana_ib_dev *dev,
 	mana_gd_init_req_hdr(&req->hdr, MANA_CONFIG_VPORT_RX, req_buf_size,
 			     sizeof(resp));
 
+	req->hdr.req.msg_version = GDMA_MESSAGE_V2;
+
 	req->vport = mpc->port_handle;
 	req->rx_enable = 1;
 	req->update_default_rxobj = 1;
@@ -46,6 +48,7 @@ static int mana_ib_cfg_vport_steering(struct mana_ib_dev *dev,
 	req->num_indir_entries = MANA_INDIRECT_TABLE_SIZE;
 	req->indir_tab_offset = sizeof(*req);
 	req->update_indir_tab = true;
+	req->cqe_coalescing_enable = 1;
 
 	req_indir_tab = (mana_handle_t *)(req + 1);
 	/* The ind table passed to the hardware must have
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d907727c7b7a5..7f4e861e398e6 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -972,7 +972,7 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 				   bool update_tab)
 {
 	u16 num_entries = MANA_INDIRECT_TABLE_SIZE;
-	struct mana_cfg_rx_steer_req *req = NULL;
+	struct mana_cfg_rx_steer_req_v2 *req;
 	struct mana_cfg_rx_steer_resp resp = {};
 	struct net_device *ndev = apc->ndev;
 	mana_handle_t *req_indir_tab;
@@ -987,6 +987,8 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 	mana_gd_init_req_hdr(&req->hdr, MANA_CONFIG_VPORT_RX, req_buf_size,
 			     sizeof(resp));
 
+	req->hdr.req.msg_version = GDMA_MESSAGE_V2;
+
 	req->vport = apc->port_handle;
 	req->num_indir_entries = num_entries;
 	req->indir_tab_offset = sizeof(*req);
@@ -996,6 +998,7 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 	req->update_hashkey = update_key;
 	req->update_indir_tab = update_tab;
 	req->default_rxobj = apc->default_rxobj;
+	req->cqe_coalescing_enable = 0;
 
 	if (update_key)
 		memcpy(&req->hashkey, apc->hashkey, MANA_HASH_KEY_SIZE);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 9eef199728454..024ad8ddb27e5 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -579,7 +579,7 @@ struct mana_fence_rq_resp {
 }; /* HW DATA */
 
 /* Configure vPort Rx Steering */
-struct mana_cfg_rx_steer_req {
+struct mana_cfg_rx_steer_req_v2 {
 	struct gdma_req_hdr hdr;
 	mana_handle_t vport;
 	u16 num_indir_entries;
@@ -592,6 +592,8 @@ struct mana_cfg_rx_steer_req {
 	u8 reserved;
 	mana_handle_t default_rxobj;
 	u8 hashkey[MANA_HASH_KEY_SIZE];
+	u8 cqe_coalescing_enable;
+	u8 reserved2[7];
 }; /* HW DATA */
 
 struct mana_cfg_rx_steer_resp {
-- 
2.39.2


^ permalink raw reply related

* [PATCH V5 net-next] net: mana: Configure hwc timeout from hardware
From: Souradeep Chakrabarti @ 2023-07-24  5:38 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
	linux-hyperv, netdev, linux-kernel, linux-rdma
  Cc: schakrabarti, Souradeep Chakrabarti

At present hwc timeout value is a fixed value. This patch sets the hwc
timeout from the hardware. It now uses a new hardware capability
GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG to query and set the value
in hwc_timeout.

Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
---
V4 -> V5:
* Replaced dev_err in mana_hwc_send_request with dev_dbg.
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 30 ++++++++++++++++++-
 .../net/ethernet/microsoft/mana/hw_channel.c  | 25 +++++++++++++++-
 include/net/mana/gdma.h                       | 20 ++++++++++++-
 include/net/mana/hw_channel.h                 |  5 ++++
 4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..4537a70e30d4 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -106,6 +106,25 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	return 0;
 }
 
+static int mana_gd_query_hwc_timeout(struct pci_dev *pdev, u32 *timeout_val)
+{
+	struct gdma_context *gc = pci_get_drvdata(pdev);
+	struct gdma_query_hwc_timeout_resp resp = {};
+	struct gdma_query_hwc_timeout_req req = {};
+	int err;
+
+	mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_HWC_TIMEOUT,
+			     sizeof(req), sizeof(resp));
+	req.timeout_ms = *timeout_val;
+	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+	if (err || resp.hdr.status)
+		return err ? err : -EPROTO;
+
+	*timeout_val = resp.timeout_ms;
+
+	return 0;
+}
+
 static int mana_gd_detect_devices(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
@@ -879,8 +898,11 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_verify_ver_resp resp = {};
 	struct gdma_verify_ver_req req = {};
+	struct hw_channel_context *hwc;
 	int err;
 
+	hwc = gc->hwc.driver_data;
+
 	mana_gd_init_req_hdr(&req.hdr, GDMA_VERIFY_VF_DRIVER_VERSION,
 			     sizeof(req), sizeof(resp));
 
@@ -907,7 +929,13 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
 			err, resp.hdr.status);
 		return err ? err : -EPROTO;
 	}
-
+	if (resp.pf_cap_flags1 & GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG) {
+		err = mana_gd_query_hwc_timeout(pdev, &hwc->hwc_timeout);
+		if (err) {
+			dev_err(gc->dev, "Failed to set the hwc timeout %d\n", err);
+			return err;
+		}
+	}
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index 2bd1d74021f7..35b5371d89bb 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -174,7 +174,25 @@ static void mana_hwc_init_event_handler(void *ctx, struct gdma_queue *q_self,
 		complete(&hwc->hwc_init_eqe_comp);
 		break;
 
+	case GDMA_EQE_HWC_SOC_RECONFIG_DATA:
+		type_data.as_uint32 = event->details[0];
+		type = type_data.type;
+		val = type_data.value;
+
+		switch (type) {
+		case HWC_DATA_CFG_HWC_TIMEOUT:
+			hwc->hwc_timeout = val;
+			break;
+
+		default:
+			dev_warn(hwc->dev, "Received unknown reconfig type %u\n", type);
+			break;
+		}
+
+		break;
+
 	default:
+		dev_warn(hwc->dev, "Received unknown gdma event %u\n", event->type);
 		/* Ignore unknown events, which should never happen. */
 		break;
 	}
@@ -704,6 +722,7 @@ int mana_hwc_create_channel(struct gdma_context *gc)
 	gd->pdid = INVALID_PDID;
 	gd->doorbell = INVALID_DOORBELL;
 
+	hwc->hwc_timeout = HW_CHANNEL_WAIT_RESOURCE_TIMEOUT_MS;
 	/* mana_hwc_init_queues() only creates the required data structures,
 	 * and doesn't touch the HWC device.
 	 */
@@ -770,6 +789,8 @@ void mana_hwc_destroy_channel(struct gdma_context *gc)
 	hwc->gdma_dev->doorbell = INVALID_DOORBELL;
 	hwc->gdma_dev->pdid = INVALID_PDID;
 
+	hwc->hwc_timeout = 0;
+
 	kfree(hwc);
 	gc->hwc.driver_data = NULL;
 	gc->hwc.gdma_context = NULL;
@@ -818,6 +839,7 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 		dest_vrq = hwc->pf_dest_vrq_id;
 		dest_vrcq = hwc->pf_dest_vrcq_id;
 	}
+	dev_dbg(hwc->dev, "HWC: timeout %u ms\n", hwc->hwc_timeout);
 
 	err = mana_hwc_post_tx_wqe(txq, tx_wr, dest_vrq, dest_vrcq, false);
 	if (err) {
@@ -825,7 +847,8 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 		goto out;
 	}
 
-	if (!wait_for_completion_timeout(&ctx->comp_event, 30 * HZ)) {
+	if (!wait_for_completion_timeout(&ctx->comp_event,
+					 (hwc->hwc_timeout / 1000) * HZ)) {
 		dev_err(hwc->dev, "HWC: Request timed out!\n");
 		err = -ETIMEDOUT;
 		goto out;
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 96c120160f15..88b6ef7ce1a6 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -33,6 +33,7 @@ enum gdma_request_type {
 	GDMA_DESTROY_PD			= 30,
 	GDMA_CREATE_MR			= 31,
 	GDMA_DESTROY_MR			= 32,
+	GDMA_QUERY_HWC_TIMEOUT		= 84, /* 0x54 */
 };
 
 #define GDMA_RESOURCE_DOORBELL_PAGE	27
@@ -57,6 +58,8 @@ enum gdma_eqe_type {
 	GDMA_EQE_HWC_INIT_EQ_ID_DB	= 129,
 	GDMA_EQE_HWC_INIT_DATA		= 130,
 	GDMA_EQE_HWC_INIT_DONE		= 131,
+	GDMA_EQE_HWC_SOC_RECONFIG	= 132,
+	GDMA_EQE_HWC_SOC_RECONFIG_DATA	= 133,
 };
 
 enum {
@@ -531,10 +534,12 @@ enum {
  * so the driver is able to reliably support features like busy_poll.
  */
 #define GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX BIT(2)
+#define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG BIT(3)
 
 #define GDMA_DRV_CAP_FLAGS1 \
 	(GDMA_DRV_CAP_FLAG_1_EQ_SHARING_MULTI_VPORT | \
-	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX)
+	 GDMA_DRV_CAP_FLAG_1_NAPI_WKDONE_FIX | \
+	 GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG)
 
 #define GDMA_DRV_CAP_FLAGS2 0
 
@@ -664,6 +669,19 @@ struct gdma_disable_queue_req {
 	u32 alloc_res_id_on_creation;
 }; /* HW DATA */
 
+/* GDMA_QUERY_HWC_TIMEOUT */
+struct gdma_query_hwc_timeout_req {
+	struct gdma_req_hdr hdr;
+	u32 timeout_ms;
+	u32 reserved;
+};
+
+struct gdma_query_hwc_timeout_resp {
+	struct gdma_resp_hdr hdr;
+	u32 timeout_ms;
+	u32 reserved;
+};
+
 enum atb_page_size {
 	ATB_PAGE_SIZE_4K,
 	ATB_PAGE_SIZE_8K,
diff --git a/include/net/mana/hw_channel.h b/include/net/mana/hw_channel.h
index 6a757a6e2732..3d3b5c881bc1 100644
--- a/include/net/mana/hw_channel.h
+++ b/include/net/mana/hw_channel.h
@@ -23,6 +23,10 @@
 #define HWC_INIT_DATA_PF_DEST_RQ_ID	10
 #define HWC_INIT_DATA_PF_DEST_CQ_ID	11
 
+#define HWC_DATA_CFG_HWC_TIMEOUT 1
+
+#define HW_CHANNEL_WAIT_RESOURCE_TIMEOUT_MS 30000
+
 /* Structures labeled with "HW DATA" are exchanged with the hardware. All of
  * them are naturally aligned and hence don't need __packed.
  */
@@ -182,6 +186,7 @@ struct hw_channel_context {
 
 	u32 pf_dest_vrq_id;
 	u32 pf_dest_vrcq_id;
+	u32 hwc_timeout;
 
 	struct hwc_caller_ctx *caller_ctx;
 };
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH V3,net-next] net: mana: Add page pool for RX buffers
From: Jesper Dangaard Brouer @ 2023-07-24 11:29 UTC (permalink / raw)
  To: Haiyang Zhang, linux-hyperv, netdev
  Cc: brouer, decui, kys, paulros, olaf, vkuznets, davem, wei.liu,
	edumazet, kuba, pabeni, leon, longli, ssengar, linux-rdma, daniel,
	john.fastabend, bpf, ast, sharmaajay, hawk, tglx, shradhagupta,
	linux-kernel, Ilias Apalodimas, Jesper Dangaard Brouer
In-Reply-To: <1689966321-17337-1-git-send-email-haiyangz@microsoft.com>



On 21/07/2023 21.05, Haiyang Zhang wrote:
> Add page pool for RX buffers for faster buffer cycle and reduce CPU
> usage.
> 
> The standard page pool API is used.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
> V3:
> Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski
> V2:
> Use the standard page pool API as suggested by Jesper Dangaard Brouer
> 
> ---
>   drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++----
>   include/net/mana/mana.h                       |  3 +
>   2 files changed, 78 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index a499e460594b..4307f25f8c7a 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[...]
> @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
>   
>   	if (rxq->xdp_flush)
>   		xdp_do_flush();
> +
> +	page_pool_nid_changed(rxq->page_pool, numa_mem_id());

I don't think this page_pool_nid_changed() called is needed, if you do
as I suggest below (nid = NUMA_NO_NODE).


>   }
>   
>   static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
[...]

> @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq *rxq)
>   	return 0;
>   }
>   
> +static int mana_create_page_pool(struct mana_rxq *rxq)
> +{
> +	struct page_pool_params pprm = {};

You are implicitly assigning NUMA node id zero.

> +	int ret;
> +
> +	pprm.pool_size = RX_BUFFERS_PER_QUEUE;
> +	pprm.napi = &rxq->rx_cq.napi;

You likely want to assign pprm.nid to NUMA_NO_NODE

  pprm.nid = NUMA_NO_NODE;

For most drivers it is recommended to assign ``NUMA_NO_NODE`` (value -1)
as the NUMA ID to ``pp_params.nid``. When ``CONFIG_NUMA`` is enabled
this setting will automatically select the (preferred) NUMA node (via
``numa_mem_id()``) based on where NAPI RX-processing is currently
running. The effect is that page_pool will only use recycled memory when
NUMA node match running CPU. This assumes CPU refilling driver RX-ring
will also run RX-NAPI.

If a driver want more control over the NUMA node memory selection,
drivers can assign (``pp_params.nid``) something else than
`NUMA_NO_NODE`` and runtime adjust via function ``page_pool_nid_changed()``.

I will update [1] with this info.
  - Docs [1] 
https://kernel.org/doc/html/latest/networking/page_pool.html#registration


> +
> +	rxq->page_pool = page_pool_create(&pprm);
> +
> +	if (IS_ERR(rxq->page_pool)) {
> +		ret = PTR_ERR(rxq->page_pool);
> +		rxq->page_pool = NULL;
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +


^ permalink raw reply

* Re: [PATCH v2 1/9] vgacon: rework Kconfig dependencies
From: Geert Uytterhoeven @ 2023-07-24 12:07 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Michael Ellerman, Arnd Bergmann, linux-fbdev, Thomas Zimmermann,
	Helge Deller, Javier Martinez Canillas, David S . Miller,
	K. Y. Srinivasan, Ard Biesheuvel, Borislav Petkov, Brian Cain,
	Catalin Marinas, Christophe Leroy, Daniel Vetter, Dave Hansen,
	Dave Airlie, Deepak Rawat, Dexuan Cui, Dinh Nguyen,
	Greg Kroah-Hartman, guoren, Haiyang Zhang, Huacai Chen,
	Ingo Molnar, John Paul Adrian Glaubitz, Khalid Aziz,
	Linus Walleij, Matt Turner, Max Filippov, Nicholas Piggin,
	Palmer Dabbelt, Russell King, Thomas Bogendoerfer,
	Thomas Gleixner, WANG Xuerui, Wei Liu, Will Deacon, x86,
	linux-alpha, linux-kernel, linux-arm-kernel, linux-efi,
	linux-csky@vger.kernel.org, linux-hexagon, linux-ia64, loongarch,
	linux-mips, linuxppc-dev, linux-riscv, linux-sh, sparclinux,
	linux-hyperv, dri-devel
In-Reply-To: <19631e74-415e-4dcb-b79d-33dcf03d2dfc@app.fastmail.com>

Hi Arnd,

On Fri, Jul 21, 2023 at 10:29 AM Arnd Bergmann <arnd@arndb.de> wrote:
> On Fri, Jul 21, 2023, at 06:59, Michael Ellerman wrote:
> > Arnd Bergmann <arnd@kernel.org> writes:
> >> From: Arnd Bergmann <arnd@arndb.de>
> >>
> >> The list of dependencies here is phrased as an opt-out, but this is missing
> >> a lot of architectures that don't actually support VGA consoles, and some
> >> of the entries are stale:
> >>
> >>  - powerpc used to support VGA consoles in the old arch/ppc codebase, but
> >>    the merged arch/powerpc never did
> >
> > Not disputing this, but how did you come to that conclusion? I grepped
> > around and couldn't convince myself whether it can work on powerpc or
> > not. ie. currently it's possible to enable CONFIG_VGA_CONSOLE and
> > powerpc does have a struct screen_info defined which seems like it would
> > allow vgacon_startup() to complete.
>
> The VGA console needs both screen_info and vga_con to work. In arch/ppc
> we had both, but in arch/powerpc we only retained the screen_info:
>
> $ git grep vga_con v2.6.26 -- arch/ppc arch/ppc64 arch/powerpc
> v2.6.26:arch/ppc/platforms/pplus.c:     conswitchp = &vga_con;
> v2.6.26:arch/ppc/platforms/prep_setup.c:        conswitchp = &vga_con;
>
> so after arch/ppc was removed, this became impossible to use on both
> pplus and prep. These two platforms were also (as far as I can tell)
> the only ones to support vga16fb as an alternative to vgacon, but
> both platforms were removed later on.

I did use vgacon and vga16fb on CHRP on a second video card
(initialized using Gabriel Paubert's x86 BIOS emulator), but that was
definitely before the advent of arch/powerpc/.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox