[PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn
@ 2025-03-25 16:48 Uros Bizjak
  2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Uros Bizjak @ 2025-03-25 16:48 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin

On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction
appears to have a false dependency on the destination register. Even
though the instruction only writes to it, the instruction will wait
until destination is ready before executing. This false dependency
was fixed for Cannon Lake (and later) processors.

Fix false dependency by clearing the destination register first.

The x86_64 defconfig object size increases by 779 bytes:

	    text           data     bss      dec            hex filename
	27341418        4643015  814852 32799285        1f47a35 vmlinux-old.o
	27342197        4643015  814852 32800064        1f47d40 vmlinux-new.o

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/arch_hweight.h | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
index cbc6157f0b4b..aa0b3bd309fc 100644
--- a/arch/x86/include/asm/arch_hweight.h
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -4,12 +4,21 @@
 
 #include <asm/cpufeatures.h>
 
+/*
+ * On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction
+ * appears to have a false dependency on the destination register. Even
+ * though the instruction only writes to it, the instruction will wait
+ * until destination is ready before executing. This false dependency
+ * was fixed for Cannon Lake (and later) processors.
+ */
+#define ASM_FORCE_CLR "xorl %k[cnt], %k[cnt]\n\t"
+
 #ifdef CONFIG_64BIT
 #define REG_IN "D"
-#define REG_OUT "a"
+#define ASM_CLR ASM_FORCE_CLR
 #else
 #define REG_IN "a"
-#define REG_OUT "a"
+#define ASM_CLR
 #endif
 
 static __always_inline unsigned int __arch_hweight32(unsigned int w)
@@ -18,8 +27,9 @@ static __always_inline unsigned int __arch_hweight32(unsigned int w)
 
 	asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
 				"call __sw_hweight32",
-				"popcntl %[val], %[cnt]", X86_FEATURE_POPCNT)
-			 : [cnt] "=" REG_OUT (res), ASM_CALL_CONSTRAINT
+				ASM_CLR "popcntl %[val], %[cnt]",
+				X86_FEATURE_POPCNT)
+			 : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
 			 : [val] REG_IN (w));
 
 	return res;
@@ -48,8 +58,9 @@ static __always_inline unsigned long __arch_hweight64(__u64 w)
 
 	asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
 				"call __sw_hweight64",
-				"popcntq %[val], %[cnt]", X86_FEATURE_POPCNT)
-			 : [cnt] "=" REG_OUT (res), ASM_CALL_CONSTRAINT
+				ASM_CLR "popcntq %[val], %[cnt]",
+				X86_FEATURE_POPCNT)
+			 : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
 			 : [val] REG_IN (w));
 
 	return res;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-25 16:48 [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Uros Bizjak
@ 2025-03-25 16:48 ` Uros Bizjak
  2025-03-25 17:11   ` Borislav Petkov
  2025-03-25 21:56   ` Ingo Molnar
  2025-03-25 17:09 ` [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Borislav Petkov
  2025-03-25 21:50 ` Ingo Molnar
  2 siblings, 2 replies; 25+ messages in thread
From: Uros Bizjak @ 2025-03-25 16:48 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin

Emit naked POPCNT instruction when available with X86_NATIVE_CPU
option. The compiler is not bound by ABI when emitting the instruction
without the fallback call to __sw_hweight{32,64}() library function
and has much more freedom to allocate input and output operands,
including memory input operand.

The code size of x86_64 defconfig (with X86_NATIVE_CPU option)
shrinks by 599 bytes:

  add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599)
  Total: Before=22710531, After=22709932, chg -0.00%

The asm changes from e.g.:

	   3bf9c:	48 8b 3d 00 00 00 00 	mov    0x0(%rip),%rdi
	   3bfa3:	e8 00 00 00 00       	call   3bfa8 <...>
	   3bfa8:	90                   	nop
	   3bfa9:	90                   	nop

with:

	     34b:	31 c0                	xor    %eax,%eax
	     34d:	f3 48 0f b8 c7       	popcnt %rdi,%rax

in the .altinstr_replacement section

to:

	   3bfdc:	31 c0                	xor    %eax,%eax
	   3bfde:	f3 48 0f b8 05 00 00 	popcnt 0x0(%rip),%rax
	   3bfe5:	00 00

where there is no need for an entry in the .altinstr_replacement
section, shrinking all text sections by 9476 bytes:

	    text           data     bss      dec            hex filename
	27267068        4643047  814852 32724967        1f357e7 vmlinux-old.o
	27257592        4643047  814852 32715491        1f332e3 vmlinux-new.o

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/arch_hweight.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
index aa0b3bd309fc..d39710e57531 100644
--- a/arch/x86/include/asm/arch_hweight.h
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -25,13 +25,18 @@ static __always_inline unsigned int __arch_hweight32(unsigned int w)
 {
 	unsigned int res;
 
+#ifdef __POPCNT__
+	asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
+		    : [cnt] "=&r" (res)
+		    : [val] ASM_INPUT_RM (w));
+#else
 	asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
 				"call __sw_hweight32",
 				ASM_CLR "popcntl %[val], %[cnt]",
 				X86_FEATURE_POPCNT)
 			 : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
 			 : [val] REG_IN (w));
-
+#endif
 	return res;
 }
 
@@ -56,13 +61,18 @@ static __always_inline unsigned long __arch_hweight64(__u64 w)
 {
 	unsigned long res;
 
+#ifdef __POPCNT__
+	asm_inline (ASM_FORCE_CLR "popcntq %[val], %[cnt]"
+		    : [cnt] "=&r" (res)
+		    : [val] ASM_INPUT_RM (w));
+#else
 	asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
 				"call __sw_hweight64",
 				ASM_CLR "popcntq %[val], %[cnt]",
 				X86_FEATURE_POPCNT)
 			 : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
 			 : [val] REG_IN (w));
-
+#endif
 	return res;
 }
 #endif /* CONFIG_X86_32 */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak
@ 2025-03-25 17:11   ` Borislav Petkov
  2025-03-30 15:15     ` Uros Bizjak
  2025-03-25 21:56   ` Ingo Molnar
  1 sibling, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2025-03-25 17:11 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	H. Peter Anvin

On Tue, Mar 25, 2025 at 05:48:38PM +0100, Uros Bizjak wrote:
> +#ifdef __POPCNT__
> +	asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
> +		    : [cnt] "=&r" (res)
> +		    : [val] ASM_INPUT_RM (w));
> +#else
>  	asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
>  				"call __sw_hweight32",
>  				ASM_CLR "popcntl %[val], %[cnt]",
>  				X86_FEATURE_POPCNT)
>  			 : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
>  			 : [val] REG_IN (w));
> -
> +#endif

A whopping 599 bytes which makes the asm more ugly.

Not worth the effort IMO.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-25 17:11   ` Borislav Petkov
@ 2025-03-30 15:15     ` Uros Bizjak
  2025-03-30 17:31       ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Uros Bizjak @ 2025-03-30 15:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	H. Peter Anvin

On Tue, Mar 25, 2025 at 6:11 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Mar 25, 2025 at 05:48:38PM +0100, Uros Bizjak wrote:
> > +#ifdef __POPCNT__
> > +     asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
> > +                 : [cnt] "=&r" (res)
> > +                 : [val] ASM_INPUT_RM (w));
> > +#else
> >       asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
> >                               "call __sw_hweight32",
> >                               ASM_CLR "popcntl %[val], %[cnt]",
> >                               X86_FEATURE_POPCNT)
> >                        : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
> >                        : [val] REG_IN (w));
> > -
> > +#endif
>
> A whopping 599 bytes which makes the asm more ugly.
>
> Not worth the effort IMO.

You missed this part:

--q--
... where there is no need for an entry in the .altinstr_replacement
section, shrinking all text sections by 9476 bytes:

            text           data     bss      dec            hex filename
        27267068        4643047  814852 32724967        1f357e7 vmlinux-old.o
        27257592        4643047  814852 32715491        1f332e3 vmlinux-new.o
--/q--

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30 15:15     ` Uros Bizjak
@ 2025-03-30 17:31       ` Borislav Petkov
  2025-03-30 18:47         ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2025-03-30 17:31 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	H. Peter Anvin

On Sun, Mar 30, 2025 at 05:15:16PM +0200, Uros Bizjak wrote:
> You missed this part:
> 
> --q--
> ... where there is no need for an entry in the .altinstr_replacement

Not really - .altinstr* memory gets jettisoned after boot.

> section, shrinking all text sections by 9476 bytes:

So if anything, this saves a whopping ~9K disk space and makes hweight* an
unreadable mess.

So no, thanks.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30 17:31       ` Borislav Petkov
@ 2025-03-30 18:47         ` Ingo Molnar
  2025-03-30 19:06           ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2025-03-30 18:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Uros Bizjak, x86, linux-kernel, Thomas Gleixner, Dave Hansen,
	H. Peter Anvin


* Borislav Petkov <bp@alien8.de> wrote:

> On Sun, Mar 30, 2025 at 05:15:16PM +0200, Uros Bizjak wrote:
> > You missed this part:
> > 
> > --q--
> > ... where there is no need for an entry in the .altinstr_replacement
> 
> Not really - .altinstr* memory gets jettisoned after boot.
> 
> > section, shrinking all text sections by 9476 bytes:
> 
> So if anything, this saves a whopping ~9K disk space and makes 
> hweight* an unreadable mess.

What unreadable mess?

The proposed patch is:

 +#ifdef __POPCNT__
 +     asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
 +                 : [cnt] "=&r" (res)
 +                 : [val] ASM_INPUT_RM (w));
 +#else
       asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
                               "call __sw_hweight32",
                               ASM_CLR "popcntl %[val], %[cnt]",
                               X86_FEATURE_POPCNT)
                        : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
                        : [val] REG_IN (w));
 -
 +#endif

Which is 3 straightforward lines of assembly code and a straightforward #ifdef.

My main objection is different: if __POPCNT__ isn't defined during the 
kernel build of major Linux distros, then this optimization almost 
doesn't exist to our users. And I don't think it's defined.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30 18:47         ` Ingo Molnar
@ 2025-03-30 19:06           ` Borislav Petkov
  2025-03-30 19:20             ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2025-03-30 19:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Uros Bizjak, x86, linux-kernel, Thomas Gleixner, Dave Hansen,
	H. Peter Anvin

On Sun, Mar 30, 2025 at 08:47:31PM +0200, Ingo Molnar wrote:
>  +#ifdef __POPCNT__
>  +     asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
>  +                 : [cnt] "=&r" (res)
>  +                 : [val] ASM_INPUT_RM (w));
>  +#else
>        asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
>                                "call __sw_hweight32",
>                                ASM_CLR "popcntl %[val], %[cnt]",
>                                X86_FEATURE_POPCNT)
>                         : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
>                         : [val] REG_IN (w));
>  -
>  +#endif

That ifdeffery.

The alternative only is fine as this is the usual way we do those insns.

The ifdeffery around it is ugly and is pushing it and it would be fine if it
would bring anything but it doesn't. It is making the code ugly for no good
reason whatsoever.

> Which is 3 straightforward lines of assembly code and a straightforward #ifdef.

And they bring what exactly?

I haven't seen anything besides some super minor, completely pointless, hm,
"savings". So much so that the uglification of the function is not worth it in
the *least*.

When I look at that code, I need to wonder now, is the __POPCNT__ defined or
not?

> My main objection is different: if __POPCNT__ isn't defined during the
> kernel build of major Linux distros, then this optimization almost
> doesn't exist to our users. And I don't think it's defined.

Yah, that too.

This whole effort is a total waste of time and energy.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30 19:06           ` Borislav Petkov
@ 2025-03-30 19:20             ` Ingo Molnar
  2025-03-30 19:28               ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2025-03-30 19:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Uros Bizjak, x86, linux-kernel, Thomas Gleixner, Dave Hansen,
	H. Peter Anvin


* Borislav Petkov <bp@alien8.de> wrote:

> On Sun, Mar 30, 2025 at 08:47:31PM +0200, Ingo Molnar wrote:
> >  +#ifdef __POPCNT__
> >  +     asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
> >  +                 : [cnt] "=&r" (res)
> >  +                 : [val] ASM_INPUT_RM (w));
> >  +#else
> >        asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
> >                                "call __sw_hweight32",
> >                                ASM_CLR "popcntl %[val], %[cnt]",
> >                                X86_FEATURE_POPCNT)
> >                         : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
> >                         : [val] REG_IN (w));
> >  -
> >  +#endif
> 
> That ifdeffery.
> 
> The alternative only is fine as this is the usual way we do those 
> insns.
> 
> The ifdeffery around it is ugly and is pushing it and it would be 
> fine if it would bring anything but it doesn't. It is making the code 
> ugly for no good reason whatsoever.

Tangible code size reduction, if it can be realized, is definitely 
'something', so your claim is simply false.

> > Which is 3 straightforward lines of assembly code and a 
> > straightforward #ifdef.
> 
> And they bring what exactly?
> 
> I haven't seen anything besides some super minor, completely 
> pointless, hm, "savings". So much so that the uglification of the 
> function is not worth it in the *least*.

Even 0.5K of .text reduction is a tangible benefit.

The kernel's 35 years long history comprises of literally over a 
million patches, which were small and inconsequential 99% of the time.

> > My main objection is different: if __POPCNT__ isn't defined during 
> > the kernel build of major Linux distros, then this optimization 
> > almost doesn't exist to our users. And I don't think it's defined.
> 
> Yah, that too.
> 
> This whole effort is a total waste of time and energy.

We don't know yet for sure, but I don't think an absolutist "can't do" 
approach is very productive.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30 19:20             ` Ingo Molnar
@ 2025-03-30 19:28               ` Borislav Petkov
  0 siblings, 0 replies; 25+ messages in thread
From: Borislav Petkov @ 2025-03-30 19:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Uros Bizjak, x86, linux-kernel, Thomas Gleixner, Dave Hansen,
	H. Peter Anvin

On Sun, Mar 30, 2025 at 09:20:28PM +0200, Ingo Molnar wrote:
> Tangible code size reduction, if it can be realized, is definitely 
> 'something', so your claim is simply false.

I don't think you're reading my mails.

What "tangible code size reduction"?

9K vmlinux size?

For sections which are getting jettisoned after boot?

So now the rule is what: make the asm an unreadable mess just to save 9K disk
space?

I call that bullshit. A total and stinking one.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak
  2025-03-25 17:11   ` Borislav Petkov
@ 2025-03-25 21:56   ` Ingo Molnar
  2025-03-29  9:19     ` Uros Bizjak
  1 sibling, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2025-03-25 21:56 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin


* Uros Bizjak <ubizjak@gmail.com> wrote:

> Emit naked POPCNT instruction when available with X86_NATIVE_CPU
> option. The compiler is not bound by ABI when emitting the instruction
> without the fallback call to __sw_hweight{32,64}() library function
> and has much more freedom to allocate input and output operands,
> including memory input operand.
> 
> The code size of x86_64 defconfig (with X86_NATIVE_CPU option)
> shrinks by 599 bytes:
> 
>   add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599)
>   Total: Before=22710531, After=22709932, chg -0.00%
> 
> The asm changes from e.g.:
> 
> 	   3bf9c:	48 8b 3d 00 00 00 00 	mov    0x0(%rip),%rdi
> 	   3bfa3:	e8 00 00 00 00       	call   3bfa8 <...>
> 	   3bfa8:	90                   	nop
> 	   3bfa9:	90                   	nop
> 
> with:
> 
> 	     34b:	31 c0                	xor    %eax,%eax
> 	     34d:	f3 48 0f b8 c7       	popcnt %rdi,%rax
> 
> in the .altinstr_replacement section
> 
> to:
> 
> 	   3bfdc:	31 c0                	xor    %eax,%eax
> 	   3bfde:	f3 48 0f b8 05 00 00 	popcnt 0x0(%rip),%rax
> 	   3bfe5:	00 00
> 
> where there is no need for an entry in the .altinstr_replacement
> section, shrinking all text sections by 9476 bytes:
> 
> 	    text           data     bss      dec            hex filename
> 	27267068        4643047  814852 32724967        1f357e7 vmlinux-old.o
> 	27257592        4643047  814852 32715491        1f332e3 vmlinux-new.o

> +#ifdef __POPCNT__
> +	asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
> +		    : [cnt] "=&r" (res)
> +		    : [val] ASM_INPUT_RM (w));
> +#else
>  	asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
>  				"call __sw_hweight32",
>  				ASM_CLR "popcntl %[val], %[cnt]",
>  				X86_FEATURE_POPCNT)
>  			 : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
>  			 : [val] REG_IN (w));

So a better optimization I think would be to declare and implement 
__sw_hweight32 with a different, less intrusive function call ABI that 
mirrors that of the instruction in essence, so that we optimize for the 
overwhelmingly common case of having the POPCNT instruction.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-25 21:56   ` Ingo Molnar
@ 2025-03-29  9:19     ` Uros Bizjak
  2025-03-29 11:00       ` David Laight
                         ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Uros Bizjak @ 2025-03-29  9:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin

On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > Emit naked POPCNT instruction when available with X86_NATIVE_CPU
> > option. The compiler is not bound by ABI when emitting the instruction
> > without the fallback call to __sw_hweight{32,64}() library function
> > and has much more freedom to allocate input and output operands,
> > including memory input operand.
> >
> > The code size of x86_64 defconfig (with X86_NATIVE_CPU option)
> > shrinks by 599 bytes:
> >
> >   add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599)
> >   Total: Before=22710531, After=22709932, chg -0.00%
> >
> > The asm changes from e.g.:
> >
> >          3bf9c:       48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi
> >          3bfa3:       e8 00 00 00 00          call   3bfa8 <...>
> >          3bfa8:       90                      nop
> >          3bfa9:       90                      nop
> >
> > with:
> >
> >            34b:       31 c0                   xor    %eax,%eax
> >            34d:       f3 48 0f b8 c7          popcnt %rdi,%rax
> >
> > in the .altinstr_replacement section
> >
> > to:
> >
> >          3bfdc:       31 c0                   xor    %eax,%eax
> >          3bfde:       f3 48 0f b8 05 00 00    popcnt 0x0(%rip),%rax
> >          3bfe5:       00 00
> >
> > where there is no need for an entry in the .altinstr_replacement
> > section, shrinking all text sections by 9476 bytes:
> >
> >           text           data     bss      dec            hex filename
> >       27267068        4643047  814852 32724967        1f357e7 vmlinux-old.o
> >       27257592        4643047  814852 32715491        1f332e3 vmlinux-new.o
>
> > +#ifdef __POPCNT__
> > +     asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
> > +                 : [cnt] "=&r" (res)
> > +                 : [val] ASM_INPUT_RM (w));
> > +#else
> >       asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
> >                               "call __sw_hweight32",
> >                               ASM_CLR "popcntl %[val], %[cnt]",
> >                               X86_FEATURE_POPCNT)
> >                        : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
> >                        : [val] REG_IN (w));
>
> So a better optimization I think would be to declare and implement
> __sw_hweight32 with a different, less intrusive function call ABI that

With an external function, the ABI specifies the location of input
argument and function result. Unless we want to declare the whole
function as asm() inline function (with some 20 instructions), we have
to specify the location of function arguments and where the function
result is to be found in the asm() that calls the external function.
Register allocator then uses this information to move arguments to the
right place before the call.

The above approach, when used to emulate an insn,  has a drawback.
When the instruction is available as an alternative, it still has
fixed input and output registers, forced by the ABI of the function
call. Register allocator has to move registers unnecessarily to
satisfy the constraints of the function call, not the instruction
itself.

The proposed solution builds on the fact that with -march=native (and
also when -mpopcnt is specified on the command line) , the compiler
signals the availability of certain ISA by defining the corresponding
definition. We can use this definition to relax the constraints to fit
the instruction, not the ABI of the fallback function call. On x86, we
can also access memory directly, avoiding clobbering a temporary input
register.

Without the fix for (obsolete) false dependency, the change becomes simply:

#ifdef __POPCNT__
     asm ("popcntl %[val], %[cnt]"
                 : [cnt] "=r" (res)
                 : [val] ASM_INPUT_RM (w));
#else

and besides the reported savings of 600 bytes in the .text section
also allows the register allocator to schedule registers (and input
arguments from memory) more optimally, not counting additional 9k
saved space in the alternative section.

The patch is also an example, how -march=native enables further
optimizations involving additional ISAs.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-29  9:19     ` Uros Bizjak
@ 2025-03-29 11:00       ` David Laight
  2025-03-30  7:49         ` Uros Bizjak
  2025-03-29 23:10       ` H. Peter Anvin
  2025-03-30  9:56       ` Ingo Molnar
  2 siblings, 1 reply; 25+ messages in thread
From: David Laight @ 2025-03-29 11:00 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, H. Peter Anvin

On Sat, 29 Mar 2025 10:19:37 +0100
Uros Bizjak <ubizjak@gmail.com> wrote:

> On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Uros Bizjak <ubizjak@gmail.com> wrote:
> >  
> > > Emit naked POPCNT instruction when available with X86_NATIVE_CPU
> > > option. The compiler is not bound by ABI when emitting the instruction
> > > without the fallback call to __sw_hweight{32,64}() library function
> > > and has much more freedom to allocate input and output operands,
> > > including memory input operand.
> > >
> > > The code size of x86_64 defconfig (with X86_NATIVE_CPU option)
> > > shrinks by 599 bytes:
> > >
> > >   add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599)
> > >   Total: Before=22710531, After=22709932, chg -0.00%
> > >
> > > The asm changes from e.g.:
> > >
> > >          3bf9c:       48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi
> > >          3bfa3:       e8 00 00 00 00          call   3bfa8 <...>
> > >          3bfa8:       90                      nop
> > >          3bfa9:       90                      nop
> > >
> > > with:
> > >
> > >            34b:       31 c0                   xor    %eax,%eax
> > >            34d:       f3 48 0f b8 c7          popcnt %rdi,%rax
> > >
> > > in the .altinstr_replacement section
> > >
> > > to:
> > >
> > >          3bfdc:       31 c0                   xor    %eax,%eax
> > >          3bfde:       f3 48 0f b8 05 00 00    popcnt 0x0(%rip),%rax
> > >          3bfe5:       00 00
> > >
> > > where there is no need for an entry in the .altinstr_replacement
> > > section, shrinking all text sections by 9476 bytes:
> > >
> > >           text           data     bss      dec            hex filename
> > >       27267068        4643047  814852 32724967        1f357e7 vmlinux-old.o
> > >       27257592        4643047  814852 32715491        1f332e3 vmlinux-new.o  
> >  
> > > +#ifdef __POPCNT__
> > > +     asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
> > > +                 : [cnt] "=&r" (res)
> > > +                 : [val] ASM_INPUT_RM (w));
> > > +#else
> > >       asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
> > >                               "call __sw_hweight32",
> > >                               ASM_CLR "popcntl %[val], %[cnt]",
> > >                               X86_FEATURE_POPCNT)
> > >                        : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
> > >                        : [val] REG_IN (w));  
> >
> > So a better optimization I think would be to declare and implement
> > __sw_hweight32 with a different, less intrusive function call ABI that  
> 
> With an external function, the ABI specifies the location of input
> argument and function result. Unless we want to declare the whole
> function as asm() inline function (with some 20 instructions), we have
> to specify the location of function arguments and where the function
> result is to be found in the asm() that calls the external function.
> Register allocator then uses this information to move arguments to the
> right place before the call.
> 
> The above approach, when used to emulate an insn,  has a drawback.
> When the instruction is available as an alternative, it still has
> fixed input and output registers, forced by the ABI of the function
> call. Register allocator has to move registers unnecessarily to
> satisfy the constraints of the function call, not the instruction
> itself.

Forcing the argument into a fixed register won't make much difference
to execution time.
Just a bit more work for the instruction decoder and a few more bytes
of I-cache.
(Register-register moves can be zero clocks.)
In many cases (but not as many as you might hope for) the compiler
back-tracks the input register requirement to the instruction that
generates the value.

In this case the called function needs two writeable registers.
I think you can tell gcc the input is invalidated and the output
is 'early clobber' so that the register are different.

> The proposed solution builds on the fact that with -march=native (and
> also when -mpopcnt is specified on the command line) , the compiler
> signals the availability of certain ISA by defining the corresponding
> definition. We can use this definition to relax the constraints to fit
> the instruction, not the ABI of the fallback function call. On x86, we
> can also access memory directly, avoiding clobbering a temporary input
> register.
> 
> Without the fix for (obsolete) false dependency, the change becomes simply:
> 
> #ifdef __POPCNT__
>      asm ("popcntl %[val], %[cnt]"
>                  : [cnt] "=r" (res)
>                  : [val] ASM_INPUT_RM (w));
> #else
> 
> and besides the reported savings of 600 bytes in the .text section
> also allows the register allocator to schedule registers (and input
> arguments from memory) more optimally, not counting additional 9k
> saved space in the alternative section.
> 
> The patch is also an example, how -march=native enables further
> optimizations involving additional ISAs.

To my mind it would be better to be able to specify oldest cpu
type the build should support.
Either by actual cpu type (eg 'skylake' or 'zen2') or maybe by
a specific instruction (eg popcnt).
The scripts would then determine the appropriate compiler flags
and any extra -Dvar to generate appropriate code.

The arch/x86/Kconfig.cpu seems to be missing options to select
between 64bit cpus.
That would also be the place to add CONFIG defines that mirror the
X86_FEATURE_xxx flags.

That would be more flexible.

	David

> 
> Thanks,
> Uros.
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-29 11:00       ` David Laight
@ 2025-03-30  7:49         ` Uros Bizjak
  2025-03-30 18:02           ` David Laight
  0 siblings, 1 reply; 25+ messages in thread
From: Uros Bizjak @ 2025-03-30  7:49 UTC (permalink / raw)
  To: David Laight
  Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, H. Peter Anvin

On Sat, Mar 29, 2025 at 12:00 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Sat, 29 Mar 2025 10:19:37 +0100
> Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > >
> > > * Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > > Emit naked POPCNT instruction when available with X86_NATIVE_CPU
> > > > option. The compiler is not bound by ABI when emitting the instruction
> > > > without the fallback call to __sw_hweight{32,64}() library function
> > > > and has much more freedom to allocate input and output operands,
> > > > including memory input operand.
> > > >
> > > > The code size of x86_64 defconfig (with X86_NATIVE_CPU option)
> > > > shrinks by 599 bytes:
> > > >
> > > >   add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599)
> > > >   Total: Before=22710531, After=22709932, chg -0.00%
> > > >
> > > > The asm changes from e.g.:
> > > >
> > > >          3bf9c:       48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi
> > > >          3bfa3:       e8 00 00 00 00          call   3bfa8 <...>
> > > >          3bfa8:       90                      nop
> > > >          3bfa9:       90                      nop
> > > >
> > > > with:
> > > >
> > > >            34b:       31 c0                   xor    %eax,%eax
> > > >            34d:       f3 48 0f b8 c7          popcnt %rdi,%rax
> > > >
> > > > in the .altinstr_replacement section
> > > >
> > > > to:
> > > >
> > > >          3bfdc:       31 c0                   xor    %eax,%eax
> > > >          3bfde:       f3 48 0f b8 05 00 00    popcnt 0x0(%rip),%rax
> > > >          3bfe5:       00 00
> > > >
> > > > where there is no need for an entry in the .altinstr_replacement
> > > > section, shrinking all text sections by 9476 bytes:
> > > >
> > > >           text           data     bss      dec            hex filename
> > > >       27267068        4643047  814852 32724967        1f357e7 vmlinux-old.o
> > > >       27257592        4643047  814852 32715491        1f332e3 vmlinux-new.o
> > >
> > > > +#ifdef __POPCNT__
> > > > +     asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
> > > > +                 : [cnt] "=&r" (res)
> > > > +                 : [val] ASM_INPUT_RM (w));
> > > > +#else
> > > >       asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
> > > >                               "call __sw_hweight32",
> > > >                               ASM_CLR "popcntl %[val], %[cnt]",
> > > >                               X86_FEATURE_POPCNT)
> > > >                        : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
> > > >                        : [val] REG_IN (w));
> > >
> > > So a better optimization I think would be to declare and implement
> > > __sw_hweight32 with a different, less intrusive function call ABI that
> >
> > With an external function, the ABI specifies the location of input
> > argument and function result. Unless we want to declare the whole
> > function as asm() inline function (with some 20 instructions), we have
> > to specify the location of function arguments and where the function
> > result is to be found in the asm() that calls the external function.
> > Register allocator then uses this information to move arguments to the
> > right place before the call.
> >
> > The above approach, when used to emulate an insn,  has a drawback.
> > When the instruction is available as an alternative, it still has
> > fixed input and output registers, forced by the ABI of the function
> > call. Register allocator has to move registers unnecessarily to
> > satisfy the constraints of the function call, not the instruction
> > itself.
>
> Forcing the argument into a fixed register won't make much difference
> to execution time.
> Just a bit more work for the instruction decoder and a few more bytes
> of I-cache.
> (Register-register moves can be zero clocks.)
> In many cases (but not as many as you might hope for) the compiler
> back-tracks the input register requirement to the instruction that
> generates the value.

I'm afraid I don't fully understand what you mean by "back-tracking
the input register requirement". However, with:

asm("insn %0, %1" : "=r" (out) : "r" (in));

the compiler is not obliged to match input with output, although many
times it does so (especially when input argument is dead). To avoid
false dependence on the output, we should force the compiler to always
match input and output:

asm("insn %0, %1" : "=r" (out) : "0" (in));

and this will resolve false dependence (input register obviously needs
to be ready before insn) at the expense of an extra move instruction
in front of the insn in case input is not dead. This is unfortunately
not possible when one of the alternatives is a function call, where
location of input and output arguments is specified by ABI.

> In this case the called function needs two writeable registers.
> I think you can tell gcc the input is invalidated and the output
> is 'early clobber' so that the register are different.

Yes, my first patch used this approach, where output operand is cleared first:

asm("xorl %0, %0; popcntl %1, %0" : "=&r" (out) : "rm" (in));

Please note that "earlyclobbered" output reg can't be matched with
input reg, or with any reg that forms the address.

> > The proposed solution builds on the fact that with -march=native (and
> > also when -mpopcnt is specified on the command line) , the compiler
> > signals the availability of certain ISA by defining the corresponding
> > definition. We can use this definition to relax the constraints to fit
> > the instruction, not the ABI of the fallback function call. On x86, we
> > can also access memory directly, avoiding clobbering a temporary input
> > register.
> >
> > Without the fix for (obsolete) false dependency, the change becomes simply:
> >
> > #ifdef __POPCNT__
> >      asm ("popcntl %[val], %[cnt]"
> >                  : [cnt] "=r" (res)
> >                  : [val] ASM_INPUT_RM (w));
> > #else
> >
> > and besides the reported savings of 600 bytes in the .text section
> > also allows the register allocator to schedule registers (and input
> > arguments from memory) more optimally, not counting additional 9k
> > saved space in the alternative section.
> >
> > The patch is also an example, how -march=native enables further
> > optimizations involving additional ISAs.
>
> To my mind it would be better to be able to specify oldest cpu
> type the build should support.
> Either by actual cpu type (eg 'skylake' or 'zen2') or maybe by
> a specific instruction (eg popcnt).
> The scripts would then determine the appropriate compiler flags
> and any extra -Dvar to generate appropriate code.

Please note that with -march=native the compiler driver ("gcc") does
this for you. -march=native expands to a series of -m compile flags
(you can see these flags by passing -### to gcc) and each flag defines
corresponding ISA macro when set. E.g., passing -mpopcnt defines
__POPCNT__ macro. These macros can be used instead of -Dvar for
conditional compilation that depends on -m ISA flags, passed to the
compiler ("cc1").

> The arch/x86/Kconfig.cpu seems to be missing options to select
> between 64bit cpus.
> That would also be the place to add CONFIG defines that mirror the
> X86_FEATURE_xxx flags.

While the above -march=native expands in the driver, setting e.g.
-march=skylake enables CPU capabilities in the compiler itself.
However, using -march=...cpu...  also sets corresponding ISA macros,
so the proposed approach does not exclude Kconfig.cpu options.
Automatically set ISA macros can be used instead of X86_FEATURE_xxx
flags for maximum flexibility.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30  7:49         ` Uros Bizjak
@ 2025-03-30 18:02           ` David Laight
  0 siblings, 0 replies; 25+ messages in thread
From: David Laight @ 2025-03-30 18:02 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, H. Peter Anvin

On Sun, 30 Mar 2025 09:49:43 +0200
Uros Bizjak <ubizjak@gmail.com> wrote:

> On Sat, Mar 29, 2025 at 12:00 PM David Laight
> <david.laight.linux@gmail.com> wrote:
> >
> > On Sat, 29 Mar 2025 10:19:37 +0100
> > Uros Bizjak <ubizjak@gmail.com> wrote:
> >  
> > > On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote:  
> > > >
> > > >
> > > > * Uros Bizjak <ubizjak@gmail.com> wrote:
> > > >  
> > > > > Emit naked POPCNT instruction when available with X86_NATIVE_CPU
> > > > > option. The compiler is not bound by ABI when emitting the instruction
> > > > > without the fallback call to __sw_hweight{32,64}() library function
> > > > > and has much more freedom to allocate input and output operands,
> > > > > including memory input operand.
> > > > >
> > > > > The code size of x86_64 defconfig (with X86_NATIVE_CPU option)
> > > > > shrinks by 599 bytes:
> > > > >
> > > > >   add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599)
> > > > >   Total: Before=22710531, After=22709932, chg -0.00%
> > > > >
> > > > > The asm changes from e.g.:
> > > > >
> > > > >          3bf9c:       48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi
> > > > >          3bfa3:       e8 00 00 00 00          call   3bfa8 <...>
> > > > >          3bfa8:       90                      nop
> > > > >          3bfa9:       90                      nop
> > > > >
> > > > > with:
> > > > >
> > > > >            34b:       31 c0                   xor    %eax,%eax
> > > > >            34d:       f3 48 0f b8 c7          popcnt %rdi,%rax
> > > > >
> > > > > in the .altinstr_replacement section
> > > > >
> > > > > to:
> > > > >
> > > > >          3bfdc:       31 c0                   xor    %eax,%eax
> > > > >          3bfde:       f3 48 0f b8 05 00 00    popcnt 0x0(%rip),%rax
> > > > >          3bfe5:       00 00
> > > > >
> > > > > where there is no need for an entry in the .altinstr_replacement
> > > > > section, shrinking all text sections by 9476 bytes:
> > > > >
> > > > >           text           data     bss      dec            hex filename
> > > > >       27267068        4643047  814852 32724967        1f357e7 vmlinux-old.o
> > > > >       27257592        4643047  814852 32715491        1f332e3 vmlinux-new.o  
> > > >  
> > > > > +#ifdef __POPCNT__
> > > > > +     asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
> > > > > +                 : [cnt] "=&r" (res)
> > > > > +                 : [val] ASM_INPUT_RM (w));
> > > > > +#else
> > > > >       asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
> > > > >                               "call __sw_hweight32",
> > > > >                               ASM_CLR "popcntl %[val], %[cnt]",
> > > > >                               X86_FEATURE_POPCNT)
> > > > >                        : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
> > > > >                        : [val] REG_IN (w));  
> > > >
> > > > So a better optimization I think would be to declare and implement
> > > > __sw_hweight32 with a different, less intrusive function call ABI that  
> > >
> > > With an external function, the ABI specifies the location of input
> > > argument and function result. Unless we want to declare the whole
> > > function as asm() inline function (with some 20 instructions), we have
> > > to specify the location of function arguments and where the function
> > > result is to be found in the asm() that calls the external function.
> > > Register allocator then uses this information to move arguments to the
> > > right place before the call.
> > >
> > > The above approach, when used to emulate an insn,  has a drawback.
> > > When the instruction is available as an alternative, it still has
> > > fixed input and output registers, forced by the ABI of the function
> > > call. Register allocator has to move registers unnecessarily to
> > > satisfy the constraints of the function call, not the instruction
> > > itself.  
> >
> > Forcing the argument into a fixed register won't make much difference
> > to execution time.
> > Just a bit more work for the instruction decoder and a few more bytes
> > of I-cache.
> > (Register-register moves can be zero clocks.)
> > In many cases (but not as many as you might hope for) the compiler
> > back-tracks the input register requirement to the instruction that
> > generates the value.  
> 
> I'm afraid I don't fully understand what you mean by "back-tracking
> the input register requirement".

If the asm block requires an input in %rdx then the instruction that
creates the value would be expected to put it into %rdx ready for the
asm block.
Even if it doesn't a register-register move is often implemented without
using the ALU by 'register renaming' (there is an indirection between
the register 'number' the code uses and the physical latches that hold
the value, multiple copies of (say) %rax can be live at the same time).

> However, with:
> 
> asm("insn %0, %1" : "=r" (out) : "r" (in));
> 
> the compiler is not obliged to match input with output, although many
> times it does so (especially when input argument is dead). To avoid
> false dependence on the output, we should force the compiler to always
> match input and output:
> 
> asm("insn %0, %1" : "=r" (out) : "0" (in));

I'd expect the compiler to generate better code if it is allowed to
use separate registers for the input and output.
It may be able to use the input value again.
There is no 'dependency' on the output register (unless the instruction
only updates the low bits).

> 
> and this will resolve false dependence (input register obviously needs
> to be ready before insn) at the expense of an extra move instruction
> in front of the insn in case input is not dead. This is unfortunately
> not possible when one of the alternatives is a function call, where
> location of input and output arguments is specified by ABI.
> 
> > In this case the called function needs two writeable registers.
> > I think you can tell gcc the input is invalidated and the output
> > is 'early clobber' so that the register are different.  
> 
> Yes, my first patch used this approach, where output operand is cleared first:

That clearing of the output serves an entirely different purpose.
> 
> asm("xorl %0, %0; popcntl %1, %0" : "=&r" (out) : "rm" (in));
> 
> Please note that "earlyclobbered" output reg can't be matched with
> input reg, or with any reg that forms the address.

But you want them to be different.
The called function needs to multiple 'shift and add' sequences.
To do that it needs a scratch register.
So if the asm block requires the input in %rdx, puts the output in %rax
and destroys %rdx you can write a function that doesn't need any other
registers.
If you try really hard you can make the called function depend on the
register the compiler selects - and white a different copy for each.
Not worth it here.

	David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-29  9:19     ` Uros Bizjak
  2025-03-29 11:00       ` David Laight
@ 2025-03-29 23:10       ` H. Peter Anvin
  2025-03-30  6:54         ` Uros Bizjak
  2025-03-30  9:56       ` Ingo Molnar
  2 siblings, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2025-03-29 23:10 UTC (permalink / raw)
  To: Uros Bizjak, Ingo Molnar
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen

On March 29, 2025 2:19:37 AM PDT, Uros Bizjak <ubizjak@gmail.com> wrote:
>On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote:
>>
>>
>> * Uros Bizjak <ubizjak@gmail.com> wrote:
>>
>> > Emit naked POPCNT instruction when available with X86_NATIVE_CPU
>> > option. The compiler is not bound by ABI when emitting the instruction
>> > without the fallback call to __sw_hweight{32,64}() library function
>> > and has much more freedom to allocate input and output operands,
>> > including memory input operand.
>> >
>> > The code size of x86_64 defconfig (with X86_NATIVE_CPU option)
>> > shrinks by 599 bytes:
>> >
>> >   add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599)
>> >   Total: Before=22710531, After=22709932, chg -0.00%
>> >
>> > The asm changes from e.g.:
>> >
>> >          3bf9c:       48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi
>> >          3bfa3:       e8 00 00 00 00          call   3bfa8 <...>
>> >          3bfa8:       90                      nop
>> >          3bfa9:       90                      nop
>> >
>> > with:
>> >
>> >            34b:       31 c0                   xor    %eax,%eax
>> >            34d:       f3 48 0f b8 c7          popcnt %rdi,%rax
>> >
>> > in the .altinstr_replacement section
>> >
>> > to:
>> >
>> >          3bfdc:       31 c0                   xor    %eax,%eax
>> >          3bfde:       f3 48 0f b8 05 00 00    popcnt 0x0(%rip),%rax
>> >          3bfe5:       00 00
>> >
>> > where there is no need for an entry in the .altinstr_replacement
>> > section, shrinking all text sections by 9476 bytes:
>> >
>> >           text           data     bss      dec            hex filename
>> >       27267068        4643047  814852 32724967        1f357e7 vmlinux-old.o
>> >       27257592        4643047  814852 32715491        1f332e3 vmlinux-new.o
>>
>> > +#ifdef __POPCNT__
>> > +     asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]"
>> > +                 : [cnt] "=&r" (res)
>> > +                 : [val] ASM_INPUT_RM (w));
>> > +#else
>> >       asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
>> >                               "call __sw_hweight32",
>> >                               ASM_CLR "popcntl %[val], %[cnt]",
>> >                               X86_FEATURE_POPCNT)
>> >                        : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
>> >                        : [val] REG_IN (w));
>>
>> So a better optimization I think would be to declare and implement
>> __sw_hweight32 with a different, less intrusive function call ABI that
>
>With an external function, the ABI specifies the location of input
>argument and function result. Unless we want to declare the whole
>function as asm() inline function (with some 20 instructions), we have
>to specify the location of function arguments and where the function
>result is to be found in the asm() that calls the external function.
>Register allocator then uses this information to move arguments to the
>right place before the call.
>
>The above approach, when used to emulate an insn,  has a drawback.
>When the instruction is available as an alternative, it still has
>fixed input and output registers, forced by the ABI of the function
>call. Register allocator has to move registers unnecessarily to
>satisfy the constraints of the function call, not the instruction
>itself.
>
>The proposed solution builds on the fact that with -march=native (and
>also when -mpopcnt is specified on the command line) , the compiler
>signals the availability of certain ISA by defining the corresponding
>definition. We can use this definition to relax the constraints to fit
>the instruction, not the ABI of the fallback function call. On x86, we
>can also access memory directly, avoiding clobbering a temporary input
>register.
>
>Without the fix for (obsolete) false dependency, the change becomes simply:
>
>#ifdef __POPCNT__
>     asm ("popcntl %[val], %[cnt]"
>                 : [cnt] "=r" (res)
>                 : [val] ASM_INPUT_RM (w));
>#else
>
>and besides the reported savings of 600 bytes in the .text section
>also allows the register allocator to schedule registers (and input
>arguments from memory) more optimally, not counting additional 9k
>saved space in the alternative section.
>
>The patch is also an example, how -march=native enables further
>optimizations involving additional ISAs.
>
>Thanks,
>Uros.
>

If you have __POPCNT__ defined, could you not simply use __builtin_popcnt()?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-29 23:10       ` H. Peter Anvin
@ 2025-03-30  6:54         ` Uros Bizjak
  0 siblings, 0 replies; 25+ messages in thread
From: Uros Bizjak @ 2025-03-30  6:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov,
	Dave Hansen

On Sun, Mar 30, 2025 at 12:10 AM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On March 29, 2025 2:19:37 AM PDT, Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> >The proposed solution builds on the fact that with -march=native (and
> >also when -mpopcnt is specified on the command line) , the compiler
> >signals the availability of certain ISA by defining the corresponding
> >definition. We can use this definition to relax the constraints to fit
> >the instruction, not the ABI of the fallback function call. On x86, we
> >can also access memory directly, avoiding clobbering a temporary input
> >register.
> >
> >Without the fix for (obsolete) false dependency, the change becomes simply:
> >
> >#ifdef __POPCNT__
> >     asm ("popcntl %[val], %[cnt]"
> >                 : [cnt] "=r" (res)
> >                 : [val] ASM_INPUT_RM (w));
> >#else
> >
> >and besides the reported savings of 600 bytes in the .text section
> >also allows the register allocator to schedule registers (and input
> >arguments from memory) more optimally, not counting additional 9k
> >saved space in the alternative section.
> >
> >The patch is also an example, how -march=native enables further
> >optimizations involving additional ISAs.
>
> If you have __POPCNT__ defined, could you not simply use __builtin_popcnt()?

We can use it, but then the compiler (at least GCC) will start to emit
false dependency fixups for POPCNT instructions (which we don't want;
TZCNT has the same problem, where we agreed that it is not worth
fixing for 10 years old cpus [1]).

Please note that __builtin functions are not strictly required to be
inlined and can generate a library call [2]. I have been burned by
this fact with __builtin_parity(), so IMO the safest way in case of
POPCNT insn is to use ISA macros only to determine the availability of
ISA.

Also note that optimizations with constant arguments are already
performed elsewhere (c.f. <asm-generic/bitops/const_hweight.h>) and
__arch_hweight() receives only variable arguments, so optimization
opportunities with __builtin function are greatly reduced.

[1] https://lore.kernel.org/lkml/CAFULd4ZzoW+vP_pa1hEF--gvsG8yaPLU8S7oBkJBZLP4Tirepw@mail.gmail.com/
[2] https://lore.kernel.org/linux-mm/CAKbZUD0N7bkuw_Le3Pr9o1V2BjjcY_YiLm8a8DPceubTdZ00GQ@mail.gmail.com/

Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-29  9:19     ` Uros Bizjak
  2025-03-29 11:00       ` David Laight
  2025-03-29 23:10       ` H. Peter Anvin
@ 2025-03-30  9:56       ` Ingo Molnar
  2025-03-30 16:07         ` Uros Bizjak
  2 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2025-03-30  9:56 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin

* Uros Bizjak <ubizjak@gmail.com> wrote:

> > So a better optimization I think would be to declare and implement 
> > __sw_hweight32 with a different, less intrusive function call ABI 
> > that
> 
> With an external function, the ABI specifies the location of input 
> argument and function result.

This is all within the kernel, and __sw_hweight32() is implemented in 
the kernel as well, entirely in assembly, and the ALTERNATIVE*() macros 
are fully under our control as well - so we have full control over the 
calling convention.

Ie. in principle there's no need for the __sw_hweight32 function 
utilized by ALTERNATIVE() to be a C-call-ABI external function with all 
its call-clobbering constraints that disturbs register state affected 
by the C-call-ABI. (RSI RSI RDX RCX R8 R9)

The calling convention used is the kernel's choice, which we can 
re-evaluate.

For example, we could make a version of __sw_hweight32 that is a 
largely no-clobber function that only touches a single register, which 
receives its input in RAX and returns the result to RAX (as usual), and 
saves/restores everything else. This pushes overhead into the uncommon 
case (__sw_hweight32 users) and reduces register pressure on the 
calling site.

I'm not saying it's *worth* it for POPCNTL emulation alone:

 - The code generation benefits might or might not be there. Needs to 
   be examined.

 - There may be some trouble with on-stack red zones used by the 
   compiler, if the compiler doesn't know that a call was done.

 - Plus rolling a different calling convention down the alternatives 
   patching macros will have some maintenance overhead side effects. 
   Possibly other usecases need to be found as well for this to be 
   worth it.

But I wanted to bust the false assumption you seem to be making about 
C-call-ABI constraints.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30  9:56       ` Ingo Molnar
@ 2025-03-30 16:07         ` Uros Bizjak
  2025-03-30 18:15           ` David Laight
  2025-03-30 18:54           ` Ingo Molnar
  0 siblings, 2 replies; 25+ messages in thread
From: Uros Bizjak @ 2025-03-30 16:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin

On Sun, Mar 30, 2025 at 11:56 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > > So a better optimization I think would be to declare and implement
> > > __sw_hweight32 with a different, less intrusive function call ABI
> > > that
> >
> > With an external function, the ABI specifies the location of input
> > argument and function result.
>
> This is all within the kernel, and __sw_hweight32() is implemented in
> the kernel as well, entirely in assembly, and the ALTERNATIVE*() macros
> are fully under our control as well - so we have full control over the
> calling convention.

There is a minor issue with a generic prototype in <linux/bitops.h>,
where we declare:

extern unsigned int __sw_hweight32(unsigned int w);
extern unsigned long __sw_hweight64(__u64 w);

This creates a bit of mixup, so perhaps it is better to define and use
an x86 specific function name. With this issue out of the way, we can
use %eax/%rax as inout register for x86_32 and x86_64 targets, and
remove:

#ifdef CONFIG_64BIT
#define REG_IN "D"
#define REG_OUT "a"
#else
#define REG_IN "a"
#define REG_OUT "a"
#endif

from <arch/x86/include/asm/arch_hweight.h>

In parallel, we should remove x86_64 conditional compilation from
arch/x86/lib/hweight.S:

SYM_FUNC_START(__sw_hweight32)

#ifdef CONFIG_X86_64
movl %edi, %eax # w
#endif
__ASM_SIZE(push,) %__ASM_REG(dx)
movl %eax, %edx # w -> t
...

> Ie. in principle there's no need for the __sw_hweight32 function
> utilized by ALTERNATIVE() to be a C-call-ABI external function with all
> its call-clobbering constraints that disturbs register state affected
> by the C-call-ABI. (RSI RSI RDX RCX R8 R9)
>
> The calling convention used is the kernel's choice, which we can
> re-evaluate.
>
> For example, we could make a version of __sw_hweight32 that is a
> largely no-clobber function that only touches a single register, which
> receives its input in RAX and returns the result to RAX (as usual), and
> saves/restores everything else. This pushes overhead into the uncommon
> case (__sw_hweight32 users) and reduces register pressure on the
> calling site.
>
> I'm not saying it's *worth* it for POPCNTL emulation alone:
>
>  - The code generation benefits might or might not be there. Needs to
>    be examined.

Matching inputs with output will actually make the instruction
"destructive", so the compiler will have to copy the input argument
when it won't die in the instruction. This is not desirable. The code
generation benefits would come from relaxing input constraint to "rm",
which is not possible with a fallback function.

>
>  - There may be some trouble with on-stack red zones used by the
>    compiler, if the compiler doesn't know that a call was done.

The kernel is currently compiled with -mno-red-zone, gcc-15 introduces
special "redzone" clobber to disable red-zone in the function that
includes asm() when/if the kernel starts using redzone.

>  - Plus rolling a different calling convention down the alternatives
>    patching macros will have some maintenance overhead side effects.
>    Possibly other usecases need to be found as well for this to be
>    worth it.

I think that adding a __POPCNT__ version (similar to my original
patch) would bring the most benefit, because we could use "rm" input
and "=r" output registers, without any constraints, enforced by
fallback function call. This is only possible with a new -march=native
functionality.

> But I wanted to bust the false assumption you seem to be making about
> C-call-ABI constraints.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30 16:07         ` Uros Bizjak
@ 2025-03-30 18:15           ` David Laight
  2025-03-30 22:44             ` H. Peter Anvin
  2025-03-30 18:54           ` Ingo Molnar
  1 sibling, 1 reply; 25+ messages in thread
From: David Laight @ 2025-03-30 18:15 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, H. Peter Anvin

On Sun, 30 Mar 2025 18:07:30 +0200
Uros Bizjak <ubizjak@gmail.com> wrote:

...
> The kernel is currently compiled with -mno-red-zone, gcc-15 introduces
> special "redzone" clobber to disable red-zone in the function that
> includes asm() when/if the kernel starts using redzone.

I really don't understand the point of a stack 'red-zone'.
In any function that contains memory writes the cost of a
'sub $n,%rsp' at the top and corresponding add at the bottom
is surely noise.

	David.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30 18:15           ` David Laight
@ 2025-03-30 22:44             ` H. Peter Anvin
  0 siblings, 0 replies; 25+ messages in thread
From: H. Peter Anvin @ 2025-03-30 22:44 UTC (permalink / raw)
  To: David Laight, Uros Bizjak
  Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov,
	Dave Hansen

On March 30, 2025 11:15:44 AM PDT, David Laight <david.laight.linux@gmail.com> wrote:
>On Sun, 30 Mar 2025 18:07:30 +0200
>Uros Bizjak <ubizjak@gmail.com> wrote:
>
>...
>> The kernel is currently compiled with -mno-red-zone, gcc-15 introduces
>> special "redzone" clobber to disable red-zone in the function that
>> includes asm() when/if the kernel starts using redzone.
>
>I really don't understand the point of a stack 'red-zone'.
>In any function that contains memory writes the cost of a
>'sub $n,%rsp' at the top and corresponding add at the bottom
>is surely noise.
>
>	David.

Perhaps you should check out the performance numbers instead.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option
  2025-03-30 16:07         ` Uros Bizjak
  2025-03-30 18:15           ` David Laight
@ 2025-03-30 18:54           ` Ingo Molnar
  1 sibling, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2025-03-30 18:54 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin


* Uros Bizjak <ubizjak@gmail.com> wrote:

> On Sun, Mar 30, 2025 at 11:56 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > > > So a better optimization I think would be to declare and implement
> > > > __sw_hweight32 with a different, less intrusive function call ABI
> > > > that
> > >
> > > With an external function, the ABI specifies the location of input
> > > argument and function result.
> >
> > This is all within the kernel, and __sw_hweight32() is implemented in
> > the kernel as well, entirely in assembly, and the ALTERNATIVE*() macros
> > are fully under our control as well - so we have full control over the
> > calling convention.
> 
> There is a minor issue with a generic prototype in <linux/bitops.h>,
> where we declare:
> 
> extern unsigned int __sw_hweight32(unsigned int w);
> extern unsigned long __sw_hweight64(__u64 w);
> 
> This creates a bit of mixup, so perhaps it is better to define and use
> an x86 specific function name.

Yes, I alluded to this complication:

> > For example, we could make a version of __sw_hweight32 that is a
> > largely no-clobber function that only touches a single register, which

That version of __sw_hweight32 would be a different symbol.

> > I'm not saying it's *worth* it for POPCNTL emulation alone:
> >
> >  - The code generation benefits might or might not be there. Needs to
> >    be examined.
> 
> Matching inputs with output will actually make the instruction
> "destructive", so the compiler will have to copy the input argument
> when it won't die in the instruction. This is not desirable.

Yeah, absolutely - it was mainly a demonstration that even 
single-clobber functions are possible. (There's even zero-clobber 
functions, like __fentry__)

> I think that adding a __POPCNT__ version (similar to my original 
> patch) would bring the most benefit, because we could use "rm" input 
> and "=r" output registers, without any constraints, enforced by 
> fallback function call. This is only possible with a new 
> -march=native functionality.

Yeah, -march=native might be nice for local tinkering, but it won't 
reach 99.999% of Linux users - so it's immaterial to this particular 
discussion.

Also, is POPCNTL the best example for this? Are there no other, more 
frequently used ALTERNATIVE() patching sites with function call 
alternatives that disturb the register state of important kernel 
functions? (And I don't know the answer.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn
  2025-03-25 16:48 [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Uros Bizjak
  2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak
@ 2025-03-25 17:09 ` Borislav Petkov
  2025-03-25 17:17   ` Uros Bizjak
  2025-03-25 21:50 ` Ingo Molnar
  2 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2025-03-25 17:09 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	H. Peter Anvin

On Tue, Mar 25, 2025 at 05:48:37PM +0100, Uros Bizjak wrote:
> +/*
> + * On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction
> + * appears to have a false dependency on the destination register. Even
> + * though the instruction only writes to it, the instruction will wait
> + * until destination is ready before executing. This false dependency
> + * was fixed for Cannon Lake (and later) processors.

Any official documentation about that?

Any performance numbers to justify that change?

Because if it doesn't matter, why do it in the first place? Especially if
you're doing this XORing now for *everyone* - not just the affected parties.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn
  2025-03-25 17:09 ` [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Borislav Petkov
@ 2025-03-25 17:17   ` Uros Bizjak
  2025-03-25 17:44     ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Uros Bizjak @ 2025-03-25 17:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	H. Peter Anvin

On Tue, Mar 25, 2025 at 6:10 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, Mar 25, 2025 at 05:48:37PM +0100, Uros Bizjak wrote:
> > +/*
> > + * On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction
> > + * appears to have a false dependency on the destination register. Even
> > + * though the instruction only writes to it, the instruction will wait
> > + * until destination is ready before executing. This false dependency
> > + * was fixed for Cannon Lake (and later) processors.
>
> Any official documentation about that?

Please see [1], errata 026.

[1] https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/8th-gen-core-spec-update.pdf

>
> Any performance numbers to justify that change?

There is a lot of performance analysis at [2] and [3].

[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011
[3] https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati

> Because if it doesn't matter, why do it in the first place? Especially if
> you're doing this XORing now for *everyone* - not just the affected parties.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn
  2025-03-25 17:17   ` Uros Bizjak
@ 2025-03-25 17:44     ` Borislav Petkov
  0 siblings, 0 replies; 25+ messages in thread
From: Borislav Petkov @ 2025-03-25 17:44 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	H. Peter Anvin

On Tue, Mar 25, 2025 at 06:17:01PM +0100, Uros Bizjak wrote:
> Please see [1], errata 026.

Then next time put at least a reference to such official source. Because right
now it reads like a wild guess.

> There is a lot of performance analysis at [2] and [3].

You mean the commit message should not mention it - people should use a search
engine to find that?!

Lemme quote our documentation for you:

"Quantify optimizations and trade-offs. If you claim improvements in
performance, memory consumption, stack footprint, or binary size, include
numbers that back them up. But also describe non-obvious costs. Optimizations
usually aren’t free but trade-offs between CPU, memory, and readability; or,
when it comes to heuristics, between different workloads. Describe the
expected downsides of your optimization so that the reviewer can weigh costs
against benefits."

From: https://kernel.org/doc/html/latest/process/submitting-patches.html

Please peruse that document when you get a chance.

So my question still is: does that change has *any* *impact* on *kernel*
workloads?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn
  2025-03-25 16:48 [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Uros Bizjak
  2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak
  2025-03-25 17:09 ` [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Borislav Petkov
@ 2025-03-25 21:50 ` Ingo Molnar
  2 siblings, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2025-03-25 21:50 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin


* Uros Bizjak <ubizjak@gmail.com> wrote:

> On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction
> appears to have a false dependency on the destination register. Even
> though the instruction only writes to it, the instruction will wait
> until destination is ready before executing. This false dependency
> was fixed for Cannon Lake (and later) processors.
> 
> Fix false dependency by clearing the destination register first.
> 
> The x86_64 defconfig object size increases by 779 bytes:
> 
> 	    text           data     bss      dec            hex filename
> 	27341418        4643015  814852 32799285        1f47a35 vmlinux-old.o
> 	27342197        4643015  814852 32800064        1f47d40 vmlinux-new.o

I don't think adding an instruction for an old-microarchitecture 
weakness that has been fixed in new hardware already is worth bloating 
the kernel.

Cannon Lake was released in 2018, 7 years ago.

It will be 1-2 years until such a change percolates to Linux users, and 
by that time the microarchitecture with the fix (Cannon Lake) will be a 
decade old, and a majority of Intel CPU users will be using it.

So I don't think this particular change is worth it, unless the false 
dependency can be quantified to have a huge impact on pre-Cannon-Lake 
CPUs - which I don't think it is.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-03-30 22:44 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-25 16:48 [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Uros Bizjak
2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak
2025-03-25 17:11   ` Borislav Petkov
2025-03-30 15:15     ` Uros Bizjak
2025-03-30 17:31       ` Borislav Petkov
2025-03-30 18:47         ` Ingo Molnar
2025-03-30 19:06           ` Borislav Petkov
2025-03-30 19:20             ` Ingo Molnar
2025-03-30 19:28               ` Borislav Petkov
2025-03-25 21:56   ` Ingo Molnar
2025-03-29  9:19     ` Uros Bizjak
2025-03-29 11:00       ` David Laight
2025-03-30  7:49         ` Uros Bizjak
2025-03-30 18:02           ` David Laight
2025-03-29 23:10       ` H. Peter Anvin
2025-03-30  6:54         ` Uros Bizjak
2025-03-30  9:56       ` Ingo Molnar
2025-03-30 16:07         ` Uros Bizjak
2025-03-30 18:15           ` David Laight
2025-03-30 22:44             ` H. Peter Anvin
2025-03-30 18:54           ` Ingo Molnar
2025-03-25 17:09 ` [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Borislav Petkov
2025-03-25 17:17   ` Uros Bizjak
2025-03-25 17:44     ` Borislav Petkov
2025-03-25 21:50 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox