* [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn
@ 2025-03-25 16:48 Uros Bizjak
2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Uros Bizjak @ 2025-03-25 16:48 UTC (permalink / raw)
To: x86, linux-kernel
Cc: Uros Bizjak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin
On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction
appears to have a false dependency on the destination register. Even
though the instruction only writes to it, the instruction will wait
until destination is ready before executing. This false dependency
was fixed for Cannon Lake (and later) processors.
Fix false dependency by clearing the destination register first.
The x86_64 defconfig object size increases by 779 bytes:
text data bss dec hex filename
27341418 4643015 814852 32799285 1f47a35 vmlinux-old.o
27342197 4643015 814852 32800064 1f47d40 vmlinux-new.o
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
arch/x86/include/asm/arch_hweight.h | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h
index cbc6157f0b4b..aa0b3bd309fc 100644
--- a/arch/x86/include/asm/arch_hweight.h
+++ b/arch/x86/include/asm/arch_hweight.h
@@ -4,12 +4,21 @@
#include <asm/cpufeatures.h>
+/*
+ * On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction
+ * appears to have a false dependency on the destination register. Even
+ * though the instruction only writes to it, the instruction will wait
+ * until destination is ready before executing. This false dependency
+ * was fixed for Cannon Lake (and later) processors.
+ */
+#define ASM_FORCE_CLR "xorl %k[cnt], %k[cnt]\n\t"
+
#ifdef CONFIG_64BIT
#define REG_IN "D"
-#define REG_OUT "a"
+#define ASM_CLR ASM_FORCE_CLR
#else
#define REG_IN "a"
-#define REG_OUT "a"
+#define ASM_CLR
#endif
static __always_inline unsigned int __arch_hweight32(unsigned int w)
@@ -18,8 +27,9 @@ static __always_inline unsigned int __arch_hweight32(unsigned int w)
asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
"call __sw_hweight32",
- "popcntl %[val], %[cnt]", X86_FEATURE_POPCNT)
- : [cnt] "=" REG_OUT (res), ASM_CALL_CONSTRAINT
+ ASM_CLR "popcntl %[val], %[cnt]",
+ X86_FEATURE_POPCNT)
+ : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
: [val] REG_IN (w));
return res;
@@ -48,8 +58,9 @@ static __always_inline unsigned long __arch_hweight64(__u64 w)
asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE
"call __sw_hweight64",
- "popcntq %[val], %[cnt]", X86_FEATURE_POPCNT)
- : [cnt] "=" REG_OUT (res), ASM_CALL_CONSTRAINT
+ ASM_CLR "popcntq %[val], %[cnt]",
+ X86_FEATURE_POPCNT)
+ : [cnt] "=a" (res), ASM_CALL_CONSTRAINT
: [val] REG_IN (w));
return res;
--
2.42.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-25 16:48 [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Uros Bizjak @ 2025-03-25 16:48 ` Uros Bizjak 2025-03-25 17:11 ` Borislav Petkov 2025-03-25 21:56 ` Ingo Molnar 2025-03-25 17:09 ` [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Borislav Petkov 2025-03-25 21:50 ` Ingo Molnar 2 siblings, 2 replies; 25+ messages in thread From: Uros Bizjak @ 2025-03-25 16:48 UTC (permalink / raw) To: x86, linux-kernel Cc: Uros Bizjak, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin Emit naked POPCNT instruction when available with X86_NATIVE_CPU option. The compiler is not bound by ABI when emitting the instruction without the fallback call to __sw_hweight{32,64}() library function and has much more freedom to allocate input and output operands, including memory input operand. The code size of x86_64 defconfig (with X86_NATIVE_CPU option) shrinks by 599 bytes: add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599) Total: Before=22710531, After=22709932, chg -0.00% The asm changes from e.g.: 3bf9c: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi 3bfa3: e8 00 00 00 00 call 3bfa8 <...> 3bfa8: 90 nop 3bfa9: 90 nop with: 34b: 31 c0 xor %eax,%eax 34d: f3 48 0f b8 c7 popcnt %rdi,%rax in the .altinstr_replacement section to: 3bfdc: 31 c0 xor %eax,%eax 3bfde: f3 48 0f b8 05 00 00 popcnt 0x0(%rip),%rax 3bfe5: 00 00 where there is no need for an entry in the .altinstr_replacement section, shrinking all text sections by 9476 bytes: text data bss dec hex filename 27267068 4643047 814852 32724967 1f357e7 vmlinux-old.o 27257592 4643047 814852 32715491 1f332e3 vmlinux-new.o Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> --- arch/x86/include/asm/arch_hweight.h | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/arch_hweight.h b/arch/x86/include/asm/arch_hweight.h index aa0b3bd309fc..d39710e57531 100644 --- a/arch/x86/include/asm/arch_hweight.h +++ b/arch/x86/include/asm/arch_hweight.h @@ -25,13 +25,18 @@ static __always_inline unsigned int __arch_hweight32(unsigned int w) { unsigned int res; +#ifdef __POPCNT__ + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" + : [cnt] "=&r" (res) + : [val] ASM_INPUT_RM (w)); +#else asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE "call __sw_hweight32", ASM_CLR "popcntl %[val], %[cnt]", X86_FEATURE_POPCNT) : [cnt] "=a" (res), ASM_CALL_CONSTRAINT : [val] REG_IN (w)); - +#endif return res; } @@ -56,13 +61,18 @@ static __always_inline unsigned long __arch_hweight64(__u64 w) { unsigned long res; +#ifdef __POPCNT__ + asm_inline (ASM_FORCE_CLR "popcntq %[val], %[cnt]" + : [cnt] "=&r" (res) + : [val] ASM_INPUT_RM (w)); +#else asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE "call __sw_hweight64", ASM_CLR "popcntq %[val], %[cnt]", X86_FEATURE_POPCNT) : [cnt] "=a" (res), ASM_CALL_CONSTRAINT : [val] REG_IN (w)); - +#endif return res; } #endif /* CONFIG_X86_32 */ -- 2.42.0 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak @ 2025-03-25 17:11 ` Borislav Petkov 2025-03-30 15:15 ` Uros Bizjak 2025-03-25 21:56 ` Ingo Molnar 1 sibling, 1 reply; 25+ messages in thread From: Borislav Petkov @ 2025-03-25 17:11 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen, H. Peter Anvin On Tue, Mar 25, 2025 at 05:48:38PM +0100, Uros Bizjak wrote: > +#ifdef __POPCNT__ > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" > + : [cnt] "=&r" (res) > + : [val] ASM_INPUT_RM (w)); > +#else > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE > "call __sw_hweight32", > ASM_CLR "popcntl %[val], %[cnt]", > X86_FEATURE_POPCNT) > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT > : [val] REG_IN (w)); > - > +#endif A whopping 599 bytes which makes the asm more ugly. Not worth the effort IMO. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-25 17:11 ` Borislav Petkov @ 2025-03-30 15:15 ` Uros Bizjak 2025-03-30 17:31 ` Borislav Petkov 0 siblings, 1 reply; 25+ messages in thread From: Uros Bizjak @ 2025-03-30 15:15 UTC (permalink / raw) To: Borislav Petkov Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen, H. Peter Anvin On Tue, Mar 25, 2025 at 6:11 PM Borislav Petkov <bp@alien8.de> wrote: > > On Tue, Mar 25, 2025 at 05:48:38PM +0100, Uros Bizjak wrote: > > +#ifdef __POPCNT__ > > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" > > + : [cnt] "=&r" (res) > > + : [val] ASM_INPUT_RM (w)); > > +#else > > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE > > "call __sw_hweight32", > > ASM_CLR "popcntl %[val], %[cnt]", > > X86_FEATURE_POPCNT) > > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT > > : [val] REG_IN (w)); > > - > > +#endif > > A whopping 599 bytes which makes the asm more ugly. > > Not worth the effort IMO. You missed this part: --q-- ... where there is no need for an entry in the .altinstr_replacement section, shrinking all text sections by 9476 bytes: text data bss dec hex filename 27267068 4643047 814852 32724967 1f357e7 vmlinux-old.o 27257592 4643047 814852 32715491 1f332e3 vmlinux-new.o --/q-- Thanks, Uros. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 15:15 ` Uros Bizjak @ 2025-03-30 17:31 ` Borislav Petkov 2025-03-30 18:47 ` Ingo Molnar 0 siblings, 1 reply; 25+ messages in thread From: Borislav Petkov @ 2025-03-30 17:31 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen, H. Peter Anvin On Sun, Mar 30, 2025 at 05:15:16PM +0200, Uros Bizjak wrote: > You missed this part: > > --q-- > ... where there is no need for an entry in the .altinstr_replacement Not really - .altinstr* memory gets jettisoned after boot. > section, shrinking all text sections by 9476 bytes: So if anything, this saves a whopping ~9K disk space and makes hweight* an unreadable mess. So no, thanks. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 17:31 ` Borislav Petkov @ 2025-03-30 18:47 ` Ingo Molnar 2025-03-30 19:06 ` Borislav Petkov 0 siblings, 1 reply; 25+ messages in thread From: Ingo Molnar @ 2025-03-30 18:47 UTC (permalink / raw) To: Borislav Petkov Cc: Uros Bizjak, x86, linux-kernel, Thomas Gleixner, Dave Hansen, H. Peter Anvin * Borislav Petkov <bp@alien8.de> wrote: > On Sun, Mar 30, 2025 at 05:15:16PM +0200, Uros Bizjak wrote: > > You missed this part: > > > > --q-- > > ... where there is no need for an entry in the .altinstr_replacement > > Not really - .altinstr* memory gets jettisoned after boot. > > > section, shrinking all text sections by 9476 bytes: > > So if anything, this saves a whopping ~9K disk space and makes > hweight* an unreadable mess. What unreadable mess? The proposed patch is: +#ifdef __POPCNT__ + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" + : [cnt] "=&r" (res) + : [val] ASM_INPUT_RM (w)); +#else asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE "call __sw_hweight32", ASM_CLR "popcntl %[val], %[cnt]", X86_FEATURE_POPCNT) : [cnt] "=a" (res), ASM_CALL_CONSTRAINT : [val] REG_IN (w)); - +#endif Which is 3 straightforward lines of assembly code and a straightforward #ifdef. My main objection is different: if __POPCNT__ isn't defined during the kernel build of major Linux distros, then this optimization almost doesn't exist to our users. And I don't think it's defined. Thanks, Ingo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 18:47 ` Ingo Molnar @ 2025-03-30 19:06 ` Borislav Petkov 2025-03-30 19:20 ` Ingo Molnar 0 siblings, 1 reply; 25+ messages in thread From: Borislav Petkov @ 2025-03-30 19:06 UTC (permalink / raw) To: Ingo Molnar Cc: Uros Bizjak, x86, linux-kernel, Thomas Gleixner, Dave Hansen, H. Peter Anvin On Sun, Mar 30, 2025 at 08:47:31PM +0200, Ingo Molnar wrote: > +#ifdef __POPCNT__ > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" > + : [cnt] "=&r" (res) > + : [val] ASM_INPUT_RM (w)); > +#else > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE > "call __sw_hweight32", > ASM_CLR "popcntl %[val], %[cnt]", > X86_FEATURE_POPCNT) > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT > : [val] REG_IN (w)); > - > +#endif That ifdeffery. The alternative only is fine as this is the usual way we do those insns. The ifdeffery around it is ugly and is pushing it and it would be fine if it would bring anything but it doesn't. It is making the code ugly for no good reason whatsoever. > Which is 3 straightforward lines of assembly code and a straightforward #ifdef. And they bring what exactly? I haven't seen anything besides some super minor, completely pointless, hm, "savings". So much so that the uglification of the function is not worth it in the *least*. When I look at that code, I need to wonder now, is the __POPCNT__ defined or not? > My main objection is different: if __POPCNT__ isn't defined during the > kernel build of major Linux distros, then this optimization almost > doesn't exist to our users. And I don't think it's defined. Yah, that too. This whole effort is a total waste of time and energy. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 19:06 ` Borislav Petkov @ 2025-03-30 19:20 ` Ingo Molnar 2025-03-30 19:28 ` Borislav Petkov 0 siblings, 1 reply; 25+ messages in thread From: Ingo Molnar @ 2025-03-30 19:20 UTC (permalink / raw) To: Borislav Petkov Cc: Uros Bizjak, x86, linux-kernel, Thomas Gleixner, Dave Hansen, H. Peter Anvin * Borislav Petkov <bp@alien8.de> wrote: > On Sun, Mar 30, 2025 at 08:47:31PM +0200, Ingo Molnar wrote: > > +#ifdef __POPCNT__ > > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" > > + : [cnt] "=&r" (res) > > + : [val] ASM_INPUT_RM (w)); > > +#else > > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE > > "call __sw_hweight32", > > ASM_CLR "popcntl %[val], %[cnt]", > > X86_FEATURE_POPCNT) > > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT > > : [val] REG_IN (w)); > > - > > +#endif > > That ifdeffery. > > The alternative only is fine as this is the usual way we do those > insns. > > The ifdeffery around it is ugly and is pushing it and it would be > fine if it would bring anything but it doesn't. It is making the code > ugly for no good reason whatsoever. Tangible code size reduction, if it can be realized, is definitely 'something', so your claim is simply false. > > Which is 3 straightforward lines of assembly code and a > > straightforward #ifdef. > > And they bring what exactly? > > I haven't seen anything besides some super minor, completely > pointless, hm, "savings". So much so that the uglification of the > function is not worth it in the *least*. Even 0.5K of .text reduction is a tangible benefit. The kernel's 35 years long history comprises of literally over a million patches, which were small and inconsequential 99% of the time. > > My main objection is different: if __POPCNT__ isn't defined during > > the kernel build of major Linux distros, then this optimization > > almost doesn't exist to our users. And I don't think it's defined. > > Yah, that too. > > This whole effort is a total waste of time and energy. We don't know yet for sure, but I don't think an absolutist "can't do" approach is very productive. Thanks, Ingo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 19:20 ` Ingo Molnar @ 2025-03-30 19:28 ` Borislav Petkov 0 siblings, 0 replies; 25+ messages in thread From: Borislav Petkov @ 2025-03-30 19:28 UTC (permalink / raw) To: Ingo Molnar Cc: Uros Bizjak, x86, linux-kernel, Thomas Gleixner, Dave Hansen, H. Peter Anvin On Sun, Mar 30, 2025 at 09:20:28PM +0200, Ingo Molnar wrote: > Tangible code size reduction, if it can be realized, is definitely > 'something', so your claim is simply false. I don't think you're reading my mails. What "tangible code size reduction"? 9K vmlinux size? For sections which are getting jettisoned after boot? So now the rule is what: make the asm an unreadable mess just to save 9K disk space? I call that bullshit. A total and stinking one. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak 2025-03-25 17:11 ` Borislav Petkov @ 2025-03-25 21:56 ` Ingo Molnar 2025-03-29 9:19 ` Uros Bizjak 1 sibling, 1 reply; 25+ messages in thread From: Ingo Molnar @ 2025-03-25 21:56 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin * Uros Bizjak <ubizjak@gmail.com> wrote: > Emit naked POPCNT instruction when available with X86_NATIVE_CPU > option. The compiler is not bound by ABI when emitting the instruction > without the fallback call to __sw_hweight{32,64}() library function > and has much more freedom to allocate input and output operands, > including memory input operand. > > The code size of x86_64 defconfig (with X86_NATIVE_CPU option) > shrinks by 599 bytes: > > add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599) > Total: Before=22710531, After=22709932, chg -0.00% > > The asm changes from e.g.: > > 3bf9c: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi > 3bfa3: e8 00 00 00 00 call 3bfa8 <...> > 3bfa8: 90 nop > 3bfa9: 90 nop > > with: > > 34b: 31 c0 xor %eax,%eax > 34d: f3 48 0f b8 c7 popcnt %rdi,%rax > > in the .altinstr_replacement section > > to: > > 3bfdc: 31 c0 xor %eax,%eax > 3bfde: f3 48 0f b8 05 00 00 popcnt 0x0(%rip),%rax > 3bfe5: 00 00 > > where there is no need for an entry in the .altinstr_replacement > section, shrinking all text sections by 9476 bytes: > > text data bss dec hex filename > 27267068 4643047 814852 32724967 1f357e7 vmlinux-old.o > 27257592 4643047 814852 32715491 1f332e3 vmlinux-new.o > +#ifdef __POPCNT__ > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" > + : [cnt] "=&r" (res) > + : [val] ASM_INPUT_RM (w)); > +#else > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE > "call __sw_hweight32", > ASM_CLR "popcntl %[val], %[cnt]", > X86_FEATURE_POPCNT) > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT > : [val] REG_IN (w)); So a better optimization I think would be to declare and implement __sw_hweight32 with a different, less intrusive function call ABI that mirrors that of the instruction in essence, so that we optimize for the overwhelmingly common case of having the POPCNT instruction. Thanks, Ingo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-25 21:56 ` Ingo Molnar @ 2025-03-29 9:19 ` Uros Bizjak 2025-03-29 11:00 ` David Laight ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Uros Bizjak @ 2025-03-29 9:19 UTC (permalink / raw) To: Ingo Molnar Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote: > > > * Uros Bizjak <ubizjak@gmail.com> wrote: > > > Emit naked POPCNT instruction when available with X86_NATIVE_CPU > > option. The compiler is not bound by ABI when emitting the instruction > > without the fallback call to __sw_hweight{32,64}() library function > > and has much more freedom to allocate input and output operands, > > including memory input operand. > > > > The code size of x86_64 defconfig (with X86_NATIVE_CPU option) > > shrinks by 599 bytes: > > > > add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599) > > Total: Before=22710531, After=22709932, chg -0.00% > > > > The asm changes from e.g.: > > > > 3bf9c: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi > > 3bfa3: e8 00 00 00 00 call 3bfa8 <...> > > 3bfa8: 90 nop > > 3bfa9: 90 nop > > > > with: > > > > 34b: 31 c0 xor %eax,%eax > > 34d: f3 48 0f b8 c7 popcnt %rdi,%rax > > > > in the .altinstr_replacement section > > > > to: > > > > 3bfdc: 31 c0 xor %eax,%eax > > 3bfde: f3 48 0f b8 05 00 00 popcnt 0x0(%rip),%rax > > 3bfe5: 00 00 > > > > where there is no need for an entry in the .altinstr_replacement > > section, shrinking all text sections by 9476 bytes: > > > > text data bss dec hex filename > > 27267068 4643047 814852 32724967 1f357e7 vmlinux-old.o > > 27257592 4643047 814852 32715491 1f332e3 vmlinux-new.o > > > +#ifdef __POPCNT__ > > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" > > + : [cnt] "=&r" (res) > > + : [val] ASM_INPUT_RM (w)); > > +#else > > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE > > "call __sw_hweight32", > > ASM_CLR "popcntl %[val], %[cnt]", > > X86_FEATURE_POPCNT) > > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT > > : [val] REG_IN (w)); > > So a better optimization I think would be to declare and implement > __sw_hweight32 with a different, less intrusive function call ABI that With an external function, the ABI specifies the location of input argument and function result. Unless we want to declare the whole function as asm() inline function (with some 20 instructions), we have to specify the location of function arguments and where the function result is to be found in the asm() that calls the external function. Register allocator then uses this information to move arguments to the right place before the call. The above approach, when used to emulate an insn, has a drawback. When the instruction is available as an alternative, it still has fixed input and output registers, forced by the ABI of the function call. Register allocator has to move registers unnecessarily to satisfy the constraints of the function call, not the instruction itself. The proposed solution builds on the fact that with -march=native (and also when -mpopcnt is specified on the command line) , the compiler signals the availability of certain ISA by defining the corresponding definition. We can use this definition to relax the constraints to fit the instruction, not the ABI of the fallback function call. On x86, we can also access memory directly, avoiding clobbering a temporary input register. Without the fix for (obsolete) false dependency, the change becomes simply: #ifdef __POPCNT__ asm ("popcntl %[val], %[cnt]" : [cnt] "=r" (res) : [val] ASM_INPUT_RM (w)); #else and besides the reported savings of 600 bytes in the .text section also allows the register allocator to schedule registers (and input arguments from memory) more optimally, not counting additional 9k saved space in the alternative section. The patch is also an example, how -march=native enables further optimizations involving additional ISAs. Thanks, Uros. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-29 9:19 ` Uros Bizjak @ 2025-03-29 11:00 ` David Laight 2025-03-30 7:49 ` Uros Bizjak 2025-03-29 23:10 ` H. Peter Anvin 2025-03-30 9:56 ` Ingo Molnar 2 siblings, 1 reply; 25+ messages in thread From: David Laight @ 2025-03-29 11:00 UTC (permalink / raw) To: Uros Bizjak Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin On Sat, 29 Mar 2025 10:19:37 +0100 Uros Bizjak <ubizjak@gmail.com> wrote: > On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > Emit naked POPCNT instruction when available with X86_NATIVE_CPU > > > option. The compiler is not bound by ABI when emitting the instruction > > > without the fallback call to __sw_hweight{32,64}() library function > > > and has much more freedom to allocate input and output operands, > > > including memory input operand. > > > > > > The code size of x86_64 defconfig (with X86_NATIVE_CPU option) > > > shrinks by 599 bytes: > > > > > > add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599) > > > Total: Before=22710531, After=22709932, chg -0.00% > > > > > > The asm changes from e.g.: > > > > > > 3bf9c: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi > > > 3bfa3: e8 00 00 00 00 call 3bfa8 <...> > > > 3bfa8: 90 nop > > > 3bfa9: 90 nop > > > > > > with: > > > > > > 34b: 31 c0 xor %eax,%eax > > > 34d: f3 48 0f b8 c7 popcnt %rdi,%rax > > > > > > in the .altinstr_replacement section > > > > > > to: > > > > > > 3bfdc: 31 c0 xor %eax,%eax > > > 3bfde: f3 48 0f b8 05 00 00 popcnt 0x0(%rip),%rax > > > 3bfe5: 00 00 > > > > > > where there is no need for an entry in the .altinstr_replacement > > > section, shrinking all text sections by 9476 bytes: > > > > > > text data bss dec hex filename > > > 27267068 4643047 814852 32724967 1f357e7 vmlinux-old.o > > > 27257592 4643047 814852 32715491 1f332e3 vmlinux-new.o > > > > > +#ifdef __POPCNT__ > > > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" > > > + : [cnt] "=&r" (res) > > > + : [val] ASM_INPUT_RM (w)); > > > +#else > > > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE > > > "call __sw_hweight32", > > > ASM_CLR "popcntl %[val], %[cnt]", > > > X86_FEATURE_POPCNT) > > > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT > > > : [val] REG_IN (w)); > > > > So a better optimization I think would be to declare and implement > > __sw_hweight32 with a different, less intrusive function call ABI that > > With an external function, the ABI specifies the location of input > argument and function result. Unless we want to declare the whole > function as asm() inline function (with some 20 instructions), we have > to specify the location of function arguments and where the function > result is to be found in the asm() that calls the external function. > Register allocator then uses this information to move arguments to the > right place before the call. > > The above approach, when used to emulate an insn, has a drawback. > When the instruction is available as an alternative, it still has > fixed input and output registers, forced by the ABI of the function > call. Register allocator has to move registers unnecessarily to > satisfy the constraints of the function call, not the instruction > itself. Forcing the argument into a fixed register won't make much difference to execution time. Just a bit more work for the instruction decoder and a few more bytes of I-cache. (Register-register moves can be zero clocks.) In many cases (but not as many as you might hope for) the compiler back-tracks the input register requirement to the instruction that generates the value. In this case the called function needs two writeable registers. I think you can tell gcc the input is invalidated and the output is 'early clobber' so that the register are different. > The proposed solution builds on the fact that with -march=native (and > also when -mpopcnt is specified on the command line) , the compiler > signals the availability of certain ISA by defining the corresponding > definition. We can use this definition to relax the constraints to fit > the instruction, not the ABI of the fallback function call. On x86, we > can also access memory directly, avoiding clobbering a temporary input > register. > > Without the fix for (obsolete) false dependency, the change becomes simply: > > #ifdef __POPCNT__ > asm ("popcntl %[val], %[cnt]" > : [cnt] "=r" (res) > : [val] ASM_INPUT_RM (w)); > #else > > and besides the reported savings of 600 bytes in the .text section > also allows the register allocator to schedule registers (and input > arguments from memory) more optimally, not counting additional 9k > saved space in the alternative section. > > The patch is also an example, how -march=native enables further > optimizations involving additional ISAs. To my mind it would be better to be able to specify oldest cpu type the build should support. Either by actual cpu type (eg 'skylake' or 'zen2') or maybe by a specific instruction (eg popcnt). The scripts would then determine the appropriate compiler flags and any extra -Dvar to generate appropriate code. The arch/x86/Kconfig.cpu seems to be missing options to select between 64bit cpus. That would also be the place to add CONFIG defines that mirror the X86_FEATURE_xxx flags. That would be more flexible. David > > Thanks, > Uros. > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-29 11:00 ` David Laight @ 2025-03-30 7:49 ` Uros Bizjak 2025-03-30 18:02 ` David Laight 0 siblings, 1 reply; 25+ messages in thread From: Uros Bizjak @ 2025-03-30 7:49 UTC (permalink / raw) To: David Laight Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin On Sat, Mar 29, 2025 at 12:00 PM David Laight <david.laight.linux@gmail.com> wrote: > > On Sat, 29 Mar 2025 10:19:37 +0100 > Uros Bizjak <ubizjak@gmail.com> wrote: > > > On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote: > > > > > > > > > * Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > > Emit naked POPCNT instruction when available with X86_NATIVE_CPU > > > > option. The compiler is not bound by ABI when emitting the instruction > > > > without the fallback call to __sw_hweight{32,64}() library function > > > > and has much more freedom to allocate input and output operands, > > > > including memory input operand. > > > > > > > > The code size of x86_64 defconfig (with X86_NATIVE_CPU option) > > > > shrinks by 599 bytes: > > > > > > > > add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599) > > > > Total: Before=22710531, After=22709932, chg -0.00% > > > > > > > > The asm changes from e.g.: > > > > > > > > 3bf9c: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi > > > > 3bfa3: e8 00 00 00 00 call 3bfa8 <...> > > > > 3bfa8: 90 nop > > > > 3bfa9: 90 nop > > > > > > > > with: > > > > > > > > 34b: 31 c0 xor %eax,%eax > > > > 34d: f3 48 0f b8 c7 popcnt %rdi,%rax > > > > > > > > in the .altinstr_replacement section > > > > > > > > to: > > > > > > > > 3bfdc: 31 c0 xor %eax,%eax > > > > 3bfde: f3 48 0f b8 05 00 00 popcnt 0x0(%rip),%rax > > > > 3bfe5: 00 00 > > > > > > > > where there is no need for an entry in the .altinstr_replacement > > > > section, shrinking all text sections by 9476 bytes: > > > > > > > > text data bss dec hex filename > > > > 27267068 4643047 814852 32724967 1f357e7 vmlinux-old.o > > > > 27257592 4643047 814852 32715491 1f332e3 vmlinux-new.o > > > > > > > +#ifdef __POPCNT__ > > > > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" > > > > + : [cnt] "=&r" (res) > > > > + : [val] ASM_INPUT_RM (w)); > > > > +#else > > > > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE > > > > "call __sw_hweight32", > > > > ASM_CLR "popcntl %[val], %[cnt]", > > > > X86_FEATURE_POPCNT) > > > > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT > > > > : [val] REG_IN (w)); > > > > > > So a better optimization I think would be to declare and implement > > > __sw_hweight32 with a different, less intrusive function call ABI that > > > > With an external function, the ABI specifies the location of input > > argument and function result. Unless we want to declare the whole > > function as asm() inline function (with some 20 instructions), we have > > to specify the location of function arguments and where the function > > result is to be found in the asm() that calls the external function. > > Register allocator then uses this information to move arguments to the > > right place before the call. > > > > The above approach, when used to emulate an insn, has a drawback. > > When the instruction is available as an alternative, it still has > > fixed input and output registers, forced by the ABI of the function > > call. Register allocator has to move registers unnecessarily to > > satisfy the constraints of the function call, not the instruction > > itself. > > Forcing the argument into a fixed register won't make much difference > to execution time. > Just a bit more work for the instruction decoder and a few more bytes > of I-cache. > (Register-register moves can be zero clocks.) > In many cases (but not as many as you might hope for) the compiler > back-tracks the input register requirement to the instruction that > generates the value. I'm afraid I don't fully understand what you mean by "back-tracking the input register requirement". However, with: asm("insn %0, %1" : "=r" (out) : "r" (in)); the compiler is not obliged to match input with output, although many times it does so (especially when input argument is dead). To avoid false dependence on the output, we should force the compiler to always match input and output: asm("insn %0, %1" : "=r" (out) : "0" (in)); and this will resolve false dependence (input register obviously needs to be ready before insn) at the expense of an extra move instruction in front of the insn in case input is not dead. This is unfortunately not possible when one of the alternatives is a function call, where location of input and output arguments is specified by ABI. > In this case the called function needs two writeable registers. > I think you can tell gcc the input is invalidated and the output > is 'early clobber' so that the register are different. Yes, my first patch used this approach, where output operand is cleared first: asm("xorl %0, %0; popcntl %1, %0" : "=&r" (out) : "rm" (in)); Please note that "earlyclobbered" output reg can't be matched with input reg, or with any reg that forms the address. > > The proposed solution builds on the fact that with -march=native (and > > also when -mpopcnt is specified on the command line) , the compiler > > signals the availability of certain ISA by defining the corresponding > > definition. We can use this definition to relax the constraints to fit > > the instruction, not the ABI of the fallback function call. On x86, we > > can also access memory directly, avoiding clobbering a temporary input > > register. > > > > Without the fix for (obsolete) false dependency, the change becomes simply: > > > > #ifdef __POPCNT__ > > asm ("popcntl %[val], %[cnt]" > > : [cnt] "=r" (res) > > : [val] ASM_INPUT_RM (w)); > > #else > > > > and besides the reported savings of 600 bytes in the .text section > > also allows the register allocator to schedule registers (and input > > arguments from memory) more optimally, not counting additional 9k > > saved space in the alternative section. > > > > The patch is also an example, how -march=native enables further > > optimizations involving additional ISAs. > > To my mind it would be better to be able to specify oldest cpu > type the build should support. > Either by actual cpu type (eg 'skylake' or 'zen2') or maybe by > a specific instruction (eg popcnt). > The scripts would then determine the appropriate compiler flags > and any extra -Dvar to generate appropriate code. Please note that with -march=native the compiler driver ("gcc") does this for you. -march=native expands to a series of -m compile flags (you can see these flags by passing -### to gcc) and each flag defines corresponding ISA macro when set. E.g., passing -mpopcnt defines __POPCNT__ macro. These macros can be used instead of -Dvar for conditional compilation that depends on -m ISA flags, passed to the compiler ("cc1"). > The arch/x86/Kconfig.cpu seems to be missing options to select > between 64bit cpus. > That would also be the place to add CONFIG defines that mirror the > X86_FEATURE_xxx flags. While the above -march=native expands in the driver, setting e.g. -march=skylake enables CPU capabilities in the compiler itself. However, using -march=...cpu... also sets corresponding ISA macros, so the proposed approach does not exclude Kconfig.cpu options. Automatically set ISA macros can be used instead of X86_FEATURE_xxx flags for maximum flexibility. Thanks, Uros. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 7:49 ` Uros Bizjak @ 2025-03-30 18:02 ` David Laight 0 siblings, 0 replies; 25+ messages in thread From: David Laight @ 2025-03-30 18:02 UTC (permalink / raw) To: Uros Bizjak Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin On Sun, 30 Mar 2025 09:49:43 +0200 Uros Bizjak <ubizjak@gmail.com> wrote: > On Sat, Mar 29, 2025 at 12:00 PM David Laight > <david.laight.linux@gmail.com> wrote: > > > > On Sat, 29 Mar 2025 10:19:37 +0100 > > Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote: > > > > > > > > > > > > * Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > > > > Emit naked POPCNT instruction when available with X86_NATIVE_CPU > > > > > option. The compiler is not bound by ABI when emitting the instruction > > > > > without the fallback call to __sw_hweight{32,64}() library function > > > > > and has much more freedom to allocate input and output operands, > > > > > including memory input operand. > > > > > > > > > > The code size of x86_64 defconfig (with X86_NATIVE_CPU option) > > > > > shrinks by 599 bytes: > > > > > > > > > > add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599) > > > > > Total: Before=22710531, After=22709932, chg -0.00% > > > > > > > > > > The asm changes from e.g.: > > > > > > > > > > 3bf9c: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi > > > > > 3bfa3: e8 00 00 00 00 call 3bfa8 <...> > > > > > 3bfa8: 90 nop > > > > > 3bfa9: 90 nop > > > > > > > > > > with: > > > > > > > > > > 34b: 31 c0 xor %eax,%eax > > > > > 34d: f3 48 0f b8 c7 popcnt %rdi,%rax > > > > > > > > > > in the .altinstr_replacement section > > > > > > > > > > to: > > > > > > > > > > 3bfdc: 31 c0 xor %eax,%eax > > > > > 3bfde: f3 48 0f b8 05 00 00 popcnt 0x0(%rip),%rax > > > > > 3bfe5: 00 00 > > > > > > > > > > where there is no need for an entry in the .altinstr_replacement > > > > > section, shrinking all text sections by 9476 bytes: > > > > > > > > > > text data bss dec hex filename > > > > > 27267068 4643047 814852 32724967 1f357e7 vmlinux-old.o > > > > > 27257592 4643047 814852 32715491 1f332e3 vmlinux-new.o > > > > > > > > > +#ifdef __POPCNT__ > > > > > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" > > > > > + : [cnt] "=&r" (res) > > > > > + : [val] ASM_INPUT_RM (w)); > > > > > +#else > > > > > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE > > > > > "call __sw_hweight32", > > > > > ASM_CLR "popcntl %[val], %[cnt]", > > > > > X86_FEATURE_POPCNT) > > > > > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT > > > > > : [val] REG_IN (w)); > > > > > > > > So a better optimization I think would be to declare and implement > > > > __sw_hweight32 with a different, less intrusive function call ABI that > > > > > > With an external function, the ABI specifies the location of input > > > argument and function result. Unless we want to declare the whole > > > function as asm() inline function (with some 20 instructions), we have > > > to specify the location of function arguments and where the function > > > result is to be found in the asm() that calls the external function. > > > Register allocator then uses this information to move arguments to the > > > right place before the call. > > > > > > The above approach, when used to emulate an insn, has a drawback. > > > When the instruction is available as an alternative, it still has > > > fixed input and output registers, forced by the ABI of the function > > > call. Register allocator has to move registers unnecessarily to > > > satisfy the constraints of the function call, not the instruction > > > itself. > > > > Forcing the argument into a fixed register won't make much difference > > to execution time. > > Just a bit more work for the instruction decoder and a few more bytes > > of I-cache. > > (Register-register moves can be zero clocks.) > > In many cases (but not as many as you might hope for) the compiler > > back-tracks the input register requirement to the instruction that > > generates the value. > > I'm afraid I don't fully understand what you mean by "back-tracking > the input register requirement". If the asm block requires an input in %rdx then the instruction that creates the value would be expected to put it into %rdx ready for the asm block. Even if it doesn't a register-register move is often implemented without using the ALU by 'register renaming' (there is an indirection between the register 'number' the code uses and the physical latches that hold the value, multiple copies of (say) %rax can be live at the same time). > However, with: > > asm("insn %0, %1" : "=r" (out) : "r" (in)); > > the compiler is not obliged to match input with output, although many > times it does so (especially when input argument is dead). To avoid > false dependence on the output, we should force the compiler to always > match input and output: > > asm("insn %0, %1" : "=r" (out) : "0" (in)); I'd expect the compiler to generate better code if it is allowed to use separate registers for the input and output. It may be able to use the input value again. There is no 'dependency' on the output register (unless the instruction only updates the low bits). > > and this will resolve false dependence (input register obviously needs > to be ready before insn) at the expense of an extra move instruction > in front of the insn in case input is not dead. This is unfortunately > not possible when one of the alternatives is a function call, where > location of input and output arguments is specified by ABI. > > > In this case the called function needs two writeable registers. > > I think you can tell gcc the input is invalidated and the output > > is 'early clobber' so that the register are different. > > Yes, my first patch used this approach, where output operand is cleared first: That clearing of the output serves an entirely different purpose. > > asm("xorl %0, %0; popcntl %1, %0" : "=&r" (out) : "rm" (in)); > > Please note that "earlyclobbered" output reg can't be matched with > input reg, or with any reg that forms the address. But you want them to be different. The called function needs to multiple 'shift and add' sequences. To do that it needs a scratch register. So if the asm block requires the input in %rdx, puts the output in %rax and destroys %rdx you can write a function that doesn't need any other registers. If you try really hard you can make the called function depend on the register the compiler selects - and white a different copy for each. Not worth it here. David ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-29 9:19 ` Uros Bizjak 2025-03-29 11:00 ` David Laight @ 2025-03-29 23:10 ` H. Peter Anvin 2025-03-30 6:54 ` Uros Bizjak 2025-03-30 9:56 ` Ingo Molnar 2 siblings, 1 reply; 25+ messages in thread From: H. Peter Anvin @ 2025-03-29 23:10 UTC (permalink / raw) To: Uros Bizjak, Ingo Molnar Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen On March 29, 2025 2:19:37 AM PDT, Uros Bizjak <ubizjak@gmail.com> wrote: >On Tue, Mar 25, 2025 at 10:56 PM Ingo Molnar <mingo@kernel.org> wrote: >> >> >> * Uros Bizjak <ubizjak@gmail.com> wrote: >> >> > Emit naked POPCNT instruction when available with X86_NATIVE_CPU >> > option. The compiler is not bound by ABI when emitting the instruction >> > without the fallback call to __sw_hweight{32,64}() library function >> > and has much more freedom to allocate input and output operands, >> > including memory input operand. >> > >> > The code size of x86_64 defconfig (with X86_NATIVE_CPU option) >> > shrinks by 599 bytes: >> > >> > add/remove: 0/0 grow/shrink: 45/197 up/down: 843/-1442 (-599) >> > Total: Before=22710531, After=22709932, chg -0.00% >> > >> > The asm changes from e.g.: >> > >> > 3bf9c: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi >> > 3bfa3: e8 00 00 00 00 call 3bfa8 <...> >> > 3bfa8: 90 nop >> > 3bfa9: 90 nop >> > >> > with: >> > >> > 34b: 31 c0 xor %eax,%eax >> > 34d: f3 48 0f b8 c7 popcnt %rdi,%rax >> > >> > in the .altinstr_replacement section >> > >> > to: >> > >> > 3bfdc: 31 c0 xor %eax,%eax >> > 3bfde: f3 48 0f b8 05 00 00 popcnt 0x0(%rip),%rax >> > 3bfe5: 00 00 >> > >> > where there is no need for an entry in the .altinstr_replacement >> > section, shrinking all text sections by 9476 bytes: >> > >> > text data bss dec hex filename >> > 27267068 4643047 814852 32724967 1f357e7 vmlinux-old.o >> > 27257592 4643047 814852 32715491 1f332e3 vmlinux-new.o >> >> > +#ifdef __POPCNT__ >> > + asm_inline (ASM_FORCE_CLR "popcntl %[val], %[cnt]" >> > + : [cnt] "=&r" (res) >> > + : [val] ASM_INPUT_RM (w)); >> > +#else >> > asm_inline (ALTERNATIVE(ANNOTATE_IGNORE_ALTERNATIVE >> > "call __sw_hweight32", >> > ASM_CLR "popcntl %[val], %[cnt]", >> > X86_FEATURE_POPCNT) >> > : [cnt] "=a" (res), ASM_CALL_CONSTRAINT >> > : [val] REG_IN (w)); >> >> So a better optimization I think would be to declare and implement >> __sw_hweight32 with a different, less intrusive function call ABI that > >With an external function, the ABI specifies the location of input >argument and function result. Unless we want to declare the whole >function as asm() inline function (with some 20 instructions), we have >to specify the location of function arguments and where the function >result is to be found in the asm() that calls the external function. >Register allocator then uses this information to move arguments to the >right place before the call. > >The above approach, when used to emulate an insn, has a drawback. >When the instruction is available as an alternative, it still has >fixed input and output registers, forced by the ABI of the function >call. Register allocator has to move registers unnecessarily to >satisfy the constraints of the function call, not the instruction >itself. > >The proposed solution builds on the fact that with -march=native (and >also when -mpopcnt is specified on the command line) , the compiler >signals the availability of certain ISA by defining the corresponding >definition. We can use this definition to relax the constraints to fit >the instruction, not the ABI of the fallback function call. On x86, we >can also access memory directly, avoiding clobbering a temporary input >register. > >Without the fix for (obsolete) false dependency, the change becomes simply: > >#ifdef __POPCNT__ > asm ("popcntl %[val], %[cnt]" > : [cnt] "=r" (res) > : [val] ASM_INPUT_RM (w)); >#else > >and besides the reported savings of 600 bytes in the .text section >also allows the register allocator to schedule registers (and input >arguments from memory) more optimally, not counting additional 9k >saved space in the alternative section. > >The patch is also an example, how -march=native enables further >optimizations involving additional ISAs. > >Thanks, >Uros. > If you have __POPCNT__ defined, could you not simply use __builtin_popcnt()? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-29 23:10 ` H. Peter Anvin @ 2025-03-30 6:54 ` Uros Bizjak 0 siblings, 0 replies; 25+ messages in thread From: Uros Bizjak @ 2025-03-30 6:54 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen On Sun, Mar 30, 2025 at 12:10 AM H. Peter Anvin <hpa@zytor.com> wrote: > > On March 29, 2025 2:19:37 AM PDT, Uros Bizjak <ubizjak@gmail.com> wrote: > > > >The proposed solution builds on the fact that with -march=native (and > >also when -mpopcnt is specified on the command line) , the compiler > >signals the availability of certain ISA by defining the corresponding > >definition. We can use this definition to relax the constraints to fit > >the instruction, not the ABI of the fallback function call. On x86, we > >can also access memory directly, avoiding clobbering a temporary input > >register. > > > >Without the fix for (obsolete) false dependency, the change becomes simply: > > > >#ifdef __POPCNT__ > > asm ("popcntl %[val], %[cnt]" > > : [cnt] "=r" (res) > > : [val] ASM_INPUT_RM (w)); > >#else > > > >and besides the reported savings of 600 bytes in the .text section > >also allows the register allocator to schedule registers (and input > >arguments from memory) more optimally, not counting additional 9k > >saved space in the alternative section. > > > >The patch is also an example, how -march=native enables further > >optimizations involving additional ISAs. > > If you have __POPCNT__ defined, could you not simply use __builtin_popcnt()? We can use it, but then the compiler (at least GCC) will start to emit false dependency fixups for POPCNT instructions (which we don't want; TZCNT has the same problem, where we agreed that it is not worth fixing for 10 years old cpus [1]). Please note that __builtin functions are not strictly required to be inlined and can generate a library call [2]. I have been burned by this fact with __builtin_parity(), so IMO the safest way in case of POPCNT insn is to use ISA macros only to determine the availability of ISA. Also note that optimizations with constant arguments are already performed elsewhere (c.f. <asm-generic/bitops/const_hweight.h>) and __arch_hweight() receives only variable arguments, so optimization opportunities with __builtin function are greatly reduced. [1] https://lore.kernel.org/lkml/CAFULd4ZzoW+vP_pa1hEF--gvsG8yaPLU8S7oBkJBZLP4Tirepw@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/CAKbZUD0N7bkuw_Le3Pr9o1V2BjjcY_YiLm8a8DPceubTdZ00GQ@mail.gmail.com/ Uros. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-29 9:19 ` Uros Bizjak 2025-03-29 11:00 ` David Laight 2025-03-29 23:10 ` H. Peter Anvin @ 2025-03-30 9:56 ` Ingo Molnar 2025-03-30 16:07 ` Uros Bizjak 2 siblings, 1 reply; 25+ messages in thread From: Ingo Molnar @ 2025-03-30 9:56 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin * Uros Bizjak <ubizjak@gmail.com> wrote: > > So a better optimization I think would be to declare and implement > > __sw_hweight32 with a different, less intrusive function call ABI > > that > > With an external function, the ABI specifies the location of input > argument and function result. This is all within the kernel, and __sw_hweight32() is implemented in the kernel as well, entirely in assembly, and the ALTERNATIVE*() macros are fully under our control as well - so we have full control over the calling convention. Ie. in principle there's no need for the __sw_hweight32 function utilized by ALTERNATIVE() to be a C-call-ABI external function with all its call-clobbering constraints that disturbs register state affected by the C-call-ABI. (RSI RSI RDX RCX R8 R9) The calling convention used is the kernel's choice, which we can re-evaluate. For example, we could make a version of __sw_hweight32 that is a largely no-clobber function that only touches a single register, which receives its input in RAX and returns the result to RAX (as usual), and saves/restores everything else. This pushes overhead into the uncommon case (__sw_hweight32 users) and reduces register pressure on the calling site. I'm not saying it's *worth* it for POPCNTL emulation alone: - The code generation benefits might or might not be there. Needs to be examined. - There may be some trouble with on-stack red zones used by the compiler, if the compiler doesn't know that a call was done. - Plus rolling a different calling convention down the alternatives patching macros will have some maintenance overhead side effects. Possibly other usecases need to be found as well for this to be worth it. But I wanted to bust the false assumption you seem to be making about C-call-ABI constraints. Thanks, Ingo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 9:56 ` Ingo Molnar @ 2025-03-30 16:07 ` Uros Bizjak 2025-03-30 18:15 ` David Laight 2025-03-30 18:54 ` Ingo Molnar 0 siblings, 2 replies; 25+ messages in thread From: Uros Bizjak @ 2025-03-30 16:07 UTC (permalink / raw) To: Ingo Molnar Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin On Sun, Mar 30, 2025 at 11:56 AM Ingo Molnar <mingo@kernel.org> wrote: > > > * Uros Bizjak <ubizjak@gmail.com> wrote: > > > > So a better optimization I think would be to declare and implement > > > __sw_hweight32 with a different, less intrusive function call ABI > > > that > > > > With an external function, the ABI specifies the location of input > > argument and function result. > > This is all within the kernel, and __sw_hweight32() is implemented in > the kernel as well, entirely in assembly, and the ALTERNATIVE*() macros > are fully under our control as well - so we have full control over the > calling convention. There is a minor issue with a generic prototype in <linux/bitops.h>, where we declare: extern unsigned int __sw_hweight32(unsigned int w); extern unsigned long __sw_hweight64(__u64 w); This creates a bit of mixup, so perhaps it is better to define and use an x86 specific function name. With this issue out of the way, we can use %eax/%rax as inout register for x86_32 and x86_64 targets, and remove: #ifdef CONFIG_64BIT #define REG_IN "D" #define REG_OUT "a" #else #define REG_IN "a" #define REG_OUT "a" #endif from <arch/x86/include/asm/arch_hweight.h> In parallel, we should remove x86_64 conditional compilation from arch/x86/lib/hweight.S: SYM_FUNC_START(__sw_hweight32) #ifdef CONFIG_X86_64 movl %edi, %eax # w #endif __ASM_SIZE(push,) %__ASM_REG(dx) movl %eax, %edx # w -> t ... > Ie. in principle there's no need for the __sw_hweight32 function > utilized by ALTERNATIVE() to be a C-call-ABI external function with all > its call-clobbering constraints that disturbs register state affected > by the C-call-ABI. (RSI RSI RDX RCX R8 R9) > > The calling convention used is the kernel's choice, which we can > re-evaluate. > > For example, we could make a version of __sw_hweight32 that is a > largely no-clobber function that only touches a single register, which > receives its input in RAX and returns the result to RAX (as usual), and > saves/restores everything else. This pushes overhead into the uncommon > case (__sw_hweight32 users) and reduces register pressure on the > calling site. > > I'm not saying it's *worth* it for POPCNTL emulation alone: > > - The code generation benefits might or might not be there. Needs to > be examined. Matching inputs with output will actually make the instruction "destructive", so the compiler will have to copy the input argument when it won't die in the instruction. This is not desirable. The code generation benefits would come from relaxing input constraint to "rm", which is not possible with a fallback function. > > - There may be some trouble with on-stack red zones used by the > compiler, if the compiler doesn't know that a call was done. The kernel is currently compiled with -mno-red-zone, gcc-15 introduces special "redzone" clobber to disable red-zone in the function that includes asm() when/if the kernel starts using redzone. > - Plus rolling a different calling convention down the alternatives > patching macros will have some maintenance overhead side effects. > Possibly other usecases need to be found as well for this to be > worth it. I think that adding a __POPCNT__ version (similar to my original patch) would bring the most benefit, because we could use "rm" input and "=r" output registers, without any constraints, enforced by fallback function call. This is only possible with a new -march=native functionality. > But I wanted to bust the false assumption you seem to be making about > C-call-ABI constraints. Thanks, Uros. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 16:07 ` Uros Bizjak @ 2025-03-30 18:15 ` David Laight 2025-03-30 22:44 ` H. Peter Anvin 2025-03-30 18:54 ` Ingo Molnar 1 sibling, 1 reply; 25+ messages in thread From: David Laight @ 2025-03-30 18:15 UTC (permalink / raw) To: Uros Bizjak Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin On Sun, 30 Mar 2025 18:07:30 +0200 Uros Bizjak <ubizjak@gmail.com> wrote: ... > The kernel is currently compiled with -mno-red-zone, gcc-15 introduces > special "redzone" clobber to disable red-zone in the function that > includes asm() when/if the kernel starts using redzone. I really don't understand the point of a stack 'red-zone'. In any function that contains memory writes the cost of a 'sub $n,%rsp' at the top and corresponding add at the bottom is surely noise. David. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 18:15 ` David Laight @ 2025-03-30 22:44 ` H. Peter Anvin 0 siblings, 0 replies; 25+ messages in thread From: H. Peter Anvin @ 2025-03-30 22:44 UTC (permalink / raw) To: David Laight, Uros Bizjak Cc: Ingo Molnar, x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen On March 30, 2025 11:15:44 AM PDT, David Laight <david.laight.linux@gmail.com> wrote: >On Sun, 30 Mar 2025 18:07:30 +0200 >Uros Bizjak <ubizjak@gmail.com> wrote: > >... >> The kernel is currently compiled with -mno-red-zone, gcc-15 introduces >> special "redzone" clobber to disable red-zone in the function that >> includes asm() when/if the kernel starts using redzone. > >I really don't understand the point of a stack 'red-zone'. >In any function that contains memory writes the cost of a >'sub $n,%rsp' at the top and corresponding add at the bottom >is surely noise. > > David. Perhaps you should check out the performance numbers instead. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option 2025-03-30 16:07 ` Uros Bizjak 2025-03-30 18:15 ` David Laight @ 2025-03-30 18:54 ` Ingo Molnar 1 sibling, 0 replies; 25+ messages in thread From: Ingo Molnar @ 2025-03-30 18:54 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin * Uros Bizjak <ubizjak@gmail.com> wrote: > On Sun, Mar 30, 2025 at 11:56 AM Ingo Molnar <mingo@kernel.org> wrote: > > > > > > * Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > So a better optimization I think would be to declare and implement > > > > __sw_hweight32 with a different, less intrusive function call ABI > > > > that > > > > > > With an external function, the ABI specifies the location of input > > > argument and function result. > > > > This is all within the kernel, and __sw_hweight32() is implemented in > > the kernel as well, entirely in assembly, and the ALTERNATIVE*() macros > > are fully under our control as well - so we have full control over the > > calling convention. > > There is a minor issue with a generic prototype in <linux/bitops.h>, > where we declare: > > extern unsigned int __sw_hweight32(unsigned int w); > extern unsigned long __sw_hweight64(__u64 w); > > This creates a bit of mixup, so perhaps it is better to define and use > an x86 specific function name. Yes, I alluded to this complication: > > For example, we could make a version of __sw_hweight32 that is a > > largely no-clobber function that only touches a single register, which That version of __sw_hweight32 would be a different symbol. > > I'm not saying it's *worth* it for POPCNTL emulation alone: > > > > - The code generation benefits might or might not be there. Needs to > > be examined. > > Matching inputs with output will actually make the instruction > "destructive", so the compiler will have to copy the input argument > when it won't die in the instruction. This is not desirable. Yeah, absolutely - it was mainly a demonstration that even single-clobber functions are possible. (There's even zero-clobber functions, like __fentry__) > I think that adding a __POPCNT__ version (similar to my original > patch) would bring the most benefit, because we could use "rm" input > and "=r" output registers, without any constraints, enforced by > fallback function call. This is only possible with a new > -march=native functionality. Yeah, -march=native might be nice for local tinkering, but it won't reach 99.999% of Linux users - so it's immaterial to this particular discussion. Also, is POPCNTL the best example for this? Are there no other, more frequently used ALTERNATIVE() patching sites with function call alternatives that disturb the register state of important kernel functions? (And I don't know the answer.) Thanks, Ingo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn 2025-03-25 16:48 [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Uros Bizjak 2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak @ 2025-03-25 17:09 ` Borislav Petkov 2025-03-25 17:17 ` Uros Bizjak 2025-03-25 21:50 ` Ingo Molnar 2 siblings, 1 reply; 25+ messages in thread From: Borislav Petkov @ 2025-03-25 17:09 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen, H. Peter Anvin On Tue, Mar 25, 2025 at 05:48:37PM +0100, Uros Bizjak wrote: > +/* > + * On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction > + * appears to have a false dependency on the destination register. Even > + * though the instruction only writes to it, the instruction will wait > + * until destination is ready before executing. This false dependency > + * was fixed for Cannon Lake (and later) processors. Any official documentation about that? Any performance numbers to justify that change? Because if it doesn't matter, why do it in the first place? Especially if you're doing this XORing now for *everyone* - not just the affected parties. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn 2025-03-25 17:09 ` [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Borislav Petkov @ 2025-03-25 17:17 ` Uros Bizjak 2025-03-25 17:44 ` Borislav Petkov 0 siblings, 1 reply; 25+ messages in thread From: Uros Bizjak @ 2025-03-25 17:17 UTC (permalink / raw) To: Borislav Petkov Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen, H. Peter Anvin On Tue, Mar 25, 2025 at 6:10 PM Borislav Petkov <bp@alien8.de> wrote: > > On Tue, Mar 25, 2025 at 05:48:37PM +0100, Uros Bizjak wrote: > > +/* > > + * On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction > > + * appears to have a false dependency on the destination register. Even > > + * though the instruction only writes to it, the instruction will wait > > + * until destination is ready before executing. This false dependency > > + * was fixed for Cannon Lake (and later) processors. > > Any official documentation about that? Please see [1], errata 026. [1] https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/8th-gen-core-spec-update.pdf > > Any performance numbers to justify that change? There is a lot of performance analysis at [2] and [3]. [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011 [3] https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviati > Because if it doesn't matter, why do it in the first place? Especially if > you're doing this XORing now for *everyone* - not just the affected parties. Thanks, Uros. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn 2025-03-25 17:17 ` Uros Bizjak @ 2025-03-25 17:44 ` Borislav Petkov 0 siblings, 0 replies; 25+ messages in thread From: Borislav Petkov @ 2025-03-25 17:44 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen, H. Peter Anvin On Tue, Mar 25, 2025 at 06:17:01PM +0100, Uros Bizjak wrote: > Please see [1], errata 026. Then next time put at least a reference to such official source. Because right now it reads like a wild guess. > There is a lot of performance analysis at [2] and [3]. You mean the commit message should not mention it - people should use a search engine to find that?! Lemme quote our documentation for you: "Quantify optimizations and trade-offs. If you claim improvements in performance, memory consumption, stack footprint, or binary size, include numbers that back them up. But also describe non-obvious costs. Optimizations usually aren’t free but trade-offs between CPU, memory, and readability; or, when it comes to heuristics, between different workloads. Describe the expected downsides of your optimization so that the reviewer can weigh costs against benefits." From: https://kernel.org/doc/html/latest/process/submitting-patches.html Please peruse that document when you get a chance. So my question still is: does that change has *any* *impact* on *kernel* workloads? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn 2025-03-25 16:48 [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Uros Bizjak 2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak 2025-03-25 17:09 ` [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Borislav Petkov @ 2025-03-25 21:50 ` Ingo Molnar 2 siblings, 0 replies; 25+ messages in thread From: Ingo Molnar @ 2025-03-25 21:50 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin * Uros Bizjak <ubizjak@gmail.com> wrote: > On Sandy/Ivy Bridge and later Intel processors, the POPCNT instruction > appears to have a false dependency on the destination register. Even > though the instruction only writes to it, the instruction will wait > until destination is ready before executing. This false dependency > was fixed for Cannon Lake (and later) processors. > > Fix false dependency by clearing the destination register first. > > The x86_64 defconfig object size increases by 779 bytes: > > text data bss dec hex filename > 27341418 4643015 814852 32799285 1f47a35 vmlinux-old.o > 27342197 4643015 814852 32800064 1f47d40 vmlinux-new.o I don't think adding an instruction for an old-microarchitecture weakness that has been fixed in new hardware already is worth bloating the kernel. Cannon Lake was released in 2018, 7 years ago. It will be 1-2 years until such a change percolates to Linux users, and by that time the microarchitecture with the fix (Cannon Lake) will be a decade old, and a majority of Intel CPU users will be using it. So I don't think this particular change is worth it, unless the false dependency can be quantified to have a huge impact on pre-Cannon-Lake CPUs - which I don't think it is. Thanks, Ingo ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2025-03-30 22:44 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-25 16:48 [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Uros Bizjak 2025-03-25 16:48 ` [PATCH -tip 2/2] x86/hweight: Use POPCNT when available with X86_NATIVE_CPU option Uros Bizjak 2025-03-25 17:11 ` Borislav Petkov 2025-03-30 15:15 ` Uros Bizjak 2025-03-30 17:31 ` Borislav Petkov 2025-03-30 18:47 ` Ingo Molnar 2025-03-30 19:06 ` Borislav Petkov 2025-03-30 19:20 ` Ingo Molnar 2025-03-30 19:28 ` Borislav Petkov 2025-03-25 21:56 ` Ingo Molnar 2025-03-29 9:19 ` Uros Bizjak 2025-03-29 11:00 ` David Laight 2025-03-30 7:49 ` Uros Bizjak 2025-03-30 18:02 ` David Laight 2025-03-29 23:10 ` H. Peter Anvin 2025-03-30 6:54 ` Uros Bizjak 2025-03-30 9:56 ` Ingo Molnar 2025-03-30 16:07 ` Uros Bizjak 2025-03-30 18:15 ` David Laight 2025-03-30 22:44 ` H. Peter Anvin 2025-03-30 18:54 ` Ingo Molnar 2025-03-25 17:09 ` [PATCH -tip 1/2] x86/hweight: Fix false output register dependency of POPCNT insn Borislav Petkov 2025-03-25 17:17 ` Uros Bizjak 2025-03-25 17:44 ` Borislav Petkov 2025-03-25 21:50 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox