public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] x86/bitops: Use TZCNT mnemonic in <asm/bitops.h>
@ 2025-03-25 17:52 Uros Bizjak
  2025-03-25 17:52 ` [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn Uros Bizjak
  2025-03-25 21:48 ` [tip: x86/asm] x86/bitops: Use TZCNT mnemonic in <asm/bitops.h> tip-bot2 for Uros Bizjak
  0 siblings, 2 replies; 9+ messages in thread
From: Uros Bizjak @ 2025-03-25 17:52 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin

Current minimum required version of binutils is 2.25,
which supports TZCNT instruction mnemonic.

Replace "REP; BSF" in variable__{ffs,ffz}() function
with this proper mnemonic.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/bitops.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 100413aff640..bbaf75ea6703 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -248,7 +248,7 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
 
 static __always_inline unsigned long variable__ffs(unsigned long word)
 {
-	asm("rep; bsf %1,%0"
+	asm("tzcnt %1,%0"
 		: "=r" (word)
 		: ASM_INPUT_RM (word));
 	return word;
@@ -267,7 +267,7 @@ static __always_inline unsigned long variable__ffs(unsigned long word)
 
 static __always_inline unsigned long variable_ffz(unsigned long word)
 {
-	asm("rep; bsf %1,%0"
+	asm("tzcnt %1,%0"
 		: "=r" (word)
 		: "r" (~word));
 	return word;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn
  2025-03-25 17:52 [PATCH 1/2] x86/bitops: Use TZCNT mnemonic in <asm/bitops.h> Uros Bizjak
@ 2025-03-25 17:52 ` Uros Bizjak
  2025-03-25 17:59   ` Borislav Petkov
                     ` (2 more replies)
  2025-03-25 21:48 ` [tip: x86/asm] x86/bitops: Use TZCNT mnemonic in <asm/bitops.h> tip-bot2 for Uros Bizjak
  1 sibling, 3 replies; 9+ messages in thread
From: Uros Bizjak @ 2025-03-25 17:52 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Uros Bizjak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin

On Haswell and later Intel processors, the TZCNT instruction appears
to have a false dependency on the destination register. Even though
the instruction only writes to it, the instruction will wait until
destination is ready before executing. This false dependency
was fixed for Skylake (and later) processors.

Fix false dependency by clearing the destination register first.

The x86_64 defconfig object size increases by 4215 bytes:

	    text           data     bss      dec            hex filename
	27342396        4642999  814852 32800247        1f47df7 vmlinux-old.o
	27346611        4643015  814852 32804478        1f48e7e vmlinux-new.o

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/bitops.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index bbaf75ea6703..7e3d1cc97c5a 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -248,8 +248,9 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
 
 static __always_inline unsigned long variable__ffs(unsigned long word)
 {
-	asm("tzcnt %1,%0"
-		: "=r" (word)
+	asm("xor %k0,%k0\n\t" /* avoid false dependency on dest register */
+	    "tzcnt %1,%0"
+		: "=&r" (word)
 		: ASM_INPUT_RM (word));
 	return word;
 }
@@ -267,8 +268,9 @@ static __always_inline unsigned long variable__ffs(unsigned long word)
 
 static __always_inline unsigned long variable_ffz(unsigned long word)
 {
-	asm("tzcnt %1,%0"
-		: "=r" (word)
+	asm("xor %k0,%k0\n\t" /* avoid false dependency on dest register */
+	    "tzcnt %1,%0"
+		: "=&r" (word)
 		: "r" (~word));
 	return word;
 }
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn
  2025-03-25 17:52 ` [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn Uros Bizjak
@ 2025-03-25 17:59   ` Borislav Petkov
  2025-03-25 18:29   ` H. Peter Anvin
  2025-03-25 21:43   ` Ingo Molnar
  2 siblings, 0 replies; 9+ messages in thread
From: Borislav Petkov @ 2025-03-25 17:59 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	H. Peter Anvin

On Tue, Mar 25, 2025 at 06:52:02PM +0100, Uros Bizjak wrote:
> On Haswell and later Intel processors, the TZCNT instruction appears
> to have a false dependency on the destination register. Even though
> the instruction only writes to it, the instruction will wait until
> destination is ready before executing. This false dependency
> was fixed for Skylake (and later) processors.
> 
> Fix false dependency by clearing the destination register first.

Same questions as about the POPCNT patch.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn
  2025-03-25 17:52 ` [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn Uros Bizjak
  2025-03-25 17:59   ` Borislav Petkov
@ 2025-03-25 18:29   ` H. Peter Anvin
  2025-03-25 21:43   ` Ingo Molnar
  2 siblings, 0 replies; 9+ messages in thread
From: H. Peter Anvin @ 2025-03-25 18:29 UTC (permalink / raw)
  To: Uros Bizjak, x86, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen

On March 25, 2025 10:52:02 AM PDT, Uros Bizjak <ubizjak@gmail.com> wrote:
>On Haswell and later Intel processors, the TZCNT instruction appears
>to have a false dependency on the destination register. Even though
>the instruction only writes to it, the instruction will wait until
>destination is ready before executing. This false dependency
>was fixed for Skylake (and later) processors.
>
>Fix false dependency by clearing the destination register first.
>
>The x86_64 defconfig object size increases by 4215 bytes:
>
>	    text           data     bss      dec            hex filename
>	27342396        4642999  814852 32800247        1f47df7 vmlinux-old.o
>	27346611        4643015  814852 32804478        1f48e7e vmlinux-new.o
>
>Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
>Cc: Thomas Gleixner <tglx@linutronix.de>
>Cc: Ingo Molnar <mingo@kernel.org>
>Cc: Borislav Petkov <bp@alien8.de>
>Cc: Dave Hansen <dave.hansen@linux.intel.com>
>Cc: "H. Peter Anvin" <hpa@zytor.com>
>---
> arch/x86/include/asm/bitops.h | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
>diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
>index bbaf75ea6703..7e3d1cc97c5a 100644
>--- a/arch/x86/include/asm/bitops.h
>+++ b/arch/x86/include/asm/bitops.h
>@@ -248,8 +248,9 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
> 
> static __always_inline unsigned long variable__ffs(unsigned long word)
> {
>-	asm("tzcnt %1,%0"
>-		: "=r" (word)
>+	asm("xor %k0,%k0\n\t" /* avoid false dependency on dest register */
>+	    "tzcnt %1,%0"
>+		: "=&r" (word)
> 		: ASM_INPUT_RM (word));
> 	return word;
> }
>@@ -267,8 +268,9 @@ static __always_inline unsigned long variable__ffs(unsigned long word)
> 
> static __always_inline unsigned long variable_ffz(unsigned long word)
> {
>-	asm("tzcnt %1,%0"
>-		: "=r" (word)
>+	asm("xor %k0,%k0\n\t" /* avoid false dependency on dest register */
>+	    "tzcnt %1,%0"
>+		: "=&r" (word)
> 		: "r" (~word));
> 	return word;
> }

Is xor better there than a mov (making it a copy of the source)? It might be more frequently fusable.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn
  2025-03-25 17:52 ` [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn Uros Bizjak
  2025-03-25 17:59   ` Borislav Petkov
  2025-03-25 18:29   ` H. Peter Anvin
@ 2025-03-25 21:43   ` Ingo Molnar
  2025-03-26  8:46     ` Uros Bizjak
  2 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2025-03-25 21:43 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin


* Uros Bizjak <ubizjak@gmail.com> wrote:

> On Haswell and later Intel processors, the TZCNT instruction appears
> to have a false dependency on the destination register. Even though
> the instruction only writes to it, the instruction will wait until
> destination is ready before executing. This false dependency
> was fixed for Skylake (and later) processors.
> 
> Fix false dependency by clearing the destination register first.
> 
> The x86_64 defconfig object size increases by 4215 bytes:
> 
> 	    text           data     bss      dec            hex filename
> 	27342396        4642999  814852 32800247        1f47df7 vmlinux-old.o
> 	27346611        4643015  814852 32804478        1f48e7e vmlinux-new.o

Yeah, so Skylake was released in 2015, about a decade ago.

So we'd be making the kernel larger for an unquantified 
micro-optimization for CPUs that almost nobody uses anymore.
That's a bad trade-off.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip: x86/asm] x86/bitops: Use TZCNT mnemonic in <asm/bitops.h>
  2025-03-25 17:52 [PATCH 1/2] x86/bitops: Use TZCNT mnemonic in <asm/bitops.h> Uros Bizjak
  2025-03-25 17:52 ` [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn Uros Bizjak
@ 2025-03-25 21:48 ` tip-bot2 for Uros Bizjak
  1 sibling, 0 replies; 9+ messages in thread
From: tip-bot2 for Uros Bizjak @ 2025-03-25 21:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Uros Bizjak, Ingo Molnar, Andy Lutomirski, Brian Gerst,
	Juergen Gross, H. Peter Anvin, Linus Torvalds, Josh Poimboeuf,
	x86, linux-kernel

The following commit has been merged into the x86/asm branch of tip:

Commit-ID:     0717b1392dc7e3f350e5a5d25ea794aa92210684
Gitweb:        https://git.kernel.org/tip/0717b1392dc7e3f350e5a5d25ea794aa92210684
Author:        Uros Bizjak <ubizjak@gmail.com>
AuthorDate:    Tue, 25 Mar 2025 18:52:01 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 25 Mar 2025 22:38:29 +01:00

x86/bitops: Use TZCNT mnemonic in <asm/bitops.h>

Current minimum required version of binutils is 2.25,
which supports TZCNT instruction mnemonic.

Replace "REP; BSF" in variable__{ffs,ffz}() function
with this proper mnemonic.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20250325175215.330659-1-ubizjak@gmail.com
---
 arch/x86/include/asm/bitops.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 100413a..bbaf75e 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -248,7 +248,7 @@ arch_test_bit_acquire(unsigned long nr, const volatile unsigned long *addr)
 
 static __always_inline unsigned long variable__ffs(unsigned long word)
 {
-	asm("rep; bsf %1,%0"
+	asm("tzcnt %1,%0"
 		: "=r" (word)
 		: ASM_INPUT_RM (word));
 	return word;
@@ -267,7 +267,7 @@ static __always_inline unsigned long variable__ffs(unsigned long word)
 
 static __always_inline unsigned long variable_ffz(unsigned long word)
 {
-	asm("rep; bsf %1,%0"
+	asm("tzcnt %1,%0"
 		: "=r" (word)
 		: "r" (~word));
 	return word;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn
  2025-03-25 21:43   ` Ingo Molnar
@ 2025-03-26  8:46     ` Uros Bizjak
  2025-03-28 22:27       ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Uros Bizjak @ 2025-03-26  8:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 2055 bytes --]

On Tue, Mar 25, 2025 at 10:43 PM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > On Haswell and later Intel processors, the TZCNT instruction appears
> > to have a false dependency on the destination register. Even though
> > the instruction only writes to it, the instruction will wait until
> > destination is ready before executing. This false dependency
> > was fixed for Skylake (and later) processors.
> >
> > Fix false dependency by clearing the destination register first.
> >
> > The x86_64 defconfig object size increases by 4215 bytes:
> >
> >           text           data     bss      dec            hex filename
> >       27342396        4642999  814852 32800247        1f47df7 vmlinux-old.o
> >       27346611        4643015  814852 32804478        1f48e7e vmlinux-new.o
>
> Yeah, so Skylake was released in 2015, about a decade ago.
>
> So we'd be making the kernel larger for an unquantified
> micro-optimization for CPUs that almost nobody uses anymore.
> That's a bad trade-off.

Yes, 4.2k seems a bit excessive. OTOH, I'd not say that the issue is a
micro-optimization, it is bordering on the hardware bug.

An alternative (also answering Peter's question) would be to make
instruction dependent only on the input (so, matching input and output
operand, making insn destructive), requiring MOV if the input operand
is to be reused. A quick test (x86_64 defconfig) with the attached
patch shows much less impact:

  text    data     bss     dec     hex filename
27564120        4642423  814852 33021395        1f7ddd3 vmlinux-old.o
27564679        4642423  814852 33021954        1f7e002 vmlinux-new.o

The increase is now 559 bytes, which IMHO is an acceptable trade-off.
Please also note that:

       asm("tzcnt %1,%0"
               : "=r" (word)
               : "r" (~word));
       return word;

emits NOT insn that is also destructive and already requires copying
of the input operand if the operand is to be reused.

Thanks,
Uros.

[-- Attachment #2: p.diff.txt --]
[-- Type: text/plain, Size: 699 bytes --]

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index bbaf75ea6703..378194687bec 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -250,7 +250,7 @@ static __always_inline unsigned long variable__ffs(unsigned long word)
 {
 	asm("tzcnt %1,%0"
 		: "=r" (word)
-		: ASM_INPUT_RM (word));
+		: "0" (word)); /* avoid false output dependency */
 	return word;
 }
 
@@ -267,10 +267,7 @@ static __always_inline unsigned long variable__ffs(unsigned long word)
 
 static __always_inline unsigned long variable_ffz(unsigned long word)
 {
-	asm("tzcnt %1,%0"
-		: "=r" (word)
-		: "r" (~word));
-	return word;
+	return variable__ffs(~word);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn
  2025-03-26  8:46     ` Uros Bizjak
@ 2025-03-28 22:27       ` Ingo Molnar
  2025-03-29  8:48         ` Uros Bizjak
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2025-03-28 22:27 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin


* Uros Bizjak <ubizjak@gmail.com> wrote:

> On Tue, Mar 25, 2025 at 10:43 PM Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > > On Haswell and later Intel processors, the TZCNT instruction appears
> > > to have a false dependency on the destination register. Even though
> > > the instruction only writes to it, the instruction will wait until
> > > destination is ready before executing. This false dependency
> > > was fixed for Skylake (and later) processors.
> > >
> > > Fix false dependency by clearing the destination register first.
> > >
> > > The x86_64 defconfig object size increases by 4215 bytes:
> > >
> > >           text           data     bss      dec            hex filename
> > >       27342396        4642999  814852 32800247        1f47df7 vmlinux-old.o
> > >       27346611        4643015  814852 32804478        1f48e7e vmlinux-new.o
> >
> > Yeah, so Skylake was released in 2015, about a decade ago.
> >
> > So we'd be making the kernel larger for an unquantified
> > micro-optimization for CPUs that almost nobody uses anymore.
> > That's a bad trade-off.
> 
> Yes, 4.2k seems a bit excessive. OTOH, I'd not say that the issue is 
> a micro-optimization, it is bordering on the hardware bug.

Has this been quantified, and do we really care about the 
micro-performance of ~10-year old CPUs, especially at the
expense of modern CPUs?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn
  2025-03-28 22:27       ` Ingo Molnar
@ 2025-03-29  8:48         ` Uros Bizjak
  0 siblings, 0 replies; 9+ messages in thread
From: Uros Bizjak @ 2025-03-29  8:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, linux-kernel, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin

On Fri, Mar 28, 2025 at 11:28 PM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@gmail.com> wrote:
>
> > On Tue, Mar 25, 2025 at 10:43 PM Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > >
> > > * Uros Bizjak <ubizjak@gmail.com> wrote:
> > >
> > > > On Haswell and later Intel processors, the TZCNT instruction appears
> > > > to have a false dependency on the destination register. Even though
> > > > the instruction only writes to it, the instruction will wait until
> > > > destination is ready before executing. This false dependency
> > > > was fixed for Skylake (and later) processors.
> > > >
> > > > Fix false dependency by clearing the destination register first.
> > > >
> > > > The x86_64 defconfig object size increases by 4215 bytes:
> > > >
> > > >           text           data     bss      dec            hex filename
> > > >       27342396        4642999  814852 32800247        1f47df7 vmlinux-old.o
> > > >       27346611        4643015  814852 32804478        1f48e7e vmlinux-new.o
> > >
> > > Yeah, so Skylake was released in 2015, about a decade ago.
> > >
> > > So we'd be making the kernel larger for an unquantified
> > > micro-optimization for CPUs that almost nobody uses anymore.
> > > That's a bad trade-off.
> >
> > Yes, 4.2k seems a bit excessive. OTOH, I'd not say that the issue is
> > a micro-optimization, it is bordering on the hardware bug.
>
> Has this been quantified, and do we really care about the
> micro-performance of ~10-year old CPUs, especially at the
> expense of modern CPUs?

No, although the change would be a one liner now. Without specially
crafted benchmark loops the impact is not noticeable and typical
kernel usage of these instructions is not that sensitive on
destination.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-03-29  8:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-25 17:52 [PATCH 1/2] x86/bitops: Use TZCNT mnemonic in <asm/bitops.h> Uros Bizjak
2025-03-25 17:52 ` [PATCH 2/2] x86/bitops: Fix false output register dependency of TZCNT insn Uros Bizjak
2025-03-25 17:59   ` Borislav Petkov
2025-03-25 18:29   ` H. Peter Anvin
2025-03-25 21:43   ` Ingo Molnar
2025-03-26  8:46     ` Uros Bizjak
2025-03-28 22:27       ` Ingo Molnar
2025-03-29  8:48         ` Uros Bizjak
2025-03-25 21:48 ` [tip: x86/asm] x86/bitops: Use TZCNT mnemonic in <asm/bitops.h> tip-bot2 for Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox