* [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence
@ 2025-02-09 19:10 David Laight
2025-02-09 19:32 ` Linus Torvalds
0 siblings, 1 reply; 7+ messages in thread
From: David Laight @ 2025-02-09 19:10 UTC (permalink / raw)
To: x86, linux-kernel, Linus Torvalds
Cc: David Laight, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Catalin Marinas, Mathieu Desnoyers,
Josh Poimboeuf, Andi Kleen, Dan Williams, linux-arch, Kees Cook,
kernel-hardening
When barrier_nospec() was added the defintion was copied from the
one used to synchronise rdtsc.
On very old cpu rdtsc was a synchronising instruction.
When this change X86_FEATURE_LFENCE_RDTSC (and a MFENCE copy) were
(probably) added so lflence/mfence could be added to synchronise rdtsc.
For old cpu (I think the code checks XMM2) no barrier was added.
I'm not sure why that code was used for barrier_nospec().
I'm sure it should actually be rmb() with the fallback to a
locked memory access on old cpu.
In any case all x86-64 cpu support XMM2 and lfence so there is
to point using alternative().
Separate the 32bit and 64bit definitions but leave the barrier
missing on old 32bit cpu.
Signed-off-by: David Laight <david.laight.linux@gmail.com>
---
arch/x86/include/asm/barrier.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 7b44b3c4cce1..7eecce9bf4fe 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -45,7 +45,11 @@
__mask; })
/* Prevent speculative execution past this barrier. */
-#define barrier_nospec() alternative("", "lfence", X86_FEATURE_LFENCE_RDTSC)
+#ifdef CONFIG_X86_32
+#define barrier_nospec() alternative("", "lfence", X86_FEATURE_XMM2)
+#else
+#define barrier_nospec() __rmb()
+#endif
#define __dma_rmb() barrier()
#define __dma_wmb() barrier()
--
2.39.5
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence
2025-02-09 19:10 [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence David Laight
@ 2025-02-09 19:32 ` Linus Torvalds
2025-02-09 21:40 ` David Laight
0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2025-02-09 19:32 UTC (permalink / raw)
To: David Laight
Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Catalin Marinas, Mathieu Desnoyers,
Josh Poimboeuf, Andi Kleen, Dan Williams, linux-arch, Kees Cook,
kernel-hardening
On Sun, 9 Feb 2025 at 11:10, David Laight <david.laight.linux@gmail.com> wrote:
>
> +#define barrier_nospec() __rmb()
This is one of those "it happens to work, but it's wrong" things.
Just make it explicit that it's "lfence" in the current implementation.
Is __rmb() also an lfence? Yes. And that's actually very confusing too
too. Because on x86, a regular read barrier is a no-op, and the "main"
rmb definition is actually this:
#define __dma_rmb() barrier()
#define __smp_rmb() dma_rmb()
so that it's only a compiler barrier.
And yes, __rmb() exists as the architecture-specific helper for "I
need to synchronize with unordered IO accesses" and is purely about
driver IO.
We should have called it "relaxed_rmb()" or "io_rmb()" or something
like that, but the IO memory ordering issues actually came up before
the modern SMP ordering issues, so due to that historical thing,
"rmb()" ends up being about the IO ordering.
It's confusing, I know. And historical. And too painful to change
because it all works and lots of people know the rules (except looking
around, it seems possibly the sunrpc code is confused, and uses
"rmb()" for SMP synchronization)
But basically a barrier_nospec() is not a IO read barrier, and an IO
read barrier is not a barrier_nospec().
They just happen to be implemented using the same instruction because
an existing instruction - that nobody uses in normal situations -
ended up effectively doing what that nospec barrier needed to do.
And some day in the future, maybe even that implementation equivalence
ends up going away again, and we end up with new barrier instructions
that depend on new CPU capabilities (or fake software capabilities:
kernel bootup flags that say "don't bother with the nospec
barriers").,
So please keep the __rmb() and the barrier_nospec() separate, don't
tie them together. They just have *soo* many differences, both
conceptual and practical.
Linus
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence
2025-02-09 19:32 ` Linus Torvalds
@ 2025-02-09 21:40 ` David Laight
2025-02-09 21:57 ` Linus Torvalds
0 siblings, 1 reply; 7+ messages in thread
From: David Laight @ 2025-02-09 21:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Catalin Marinas, Mathieu Desnoyers,
Josh Poimboeuf, Andi Kleen, Dan Williams, linux-arch, Kees Cook,
kernel-hardening
On Sun, 9 Feb 2025 11:32:32 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 9 Feb 2025 at 11:10, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > +#define barrier_nospec() __rmb()
>
> This is one of those "it happens to work, but it's wrong" things.
>
> Just make it explicit that it's "lfence" in the current implementation.
Easily done.
Any idea what the one used to synchronise rdtsc should be?
'lfence' is the right instruction (give or take), but it isn't
a speculation issue.
It really is 'wait for all memory accesses to finish' to give
a sensible(ish) answer for cycle timing.
And on old cpu you want nothing - not a locked memory access.
>
> Is __rmb() also an lfence? Yes. And that's actually very confusing too
> too. Because on x86, a regular read barrier is a no-op, and the "main"
> rmb definition is actually this:
>
> #define __dma_rmb() barrier()
> #define __smp_rmb() dma_rmb()
>
> so that it's only a compiler barrier.
I couldn't work out why __smp_mb() is so much stronger than the rmb()
and wmb() forms - I presume the is history there I wasn't looking for.
> And yes, __rmb() exists as the architecture-specific helper for "I
> need to synchronize with unordered IO accesses" and is purely about
> driver IO.
I'd missed the history of it being IO related.
...
> And some day in the future, maybe even that implementation equivalence
> ends up going away again, and we end up with new barrier instructions
> that depend on new CPU capabilities (or fake software capabilities:
> kernel bootup flags that say "don't bother with the nospec
> barriers").
Actually there is already the cpu flag to treat addresses with the top
bit set as 'supervisor' in the initial address decode - rather that
checking the page table in parallel with the d-cache accesses.
When that hits real silicon then patching out the barrier_nospec()
lfence would make sense.
There is also your kernel build machine where you don't care.
So compiling them out or boot patching them out is a real option.
This does make it more clear that the rdtsc code has the wrong barrier.
> So please keep the __rmb() and the barrier_nospec() separate, don't
> tie them together. They just have *soo* many differences, both
> conceptual and practical.
A simple V2 :-)
David
>
> Linus
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence
2025-02-09 21:40 ` David Laight
@ 2025-02-09 21:57 ` Linus Torvalds
2025-02-10 1:09 ` Rik van Riel
2025-02-10 4:29 ` Andi Kleen
0 siblings, 2 replies; 7+ messages in thread
From: Linus Torvalds @ 2025-02-09 21:57 UTC (permalink / raw)
To: David Laight
Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Catalin Marinas, Mathieu Desnoyers,
Josh Poimboeuf, Andi Kleen, Dan Williams, linux-arch, Kees Cook,
kernel-hardening
On Sun, 9 Feb 2025 at 13:40, David Laight <david.laight.linux@gmail.com> wrote:
>
> Any idea what the one used to synchronise rdtsc should be?
> 'lfence' is the right instruction (give or take), but it isn't
> a speculation issue.
> It really is 'wait for all memory accesses to finish' to give
> a sensible(ish) answer for cycle timing.
No, even that is actually very different.
What happened was that 'lfence' was designed and documented - and
named - as a memory fencing thing, but the *implementation* of it was
basically about the front-end pipeline.
IOW, ignore the name or the documentation. Think of "lfence" as a
"this stops the pipeline until all previous instructions have
retired". Because that is what it *is*.
So it's basically a synchronization instruction *regardless* of memory accesses.
Which is why it was then used for the rdtsc serialization - it
basically says "don't *actually* read the TSC until you've finished
everything you've started".
And which is why it ended up being used for speculation control, even
though the instructions it serializes are *not* necessarily memory
accesses at all, but things like the address conditional that precedes
it.
So the speculation control use is literally "wait for the previous
conditional branches to retire before continuing". Yes, the
"continuing" tends to be a load, but that's almost incidental.
> And on old cpu you want nothing - not a locked memory access.
Well, back in the day, those locked instructions did the same thing.
> I couldn't work out why __smp_mb() is so much stronger than the rmb()
> and wmb() forms - I presume the is history there I wasn't looking for.
So on x86, both read and write barriers are complete no-ops, because
all reads are ordered, and all writes are ordered. So those only need
compiler barriers to guarantee that the compiler itself doesn't
re-order them.
(Side note: earlier reads are also guaranteed to happen before later
writes, so it's really only writes that can be delayed past reads, but
we don't haev a barrier for that situation anyway. Also note that all
of this is not "real" ordering, but only a guarantee that the
user-visible semantics are AS IF they were actually ordered - if
things are local in cache, ordering doesn't matter because no external
CPU can *see* what the ordering was).
So basically the only memory barriers that matter on x86 are the full
"smp_mb()" that orders reads vs writes, and the ordering for
non-ordered accesses used for IO.
And then lfence is basically used for non-memory ordering too.
Linus
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence
2025-02-09 21:57 ` Linus Torvalds
@ 2025-02-10 1:09 ` Rik van Riel
2025-02-10 2:15 ` H. Peter Anvin
2025-02-10 4:29 ` Andi Kleen
1 sibling, 1 reply; 7+ messages in thread
From: Rik van Riel @ 2025-02-10 1:09 UTC (permalink / raw)
To: Linus Torvalds, David Laight
Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Catalin Marinas, Mathieu Desnoyers,
Josh Poimboeuf, Andi Kleen, Dan Williams, linux-arch, Kees Cook,
kernel-hardening, Paul E.McKenney
On Sun, 2025-02-09 at 13:57 -0800, Linus Torvalds wrote:
>
> So on x86, both read and write barriers are complete no-ops, because
> all reads are ordered, and all writes are ordered.
Given that this thread started with a reference
to rdtsc, it may be worth keeping in mind that
rdtsc reads themselves do not always appear to
be ordered.
Paul and I spotted some occasionaly "backwards
TSC values" from the CSD lock instrumentation code,
which went away when using ordered TSC reads:
https://lkml.iu.edu/hypermail/linux/kernel/2410.1/03202.html
I guess maybe a TSC read does not follow all the same
rules as a memory read, sometimes?
--
All Rights Reversed.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence
2025-02-10 1:09 ` Rik van Riel
@ 2025-02-10 2:15 ` H. Peter Anvin
0 siblings, 0 replies; 7+ messages in thread
From: H. Peter Anvin @ 2025-02-10 2:15 UTC (permalink / raw)
To: Rik van Riel, Linus Torvalds, David Laight
Cc: x86, linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Catalin Marinas, Mathieu Desnoyers, Josh Poimboeuf,
Andi Kleen, Dan Williams, linux-arch, Kees Cook, kernel-hardening,
Paul E.McKenney
On February 9, 2025 5:09:51 PM PST, Rik van Riel <riel@surriel.com> wrote:
>On Sun, 2025-02-09 at 13:57 -0800, Linus Torvalds wrote:
>>
>> So on x86, both read and write barriers are complete no-ops, because
>> all reads are ordered, and all writes are ordered.
>
>Given that this thread started with a reference
>to rdtsc, it may be worth keeping in mind that
>rdtsc reads themselves do not always appear to
>be ordered.
>
>Paul and I spotted some occasionaly "backwards
>TSC values" from the CSD lock instrumentation code,
>which went away when using ordered TSC reads:
>
>https://lkml.iu.edu/hypermail/linux/kernel/2410.1/03202.html
>
>I guess maybe a TSC read does not follow all the same
>rules as a memory read, sometimes?
>
It probably doesn't, at least on uarches.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence
2025-02-09 21:57 ` Linus Torvalds
2025-02-10 1:09 ` Rik van Riel
@ 2025-02-10 4:29 ` Andi Kleen
1 sibling, 0 replies; 7+ messages in thread
From: Andi Kleen @ 2025-02-10 4:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Laight, x86, linux-kernel, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Catalin Marinas,
Mathieu Desnoyers, Josh Poimboeuf, Dan Williams, linux-arch,
Kees Cook, kernel-hardening
> So on x86, both read and write barriers are complete no-ops, because
> all reads are ordered, and all writes are ordered. So those only need
> compiler barriers to guarantee that the compiler itself doesn't
> re-order them.
>
> (Side note: earlier reads are also guaranteed to happen before later
> writes, so it's really only writes that can be delayed past reads, but
> we don't haev a barrier for that situation anyway. Also note that all
> of this is not "real" ordering, but only a guarantee that the
> user-visible semantics are AS IF they were actually ordered - if
> things are local in cache, ordering doesn't matter because no external
> CPU can *see* what the ordering was).
However in the local case *FENCE still orders, so it's actually not a
nop. Just normally you can't tell the difference in ordering semantics,
but it's visible in side effects like RDTSC.
-Andi
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-02-10 4:29 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-09 19:10 [PATCH 1/1] x86: In x86-64 barrier_nospec can always be lfence David Laight
2025-02-09 19:32 ` Linus Torvalds
2025-02-09 21:40 ` David Laight
2025-02-09 21:57 ` Linus Torvalds
2025-02-10 1:09 ` Rik van Riel
2025-02-10 2:15 ` H. Peter Anvin
2025-02-10 4:29 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).