[PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
@ 2024-10-12  4:09 Josh Poimboeuf
  2024-10-12  8:48 ` Andrew Cooper
  0 siblings, 1 reply; 51+ messages in thread
From: Josh Poimboeuf @ 2024-10-12  4:09 UTC (permalink / raw)
  To: x86
  Cc: Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta,
	Waiman Long, Dave Hansen, Ingo Molnar, Linus Torvalds,
	Michael Ellerman, linuxppc-dev, Andrew Cooper, Mark Rutland

For x86-64, the barrier_nospec() in copy_from_user() is overkill and
painfully slow.  Instead, use pointer masking to force the user pointer
to a non-kernel value even in speculative paths.

While at it, harden the x86 implementations of raw_copy_to_user() and
clear_user(): a write in a mispredicted access_ok() branch to a
user-controlled kernel address can populate the rest of the affected
cache line with kernel data.

To avoid regressing powerpc, move the barrier_nospec() to the powerpc
raw_copy_from_user() implementation so there's no functional change.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/powerpc/include/asm/uaccess.h | 2 ++
 arch/x86/include/asm/uaccess_64.h  | 4 +++-
 arch/x86/lib/getuser.S             | 2 +-
 arch/x86/lib/putuser.S             | 2 +-
 include/linux/uaccess.h            | 6 ------
 5 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 4f5a46a77fa2..12abb8bf5eda 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -7,6 +7,7 @@
 #include <asm/extable.h>
 #include <asm/kup.h>
 #include <asm/asm-compat.h>
+#include <asm/barrier.h>
 
 #ifdef __powerpc64__
 /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */
@@ -341,6 +342,7 @@ static inline unsigned long raw_copy_from_user(void *to,
 {
 	unsigned long ret;
 
+	barrier_nospec();
 	allow_read_from_user(from, n);
 	ret = __copy_tofrom_user((__force void __user *)to, from, n);
 	prevent_read_from_user(from, n);
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index afce8ee5d7b7..39199eef26be 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -133,12 +133,14 @@ copy_user_generic(void *to, const void *from, unsigned long len)
 static __always_inline __must_check unsigned long
 raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
 {
+	src = mask_user_address(src);
 	return copy_user_generic(dst, (__force void *)src, size);
 }
 
 static __always_inline __must_check unsigned long
 raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
 {
+	dst = mask_user_address(dst);
 	return copy_user_generic((__force void *)dst, src, size);
 }
 
@@ -197,7 +199,7 @@ static __always_inline __must_check unsigned long __clear_user(void __user *addr
 static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
 {
 	if (__access_ok(to, n))
-		return __clear_user(to, n);
+		return __clear_user(mask_user_address(to), n);
 	return n;
 }
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index d066aecf8aeb..094224ec9dca 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -39,7 +39,7 @@
 
 .macro check_range size:req
 .if IS_ENABLED(CONFIG_X86_64)
-	mov %rax, %rdx
+	mov %rax, %rdx		/* mask_user_address() */
 	sar $63, %rdx
 	or %rdx, %rax
 .else
diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S
index 975c9c18263d..09b7e37934ab 100644
--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -34,7 +34,7 @@
 
 .macro check_range size:req
 .if IS_ENABLED(CONFIG_X86_64)
-	mov %rcx, %rbx
+	mov %rcx, %rbx		/* mask_user_address() */
 	sar $63, %rbx
 	or %rbx, %rcx
 .else
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 39c7cf82b0c2..dda9725a9559 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -160,12 +160,6 @@ _inline_copy_from_user(void *to, const void __user *from, unsigned long n)
 	unsigned long res = n;
 	might_fault();
 	if (!should_fail_usercopy() && likely(access_ok(from, n))) {
-		/*
-		 * Ensure that bad access_ok() speculation will not
-		 * lead to nasty side effects *after* the copy is
-		 * finished:
-		 */
-		barrier_nospec();
 		instrument_copy_from_user_before(to, from, n);
 		res = raw_copy_from_user(to, from, n);
 		instrument_copy_from_user_after(to, from, n, res);
-- 
2.46.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-12  4:09 [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() Josh Poimboeuf
@ 2024-10-12  8:48 ` Andrew Cooper
  2024-10-12 14:09   ` Josh Poimboeuf
  2024-10-12 15:44   ` Linus Torvalds
  0 siblings, 2 replies; 51+ messages in thread
From: Andrew Cooper @ 2024-10-12  8:48 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta,
	Waiman Long, Dave Hansen, Ingo Molnar, Linus Torvalds,
	Michael Ellerman, linuxppc-dev, Mark Rutland

On 12/10/2024 5:09 am, Josh Poimboeuf wrote:
> For x86-64, the barrier_nospec() in copy_from_user() is overkill and
> painfully slow.  Instead, use pointer masking to force the user pointer
> to a non-kernel value even in speculative paths.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

You do realise mask_user_address() is unsafe under speculation on AMD
systems?

Had the mask_user_address() patch been put for review, this feedback
would have been given then.


AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one
saturated by shifting, not bit 63.

As it stands, you're reintroducing the very problem barrier_nospec() was
introduced to mitigate.

~Andrew


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-12  8:48 ` Andrew Cooper
@ 2024-10-12 14:09   ` Josh Poimboeuf
  2024-10-12 14:21     ` Borislav Petkov
  2024-10-12 14:26     ` Andrew Cooper
  2024-10-12 15:44   ` Linus Torvalds
  1 sibling, 2 replies; 51+ messages in thread
From: Josh Poimboeuf @ 2024-10-12 14:09 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra,
	Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar,
	Linus Torvalds, Michael Ellerman, linuxppc-dev, Mark Rutland

On Sat, Oct 12, 2024 at 09:48:57AM +0100, Andrew Cooper wrote:
> On 12/10/2024 5:09 am, Josh Poimboeuf wrote:
> > For x86-64, the barrier_nospec() in copy_from_user() is overkill and
> > painfully slow.  Instead, use pointer masking to force the user pointer
> > to a non-kernel value even in speculative paths.
> >
> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> You do realise mask_user_address() is unsafe under speculation on AMD
> systems?
> 
> Had the mask_user_address() patch been put for review, this feedback
> would have been given then.
> 
> 
> AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one
> saturated by shifting, not bit 63.

Ok... why?

-- 
Josh


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-12 14:09   ` Josh Poimboeuf
@ 2024-10-12 14:21     ` Borislav Petkov
  2024-10-12 15:58       ` Linus Torvalds
  2024-10-12 14:26     ` Andrew Cooper
  1 sibling, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2024-10-12 14:21 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrew Cooper, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta,
	Waiman Long, Dave Hansen, Ingo Molnar, Linus Torvalds,
	Michael Ellerman, linuxppc-dev, Mark Rutland

On Sat, Oct 12, 2024 at 09:09:23AM -0500, Josh Poimboeuf wrote:
> On Sat, Oct 12, 2024 at 09:48:57AM +0100, Andrew Cooper wrote:
> > On 12/10/2024 5:09 am, Josh Poimboeuf wrote:
> > > For x86-64, the barrier_nospec() in copy_from_user() is overkill and
> > > painfully slow.  Instead, use pointer masking to force the user pointer
> > > to a non-kernel value even in speculative paths.
> > >
> > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > 
> > You do realise mask_user_address() is unsafe under speculation on AMD
> > systems?
> > 
> > Had the mask_user_address() patch been put for review, this feedback
> > would have been given then.
> > 
> > 
> > AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one
> > saturated by shifting, not bit 63.
> 
> Ok... why?

I've been working on a fix for this. Still WIP.

Author: Borislav Petkov (AMD) <bp@alien8.de>
Date:   Sat Oct 12 00:18:10 2024 +0200

    x86/uaccess: Fix user access masking on AMD
    
    Commit
    
      2865baf54077 ("x86: support user address masking instead of non-speculative conditional")
    
    started doing a data-dependent mask on the user address in order to
    speed up the fast unsafe user access patterns.
    
    However, AMD CPUs would use only 48, or 56 bits respectively when doing
    transient TLB accesses and a non-canonical kernel address supplied by
    user can still result in a TLB hit and access kernel data transiently,
    leading to a potential spectre v1 leak, see bulletin Link below.
    
    Therefore, use the most-significant address bit when doing the masking
    and prevent such transient TLB hits.
    
    Fixes: 2865baf54077 ("x86: support user address masking instead of non-speculative conditional")
    Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
    Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index afce8ee5d7b7..73c7a6bf1468 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -58,7 +58,9 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
  * user_access_begin that can avoid the fencing. This only works
  * for dense accesses starting at the address.
  */
-#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63)))
+#define mask_user_address(x) ((typeof(x)) \
+			((long)(x) | ((long)(x) << (63 - __VIRTUAL_MASK_SHIFT) >> 63)))
+
 #define masked_user_access_begin(x) ({				\
 	__auto_type __masked_ptr = (x);				\
 	__masked_ptr = mask_user_address(__masked_ptr);		\

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-12 14:09   ` Josh Poimboeuf
  2024-10-12 14:21     ` Borislav Petkov
@ 2024-10-12 14:26     ` Andrew Cooper
  1 sibling, 0 replies; 51+ messages in thread
From: Andrew Cooper @ 2024-10-12 14:26 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra,
	Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar,
	Linus Torvalds, Michael Ellerman, linuxppc-dev, Mark Rutland

On 12/10/2024 3:09 pm, Josh Poimboeuf wrote:
> On Sat, Oct 12, 2024 at 09:48:57AM +0100, Andrew Cooper wrote:
>> On 12/10/2024 5:09 am, Josh Poimboeuf wrote:
>>> For x86-64, the barrier_nospec() in copy_from_user() is overkill and
>>> painfully slow.  Instead, use pointer masking to force the user pointer
>>> to a non-kernel value even in speculative paths.
>>>
>>> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
>> You do realise mask_user_address() is unsafe under speculation on AMD
>> systems?
>>
>> Had the mask_user_address() patch been put for review, this feedback
>> would have been given then.
>>
>>
>> AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one
>> saturated by shifting, not bit 63.
> Ok... why?

CVE-2020-12965

https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html

~Andrew


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-12  8:48 ` Andrew Cooper
  2024-10-12 14:09   ` Josh Poimboeuf
@ 2024-10-12 15:44   ` Linus Torvalds
  2024-10-12 17:23     ` Andrew Cooper
  1 sibling, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-12 15:44 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

On Sat, 12 Oct 2024 at 01:49, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> You do realise mask_user_address() is unsafe under speculation on AMD
> systems?

That had *better* not be true.

> Had the mask_user_address() patch been put for review, this feedback
> would have been given then.

That's BS.

Why? Look at commit b19b74bc99b1 ("x86/mm: Rework address range check
in get_user() and put_user()").

This mask_user_address() thing is how we've been doing a regular
get/put_user() for the last 18 months. It's *exactly* the same
pattern:

        mov %rax, %rdx
        sar $63, %rdx
        or %rdx, %rax

ie we saturate the sign bit.

> AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one
> saturated by shifting, not bit 63.
>
> As it stands, you're reintroducing the very problem barrier_nospec() was
> introduced to mitigate.

If that's true, we have much bigger issues. And it has nothing to do
with the new address masking macro, that just exposed existing logic.

                  Linus


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-12 14:21     ` Borislav Petkov
@ 2024-10-12 15:58       ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-10-12 15:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

On Sat, 12 Oct 2024 at 07:21, Borislav Petkov <bp@alien8.de> wrote:
>
>     Commit
>
>       2865baf54077 ("x86: support user address masking instead of non-speculative conditional")

No.

Thos started long before. Again, see commit b19b74bc99b1 ("x86/mm:
Rework address range check in get_user() and put_user()") and the code
it generates.

get_user() results in a plain calls to __get_user_X, where X is the
size. No barriers. And __get_user_X() does that exact same thing.

And no, your suggested patch looks very suspicious:

  +#define mask_user_address(x) ((typeof(x)) \
  +                       ((long)(x) | ((long)(x) << (63 -
__VIRTUAL_MASK_SHIFT) >> 63)))

that does no masking at all on a kernel address, so you can feed it
random kernel addresses and it will just access them,

                  Linus


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-12 15:44   ` Linus Torvalds
@ 2024-10-12 17:23     ` Andrew Cooper
  2024-10-12 17:44       ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Cooper @ 2024-10-12 17:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

On 12/10/2024 4:44 pm, Linus Torvalds wrote:
> On Sat, 12 Oct 2024 at 01:49, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> You do realise mask_user_address() is unsafe under speculation on AMD
>> systems?
> That had *better* not be true.

Yeah I'd prefer it wasn't true either.
>> Had the mask_user_address() patch been put for review, this feedback
>> would have been given then.
> That's BS.
>
> Why? Look at commit b19b74bc99b1 ("x86/mm: Rework address range check
> in get_user() and put_user()").

That looks like 3 Intel tags and 0 AMD tags.

But ok, I didn't spot this one, and it looks unsafe too.  It was not
reviewed by anyone that had a reasonable expectation to know AMD's
microarchitectural behaviour.

Previously, the STAC protected against bad prediction of the JAE and
prevented dereferencing the pointer if it was greater than TASK_SIZE.

Importantly for the issue at hand, the calculation against TASK_SIZE
excluded the whole non-canonical region.

> This mask_user_address() thing is how we've been doing a regular
> get/put_user() for the last 18 months. It's *exactly* the same
> pattern:
>
>         mov %rax, %rdx
>         sar $63, %rdx
>         or %rdx, %rax
>
> ie we saturate the sign bit.

This logic is asymmetric.

For an address in the upper half (canonical or non-canonical), it ORs
with -1 and fully replaces the prior address.

For an address in the lower half (canonical or non-canonical), it leaves
the value intact, as either canonical or non-canoncal.

Then the pointer is architecturally dereferenced, relying on catching
#PF/#GP for the slow path.  Architecturally, this is safe.

Micro-architecturally though, AMD CPUs use bit 47, not 63, in the TLB
lookup.  This behaviour dates from the K8, and is exposed somewhat in
the virt extensions.

When userspace passes in a non-canonical pointer in the low half of the
address space but with bit 47 set, it will be considered a high-half
pointer when sent for TLB lookup, and the pagetables say it's a
supervisor mapping, so the memory access will be permitted to go ahead
speculatively.  Only later does the pipeline realise the address was
non-canonical and raise #GP.

This lets userspace directly target and load anything cacheable in the
kernel mappings.  It's not as easy to exploit as Meltdown on Intel, but
it known behaviour, and been the subject of academic work for 4 years.

~Andrew

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-12 17:23     ` Andrew Cooper
@ 2024-10-12 17:44       ` Linus Torvalds
  2024-10-13  0:53         ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-12 17:44 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

On Sat, 12 Oct 2024 at 10:23, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>
> This logic is asymmetric.
>
> For an address in the upper half (canonical or non-canonical), it ORs
> with -1 and fully replaces the prior address.

Right. The point is that non-canonical addresses will fault, and
kernel addresses are guaranteed to fault.

And the assumption was that any fault will be sufficient to hide the
result, because otherwise you have meltdown all over again.

> When userspace passes in a non-canonical pointer in the low half of the
> address space but with bit 47 set, it will be considered a high-half
> pointer when sent for TLB lookup, and the pagetables say it's a
> supervisor mapping, so the memory access will be permitted to go ahead
> speculatively.  Only later does the pipeline realise the address was
> non-canonical and raise #GP.
>
> This lets userspace directly target and load anything cacheable in the
> kernel mappings.  It's not as easy to exploit as Meltdown on Intel, but
> it known behaviour, and been the subject of academic work for 4 years.

It sure was never talked about in kernel circles. I checked my email
archives, and neither

   CVE-2020-12965

nor that

    https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html

is anywhere in my emails, nor does lore.kernel.org find them anywhere either.

Anyway, what's the speculation window size like?

               Linus


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-12 17:44       ` Linus Torvalds
@ 2024-10-13  0:53         ` Linus Torvalds
  2024-10-13  1:21           ` Linus Torvalds
  2024-10-14 11:56           ` Kirill A. Shutemov
  0 siblings, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-10-13  0:53 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

On Sat, 12 Oct 2024 at 10:44, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, what's the speculation window size like?

Note that this is important basically because we do *NOT* want to
check the address against TASK_SIZE_MAX like we used to, because not
only is TASK_SIZE_MAX not a compile-time constant, but with linear
address masking, people actually *want* to use addresses that are in
the non-canonical range.

IOW, see also

    arch/x86/include/asm/uaccess_64.h

and notice how the x86-64 __access_ok() check *also_ does the whole
"top bit set" thing (iow, see __access_ok()).

IOW, this actually goes even further back than the commit I mentioned
earlier - it goes back to commit 6014bc27561f ("x86-64: make
access_ok() independent of LAM") because without the sign bit trick,
LAM is a complete disaster.

So no, the address masking can not depend on things like
__VIRTUAL_MASK_SHIFT, it would need to at least take LAM into account
too. Not that I know if there are any CPU's out there that actually
have LAM enabled.

            Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-13  0:53         ` Linus Torvalds
@ 2024-10-13  1:21           ` Linus Torvalds
  2024-10-14  3:54             ` Josh Poimboeuf
  2024-10-14 11:56           ` Kirill A. Shutemov
  1 sibling, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-13  1:21 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

On Sat, 12 Oct 2024 at 17:53, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So no, the address masking can not depend on things like
> __VIRTUAL_MASK_SHIFT, it would need to at least take LAM into account
> too. Not that I know if there are any CPU's out there that actually
> have LAM enabled.

Lunar Lake and Arrow Lake, apparently.

One thing that may make this all moot is that the value loaded from a
possible non-canonical range won't actually be used until after we've
done a "CLAC".

And at least judging by the performance of STAC/CLAC on my machines,
those instructions will likely have stopped any speculation cold.

So maybe the "what is the actual cycle latency of detecting the
faulting instruction" really is the core question here.

               Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-13  1:21           ` Linus Torvalds
@ 2024-10-14  3:54             ` Josh Poimboeuf
  2024-10-14  6:50               ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Josh Poimboeuf @ 2024-10-14  3:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

On Sat, Oct 12, 2024 at 06:21:12PM -0700, Linus Torvalds wrote:
> On Sat, 12 Oct 2024 at 17:53, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So no, the address masking can not depend on things like
> > __VIRTUAL_MASK_SHIFT, it would need to at least take LAM into account
> > too. Not that I know if there are any CPU's out there that actually
> > have LAM enabled.

If I understand correctly, LAM bits are for the benefit of SW and are
ignored by HW?  Can we just mask those bits off?

> And at least judging by the performance of STAC/CLAC on my machines,
> those instructions will likely have stopped any speculation cold.
> 
> So maybe the "what is the actual cycle latency of detecting the
> faulting instruction" really is the core question here.

I think I remember reading that STAC/CLAC can't necessarily be relied on
as a speculation barrier, depending on microarchitectural details.  It
might be safest to assume we can't rely on that.  Masking is relatively
cheap anyway.

It sounds like we may need to go back to a more traditional access_ok()
which masks off the LAM bits (right?) and then compares with
TASK_SIZE_MAX.  While TASK_SIZE_MAX is no longer necessarily a
compile-time constant for the CONFIG_X86_5LEVEL case, it's no worse than
a single "mov" instruction in task_size_max().

So far I have something like the below which is completely untested and
probably actively wrong in some places.

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 39199eef26be..92c9f2847217 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -50,57 +50,43 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
  * The virtual address space space is logically divided into a kernel
  * half and a user half.  When cast to a signed type, user pointers
  * are positive and kernel pointers are negative.
+ *
+ * TODO make sure users are taking AMD non-canonical bit speculation bug into
+ *      account
  */
 #define valid_user_address(x) ((__force long)(x) >= 0)
 
+#ifndef untagged_addr
+#define untagged_addr(addr) (addr)
+#endif
+
+static inline bool __access_ok(const void __user *ptr, unsigned long size)
+{
+	unsigned long addr = (__force unsigned long)untagged_addr(ptr);
+	unsigned long limit = TASK_SIZE_MAX;
+
+	return (size <= limit) && addr <= (limit - size);
+}
+#define __access_ok __access_ok
+
+/*
+ * Called after access_ok(), so architecturally this is a valid user pointer.
+ * Force it that way for the speculative case.
+ */
+#define mask_user_address(x) \
+	((typeof(x))((__force unsigned long)(x) & ((1UL << __VIRTUAL_MASK_SHIFT) - 1)))
+
 /*
  * Masking the user address is an alternative to a conditional
  * user_access_begin that can avoid the fencing. This only works
  * for dense accesses starting at the address.
  */
-#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63)))
-#define masked_user_access_begin(x) ({				\
-	__auto_type __masked_ptr = (x);				\
-	__masked_ptr = mask_user_address(__masked_ptr);		\
-	__uaccess_begin(); __masked_ptr; })
-
-/*
- * User pointers can have tag bits on x86-64.  This scheme tolerates
- * arbitrary values in those bits rather then masking them off.
- *
- * Enforce two rules:
- * 1. 'ptr' must be in the user half of the address space
- * 2. 'ptr+size' must not overflow into kernel addresses
- *
- * Note that addresses around the sign change are not valid addresses,
- * and will GP-fault even with LAM enabled if the sign bit is set (see
- * "CR3.LAM_SUP" that can narrow the canonicality check if we ever
- * enable it, but not remove it entirely).
- *
- * So the "overflow into kernel addresses" does not imply some sudden
- * exact boundary at the sign bit, and we can allow a lot of slop on the
- * size check.
- *
- * In fact, we could probably remove the size check entirely, since
- * any kernel accesses will be in increasing address order starting
- * at 'ptr', and even if the end might be in kernel space, we'll
- * hit the GP faults for non-canonical accesses before we ever get
- * there.
- *
- * That's a separate optimization, for now just handle the small
- * constant case.
- */
-static inline bool __access_ok(const void __user *ptr, unsigned long size)
+static __must_check __always_inline void __user *masked_user_access_begin(void __user *ptr, size_t len)
 {
-	if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) {
-		return valid_user_address(ptr);
-	} else {
-		unsigned long sum = size + (__force unsigned long)ptr;
-
-		return valid_user_address(sum) && sum >= (__force unsigned long)ptr;
-	}
+	if (unlikely(!__access_ok(ptr,len)))
+		return (__force void __user *)NULL;
+	return mask_user_address(ptr);
 }
-#define __access_ok __access_ok
 
 /*
  * Copy To/From Userspace
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index 094224ec9dca..7e09708642e2 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -35,14 +35,16 @@
 #include <asm/asm.h>
 #include <asm/smap.h>
 
-#define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
-
-.macro check_range size:req
+.macro check_range size:req check=1
 .if IS_ENABLED(CONFIG_X86_64)
-	mov %rax, %rdx		/* mask_user_address() */
-	sar $63, %rdx
-	or %rdx, %rax
-.else
+	// FIXME take size into account?
+	ALTERNATIVE "movq $0x7fffffffffff, %rdx", "movq $0xffffffffffffff, %rdx", X86_FEATURE_LA57
+.if \check == 1
+	cmp %rdx, %rax		/* access_ok() */
+	ja .Lbad_get_user
+.endif
+	and %rdx, %rax		/* mask_user_address() */
+.else	/* 32 bit */
 	cmp $TASK_SIZE_MAX-\size+1, %eax
 	jae .Lbad_get_user
 	sbb %edx, %edx		/* array_index_mask_nospec() */
@@ -105,10 +107,10 @@ SYM_FUNC_START(__get_user_8)
 SYM_FUNC_END(__get_user_8)
 EXPORT_SYMBOL(__get_user_8)
 
-/* .. and the same for __get_user, just without the range checks */
+/* .. and the same for __get_user, just without the access_ok() checks */
 SYM_FUNC_START(__get_user_nocheck_1)
+	check_range size=1 check=0
 	ASM_STAC
-	ASM_BARRIER_NOSPEC
 	UACCESS movzbl (%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
@@ -117,8 +119,8 @@ SYM_FUNC_END(__get_user_nocheck_1)
 EXPORT_SYMBOL(__get_user_nocheck_1)
 
 SYM_FUNC_START(__get_user_nocheck_2)
+	check_range size=2 check=0
 	ASM_STAC
-	ASM_BARRIER_NOSPEC
 	UACCESS movzwl (%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
@@ -127,8 +129,8 @@ SYM_FUNC_END(__get_user_nocheck_2)
 EXPORT_SYMBOL(__get_user_nocheck_2)
 
 SYM_FUNC_START(__get_user_nocheck_4)
+	check_range size=4 check=0
 	ASM_STAC
-	ASM_BARRIER_NOSPEC
 	UACCESS movl (%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
@@ -137,8 +139,8 @@ SYM_FUNC_END(__get_user_nocheck_4)
 EXPORT_SYMBOL(__get_user_nocheck_4)
 
 SYM_FUNC_START(__get_user_nocheck_8)
+	check_range size=8 check=0
 	ASM_STAC
-	ASM_BARRIER_NOSPEC
 #ifdef CONFIG_X86_64
 	UACCESS movq (%_ASM_AX),%rdx
 #else
diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S
index 09b7e37934ab..dde3b7825717 100644
--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -32,12 +32,15 @@
  * as they get called from within inline assembly.
  */
 
-.macro check_range size:req
+.macro check_range size:req check=1
 .if IS_ENABLED(CONFIG_X86_64)
-	mov %rcx, %rbx		/* mask_user_address() */
-	sar $63, %rbx
-	or %rbx, %rcx
-.else
+	ALTERNATIVE "movq $0x7fffffffffff, %rdx", "movq $0xffffffffffffff, %rdx", X86_FEATURE_LA57
+.if \check == 1
+	cmp %rdx, %rax		/* access_ok() */
+	ja .Lbad_put_user
+.endif
+	and %rdx, %rax		/* mask_user_address() */
+.else	/* 32 bit */
 	cmp $TASK_SIZE_MAX-\size+1, %ecx
 	jae .Lbad_put_user
 .endif
@@ -55,6 +58,7 @@ SYM_FUNC_END(__put_user_1)
 EXPORT_SYMBOL(__put_user_1)
 
 SYM_FUNC_START(__put_user_nocheck_1)
+	check_range size=1 check=0
 	ASM_STAC
 2:	movb %al,(%_ASM_CX)
 	xor %ecx,%ecx
@@ -74,6 +78,7 @@ SYM_FUNC_END(__put_user_2)
 EXPORT_SYMBOL(__put_user_2)
 
 SYM_FUNC_START(__put_user_nocheck_2)
+	check_range size=2 check=0
 	ASM_STAC
 4:	movw %ax,(%_ASM_CX)
 	xor %ecx,%ecx
@@ -93,6 +98,7 @@ SYM_FUNC_END(__put_user_4)
 EXPORT_SYMBOL(__put_user_4)
 
 SYM_FUNC_START(__put_user_nocheck_4)
+	check_range size=4 check=0
 	ASM_STAC
 6:	movl %eax,(%_ASM_CX)
 	xor %ecx,%ecx
@@ -115,6 +121,7 @@ SYM_FUNC_END(__put_user_8)
 EXPORT_SYMBOL(__put_user_8)
 
 SYM_FUNC_START(__put_user_nocheck_8)
+	check_range size=8 check=0
 	ASM_STAC
 9:	mov %_ASM_AX,(%_ASM_CX)
 #ifdef CONFIG_X86_32
diff --git a/fs/select.c b/fs/select.c
index a77907faf2b4..2ca7ad555342 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -777,9 +777,11 @@ static inline int get_sigset_argpack(struct sigset_argpack *to,
 {
 	// the path is hot enough for overhead of copy_from_user() to matter
 	if (from) {
-		if (can_do_masked_user_access())
-			from = masked_user_access_begin(from);
-		else if (!user_read_access_begin(from, sizeof(*from)))
+		if (can_do_masked_user_access()) {
+			from = masked_user_access_begin(from, sizeof(*from));
+			if (!from)
+				return -EFAULT;
+		} else if (!user_read_access_begin(from, sizeof(*from)))
 			return -EFAULT;
 		unsafe_get_user(to->p, &from->p, Efault);
 		unsafe_get_user(to->size, &from->size, Efault);
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index dda9725a9559..fe11b3fb755a 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -37,7 +37,7 @@
  #define can_do_masked_user_access() 1
 #else
  #define can_do_masked_user_access() 0
- #define masked_user_access_begin(src) NULL
+ #define masked_user_access_begin(ptr, len) NULL
 #endif
 
 /*
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index 989a12a67872..ce5eaaed2365 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -120,15 +120,6 @@ long strncpy_from_user(char *dst, const char __user *src, long count)
 	if (unlikely(count <= 0))
 		return 0;
 
-	if (can_do_masked_user_access()) {
-		long retval;
-
-		src = masked_user_access_begin(src);
-		retval = do_strncpy_from_user(dst, src, count, count);
-		user_read_access_end();
-		return retval;
-	}
-
 	max_addr = TASK_SIZE_MAX;
 	src_addr = (unsigned long)untagged_addr(src);
 	if (likely(src_addr < max_addr)) {
@@ -144,11 +135,17 @@ long strncpy_from_user(char *dst, const char __user *src, long count)
 
 		kasan_check_write(dst, count);
 		check_object_size(dst, count, false);
-		if (user_read_access_begin(src, max)) {
-			retval = do_strncpy_from_user(dst, src, count, max);
-			user_read_access_end();
-			return retval;
+		if (can_do_masked_user_access()) {
+			src = masked_user_access_begin(src, max);
+			if (!src)
+				return -EFAULT;
+		} else if (!user_read_access_begin(src, max)) {
+			return -EFAULT;
 		}
+
+		retval = do_strncpy_from_user(dst, src, count, max);
+		user_read_access_end();
+		return retval;
 	}
 	return -EFAULT;
 }
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index 6e489f9e90f1..c6066906704a 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -96,15 +96,6 @@ long strnlen_user(const char __user *str, long count)
 	if (unlikely(count <= 0))
 		return 0;
 
-	if (can_do_masked_user_access()) {
-		long retval;
-
-		str = masked_user_access_begin(str);
-		retval = do_strnlen_user(str, count, count);
-		user_read_access_end();
-		return retval;
-	}
-
 	max_addr = TASK_SIZE_MAX;
 	src_addr = (unsigned long)untagged_addr(str);
 	if (likely(src_addr < max_addr)) {
@@ -118,11 +109,17 @@ long strnlen_user(const char __user *str, long count)
 		if (max > count)
 			max = count;
 
-		if (user_read_access_begin(str, max)) {
-			retval = do_strnlen_user(str, count, max);
-			user_read_access_end();
-			return retval;
+		if (can_do_masked_user_access()) {
+			str = masked_user_access_begin(str, max);
+			if (!str)
+				return 0;
+		} else if (!user_read_access_begin(str, max)) {
+			   return 0;
 		}
+
+		retval = do_strnlen_user(str, count, max);
+		user_read_access_end();
+		return retval;
 	}
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14  3:54             ` Josh Poimboeuf
@ 2024-10-14  6:50               ` Linus Torvalds
  2024-10-14  9:59                 ` David Laight
                                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-10-14  6:50 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

[-- Attachment #1: Type: text/plain, Size: 7423 bytes --]

On Sun, 13 Oct 2024 at 20:54, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> If I understand correctly, LAM bits are for the benefit of SW and are
> ignored by HW?  Can we just mask those bits off?

Yes. But then you waste time on the masking, but particularly if it
then causes silly extra overhead just to get the mask.

That was why the whole original LAM patches were dismissed - not
because an "and" instruction to remove the LAM bits would be
expensive, but because when you do static branching depending on it
and then load the mask from memory, the code goes from nice simple
straight-line efficient code to a horrible mess.

The early LAM patches admittedly made it even worse because it did a
per-thread mask etc, so it really got quite nasty and wasn't just
*one* static jump to a constant. But still..

> > So maybe the "what is the actual cycle latency of detecting the
> > faulting instruction" really is the core question here.
>
> I think I remember reading that STAC/CLAC can't necessarily be relied on
> as a speculation barrier, depending on microarchitectural details.  It
> might be safest to assume we can't rely on that.  Masking is relatively
> cheap anyway.

The WHOLE BUG is all about "microarchitectural details".

So arguing that STAC is a microarchitectural detail and not
architecturally guaranteed isn't an argument.

Because architecturally, this but simply does not exist, and so some
architectural definition simply doesn't matter.

So all that matters is whether the broken microarchitectures that used
the wrong bit for testing also have a STAC that basically hides the
bug.

Christ. The hardware should have used a bit that is *MEANINGFUL*,
namely bit #63, not some random meaningless bit that just happens to
be one of the bits that is then checked for canonicality.

And it's so annoying, because from a *hardware* perspective, bit #63
vs bit #48 is completely irrelevant. It's literally just a wire choice

But from an architectural perspective, bit #63 is not only the
*actual* bit that is the real difference  ("kernel is at the top of
the address space") but for software, bit #48 is fundamentally harder
to test.

IOW, it's completely the wrong effing bit to test.

Honestly, the Intel meltdown thing I at least understood *why* it happened.

This AMD "use meaningless bit #48" bug just strikes me as active malice.

The paper I found for this bug doesn't go into any details of what the
cycle count issues are until the full address is verified, and doesn't
even mention which micro-architectures are affected. It says "Zen+"
and "Zen 2", from a quick read.

The AMD note doesn't say even that. Is this *all* Zen cores? If not,
when did it get fixed?

> So far I have something like the below which is completely untested and
> probably actively wrong in some places.

This is worse than actively wrong. It's just adding insult to injury.

> +static inline bool __access_ok(const void __user *ptr, unsigned long size)
> +{
> +       unsigned long addr = (__force unsigned long)untagged_addr(ptr);
> +       unsigned long limit = TASK_SIZE_MAX;

We're not encouraging this complete microarchitectural incompetence by
using something expensive like TASK_SIZE_MAX, which actually expands
to be conditional on pgtable_l5_enabled() and a shift (maybe the
compiler ends up moving the shift into both sides of the
conditional?).

Which isn't even the right thing, because presumably as far as the
HARDWARE is concerned, the actual width of the TLB is what gets
tested,  and isn't actually dependent on whether 5-level paging is
actually *enabled* or not, but on whether the hardware *supports*
5-level paging.

I guess as long as we aren't using the high bits, we can just clear
all of them (so if we just always clear bit #57 when we also clear
#48, we are ok), but it still seems wrong in addition to being
pointlessly expensive.

> +#define mask_user_address(x) \
> +       ((typeof(x))((__force unsigned long)(x) & ((1UL << __VIRTUAL_MASK_SHIFT) - 1)))

No. Again, that code makes it look like it's some nice simple
constant. It's not. __VIRTUAL_MASK_SHIFT is that conditional thing
based on pgtable_l5_enabled(). So now you have static branches etc
craziness instead of having a nice constant.

We can fix this, but no, we're not going "hardware people did
something incredibly stupid, so now to balance things out, *we* will
do something equally stupid".

We are going to be better than that.

So the way to fix this properly - if it even needs fixing - is to just
have an actual assembler alternate that does something like this in
get_user.S, and does the same for the C code for mask_user_address().

And it needs a *big* comment about the stupidity of hardware checking
the wrong bit that has no semantic meaning except for it (too late!)
being tested for being canonical with the actual bit that matters (ie
bit #63).

And again, the whole "if it even needs fixing" is a big thing. Maybe
STAC is just taking long enough that the canonicality check *has* been
done. We know the STAC isn't a serializing instruction, but we also
*do* know that STAC sure as hell will synchronize at least to some
degree with memory access permission testing, because that's literally
the whole and only point of it.

(Of course, if the AC bit was just passed along from the front end and
tagged all the instructions, the actual CLAC/STAC instructions
wouldn't need to serializing with actual instruction execution, but if
that was the case they wouldn't be as expensive as I see them being in
profiles, so we know the uarch isn't doing something that clever).

Christ. I thought I was over being annoyed by hardware bugs. But this
hardware bug is just annoying in how *stupid* it is.

Anyway, the attached patch

 (a) only fixes get_user() - I assume put_user() doesn't even need
this, because the store will go into the store buffer, and certainly
be killed before it gets anywhere else?

 (b) only does the asm side, the mask_user_address() would need to do
the same thing using the C version: alternative()

 (c) is entirely untested, and I might have gotten the constants wrong
or some other logic wrong.

but at least the code it generates doesn't look actively like garbage.
It generates something like this:

        mov    %rax,%rdx
        sar    $0x3f,%rdx
        or     %rdx,%rax
        movabs $0x80007fffffffffff,%rdx
        and    %rdx,%rax

which looks about as good as it gets (assuming I didn't screw up the constant).

The "or" still sets all bits when it's one of the real kernel
addresses), but the "and" will now guarantee that the end result is
canonical for positive addresses, and guaranteed non-canonical - and
for this AMD bug, a zero bit 48/57 - for the invalid kernel range.

Yes, it basically means that as far as the kernel is concerned,
"everything is LAM".

The kernel would actually accept all positive addresses, and only
fault (now with a non-canonical GP fault) when users try to use
negative addresses.

Which is arguably also quite horrendous, but it's effectively what
having LAM enabled would do in hardware anyway, so hey, it's "forward
looking". Bah.

And we could make the masking constants be 0xfeff7fffffffffff and
0xfeffffffffffff, and *only* mask off bit 48/57.

And we could even make the whole "and" be conditional on the AMD bug
with a new X86_FEATURE_CANONICAL_LEAK thing.

So there are certainly options in this area.

             Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 673 bytes --]

 arch/x86/lib/getuser.S | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index d066aecf8aeb..7d5730aa18b8 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -37,11 +37,17 @@

 #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC

+#define X86_CANONICAL_MASK ALTERNATIVE \
+	"movq $0x80007fffffffffff,%rdx", \
+	"movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57
+
 .macro check_range size:req
 .if IS_ENABLED(CONFIG_X86_64)
 	mov %rax, %rdx
 	sar $63, %rdx
 	or %rdx, %rax
+	X86_CANONICAL_MASK
+	and %rdx,%rax
 .else
 	cmp $TASK_SIZE_MAX-\size+1, %eax
 	jae .Lbad_get_user

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14  6:50               ` Linus Torvalds
@ 2024-10-14  9:59                 ` David Laight
  2024-10-14 11:20                   ` Linus Torvalds
  2024-10-14 11:12                 ` Andrew Cooper
  2024-10-14 12:30                 ` Kirill A. Shutemov
  2 siblings, 1 reply; 51+ messages in thread
From: David Laight @ 2024-10-14  9:59 UTC (permalink / raw)
  To: 'Linus Torvalds', Josh Poimboeuf
  Cc: Andrew Cooper, x86@kernel.org, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org,
	Mark Rutland

...
> > If I understand correctly, LAM bits are for the benefit of SW and are
> > ignored by HW?  Can we just mask those bits off?
> 
> Yes. But then you waste time on the masking, but particularly if it
> then causes silly extra overhead just to get the mask.

Isn't LAM just plain stupid unless the hardware validates the bits
against the TLB?
You start with a nice big sparse address space (potentially 63 bits of it)
where things can be spread out to make 'random' addresses likely to
fault and then alias 32k addresses onto each memory location.
Sounds brain-dead to me.
If you could set the stack pointer to a 'high' address and have the
hardware check that the TLB was for that alias then you'd get
reasonable stack overflow checking.

...
> namely bit #63, not some random meaningless bit that just happens to
> be one of the bits that is then checked for canonicality.
> 
> And it's so annoying, because from a *hardware* perspective, bit #63
> vs bit #48 is completely irrelevant. It's literally just a wire choice
> 
> But from an architectural perspective, bit #63 is not only the
> *actual* bit that is the real difference  ("kernel is at the top of
> the address space") but for software, bit #48 is fundamentally harder
> to test.

Doesn't ARM64 have the same issue?

I'm sure I also remember some other architectural feature that
extends the valid virtual addresses beyond 48 bits?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14  6:50               ` Linus Torvalds
  2024-10-14  9:59                 ` David Laight
@ 2024-10-14 11:12                 ` Andrew Cooper
  2024-10-14 12:30                 ` Kirill A. Shutemov
  2 siblings, 0 replies; 51+ messages in thread
From: Andrew Cooper @ 2024-10-14 11:12 UTC (permalink / raw)
  To: Linus Torvalds, Josh Poimboeuf
  Cc: x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra,
	Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar,
	Michael Ellerman, linuxppc-dev, Mark Rutland

On 14/10/2024 7:50 am, Linus Torvalds wrote:
> On Sun, 13 Oct 2024 at 20:54, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>> If I understand correctly, LAM bits are for the benefit of SW and are
>> ignored by HW?

That is what is written in the ISE today.

But, I'll note that on ARM, MTE (Memory Tagging Extensions) use the same
bits that BTI (Top Byte Ignore) made available to software.

>>> So maybe the "what is the actual cycle latency of detecting the
>>> faulting instruction" really is the core question here.

I'm afraid I will have to defer that to AMD.  I have not experimented
personally.

>> I think I remember reading that STAC/CLAC can't necessarily be relied on
>> as a speculation barrier, depending on microarchitectural details.

STAC and CLAC are not LFENCEes.

They happen to be on implementations where AC is not a renamed flag
(i.e. can't be updated speculatively), but both vendors want the
flexibility in the future.

Therefore, it is my recollection that the vendors agreed that STAC/CLAC
would prevent memory accesses from occurring under the wrong AC,
although I can't actually find this written down anywhere.

I think we need to chase some architects until a statement or two fall out.

>> It
>> might be safest to assume we can't rely on that.  Masking is relatively
>> cheap anyway.
> The WHOLE BUG is all about "microarchitectural details".
>
> So arguing that STAC is a microarchitectural detail and not
> architecturally guaranteed isn't an argument.
>
> Because architecturally, this but simply does not exist, and so some
> architectural definition simply doesn't matter.
>
> So all that matters is whether the broken microarchitectures that used
> the wrong bit for testing also have a STAC that basically hides the
> bug.
>
> Christ. The hardware should have used a bit that is *MEANINGFUL*,
> namely bit #63, not some random meaningless bit that just happens to
> be one of the bits that is then checked for canonicality.
>
> And it's so annoying, because from a *hardware* perspective, bit #63
> vs bit #48 is completely irrelevant. It's literally just a wire choice
>
> But from an architectural perspective, bit #63 is not only the
> *actual* bit that is the real difference  ("kernel is at the top of
> the address space") but for software, bit #48 is fundamentally harder
> to test.
>
> IOW, it's completely the wrong effing bit to test.
>
> Honestly, the Intel meltdown thing I at least understood *why* it happened.
>
> This AMD "use meaningless bit #48" bug just strikes me as active malice.

I think this is unfair.

This was an implementation decision taken probably around the millennium
(given lead times to get the K8 out in 2003).

There is a reason why - given they're architecturally equivalent - the
inside bit is more useful than the outside bit.  Some AMD CPUs really
only have 48 bits of storage for pointers, and it really does simplify
circuitry.

Furthermore, since when have software concerns (one bit being harder to
test than others) ever factored into hardware design.  It's sad but true.

From further up the email chain:

> It sure was never talked about in kernel circles.

Quite possibly not.  At the time the Linux accessors were safe, so where
ought it to have been discussed?

It's not appropriate for an encrypted list, nor security@.  AMD did
deliver it to software vendors via their normal mechanism.

~Andrew

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14  9:59                 ` David Laight
@ 2024-10-14 11:20                   ` Linus Torvalds
  2024-10-14 14:40                     ` David Laight
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-14 11:20 UTC (permalink / raw)
  To: David Laight
  Cc: Josh Poimboeuf, Andrew Cooper, x86@kernel.org, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman,
	linuxppc-dev@lists.ozlabs.org, Mark Rutland

On Mon, 14 Oct 2024 at 02:59, David Laight <David.Laight@aculab.com> wrote:
>
> Isn't LAM just plain stupid unless the hardware validates the bits
> against the TLB?

What? No. You can't do that. At some point, the TLB isn't filled, and
then you have to do the access with lots of linear address bits
masked.

> Doesn't ARM64 have the same issue?

Yes. They have different bits that they'd need to test for this. They
have top-byte-ignore, so they can't use the sign bit.

On arm64, the address masking might end up looking something like this

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d31e86ef6377

instead.

              Linus


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-13  0:53         ` Linus Torvalds
  2024-10-13  1:21           ` Linus Torvalds
@ 2024-10-14 11:56           ` Kirill A. Shutemov
  1 sibling, 0 replies; 51+ messages in thread
From: Kirill A. Shutemov @ 2024-10-14 11:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Sat, Oct 12, 2024 at 05:53:19PM -0700, Linus Torvalds wrote:
> On Sat, 12 Oct 2024 at 10:44, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Anyway, what's the speculation window size like?
> 
> Note that this is important basically because we do *NOT* want to
> check the address against TASK_SIZE_MAX like we used to, because not
> only is TASK_SIZE_MAX not a compile-time constant, but with linear
> address masking, people actually *want* to use addresses that are in
> the non-canonical range.
> 
> IOW, see also
> 
>     arch/x86/include/asm/uaccess_64.h
> 
> and notice how the x86-64 __access_ok() check *also_ does the whole
> "top bit set" thing (iow, see __access_ok()).
> 
> IOW, this actually goes even further back than the commit I mentioned
> earlier - it goes back to commit 6014bc27561f ("x86-64: make
> access_ok() independent of LAM") because without the sign bit trick,
> LAM is a complete disaster.
> 
> So no, the address masking can not depend on things like
> __VIRTUAL_MASK_SHIFT, it would need to at least take LAM into account
> too. Not that I know if there are any CPU's out there that actually
> have LAM enabled.

Actually LAM is fine with the __VIRTUAL_MASK_SHIFT check. LAM enforces bit
47 (or 56 for 5-level paging) to be equal to bit 63. Otherwise it is
canonicality violation.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14  6:50               ` Linus Torvalds
  2024-10-14  9:59                 ` David Laight
  2024-10-14 11:12                 ` Andrew Cooper
@ 2024-10-14 12:30                 ` Kirill A. Shutemov
  2024-10-14 15:39                   ` Andrew Cooper
  2024-10-14 16:55                   ` Linus Torvalds
  2 siblings, 2 replies; 51+ messages in thread
From: Kirill A. Shutemov @ 2024-10-14 12:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Sun, Oct 13, 2024 at 11:50:55PM -0700, Linus Torvalds wrote:
> Anyway, the attached patch
> 
> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> index d066aecf8aeb..7d5730aa18b8 100644
> --- a/arch/x86/lib/getuser.S
> +++ b/arch/x86/lib/getuser.S
> @@ -37,11 +37,17 @@
>  
>  #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
>  
> +#define X86_CANONICAL_MASK ALTERNATIVE \
> +	"movq $0x80007fffffffffff,%rdx", \
> +	"movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57
> +
>  .macro check_range size:req
>  .if IS_ENABLED(CONFIG_X86_64)
>  	mov %rax, %rdx
>  	sar $63, %rdx
>  	or %rdx, %rax
> +	X86_CANONICAL_MASK
> +	and %rdx,%rax
>  .else
>  	cmp $TASK_SIZE_MAX-\size+1, %eax
>  	jae .Lbad_get_user

Given that LAM enforces bit 47/56 to be equal to bit 63 I think we can do
this unconditionally instead of masking:

diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index d066aecf8aeb..86d4511520b1 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -37,9 +37,14 @@
 
 #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
 
+#define SHIFT_LEFT_TO_MSB ALTERNATIVE \
+	"shl $(64 - 48), %rdx", \
+	"shl $(64 - 57), %rdx", X86_FEATURE_LA57
+
 .macro check_range size:req
 .if IS_ENABLED(CONFIG_X86_64)
 	mov %rax, %rdx
+	SHIFT_LEFT_TO_MSB
 	sar $63, %rdx
 	or %rdx, %rax
 .else
-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14 11:20                   ` Linus Torvalds
@ 2024-10-14 14:40                     ` David Laight
  0 siblings, 0 replies; 51+ messages in thread
From: David Laight @ 2024-10-14 14:40 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Josh Poimboeuf, Andrew Cooper, x86@kernel.org, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman,
	linuxppc-dev@lists.ozlabs.org, Mark Rutland

From: Linus Torvalds
> Sent: 14 October 2024 12:21
> 
> On Mon, 14 Oct 2024 at 02:59, David Laight <David.Laight@aculab.com> wrote:
> >
> > Isn't LAM just plain stupid unless the hardware validates the bits
> > against the TLB?
> 
> What? No. You can't do that. At some point, the TLB isn't filled, and
> then you have to do the access with lots of linear address bits
> masked.

And the 'high' bits get reloaded from the PTE.
(Although I suspect there isn't space for them.)

Reloading the high bits from the VA that caused the fault would
be 'interesting' - certainly better than just ignoring them.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14 12:30                 ` Kirill A. Shutemov
@ 2024-10-14 15:39                   ` Andrew Cooper
  2024-10-15 10:00                     ` Borislav Petkov
  2024-10-20 22:44                     ` Josh Poimboeuf
  2024-10-14 16:55                   ` Linus Torvalds
  1 sibling, 2 replies; 51+ messages in thread
From: Andrew Cooper @ 2024-10-14 15:39 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds
  Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

On 14/10/2024 1:30 pm, Kirill A. Shutemov wrote:
> On Sun, Oct 13, 2024 at 11:50:55PM -0700, Linus Torvalds wrote:
>> Anyway, the attached patch
>>
>> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
>> index d066aecf8aeb..7d5730aa18b8 100644
>> --- a/arch/x86/lib/getuser.S
>> +++ b/arch/x86/lib/getuser.S
>> @@ -37,11 +37,17 @@
>>  
>>  #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
>>  
>> +#define X86_CANONICAL_MASK ALTERNATIVE \
>> +	"movq $0x80007fffffffffff,%rdx", \
>> +	"movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57
>> +
>>  .macro check_range size:req
>>  .if IS_ENABLED(CONFIG_X86_64)
>>  	mov %rax, %rdx
>>  	sar $63, %rdx
>>  	or %rdx, %rax
>> +	X86_CANONICAL_MASK
>> +	and %rdx,%rax
>>  .else
>>  	cmp $TASK_SIZE_MAX-\size+1, %eax
>>  	jae .Lbad_get_user
> Given that LAM enforces bit 47/56 to be equal to bit 63 I think we can do
> this unconditionally instead of masking:
>
> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> index d066aecf8aeb..86d4511520b1 100644
> --- a/arch/x86/lib/getuser.S
> +++ b/arch/x86/lib/getuser.S
> @@ -37,9 +37,14 @@
>  
>  #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
>  
> +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \
> +	"shl $(64 - 48), %rdx", \
> +	"shl $(64 - 57), %rdx", X86_FEATURE_LA57
> +
>  .macro check_range size:req
>  .if IS_ENABLED(CONFIG_X86_64)
>  	mov %rax, %rdx
> +	SHIFT_LEFT_TO_MSB
>  	sar $63, %rdx
>  	or %rdx, %rax
>  .else

That looks like it ought to DTRT in some cases, but I'll definitely ask
AMD for confirmation.

But, I expect it will malfunction on newer hardware when
CONFIG_X86_5LEVEL=n, because it causes Linux to explicitly ignore the
LA57 bit.  That can be fixed by changing how CONFIG_X86_5LEVEL works.

I also expect it will malfunction under virt on an LA57-capable system
running a VM in LA48 mode (this time, Linux doesn't get to see the
relevant uarch detail), and I have no good suggestion here.

~Andrew


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14 12:30                 ` Kirill A. Shutemov
  2024-10-14 15:39                   ` Andrew Cooper
@ 2024-10-14 16:55                   ` Linus Torvalds
  2024-10-16 16:10                     ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-14 16:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Mon, 14 Oct 2024 at 05:30, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Given that LAM enforces bit 47/56 to be equal to bit 63 I think we can do
> this unconditionally instead of masking:
>
> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> index d066aecf8aeb..86d4511520b1 100644
> --- a/arch/x86/lib/getuser.S
> +++ b/arch/x86/lib/getuser.S
> @@ -37,9 +37,14 @@
>
>  #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
>
> +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \
> +       "shl $(64 - 48), %rdx", \
> +       "shl $(64 - 57), %rdx", X86_FEATURE_LA57

That's certainly a lot smaller than the big constant, which is good.

But I would want some AMD architect to actually tell us *which* cores
this affects, and just what the cycle latency of the canonicality
check really is.

For example, the whole point of the address masking is that the access
itself isn't actually conditional, and we want (and expect) the
exception in the invalid address case (which very much includes the
non-canonicality).

So either the attacker has to find an unrelated conditional that
precedes the whole user access sequence (ie not just the CLAC
afterwards, but the STAC and the address masking that precedes the
access), *or* the attacker has to be able to deal with whatever signal
noise that comes from taking the GP exception and signal handling and
return to user space.

The exception handling will almost certainly have destroyed any signal
in L1 cache, so some flush+reload attack is going to be very very
inconvenient. And we could easily make it much worse by just adding
extra work to the GP exception handling.

And that is all in *addition* to having the sequence of the
non-canonical access followed by the "clac" and then you have to find
a dependent load to leak the data you loaded through the non-canonical
address. And all *that* needs to happen before the non-canonical
address actually being checked.

End result: I don't think this is actually a real scenario in the
first place. I would want some AMD person to really check.

                   Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14 15:39                   ` Andrew Cooper
@ 2024-10-15 10:00                     ` Borislav Petkov
  2024-10-20 22:44                     ` Josh Poimboeuf
  1 sibling, 0 replies; 51+ messages in thread
From: Borislav Petkov @ 2024-10-15 10:00 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kirill A. Shutemov, Linus Torvalds, Josh Poimboeuf, x86,
	Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Mon, Oct 14, 2024 at 04:39:26PM +0100, Andrew Cooper wrote:
> But, I expect it will malfunction on newer hardware when
> CONFIG_X86_5LEVEL=n, because it causes Linux to explicitly ignore the
> LA57 bit.  That can be fixed by changing how CONFIG_X86_5LEVEL works.

https://lore.kernel.org/all/80734605-1926-4ac7-9c63-006fe3ea6b6a@amd.com/

But there was some lmbench regression...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14 16:55                   ` Linus Torvalds
@ 2024-10-16 16:10                     ` Linus Torvalds
  2024-10-16 22:02                       ` Andrew Cooper
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-16 16:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Mon, 14 Oct 2024 at 09:55, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 14 Oct 2024 at 05:30, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > Given that LAM enforces bit 47/56 to be equal to bit 63 I think we can do
> > this unconditionally instead of masking:
> >
> > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> > index d066aecf8aeb..86d4511520b1 100644
> > --- a/arch/x86/lib/getuser.S
> > +++ b/arch/x86/lib/getuser.S
> > @@ -37,9 +37,14 @@
> >
> >  #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
> >
> > +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \
> > +       "shl $(64 - 48), %rdx", \
> > +       "shl $(64 - 57), %rdx", X86_FEATURE_LA57
>
> That's certainly a lot smaller than the big constant, which is good.

Actually, having thought this over, I think we can simplify things
more, and not do any shifts AT ALL.

IOW, the whole "mask non-canonical bits" model I think works on its
own even without the "shift sign bit and or" logic.

So it can replace all the shifting entirely (much less the double
shifting), and we can have something like this:

  --- a/arch/x86/lib/getuser.S
  +++ b/arch/x86/lib/getuser.S
  @@ -37,11 +37,14 @@

   #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC

  +#define X86_CANONICAL_MASK ALTERNATIVE \
  +     "movq $0x80007fffffffffff,%rdx", \
  +     "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57
  +
   .macro check_range size:req
   .if IS_ENABLED(CONFIG_X86_64)
  -     mov %rax, %rdx
  -     sar $63, %rdx
  -     or %rdx, %rax
  +     X86_CANONICAL_MASK      /* mask into %rdx */
  +     and %rdx,%rax
   .else
        cmp $TASK_SIZE_MAX-\size+1, %eax
        jae .Lbad_get_user

Yes, it does end up having that big constant (not great), but makes up
for it by avoiding the register move instruction (3 bytes) and two
shift instructions (4 bytes each), so while the "load big constant"
instruction is a big 10-byte instruction, it's actually one byte
shorter than the old code was.

It also avoids a couple of data dependencies, so it's both smaller and simpler.

An invalid address with the high bit set will become non-canonical, so
it will cause an exception as expected.

But it avoids the AMD issue, because while it's non-canonical, it
always clears that bit 48 (or 57), so it will never hit a kernel
TLB/cache tag.

Of course, if some other core then does the sane thing, and uses  bit
#63 (instead of #48/57), and still has the pseudo-meltdown behavior,
then the non-canonical address generated by the plain 'and' will end
up having the same old issue.

But in (related) news, the address masking trick really does matter:

   https://lore.kernel.org/all/202410161557.5b87225e-oliver.sang@intel.com/

which reports that ridiculous 6.8% improvement just from avoiding the
barrier in user_access_begin().

So that barrier really *is* very expensive. Surprisingly so.

                 Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-16 16:10                     ` Linus Torvalds
@ 2024-10-16 22:02                       ` Andrew Cooper
  2024-10-16 22:13                         ` Kirill A. Shutemov
  2024-10-16 22:32                         ` Linus Torvalds
  0 siblings, 2 replies; 51+ messages in thread
From: Andrew Cooper @ 2024-10-16 22:02 UTC (permalink / raw)
  To: Linus Torvalds, Kirill A. Shutemov
  Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland

On 16/10/2024 5:10 pm, Linus Torvalds wrote:
>   --- a/arch/x86/lib/getuser.S
>   +++ b/arch/x86/lib/getuser.S
>   @@ -37,11 +37,14 @@
>
>    #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
>
>   +#define X86_CANONICAL_MASK ALTERNATIVE \
>   +     "movq $0x80007fffffffffff,%rdx", \
>   +     "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57
>   +
>    .macro check_range size:req
>    .if IS_ENABLED(CONFIG_X86_64)
>   -     mov %rax, %rdx
>   -     sar $63, %rdx
>   -     or %rdx, %rax
>   +     X86_CANONICAL_MASK      /* mask into %rdx */
>   +     and %rdx,%rax

That doesn't have the same semantics, does it?

Consider userspace passing an otherwise-good pointer with bit 60 set. 
Previously that would have resulted in a failure, whereas now it will
succeed.

If AMD think it's appropriate, then what you probably want is the real
branch as per before (to maintain architectural user behaviour), and
then use a trick such as this one in place of the LFENCE for speed in
the common case.

> But in (related) news, the address masking trick really does matter:
>
>    https://lore.kernel.org/all/202410161557.5b87225e-oliver.sang@intel.com/
>
> which reports that ridiculous 6.8% improvement just from avoiding the
> barrier in user_access_begin().
>
> So that barrier really *is* very expensive. Surprisingly so.

7% performance is what it costs to maintain the security barrier we were
sold originally.

Then it's a lot of time, arguing in public, and wild guesswork over
explicitly undocumented properties that the vendors would prefer not to
discuss, to try and claw some of that 7% back while keeping the security
barrier intact for the benefit of users who want it both secure and fast.

Forgive me if I think that we (the SW people) are getting the raw end of
the deal here, while vendors keep selling more and more expensive chips
that don't work safely.

~Andrew

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-16 22:02                       ` Andrew Cooper
@ 2024-10-16 22:13                         ` Kirill A. Shutemov
  2024-10-16 22:34                           ` Linus Torvalds
  2024-10-16 22:32                         ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: Kirill A. Shutemov @ 2024-10-16 22:13 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Linus Torvalds, Josh Poimboeuf, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Wed, Oct 16, 2024 at 11:02:56PM +0100, Andrew Cooper wrote:
> On 16/10/2024 5:10 pm, Linus Torvalds wrote:
> >   --- a/arch/x86/lib/getuser.S
> >   +++ b/arch/x86/lib/getuser.S
> >   @@ -37,11 +37,14 @@
> >
> >    #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
> >
> >   +#define X86_CANONICAL_MASK ALTERNATIVE \
> >   +     "movq $0x80007fffffffffff,%rdx", \
> >   +     "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57
> >   +
> >    .macro check_range size:req
> >    .if IS_ENABLED(CONFIG_X86_64)
> >   -     mov %rax, %rdx
> >   -     sar $63, %rdx
> >   -     or %rdx, %rax
> >   +     X86_CANONICAL_MASK      /* mask into %rdx */
> >   +     and %rdx,%rax
> 
> That doesn't have the same semantics, does it?
> 
> Consider userspace passing an otherwise-good pointer with bit 60 set. 
> Previously that would have resulted in a failure, whereas now it will
> succeed.

It is worse than that. If we get LAM_SUP enabled (there's KASAN patchset
in works) this check will allow arbitrary kernel addresses.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-16 22:02                       ` Andrew Cooper
  2024-10-16 22:13                         ` Kirill A. Shutemov
@ 2024-10-16 22:32                         ` Linus Torvalds
  2024-10-17 11:00                           ` Borislav Petkov
  1 sibling, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-16 22:32 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kirill A. Shutemov, Josh Poimboeuf, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Wed, 16 Oct 2024 at 15:03, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> That doesn't have the same semantics, does it?

Correct. It just basically makes all positive addresses be force-canonicalized.

> If AMD think it's appropriate, then what you probably want is the real
> branch as per before (to maintain architectural user behaviour), and
> then use a trick such as this one in place of the LFENCE for speed in
> the common case.

The problem with the branch is that it really can only branch on the
sign bit - because of LAM.

So with LAM, those bits are pretty much ignored anyway.

> > So that barrier really *is* very expensive. Surprisingly so.
>
> 7% performance is what it costs to maintain the security barrier we were
> sold originally.

Absolutely. And the masking was something that basically says "we get
it all back" (with "all" being just this part, of course - never mind
all the other workarounds).

> Forgive me if I think that we (the SW people) are getting the raw end of
> the deal here, while vendors keep selling more and more expensive chips
> that don't work safely.

I'm 100% with you.

My preference would actually be to do nothing, on the assumption that
the AMD issue is actually impossible to trigger (due to CLAC/STAC
serializing memory address checks - which the timings certainly imply
they do).

But if we have to do magic bit masking, I'd rather it be *fast* magic
bit masking.

               Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-16 22:13                         ` Kirill A. Shutemov
@ 2024-10-16 22:34                           ` Linus Torvalds
  2024-10-28 11:29                             ` Kirill A. Shutemov
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-16 22:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Wed, 16 Oct 2024 at 15:13, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> It is worse than that. If we get LAM_SUP enabled (there's KASAN patchset
> in works) this check will allow arbitrary kernel addresses.

Ugh. I haven't seen the LAM_SUP patches.

But yeah, I assume any LAM_SUP model would basically then make the GP
fault due to non-canonical addresses go away.

          Linus


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-16 22:32                         ` Linus Torvalds
@ 2024-10-17 11:00                           ` Borislav Petkov
  0 siblings, 0 replies; 51+ messages in thread
From: Borislav Petkov @ 2024-10-17 11:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Cooper, Kirill A. Shutemov, Josh Poimboeuf, x86,
	Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Wed, Oct 16, 2024 at 03:32:32PM -0700, Linus Torvalds wrote:
> My preference would actually be to do nothing, on the assumption that
> the AMD issue is actually impossible to trigger (due to CLAC/STAC
> serializing memory address checks - which the timings certainly imply
> they do).

I'm verifying that assumption...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-14 15:39                   ` Andrew Cooper
  2024-10-15 10:00                     ` Borislav Petkov
@ 2024-10-20 22:44                     ` Josh Poimboeuf
  2024-10-20 22:59                       ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: Josh Poimboeuf @ 2024-10-20 22:44 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kirill A. Shutemov, Linus Torvalds, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Mon, Oct 14, 2024 at 04:39:26PM +0100, Andrew Cooper wrote:
> On 14/10/2024 1:30 pm, Kirill A. Shutemov wrote:
> > +++ b/arch/x86/lib/getuser.S
> > @@ -37,9 +37,14 @@
> > +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \
> > +	"shl $(64 - 48), %rdx", \
> > +	"shl $(64 - 57), %rdx", X86_FEATURE_LA57
> > +
> >  .macro check_range size:req
> >  .if IS_ENABLED(CONFIG_X86_64)
> >  	mov %rax, %rdx
> > +	SHIFT_LEFT_TO_MSB
> >  	sar $63, %rdx
> >  	or %rdx, %rax
> >  .else
> 
> That looks like it ought to DTRT in some cases, but I'll definitely ask
> AMD for confirmation.
> 
> But, I expect it will malfunction on newer hardware when
> CONFIG_X86_5LEVEL=n, because it causes Linux to explicitly ignore the
> LA57 bit.  That can be fixed by changing how CONFIG_X86_5LEVEL works.
> 
> I also expect it will malfunction under virt on an LA57-capable system
> running a VM in LA48 mode (this time, Linux doesn't get to see the
> relevant uarch detail), and I have no good suggestion here.

BTW the paper [1] nonchalantly mentions:

  "All Intel CPUs that are vulnerable to MDS attacks inherently have the
  same flaw described here."

Anyway, I'd really like to make forward progress on getting rid of the
LFENCEs in copy_from_user() and __get_user(), so until if/when we hear
back from both vendors, how about we avoid noncanonical exceptions
altogether (along with the edge cases mentioned above) and do something
like the below?

Sure, it could maybe be optimized by a few bytes if we were given more
concrete recommendations, but that can be done later if/when that
happens.  In the meantime we'd have no canonical exception worries and
can use a similar strategy to get rid of the LFENCEs.

[1] https://arxiv.org/ftp/arxiv/papers/2108/2108.10771.pdf

diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index 112e88ebd07d..dfc6881eb785 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -41,12 +41,21 @@
 	"shl $(64 - 57), %rdx", X86_FEATURE_LA57,				\
 	"", ALT_NOT(X86_BUG_CANONICAL)
 
#ifdef CONFIG_X86_5LEVEL
#define LOAD_TASK_SIZE_MINUS_N(n) \
	ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
		    __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57
#else
#define LOAD_TASK_SIZE_MINUS_N(n) \
	mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
#endif

 .macro check_range size
 .if IS_ENABLED(CONFIG_X86_64)
 	/* If above TASK_SIZE_MAX, convert to all 1's */
	LOAD_TASK_SIZE_MINUS_N(size - 1)
	cmp %rax, %rdx
	sbb %rdx, %rdx
	or  %rdx, %rax


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-20 22:44                     ` Josh Poimboeuf
@ 2024-10-20 22:59                       ` Linus Torvalds
  2024-10-20 23:11                         ` Josh Poimboeuf
  2024-10-21 10:48                         ` Kirill A. Shutemov
  0 siblings, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-10-20 22:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> Anyway, I'd really like to make forward progress on getting rid of the
> LFENCEs in copy_from_user() and __get_user(), so until if/when we hear
> back from both vendors, how about we avoid noncanonical exceptions
> altogether (along with the edge cases mentioned above) and do something
> like the below?

That doesn't work for LAM at _all_.

So at a minimum, you need to then say "for LAM enabled CPU's we do the
'shift sign bit' trick".

Hopefully any LAM-capable CPU doesn't have this issue?

And I still think that clac/stac has to serialize with surrounding
memory operations, making this all moot.

           Linus


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-20 22:59                       ` Linus Torvalds
@ 2024-10-20 23:11                         ` Josh Poimboeuf
  2024-10-20 23:14                           ` Josh Poimboeuf
  2024-10-20 23:35                           ` Linus Torvalds
  2024-10-21 10:48                         ` Kirill A. Shutemov
  1 sibling, 2 replies; 51+ messages in thread
From: Josh Poimboeuf @ 2024-10-20 23:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Sun, Oct 20, 2024 at 03:59:25PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >
> > Anyway, I'd really like to make forward progress on getting rid of the
> > LFENCEs in copy_from_user() and __get_user(), so until if/when we hear
> > back from both vendors, how about we avoid noncanonical exceptions
> > altogether (along with the edge cases mentioned above) and do something
> > like the below?
> 
> That doesn't work for LAM at _all_.

Argh, ok.

> So at a minimum, you need to then say "for LAM enabled CPU's we do the
> 'shift sign bit' trick".

Something like below to wipe out the LAM bits beforehand?

I'm probably overlooking something else as there are a lot of annoying
details here...

> Hopefully any LAM-capable CPU doesn't have this issue?
> 
> And I still think that clac/stac has to serialize with surrounding
> memory operations, making this all moot.

Until it's s/think/know/ can we please put something in place?


#define FORCE_CANONICAL								\
	ALTERNATIVE_2								\
	"shl $(64 - 48), %rdx",							\
	"shl $(64 - 57), %rdx", X86_FEATURE_LA57,				\
	"", ALT_NOT(X86_FEATURE_LAM)

#ifdef CONFIG_X86_5LEVEL
#define LOAD_TASK_SIZE_MINUS_N(n) \
	ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
		    __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57
#else
#define LOAD_TASK_SIZE_MINUS_N(n) \
	mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
#endif

.macro check_range size
.if IS_ENABLED(CONFIG_X86_64)
	FORCE_CANONICAL
	/* If above TASK_SIZE_MAX, convert to all 1's */
	LOAD_TASK_SIZE_MINUS_N(size-1)
	cmp %rax, %rdx
	sbb %rdx, %rdx
	or  %rdx, %rax


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-20 23:11                         ` Josh Poimboeuf
@ 2024-10-20 23:14                           ` Josh Poimboeuf
  2024-10-20 23:35                           ` Linus Torvalds
  1 sibling, 0 replies; 51+ messages in thread
From: Josh Poimboeuf @ 2024-10-20 23:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Sun, Oct 20, 2024 at 04:11:25PM -0700, Josh Poimboeuf wrote:
> #define FORCE_CANONICAL								\
> 	ALTERNATIVE_2								\
> 	"shl $(64 - 48), %rdx",							\
> 	"shl $(64 - 57), %rdx", X86_FEATURE_LA57,				\
			 ^^^^
			 these should be rax
> 	"", ALT_NOT(X86_FEATURE_LAM)
> 
> #ifdef CONFIG_X86_5LEVEL
> #define LOAD_TASK_SIZE_MINUS_N(n) \
> 	ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
> 		    __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57
> #else
> #define LOAD_TASK_SIZE_MINUS_N(n) \
> 	mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
> #endif
> 
> .macro check_range size
> .if IS_ENABLED(CONFIG_X86_64)
> 	FORCE_CANONICAL
> 	/* If above TASK_SIZE_MAX, convert to all 1's */
> 	LOAD_TASK_SIZE_MINUS_N(size-1)
> 	cmp %rax, %rdx
> 	sbb %rdx, %rdx
> 	or  %rdx, %rax

-- 
Josh


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-20 23:11                         ` Josh Poimboeuf
  2024-10-20 23:14                           ` Josh Poimboeuf
@ 2024-10-20 23:35                           ` Linus Torvalds
  2024-10-23  9:44                             ` Borislav Petkov
  1 sibling, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-20 23:35 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Sun, 20 Oct 2024 at 16:11, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> Until it's s/think/know/ can we please put something in place?

Honestly, I'm pretty damn fed up with buggy hardware and completely
theoretical attacks that have never actually shown themselves to be
used in practice.

So I think this time we push back on the hardware people and tell them
 it's *THEIR* damn problem, and if they can't even be bothered to say
yay-or-nay, we just sit tight.

Because dammit, let's put the onus on where the blame lies, and not
just take any random shit from bad hardware and say "oh, but it
*might* be a problem".

                 Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-20 22:59                       ` Linus Torvalds
  2024-10-20 23:11                         ` Josh Poimboeuf
@ 2024-10-21 10:48                         ` Kirill A. Shutemov
  2024-10-22  2:36                           ` Linus Torvalds
  2024-10-22  8:16                           ` Pawan Gupta
  1 sibling, 2 replies; 51+ messages in thread
From: Kirill A. Shutemov @ 2024-10-21 10:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland, Alexander Shishkin

On Sun, Oct 20, 2024 at 03:59:25PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >
> > Anyway, I'd really like to make forward progress on getting rid of the
> > LFENCEs in copy_from_user() and __get_user(), so until if/when we hear
> > back from both vendors, how about we avoid noncanonical exceptions
> > altogether (along with the edge cases mentioned above) and do something
> > like the below?
> 
> That doesn't work for LAM at _all_.
> 
> So at a minimum, you need to then say "for LAM enabled CPU's we do the
> 'shift sign bit' trick".
> 
> Hopefully any LAM-capable CPU doesn't have this issue?

LAM brings own speculation issues[1] that is going to be addressed by
LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it
never got applied for some reason.

[1] https://download.vusec.net/papers/slam_sp24.pdf
[2] https://lore.kernel.org/all/20240710160655.3402786-1-alexander.shishkin@linux.intel.com
[3] https://lore.kernel.org/all/5373262886f2783f054256babdf5a98545dc986b.1706068222.git.pawan.kumar.gupta@linux.intel.com

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-21 10:48                         ` Kirill A. Shutemov
@ 2024-10-22  2:36                           ` Linus Torvalds
  2024-10-22 10:33                             ` Kirill A. Shutemov
  2024-10-22  8:16                           ` Pawan Gupta
  1 sibling, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-22  2:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland, Alexander Shishkin

On Mon, 21 Oct 2024 at 03:48, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> LAM brings own speculation issues[1] that is going to be addressed by
> LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it
> never got applied for some reason.

Bah., I think the reason was that nobody knew what - if any - any hardware.

I did find an Intel "Product Brief" for "Intel® Xeon® 6 processors
with E-cores" that claim both LASS and LAM, but that one puts LAM as a
"security feature", so that seems a bit confused. That's Redwood Cove
cores, I think.

I guess that should mean that Meteor Lake ("Core Ultra") should also
have it? Or is it then turned off because the E cores don't do it?

I do find LASS - but not LAM - mentioned in the "Intel Core Ultra (PS
series)" datasheet.

And there's a Intel Core Ultra 200V datasheet (1 of 2) that does
mention both LAM and LASS. So I guess now it *really* exists.

Can somebody confirm?

But yeah, I guess we should just do this:

> [3] https://lore.kernel.org/all/5373262886f2783f054256babdf5a98545dc986b.1706068222.git.pawan.kumar.gupta@linux.intel.com

for now.

                  Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-21 10:48                         ` Kirill A. Shutemov
  2024-10-22  2:36                           ` Linus Torvalds
@ 2024-10-22  8:16                           ` Pawan Gupta
  2024-10-22 10:44                             ` Kirill A. Shutemov
  1 sibling, 1 reply; 51+ messages in thread
From: Pawan Gupta @ 2024-10-22  8:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Josh Poimboeuf, Andrew Cooper, x86,
	Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland, Alexander Shishkin

On Mon, Oct 21, 2024 at 01:48:15PM +0300, Kirill A. Shutemov wrote:
> On Sun, Oct 20, 2024 at 03:59:25PM -0700, Linus Torvalds wrote:
> > On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > >
> > > Anyway, I'd really like to make forward progress on getting rid of the
> > > LFENCEs in copy_from_user() and __get_user(), so until if/when we hear
> > > back from both vendors, how about we avoid noncanonical exceptions
> > > altogether (along with the edge cases mentioned above) and do something
> > > like the below?
> > 
> > That doesn't work for LAM at _all_.
> > 
> > So at a minimum, you need to then say "for LAM enabled CPU's we do the
> > 'shift sign bit' trick".
> > 
> > Hopefully any LAM-capable CPU doesn't have this issue?
> 
> LAM brings own speculation issues[1] that is going to be addressed by
> LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it
> never got applied for some reason.

AIU, issue with LAM is it allows speculative translation of non-canonical
user address. It does not allow data fetches from those addresses. SLAM
attack uses LAM to form a TLB-based disclosure gadget. In essence, SLAM
needs to chain another speculative vulnerability to form a successful
attack. If the data accessed by a chained speculative vulnerability is
interpreted as a pointer, LAM may allow the translation of non-canonical
user address. This is specially true for ascii characters that have MSB
masked off.

I don't really know how this could be a problem for something like
copy_from_user() that ensures the canonicality of bit 63.

AFAIK, Sierra Forest, Lunar Lake and Arrow Lake support LAM. To be on safe
side, it is better to keep LAM disabled until LASS arrives.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-22  2:36                           ` Linus Torvalds
@ 2024-10-22 10:33                             ` Kirill A. Shutemov
  0 siblings, 0 replies; 51+ messages in thread
From: Kirill A. Shutemov @ 2024-10-22 10:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland, Alexander Shishkin

On Mon, Oct 21, 2024 at 07:36:50PM -0700, Linus Torvalds wrote:
> On Mon, 21 Oct 2024 at 03:48, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > LAM brings own speculation issues[1] that is going to be addressed by
> > LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it
> > never got applied for some reason.
> 
> Bah., I think the reason was that nobody knew what - if any - any hardware.
> 
> I did find an Intel "Product Brief" for "Intel® Xeon® 6 processors
> with E-cores" that claim both LASS and LAM, but that one puts LAM as a
> "security feature", so that seems a bit confused. That's Redwood Cove
> cores, I think.
> 
> I guess that should mean that Meteor Lake ("Core Ultra") should also
> have it? Or is it then turned off because the E cores don't do it?

No, Meteor Lake doesn't support neither LAM nor LASS.

So far, according to SDM, it is supported by Sierra Forest (E-core based
Xeons) and Lunar Lake.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-22  8:16                           ` Pawan Gupta
@ 2024-10-22 10:44                             ` Kirill A. Shutemov
  0 siblings, 0 replies; 51+ messages in thread
From: Kirill A. Shutemov @ 2024-10-22 10:44 UTC (permalink / raw)
  To: Pawan Gupta
  Cc: Linus Torvalds, Josh Poimboeuf, Andrew Cooper, x86,
	Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland, Alexander Shishkin

On Tue, Oct 22, 2024 at 01:16:58AM -0700, Pawan Gupta wrote:
> On Mon, Oct 21, 2024 at 01:48:15PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Oct 20, 2024 at 03:59:25PM -0700, Linus Torvalds wrote:
> > > On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > >
> > > > Anyway, I'd really like to make forward progress on getting rid of the
> > > > LFENCEs in copy_from_user() and __get_user(), so until if/when we hear
> > > > back from both vendors, how about we avoid noncanonical exceptions
> > > > altogether (along with the edge cases mentioned above) and do something
> > > > like the below?
> > > 
> > > That doesn't work for LAM at _all_.
> > > 
> > > So at a minimum, you need to then say "for LAM enabled CPU's we do the
> > > 'shift sign bit' trick".
> > > 
> > > Hopefully any LAM-capable CPU doesn't have this issue?
> > 
> > LAM brings own speculation issues[1] that is going to be addressed by
> > LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it
> > never got applied for some reason.
> 
> AIU, issue with LAM is it allows speculative translation of non-canonical
> user address. It does not allow data fetches from those addresses. SLAM
> attack uses LAM to form a TLB-based disclosure gadget. In essence, SLAM
> needs to chain another speculative vulnerability to form a successful
> attack. If the data accessed by a chained speculative vulnerability is
> interpreted as a pointer, LAM may allow the translation of non-canonical
> user address. This is specially true for ascii characters that have MSB
> masked off.

Right.

> I don't really know how this could be a problem for something like
> copy_from_user() that ensures the canonicality of bit 63.

It doesn't affect copy_from/to_user() and put/get_user(), but allows to
make other kernel code (outside STAC/CLAC block) to speculatively
translate addresses and reveal secrets this way.

> AFAIK, Sierra Forest, Lunar Lake and Arrow Lake support LAM. To be on safe
> side, it is better to keep LAM disabled until LASS arrives.

Dave has applied it to tip/urgent. I expect it to be merged soon.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-20 23:35                           ` Linus Torvalds
@ 2024-10-23  9:44                             ` Borislav Petkov
  2024-10-23 19:17                               ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2024-10-23  9:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86,
	Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Sun, Oct 20, 2024 at 04:35:21PM -0700, Linus Torvalds wrote:
> So I think this time we push back on the hardware people and tell them
>  it's *THEIR* damn problem, and if they can't even be bothered to say
> yay-or-nay, we just sit tight.

So I was able to get some info: CLAC/STAC in everything Zen4 and older is
serializing so there's no issue there.

Zen5 is a different story and AC is being renamed there and thus,
non-serializing. So we'd need to do something there, perhaps something
like Josh's patch...

But the good thing is we can alternative it out only for those machines, apart
from the rest.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-23  9:44                             ` Borislav Petkov
@ 2024-10-23 19:17                               ` Linus Torvalds
  2024-10-23 20:07                                 ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-23 19:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86,
	Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

[-- Attachment #1: Type: text/plain, Size: 2591 bytes --]

On Wed, 23 Oct 2024 at 02:45, Borislav Petkov <bp@alien8.de> wrote:
>
> So I was able to get some info: CLAC/STAC in everything Zen4 and older is
> serializing so there's no issue there.
>
> Zen5 is a different story and AC is being renamed there and thus,
> non-serializing. So we'd need to do something there, perhaps something
> like Josh's patch...
>
> But the good thing is we can alternative it out only for those machines, apart
> from the rest.

Hmm. The problem with alternatives is that it gets increasingly ugly
when there are more than two of them.

And even doing a single alternative with some "don't do this unless X"
then means either extra no-ops or some horrid static branch.

Now, if we ignore LAM, we only have two cases (LA48 vs LA57), but we
know that even if we disable LAM until we do the LASS support, that
third case *will* happen.

Finally - the main reason to use the sign bit is that you don't need
some big constant - and the code is small, so the "shift right by 63
and or" is just two instructions and a couple of bytes.

But once you have an alternative *with* a big constant, that advantage
just goes away because you'll just replace the size the constant takes
up with no-ops instead.

So then you might as well just _always_ use a constant.

Which makes me wonder if we can't just use the runtime-const
infrastructure instead. That has the advantage that you can have any
number of different constants and it won't complicate the usage sites,
and you can easily just update the constant at init time with
runtime_const_init() to an arbitrary value.

So then LAM support becomes just updating the value for
runtime_const_init() - none of the code changes, and no different
alternatives cases.

Something like the attached, in other words. Note that this actually does that

+               if (cpu_feature_enabled(X86_FEATURE_LAM))
+                       USER_PTR_MAX = (1ul << 63) - PAGE_SIZE;

partly to show it as docs, and partly because we should just clear
that X86_FEATURE_LAM bit if LASS isn't set, so I think it's right
anyway.

NOTE! This is obviously untested and I didn't check that it does the
cmp/sbb/or the right way around.

Also, it's worth noting that the cmp/sbb/or sequence (if I *did* get
it the right way around) will set all bits only if the original
address is *larger* than USER_PTR_MAX, but we already subtract
PAGE_SIZE from the actual hardware maximum due to other errata, so
that's fine.

IOW, it will turn USER_PTR_MAX and anything less into itself, but
USER_PTR_MAX+1 (and above) becomes all ones.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 3010 bytes --]

 arch/x86/include/asm/uaccess_64.h | 17 ++++++++++++++++-
 arch/x86/kernel/cpu/common.c      |  6 ++++++
 arch/x86/kernel/vmlinux.lds.S     |  1 +
 arch/x86/lib/getuser.S            |  9 +++++++--
 4 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index afce8ee5d7b7..df1468ada3ed 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -12,6 +12,13 @@
 #include <asm/cpufeatures.h>
 #include <asm/page.h>
 #include <asm/percpu.h>
+#include <asm/runtime-const.h>
+
+/*
+ * Virtual variable: there's no actual backing store for this,
+ * it can purely be used as 'runtime_const_ptr(USER_PTR_MAX)'
+ */
+extern unsigned long USER_PTR_MAX;

 #ifdef CONFIG_ADDRESS_MASKING
 /*
@@ -58,7 +65,15 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
  * user_access_begin that can avoid the fencing. This only works
  * for dense accesses starting at the address.
  */
-#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63)))
+static inline void __user *mask_user_address(const void __user *ptr)
+{
+	void __user *ret;
+	asm("cmp %1,%0; sbb %k0,%k0; or %1,%0"
+		:"=r" (ret)
+		:"r" (ptr),
+		 "0" (runtime_const_ptr(USER_PTR_MAX)));
+	return ret;
+}
 #define masked_user_access_begin(x) ({				\
 	__auto_type __masked_ptr = (x);				\
 	__masked_ptr = mask_user_address(__masked_ptr);		\
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f1040cb64841..9d419dc1df12 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -69,6 +69,7 @@
 #include <asm/sev.h>
 #include <asm/tdx.h>
 #include <asm/posted_intr.h>
+#include <asm/runtime-const.h>

 #include "cpu.h"

@@ -2389,6 +2390,11 @@ void __init arch_cpu_finalize_init(void)
 	alternative_instructions();

 	if (IS_ENABLED(CONFIG_X86_64)) {
+		unsigned long USER_PTR_MAX = TASK_SIZE_MAX;
+
+		if (cpu_feature_enabled(X86_FEATURE_LAM))
+			USER_PTR_MAX = (1ul << 63) - PAGE_SIZE;
+		runtime_const_init(ptr, USER_PTR_MAX);
 		/*
 		 * Make sure the first 2MB area is not mapped by huge pages
 		 * There are typically fixed size MTRRs in there and overlapping
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 6726be89b7a6..b8c5741d2fb4 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -358,6 +358,7 @@ SECTIONS
 #endif

 	RUNTIME_CONST_VARIABLES
+	RUNTIME_CONST(ptr, USER_PTR_MAX)

 	. = ALIGN(PAGE_SIZE);

diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index d066aecf8aeb..202d68e3fea0 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -39,8 +39,13 @@

 .macro check_range size:req
 .if IS_ENABLED(CONFIG_X86_64)
-	mov %rax, %rdx
-	sar $63, %rdx
+	movq $0x0123456789abcdef,%rdx
+  1:
+  .pushsection runtime_ptr_USER_PTR_MAX,"a"
+	.long 1b - 8 - .
+  .popsection
+	cmp %rax, %rdx
+	sbb %edx, %edx
 	or %rdx, %rax
 .else
 	cmp $TASK_SIZE_MAX-\size+1, %eax

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-23 19:17                               ` Linus Torvalds
@ 2024-10-23 20:07                                 ` Linus Torvalds
  2024-10-23 23:32                                   ` Linus Torvalds
  2024-10-24  9:21                                   ` David Laight
  0 siblings, 2 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-10-23 20:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86,
	Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Wed, 23 Oct 2024 at 12:17, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> NOTE! This is obviously untested and I didn't check that it does the
> cmp/sbb/or the right way around.

Well, it boots. The code generation (from strncpy_from_user()) seems ok:

      movabs $0x123456789abcdef,%rcx
      cmp    %rsi,%rcx
      sbb    %ecx,%ecx
      or     %rsi,%rcx

where obviously that constant is the bogus pre-initialized value, not
the actual runtime value.

Which is not to say that the patch might not be horribly buggy anyway.

              Linus


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-23 20:07                                 ` Linus Torvalds
@ 2024-10-23 23:32                                   ` Linus Torvalds
  2024-10-24  2:00                                     ` Linus Torvalds
  2024-10-24  9:21                                   ` David Laight
  1 sibling, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-23 23:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86,
	Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

[-- Attachment #1: Type: text/plain, Size: 1092 bytes --]

On Wed, 23 Oct 2024 at 13:07, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Well, it boots. The code generation (from strncpy_from_user()) seems ok:

Actually, doing some more sanity checking, that patch is wrong.

Not *badly* wrong, but for some reason I did the "sbb" in 32-bit
(quite intentionally, but it's very wrong: I for some reason mentally
went "32-bit sign-extends to 64-bit")

I'd blame the fact that some of the very earliest x86-64 specs did
indeed do exactly that, but the reality is that it was just a
brainfart.

Anyway, the attached patch seems to actually _really_ work, and DTRT.

But considering that I created a 32-bit mask there for a while, maybe
somebody else should actually verify.

And I guess I should make "__put_user()" do the same thing, just so
that we only have one sequence. It probably doesn't matter for
put_user(), since there's no data leak coming out of it, but if
nothing else, avoiding non-canonical accesses from the kernel for any
non-LAM/LASS setup is probably just a good thing once we have this
logic.

Hmm?

                  Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 3011 bytes --]

 arch/x86/include/asm/uaccess_64.h | 17 ++++++++++++++++-
 arch/x86/kernel/cpu/common.c      |  7 +++++++
 arch/x86/kernel/vmlinux.lds.S     |  1 +
 arch/x86/lib/getuser.S            |  9 +++++++--
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index afce8ee5d7b7..52e1a549b98a 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -12,6 +12,13 @@
 #include <asm/cpufeatures.h>
 #include <asm/page.h>
 #include <asm/percpu.h>
+#include <asm/runtime-const.h>
+
+/*
+ * Virtual variable: there's no actual backing store for this,
+ * it can purely be used as 'runtime_const_ptr(USER_PTR_MAX)'
+ */
+extern unsigned long USER_PTR_MAX;
 
 #ifdef CONFIG_ADDRESS_MASKING
 /*
@@ -58,7 +65,15 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
  * user_access_begin that can avoid the fencing. This only works
  * for dense accesses starting at the address.
  */
-#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63)))
+static inline void __user *mask_user_address(const void __user *ptr)
+{
+	void __user *ret;
+	asm("cmp %1,%0; sbb %0,%0; or %1,%0"
+		:"=r" (ret)
+		:"r" (ptr),
+		 "0" (runtime_const_ptr(USER_PTR_MAX)));
+	return ret;
+}
 #define masked_user_access_begin(x) ({				\
 	__auto_type __masked_ptr = (x);				\
 	__masked_ptr = mask_user_address(__masked_ptr);		\
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f1040cb64841..8ee6579f694b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -69,6 +69,7 @@
 #include <asm/sev.h>
 #include <asm/tdx.h>
 #include <asm/posted_intr.h>
+#include <asm/runtime-const.h>
 
 #include "cpu.h"
 
@@ -2389,6 +2390,12 @@ void __init arch_cpu_finalize_init(void)
 	alternative_instructions();
 
 	if (IS_ENABLED(CONFIG_X86_64)) {
+		unsigned long USER_PTR_MAX = TASK_SIZE_MAX;
+
+		if (cpu_feature_enabled(X86_FEATURE_LAM))
+			USER_PTR_MAX = (1ul << 63) - PAGE_SIZE;
+		runtime_const_init(ptr, USER_PTR_MAX);
+
 		/*
 		 * Make sure the first 2MB area is not mapped by huge pages
 		 * There are typically fixed size MTRRs in there and overlapping
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 6726be89b7a6..b8c5741d2fb4 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -358,6 +358,7 @@ SECTIONS
 #endif
 
 	RUNTIME_CONST_VARIABLES
+	RUNTIME_CONST(ptr, USER_PTR_MAX)
 
 	. = ALIGN(PAGE_SIZE);
 
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index d066aecf8aeb..4357ec2a0bfc 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -39,8 +39,13 @@
 
 .macro check_range size:req
 .if IS_ENABLED(CONFIG_X86_64)
-	mov %rax, %rdx
-	sar $63, %rdx
+	movq $0x0123456789abcdef,%rdx
+  1:
+  .pushsection runtime_ptr_USER_PTR_MAX,"a"
+	.long 1b - 8 - .
+  .popsection
+	cmp %rax, %rdx
+	sbb %rdx, %rdx
 	or %rdx, %rax
 .else
 	cmp $TASK_SIZE_MAX-\size+1, %eax

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-23 23:32                                   ` Linus Torvalds
@ 2024-10-24  2:00                                     ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-10-24  2:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86,
	Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Wed, 23 Oct 2024 at 16:32, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I guess I should make "__put_user()" do the same thing, just so
> that we only have one sequence.

No, I decided it's not worth it. The put_user side already also
doesn't do any other speculation barriers, simply because it has no
speculative outputs that could then be used in some gadget to leak
anything.

I did extend the USER_ADDR_MAX logic to valid_user_address(), and I
wrote a commit log. And sent out what I *think* is a good patch to
lkml and the x86 maintainers:

    https://lore.kernel.org/all/20241024013214.129639-1-torvalds@linux-foundation.org/

I'm not super-happy with the open-coded magic runtime section stuff in
getuser.S, but with no other asm users I also didn't want to randomly
pollute some header file with ugly asm-specific macros that only get
used in one place.

Also, I left the LAM case in, but disabled in a comment about how it
should be gated by LASS. So that code isn't actually enabled right
now.

Does anybody see any issues with that patch? It's not that many actual
lines of code, and I've been staring at it pretty much all day today
(in case anybody wondered why no pull requests), but I've been staring
at it so much that I'm patch-blind by now.

I've also looked at the generated code. You can look at the asm
output, of course, but that ends up being pretty messy due to the
fixup hackery. I've been doing

   objdump --disassemble --no-addresses --no-show-raw-insn vmlinux

and you can see where this gets used by searching for
"0x123456789abcdef" in the objdumpo disassembly. That's the runtime
constant that gets rewritten. Obviously some of them are for another
runtime constant (ie dcache_hash), but it's pretty obvious.

The code generation seems ok, but like the patch, I'm getting
code-blind from having looked at the same thing too many times. Yes,
it looked better when it only used the sign bit, but oh well..

And yes, I'm running that code now, and I did a few tests with system
calls with invalid addresses and some debug output. Which is still not
saying "it has no bugs", but at least any bugs aren't obvious to me.

                    Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-23 20:07                                 ` Linus Torvalds
  2024-10-23 23:32                                   ` Linus Torvalds
@ 2024-10-24  9:21                                   ` David Laight
  2024-10-24 16:53                                     ` Linus Torvalds
  1 sibling, 1 reply; 51+ messages in thread
From: David Laight @ 2024-10-24  9:21 UTC (permalink / raw)
  To: 'Linus Torvalds', Borislav Petkov
  Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86@kernel.org,
	Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman,
	linuxppc-dev@lists.ozlabs.org, Mark Rutland

From: Linus Torvalds
> Sent: 23 October 2024 21:08
> 
> On Wed, 23 Oct 2024 at 12:17, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > NOTE! This is obviously untested and I didn't check that it does the
> > cmp/sbb/or the right way around.
> 
> Well, it boots. The code generation (from strncpy_from_user()) seems ok:
> 
>       movabs $0x123456789abcdef,%rcx
>       cmp    %rsi,%rcx
>       sbb    %ecx,%ecx
>       or     %rsi,%rcx
> 
> where obviously that constant is the bogus pre-initialized value, not
> the actual runtime value.

Would it be better to make the 'bogus' constant one that makes
all accesses fail?
So you soon find out it any code doesn't get patched.

I also wonder how big the table of addresses to patch is.
If that gets into inlined functions it could be big.

OTOH having a real function that does access_ok(), clac and address
masking may not problem.

Especially if there is always a (PAGE sized) gap between the highest
user address and the lowest kernel address so the 'size' argument
to access_ok() can be ignored on the assumption that the accesses
are (reasonably) linear.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-24  9:21                                   ` David Laight
@ 2024-10-24 16:53                                     ` Linus Torvalds
  2024-10-25  8:56                                       ` David Laight
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-24 16:53 UTC (permalink / raw)
  To: David Laight
  Cc: Borislav Petkov, Josh Poimboeuf, Andrew Cooper,
	Kirill A. Shutemov, x86@kernel.org, Thomas Gleixner,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org,
	Mark Rutland

On Thu, 24 Oct 2024 at 02:22, David Laight <David.Laight@aculab.com> wrote:
>
> Would it be better to make the 'bogus' constant one that makes
> all accesses fail?

Well, the bogus constant is actually used for other runtime cases too
(well, right now that's only the dentry pointer, but could be more),
it's not some kind of "just for the max user address".

And for things like profiling, where the tools use the original object
code, making the bogus constant really obvious is actually helpful.

So I do think that "standing out" is more important than having some
value that has semantic meaning, when that value is going to be
overwritten at boot time.

> So you soon find out it any code doesn't get patched.

Yes, that would be nice, but if the simple patching logic is broken,
you have worse problems. The patching logic really is just a couple of
lines of code, and the location where we set this particular value is
literally next to the place we do all our other alternatives, so if
there's something wrong there, you basically have a broken system.

> I also wonder how big the table of addresses to patch is.
> If that gets into inlined functions it could be big.

It's 4 bytes per location, so yes, it grows linearly by use - but not very much.

And the patch table is in the init section that gets discarded after
boot, along with all the other things like the altinstructions and
things like the return sites for retpolines. Which are *much* bigger
and more common.

So yes, there's a table, and yes it grows, but at least in my personal
config, the USER_PTR_MAX table is 424 bytes - and we free it after
boot.  90% of that comes from 'access_ok()' sprinkled around and
inlined.

Just as a comparison, the altinstruction tables (both the pointers and
the actual replacement instructions) is 25kB in that config. Also a
drop in the ocean, and also something that gets free'd after init.

> OTOH having a real function that does access_ok(), clac and address
> masking may not problem.

access_ok() itself is so rarely used these days that we could out-line
it.  But the code cost of a function call is likely higher than
inlining the 8-byte constant and a couple of instructions: not because
the call instruction itself, but because of the code generation pain
around it (register save/restore etc).

IOW, the call instruction is just five bytes, but it will typically
cause spills and forced register choices for argument and return
value. It will obviously depend a lot on the context of the function
call, but to save 4 bytes for the table that gets freed, and have the
8-byte constant only once? That's a very false economy.

Eg an example code sequence (from a random place in the kernel that I
looked at with objdump is

        movabs $0x123456789abcdef,%rax
        cmp    %rsi,%rax
        jb     <__x64_sys_rt_sigreturn+0x20e>

and that's 19 bytes. But I tried it with a function call, and that
same function grew from 1320 bytes to 1324 bytes.

So the function call version generated 4 bytes bigger code. I didn't
figure out why, because register allocation was so different, but I do
think it's exactly that: function calls cause limits on register use.

So even if we didn't free the 4 byte entry after init, making
access_ok() a function call would actually not be a win.

And with how slow 'ret' instructions can be, we really don't want
small function calls.

In fact, it really makes me wonder if we should inline 'get_user()' entirely.

> Especially if there is always a (PAGE sized) gap between the highest
> user address and the lowest kernel address so the 'size' argument
> to access_ok() can be ignored on the assumption that the accesses
> are (reasonably) linear.

Yes, that's what we do right now for the inline code generation anyway.

(Ok, we actually do take the 'size' value into account when it's not
constant, or when it's huge, but even that is just out of a silly
abundance of caution and not a very common case).

                Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-24 16:53                                     ` Linus Torvalds
@ 2024-10-25  8:56                                       ` David Laight
  2024-10-25 16:35                                         ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: David Laight @ 2024-10-25  8:56 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Borislav Petkov, Josh Poimboeuf, Andrew Cooper,
	Kirill A. Shutemov, x86@kernel.org, Thomas Gleixner,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org,
	Mark Rutland

...
> access_ok() itself is so rarely used these days that we could out-line
> it.  But the code cost of a function call is likely higher than
> inlining the 8-byte constant and a couple of instructions: not because
> the call instruction itself, but because of the code generation pain
> around it (register save/restore etc).

I was thinking it might need a non-standard call sequence so that
it could return yes/no and mask the address.

I did once wonder whether it could just mask the address.
Converting kernel addresses into ones that were guaranteed
to fault later.
Annoyingly I don't think that NULL can be used, maybe ~0 faults
for byte accesses - or is a very boring byte.
Certainly loading the TLB/cache line for ~0 isn't going to be useful.

...
> And with how slow 'ret' instructions can be, we really don't want
> small function calls.

Can that matter for leaf functions?
IIRC the issue is return stack underflow causing indirect branch
prediction be used - probably worse than just using wrapped
return stack value!
That won't happen for a leaf function unless you get an interrupt.
Static analysis could determine function returns that can never
underflow the return stack.

> In fact, it really makes me wonder if we should inline 'get_user()' entirely.

I think that would bloat some cold paths.
I know I've put a wrapper on copy_to/from_user() in the past.
(I don't think that is inlined at all these days?)
So you might want two copies.

An entirely horrid way is to have #define get_user(args) inline it
then (get_user)(args) would call the real function.

Mind you there are a lot of get_user() and put_user() in the getsockopt()
functions - which could all have a kernel address passed and
that access done by the wrapper.
A simple change to a kernel address doesn't change the prototype
so the complier doesn't detect unchanged code.
I've tried changing it to:
	new_size_or_errno = getsockopt(..., old_size)
but there are a few places that want to return an error and
update the size.
Plausibly returning (errno << n | size) from those places would
solve the problem - and the extra check would be in the error path.

Trouble is it is a massive patch series :-(

> > Especially if there is always a (PAGE sized) gap between the highest
> > user address and the lowest kernel address so the 'size' argument
> > to access_ok() can be ignored on the assumption that the accesses
> > are (reasonably) linear.
> 
> Yes, that's what we do right now for the inline code generation anyway.

Is that gap there on all architectures?

> (Ok, we actually do take the 'size' value into account when it's not
> constant, or when it's huge, but even that is just out of a silly
> abundance of caution and not a very common case).

Hmm the buffers for all read/write (etc) calls are variable length.
So it is a moderately common case.

	David

> 
>                 Linus

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-25  8:56                                       ` David Laight
@ 2024-10-25 16:35                                         ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-10-25 16:35 UTC (permalink / raw)
  To: David Laight
  Cc: Borislav Petkov, Josh Poimboeuf, Andrew Cooper,
	Kirill A. Shutemov, x86@kernel.org, Thomas Gleixner,
	Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen,
	Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org,
	Mark Rutland

On Fri, 25 Oct 2024 at 01:56, David Laight <David.Laight@aculab.com> wrote:
>
> > > Especially if there is always a (PAGE sized) gap between the highest
> > > user address and the lowest kernel address so the 'size' argument
> > > to access_ok() can be ignored on the assumption that the accesses
> > > are (reasonably) linear.
> >
> > Yes, that's what we do right now for the inline code generation anyway.
>
> Is that gap there on all architectures?

All of the address checks are architecture-specific, so this page-size
gap is purely about x86-64.

             Linus


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-16 22:34                           ` Linus Torvalds
@ 2024-10-28 11:29                             ` Kirill A. Shutemov
  2024-10-28 18:44                               ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Kirill A. Shutemov @ 2024-10-28 11:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Wed, Oct 16, 2024 at 03:34:11PM -0700, Linus Torvalds wrote:
> On Wed, 16 Oct 2024 at 15:13, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > It is worse than that. If we get LAM_SUP enabled (there's KASAN patchset
> > in works) this check will allow arbitrary kernel addresses.
> 
> Ugh. I haven't seen the LAM_SUP patches.
> 
> But yeah, I assume any LAM_SUP model would basically then make the GP
> fault due to non-canonical addresses go away.

Actually, I've got this wrong.

I based my comment about incompatibility with LAM_SUP on the approach
description that stated zeroing bits 48/57, but actual masks zeroing bits
47/56 which is compatible with all LAM modes. It will trigger #GP for
LAM_SUP too for kernel addresses: bit 63 has to be equal to bit 47/56.

I think it is worth looking at this approach again as it gets better
code: removes two instructions from get_user() and doesn't get flags
involved.

Below is patch on top of current mainline.

The mask only clears bit 47/56. It gets us stronger canonicality check on
machines with LAM off.

I am not entirely sure about my take on valid_user_address() here.

Any comments?

diff --git a/arch/x86/include/asm/runtime-const.h b/arch/x86/include/asm/runtime-const.h
index 6652ebddfd02..8d983cfd06ea 100644
--- a/arch/x86/include/asm/runtime-const.h
+++ b/arch/x86/include/asm/runtime-const.h
@@ -2,6 +2,18 @@
 #ifndef _ASM_RUNTIME_CONST_H
 #define _ASM_RUNTIME_CONST_H
 
+#ifdef __ASSEMBLY__
+
+.macro RUNTIME_CONST_PTR sym reg
+	movq	$0x0123456789abcdef, %\reg
+	1:
+	.pushsection runtime_ptr_\sym, "a"
+	.long	1b - 8 - .
+	.popsection
+.endm
+
+#else /* __ASSEMBLY__ */
+
 #define runtime_const_ptr(sym) ({				\
 	typeof(sym) __ret;					\
 	asm_inline("mov %1,%0\n1:\n"				\
@@ -58,4 +70,5 @@ static inline void runtime_const_fixup(void (*fn)(void *, unsigned long),
 	}
 }
 
+#endif /* __ASSEMBLY__ */
 #endif
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index b0a887209400..77acef272b0d 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -16,9 +16,9 @@
 
 /*
  * Virtual variable: there's no actual backing store for this,
- * it can purely be used as 'runtime_const_ptr(USER_PTR_MAX)'
+ * it can purely be used as 'runtime_const_ptr(USER_CANONICAL_MASK)'
  */
-extern unsigned long USER_PTR_MAX;
+extern unsigned long USER_CANONICAL_MASK;
 
 #ifdef CONFIG_ADDRESS_MASKING
 /*
@@ -54,7 +54,7 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 #endif
 
 #define valid_user_address(x) \
-	((__force unsigned long)(x) <= runtime_const_ptr(USER_PTR_MAX))
+	(!((__force unsigned long)(x) & ~runtime_const_ptr(USER_CANONICAL_MASK)))
 
 /*
  * Masking the user address is an alternative to a conditional
@@ -63,13 +63,8 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
  */
 static inline void __user *mask_user_address(const void __user *ptr)
 {
-	unsigned long mask;
-	asm("cmp %1,%0\n\t"
-	    "sbb %0,%0"
-		:"=r" (mask)
-		:"r" (ptr),
-		 "0" (runtime_const_ptr(USER_PTR_MAX)));
-	return (__force void __user *)(mask | (__force unsigned long)ptr);
+	unsigned long mask = runtime_const_ptr(USER_CANONICAL_MASK);
+	return (__force void __user *)(mask & (__force unsigned long)ptr);
 }
 #define masked_user_access_begin(x) ({				\
 	__auto_type __masked_ptr = (x);				\
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a5f221ea5688..9d25f4f146f4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -2390,14 +2390,9 @@ void __init arch_cpu_finalize_init(void)
 	alternative_instructions();
 
 	if (IS_ENABLED(CONFIG_X86_64)) {
-		unsigned long USER_PTR_MAX = TASK_SIZE_MAX-1;
+		unsigned long USER_CANONICAL_MASK = ~(1UL << __VIRTUAL_MASK_SHIFT);
 
-		/*
-		 * Enable this when LAM is gated on LASS support
-		if (cpu_feature_enabled(X86_FEATURE_LAM))
-			USER_PTR_MAX = (1ul << 63) - PAGE_SIZE - 1;
-		 */
-		runtime_const_init(ptr, USER_PTR_MAX);
+		runtime_const_init(ptr, USER_CANONICAL_MASK);
 
 		/*
 		 * Make sure the first 2MB area is not mapped by huge pages
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index b8c5741d2fb4..65d2f4cad0aa 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -358,7 +358,7 @@ SECTIONS
 #endif
 
 	RUNTIME_CONST_VARIABLES
-	RUNTIME_CONST(ptr, USER_PTR_MAX)
+	RUNTIME_CONST(ptr, USER_CANONICAL_MASK)
 
 	. = ALIGN(PAGE_SIZE);
 
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index 4357ec2a0bfc..b93f0282c6c9 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -32,6 +32,7 @@
 #include <asm/errno.h>
 #include <asm/asm-offsets.h>
 #include <asm/thread_info.h>
+#include <asm/runtime-const.h>
 #include <asm/asm.h>
 #include <asm/smap.h>
 
@@ -39,14 +40,8 @@
 
 .macro check_range size:req
 .if IS_ENABLED(CONFIG_X86_64)
-	movq $0x0123456789abcdef,%rdx
-  1:
-  .pushsection runtime_ptr_USER_PTR_MAX,"a"
-	.long 1b - 8 - .
-  .popsection
-	cmp %rax, %rdx
-	sbb %rdx, %rdx
-	or %rdx, %rax
+	RUNTIME_CONST_PTR sym=USER_CANONICAL_MASK reg=rdx
+	and %rdx, %rax
 .else
 	cmp $TASK_SIZE_MAX-\size+1, %eax
 	jae .Lbad_get_user
-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-28 11:29                             ` Kirill A. Shutemov
@ 2024-10-28 18:44                               ` Linus Torvalds
  2024-10-28 20:31                                 ` Kirill A. Shutemov
  0 siblings, 1 reply; 51+ messages in thread
From: Linus Torvalds @ 2024-10-28 18:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Mon, 28 Oct 2024 at 01:29, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> I think it is worth looking at this approach again as it gets better
> code: removes two instructions from get_user() and doesn't get flags
> involved.

The problem with the 'and' is that it's just re-introducing the same old bug.

> The mask only clears bit 47/56. It gets us stronger canonicality check on
> machines with LAM off.

But it means that you can just pass in non-canonical addresses, and
cause the same old AMD data leak.

And yes, it so happens that the AMD microarchitectures use bit 47/56
for "user address", and now it's only leaking user data, but that's a
random microarchitectural choice - and a really horribly bad one at
that.

IOW, the WHOLE BUG was because some AMD engineer decided to use bits
[0..48] for indexing, when they SHOULD have used [0..47,63], since
bits 63 and 48 *should* be the same, and the latter is the one that is
a hell of a lot more convenient and obvious, and would have made ALL
of this completely irrelevant, and we'd have just used the sign bit
and everything would be fine.

But instead of picking the sane bit, they picked a random bit.

And now I think your patch has two bugs in it:

 - it depends on that random undocumented microarchitectural mistake

 - it uses __VIRTUAL_MASK_SHIFT which probably doesn't even match what
hardware does

I The *hardware* presumably uses either bit 47/56 because that's the
actual hardware width of the TLB / cache matching.

But __VIRTUAL_MASK_SHIFT is hardcoded to 47 is 5-level page tables are disabled.

So no. We're not replacing one simple microarchitectural assumption
(sign bit is sufficient for safety) with another more complicated one
(bit X is sufficient for safety).

        Linus

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-28 18:44                               ` Linus Torvalds
@ 2024-10-28 20:31                                 ` Kirill A. Shutemov
  2024-10-28 20:49                                   ` Linus Torvalds
  0 siblings, 1 reply; 51+ messages in thread
From: Kirill A. Shutemov @ 2024-10-28 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Mon, Oct 28, 2024 at 08:44:10AM -1000, Linus Torvalds wrote:
> I The *hardware* presumably uses either bit 47/56 because that's the
> actual hardware width of the TLB / cache matching.

It is a guess, right?

The bug makes more sense to me if the bit number depends on the active
paging mode: this bit naturally used for indexing in top level page table.
The mistake is more understandable this way.

> So no. We're not replacing one simple microarchitectural assumption
> (sign bit is sufficient for safety) with another more complicated one
> (bit X is sufficient for safety).

Okay, fair enough.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()
  2024-10-28 20:31                                 ` Kirill A. Shutemov
@ 2024-10-28 20:49                                   ` Linus Torvalds
  0 siblings, 0 replies; 51+ messages in thread
From: Linus Torvalds @ 2024-10-28 20:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner,
	Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long,
	Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev,
	Mark Rutland

On Mon, 28 Oct 2024 at 10:31, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Mon, Oct 28, 2024 at 08:44:10AM -1000, Linus Torvalds wrote:
> > I The *hardware* presumably uses either bit 47/56 because that's the
> > actual hardware width of the TLB / cache matching.
>
> It is a guess, right?

Yes, I have no idea what they actually do.

> The bug makes more sense to me if the bit number depends on the active
> paging mode: this bit naturally used for indexing in top level page table.
> The mistake is more understandable this way.

If I were a hardware designer - which I'm obviously not - I would
*not* make the TLB lookup be based on the paging mode at all.

Only at actual TLB fill time would I care about things like 4-vs-5
level page tables (and the canonicality check would then obviously
cause the TLB fill to not happen, so you would never have a TLB entry
that could clash with the 4-vs-5 level page table model)

So I would assume that hardware that supports 5-level paging would
just unconditionally use bits 0-56, and older hardware would just use
bits 0-47. Why put extra pointless logic in what is arguably the most
critical path of the most critical piece on the CPU?

But I have zero inside knowledge, so yes, it's all a guess.

               Linus


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2024-10-28 20:50 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-12  4:09 [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() Josh Poimboeuf
2024-10-12  8:48 ` Andrew Cooper
2024-10-12 14:09   ` Josh Poimboeuf
2024-10-12 14:21     ` Borislav Petkov
2024-10-12 15:58       ` Linus Torvalds
2024-10-12 14:26     ` Andrew Cooper
2024-10-12 15:44   ` Linus Torvalds
2024-10-12 17:23     ` Andrew Cooper
2024-10-12 17:44       ` Linus Torvalds
2024-10-13  0:53         ` Linus Torvalds
2024-10-13  1:21           ` Linus Torvalds
2024-10-14  3:54             ` Josh Poimboeuf
2024-10-14  6:50               ` Linus Torvalds
2024-10-14  9:59                 ` David Laight
2024-10-14 11:20                   ` Linus Torvalds
2024-10-14 14:40                     ` David Laight
2024-10-14 11:12                 ` Andrew Cooper
2024-10-14 12:30                 ` Kirill A. Shutemov
2024-10-14 15:39                   ` Andrew Cooper
2024-10-15 10:00                     ` Borislav Petkov
2024-10-20 22:44                     ` Josh Poimboeuf
2024-10-20 22:59                       ` Linus Torvalds
2024-10-20 23:11                         ` Josh Poimboeuf
2024-10-20 23:14                           ` Josh Poimboeuf
2024-10-20 23:35                           ` Linus Torvalds
2024-10-23  9:44                             ` Borislav Petkov
2024-10-23 19:17                               ` Linus Torvalds
2024-10-23 20:07                                 ` Linus Torvalds
2024-10-23 23:32                                   ` Linus Torvalds
2024-10-24  2:00                                     ` Linus Torvalds
2024-10-24  9:21                                   ` David Laight
2024-10-24 16:53                                     ` Linus Torvalds
2024-10-25  8:56                                       ` David Laight
2024-10-25 16:35                                         ` Linus Torvalds
2024-10-21 10:48                         ` Kirill A. Shutemov
2024-10-22  2:36                           ` Linus Torvalds
2024-10-22 10:33                             ` Kirill A. Shutemov
2024-10-22  8:16                           ` Pawan Gupta
2024-10-22 10:44                             ` Kirill A. Shutemov
2024-10-14 16:55                   ` Linus Torvalds
2024-10-16 16:10                     ` Linus Torvalds
2024-10-16 22:02                       ` Andrew Cooper
2024-10-16 22:13                         ` Kirill A. Shutemov
2024-10-16 22:34                           ` Linus Torvalds
2024-10-28 11:29                             ` Kirill A. Shutemov
2024-10-28 18:44                               ` Linus Torvalds
2024-10-28 20:31                                 ` Kirill A. Shutemov
2024-10-28 20:49                                   ` Linus Torvalds
2024-10-16 22:32                         ` Linus Torvalds
2024-10-17 11:00                           ` Borislav Petkov
2024-10-14 11:56           ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).