* [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() @ 2024-10-12 4:09 Josh Poimboeuf 2024-10-12 8:48 ` Andrew Cooper 0 siblings, 1 reply; 51+ messages in thread From: Josh Poimboeuf @ 2024-10-12 4:09 UTC (permalink / raw) To: x86 Cc: Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Linus Torvalds, Michael Ellerman, linuxppc-dev, Andrew Cooper, Mark Rutland For x86-64, the barrier_nospec() in copy_from_user() is overkill and painfully slow. Instead, use pointer masking to force the user pointer to a non-kernel value even in speculative paths. While at it, harden the x86 implementations of raw_copy_to_user() and clear_user(): a write in a mispredicted access_ok() branch to a user-controlled kernel address can populate the rest of the affected cache line with kernel data. To avoid regressing powerpc, move the barrier_nospec() to the powerpc raw_copy_from_user() implementation so there's no functional change. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> --- arch/powerpc/include/asm/uaccess.h | 2 ++ arch/x86/include/asm/uaccess_64.h | 4 +++- arch/x86/lib/getuser.S | 2 +- arch/x86/lib/putuser.S | 2 +- include/linux/uaccess.h | 6 ------ 5 files changed, 7 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index 4f5a46a77fa2..12abb8bf5eda 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -7,6 +7,7 @@ #include <asm/extable.h> #include <asm/kup.h> #include <asm/asm-compat.h> +#include <asm/barrier.h> #ifdef __powerpc64__ /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */ @@ -341,6 +342,7 @@ static inline unsigned long raw_copy_from_user(void *to, { unsigned long ret; + barrier_nospec(); allow_read_from_user(from, n); ret = __copy_tofrom_user((__force void __user *)to, from, n); prevent_read_from_user(from, n); diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index afce8ee5d7b7..39199eef26be 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -133,12 +133,14 @@ copy_user_generic(void *to, const void *from, unsigned long len) static __always_inline __must_check unsigned long raw_copy_from_user(void *dst, const void __user *src, unsigned long size) { + src = mask_user_address(src); return copy_user_generic(dst, (__force void *)src, size); } static __always_inline __must_check unsigned long raw_copy_to_user(void __user *dst, const void *src, unsigned long size) { + dst = mask_user_address(dst); return copy_user_generic((__force void *)dst, src, size); } @@ -197,7 +199,7 @@ static __always_inline __must_check unsigned long __clear_user(void __user *addr static __always_inline unsigned long clear_user(void __user *to, unsigned long n) { if (__access_ok(to, n)) - return __clear_user(to, n); + return __clear_user(mask_user_address(to), n); return n; } #endif /* _ASM_X86_UACCESS_64_H */ diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index d066aecf8aeb..094224ec9dca 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -39,7 +39,7 @@ .macro check_range size:req .if IS_ENABLED(CONFIG_X86_64) - mov %rax, %rdx + mov %rax, %rdx /* mask_user_address() */ sar $63, %rdx or %rdx, %rax .else diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S index 975c9c18263d..09b7e37934ab 100644 --- a/arch/x86/lib/putuser.S +++ b/arch/x86/lib/putuser.S @@ -34,7 +34,7 @@ .macro check_range size:req .if IS_ENABLED(CONFIG_X86_64) - mov %rcx, %rbx + mov %rcx, %rbx /* mask_user_address() */ sar $63, %rbx or %rbx, %rcx .else diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 39c7cf82b0c2..dda9725a9559 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -160,12 +160,6 @@ _inline_copy_from_user(void *to, const void __user *from, unsigned long n) unsigned long res = n; might_fault(); if (!should_fail_usercopy() && likely(access_ok(from, n))) { - /* - * Ensure that bad access_ok() speculation will not - * lead to nasty side effects *after* the copy is - * finished: - */ - barrier_nospec(); instrument_copy_from_user_before(to, from, n); res = raw_copy_from_user(to, from, n); instrument_copy_from_user_after(to, from, n, res); -- 2.46.2 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-12 4:09 [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() Josh Poimboeuf @ 2024-10-12 8:48 ` Andrew Cooper 2024-10-12 14:09 ` Josh Poimboeuf 2024-10-12 15:44 ` Linus Torvalds 0 siblings, 2 replies; 51+ messages in thread From: Andrew Cooper @ 2024-10-12 8:48 UTC (permalink / raw) To: Josh Poimboeuf, x86 Cc: Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Linus Torvalds, Michael Ellerman, linuxppc-dev, Mark Rutland On 12/10/2024 5:09 am, Josh Poimboeuf wrote: > For x86-64, the barrier_nospec() in copy_from_user() is overkill and > painfully slow. Instead, use pointer masking to force the user pointer > to a non-kernel value even in speculative paths. > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> You do realise mask_user_address() is unsafe under speculation on AMD systems? Had the mask_user_address() patch been put for review, this feedback would have been given then. AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one saturated by shifting, not bit 63. As it stands, you're reintroducing the very problem barrier_nospec() was introduced to mitigate. ~Andrew ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-12 8:48 ` Andrew Cooper @ 2024-10-12 14:09 ` Josh Poimboeuf 2024-10-12 14:21 ` Borislav Petkov 2024-10-12 14:26 ` Andrew Cooper 2024-10-12 15:44 ` Linus Torvalds 1 sibling, 2 replies; 51+ messages in thread From: Josh Poimboeuf @ 2024-10-12 14:09 UTC (permalink / raw) To: Andrew Cooper Cc: x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Linus Torvalds, Michael Ellerman, linuxppc-dev, Mark Rutland On Sat, Oct 12, 2024 at 09:48:57AM +0100, Andrew Cooper wrote: > On 12/10/2024 5:09 am, Josh Poimboeuf wrote: > > For x86-64, the barrier_nospec() in copy_from_user() is overkill and > > painfully slow. Instead, use pointer masking to force the user pointer > > to a non-kernel value even in speculative paths. > > > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> > > You do realise mask_user_address() is unsafe under speculation on AMD > systems? > > Had the mask_user_address() patch been put for review, this feedback > would have been given then. > > > AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one > saturated by shifting, not bit 63. Ok... why? -- Josh ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-12 14:09 ` Josh Poimboeuf @ 2024-10-12 14:21 ` Borislav Petkov 2024-10-12 15:58 ` Linus Torvalds 2024-10-12 14:26 ` Andrew Cooper 1 sibling, 1 reply; 51+ messages in thread From: Borislav Petkov @ 2024-10-12 14:21 UTC (permalink / raw) To: Josh Poimboeuf Cc: Andrew Cooper, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Linus Torvalds, Michael Ellerman, linuxppc-dev, Mark Rutland On Sat, Oct 12, 2024 at 09:09:23AM -0500, Josh Poimboeuf wrote: > On Sat, Oct 12, 2024 at 09:48:57AM +0100, Andrew Cooper wrote: > > On 12/10/2024 5:09 am, Josh Poimboeuf wrote: > > > For x86-64, the barrier_nospec() in copy_from_user() is overkill and > > > painfully slow. Instead, use pointer masking to force the user pointer > > > to a non-kernel value even in speculative paths. > > > > > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> > > > > You do realise mask_user_address() is unsafe under speculation on AMD > > systems? > > > > Had the mask_user_address() patch been put for review, this feedback > > would have been given then. > > > > > > AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one > > saturated by shifting, not bit 63. > > Ok... why? I've been working on a fix for this. Still WIP. Author: Borislav Petkov (AMD) <bp@alien8.de> Date: Sat Oct 12 00:18:10 2024 +0200 x86/uaccess: Fix user access masking on AMD Commit 2865baf54077 ("x86: support user address masking instead of non-speculative conditional") started doing a data-dependent mask on the user address in order to speed up the fast unsafe user access patterns. However, AMD CPUs would use only 48, or 56 bits respectively when doing transient TLB accesses and a non-canonical kernel address supplied by user can still result in a TLB hit and access kernel data transiently, leading to a potential spectre v1 leak, see bulletin Link below. Therefore, use the most-significant address bit when doing the masking and prevent such transient TLB hits. Fixes: 2865baf54077 ("x86: support user address masking instead of non-speculative conditional") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index afce8ee5d7b7..73c7a6bf1468 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -58,7 +58,9 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, * user_access_begin that can avoid the fencing. This only works * for dense accesses starting at the address. */ -#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63))) +#define mask_user_address(x) ((typeof(x)) \ + ((long)(x) | ((long)(x) << (63 - __VIRTUAL_MASK_SHIFT) >> 63))) + #define masked_user_access_begin(x) ({ \ __auto_type __masked_ptr = (x); \ __masked_ptr = mask_user_address(__masked_ptr); \ -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-12 14:21 ` Borislav Petkov @ 2024-10-12 15:58 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2024-10-12 15:58 UTC (permalink / raw) To: Borislav Petkov Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sat, 12 Oct 2024 at 07:21, Borislav Petkov <bp@alien8.de> wrote: > > Commit > > 2865baf54077 ("x86: support user address masking instead of non-speculative conditional") No. Thos started long before. Again, see commit b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()") and the code it generates. get_user() results in a plain calls to __get_user_X, where X is the size. No barriers. And __get_user_X() does that exact same thing. And no, your suggested patch looks very suspicious: +#define mask_user_address(x) ((typeof(x)) \ + ((long)(x) | ((long)(x) << (63 - __VIRTUAL_MASK_SHIFT) >> 63))) that does no masking at all on a kernel address, so you can feed it random kernel addresses and it will just access them, Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-12 14:09 ` Josh Poimboeuf 2024-10-12 14:21 ` Borislav Petkov @ 2024-10-12 14:26 ` Andrew Cooper 1 sibling, 0 replies; 51+ messages in thread From: Andrew Cooper @ 2024-10-12 14:26 UTC (permalink / raw) To: Josh Poimboeuf Cc: x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Linus Torvalds, Michael Ellerman, linuxppc-dev, Mark Rutland On 12/10/2024 3:09 pm, Josh Poimboeuf wrote: > On Sat, Oct 12, 2024 at 09:48:57AM +0100, Andrew Cooper wrote: >> On 12/10/2024 5:09 am, Josh Poimboeuf wrote: >>> For x86-64, the barrier_nospec() in copy_from_user() is overkill and >>> painfully slow. Instead, use pointer masking to force the user pointer >>> to a non-kernel value even in speculative paths. >>> >>> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> >> You do realise mask_user_address() is unsafe under speculation on AMD >> systems? >> >> Had the mask_user_address() patch been put for review, this feedback >> would have been given then. >> >> >> AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one >> saturated by shifting, not bit 63. > Ok... why? CVE-2020-12965 https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html ~Andrew ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-12 8:48 ` Andrew Cooper 2024-10-12 14:09 ` Josh Poimboeuf @ 2024-10-12 15:44 ` Linus Torvalds 2024-10-12 17:23 ` Andrew Cooper 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-12 15:44 UTC (permalink / raw) To: Andrew Cooper Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sat, 12 Oct 2024 at 01:49, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > You do realise mask_user_address() is unsafe under speculation on AMD > systems? That had *better* not be true. > Had the mask_user_address() patch been put for review, this feedback > would have been given then. That's BS. Why? Look at commit b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()"). This mask_user_address() thing is how we've been doing a regular get/put_user() for the last 18 months. It's *exactly* the same pattern: mov %rax, %rdx sar $63, %rdx or %rdx, %rax ie we saturate the sign bit. > AMD needs to arrange for bit 47 (bit 58 with LA57) to be the one > saturated by shifting, not bit 63. > > As it stands, you're reintroducing the very problem barrier_nospec() was > introduced to mitigate. If that's true, we have much bigger issues. And it has nothing to do with the new address masking macro, that just exposed existing logic. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-12 15:44 ` Linus Torvalds @ 2024-10-12 17:23 ` Andrew Cooper 2024-10-12 17:44 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Andrew Cooper @ 2024-10-12 17:23 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On 12/10/2024 4:44 pm, Linus Torvalds wrote: > On Sat, 12 Oct 2024 at 01:49, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> You do realise mask_user_address() is unsafe under speculation on AMD >> systems? > That had *better* not be true. Yeah I'd prefer it wasn't true either. >> Had the mask_user_address() patch been put for review, this feedback >> would have been given then. > That's BS. > > Why? Look at commit b19b74bc99b1 ("x86/mm: Rework address range check > in get_user() and put_user()"). That looks like 3 Intel tags and 0 AMD tags. But ok, I didn't spot this one, and it looks unsafe too. It was not reviewed by anyone that had a reasonable expectation to know AMD's microarchitectural behaviour. Previously, the STAC protected against bad prediction of the JAE and prevented dereferencing the pointer if it was greater than TASK_SIZE. Importantly for the issue at hand, the calculation against TASK_SIZE excluded the whole non-canonical region. > This mask_user_address() thing is how we've been doing a regular > get/put_user() for the last 18 months. It's *exactly* the same > pattern: > > mov %rax, %rdx > sar $63, %rdx > or %rdx, %rax > > ie we saturate the sign bit. This logic is asymmetric. For an address in the upper half (canonical or non-canonical), it ORs with -1 and fully replaces the prior address. For an address in the lower half (canonical or non-canonical), it leaves the value intact, as either canonical or non-canoncal. Then the pointer is architecturally dereferenced, relying on catching #PF/#GP for the slow path. Architecturally, this is safe. Micro-architecturally though, AMD CPUs use bit 47, not 63, in the TLB lookup. This behaviour dates from the K8, and is exposed somewhat in the virt extensions. When userspace passes in a non-canonical pointer in the low half of the address space but with bit 47 set, it will be considered a high-half pointer when sent for TLB lookup, and the pagetables say it's a supervisor mapping, so the memory access will be permitted to go ahead speculatively. Only later does the pipeline realise the address was non-canonical and raise #GP. This lets userspace directly target and load anything cacheable in the kernel mappings. It's not as easy to exploit as Meltdown on Intel, but it known behaviour, and been the subject of academic work for 4 years. ~Andrew ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-12 17:23 ` Andrew Cooper @ 2024-10-12 17:44 ` Linus Torvalds 2024-10-13 0:53 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-12 17:44 UTC (permalink / raw) To: Andrew Cooper Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sat, 12 Oct 2024 at 10:23, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> > This logic is asymmetric. > > For an address in the upper half (canonical or non-canonical), it ORs > with -1 and fully replaces the prior address. Right. The point is that non-canonical addresses will fault, and kernel addresses are guaranteed to fault. And the assumption was that any fault will be sufficient to hide the result, because otherwise you have meltdown all over again. > When userspace passes in a non-canonical pointer in the low half of the > address space but with bit 47 set, it will be considered a high-half > pointer when sent for TLB lookup, and the pagetables say it's a > supervisor mapping, so the memory access will be permitted to go ahead > speculatively. Only later does the pipeline realise the address was > non-canonical and raise #GP. > > This lets userspace directly target and load anything cacheable in the > kernel mappings. It's not as easy to exploit as Meltdown on Intel, but > it known behaviour, and been the subject of academic work for 4 years. It sure was never talked about in kernel circles. I checked my email archives, and neither CVE-2020-12965 nor that https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html is anywhere in my emails, nor does lore.kernel.org find them anywhere either. Anyway, what's the speculation window size like? Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-12 17:44 ` Linus Torvalds @ 2024-10-13 0:53 ` Linus Torvalds 2024-10-13 1:21 ` Linus Torvalds 2024-10-14 11:56 ` Kirill A. Shutemov 0 siblings, 2 replies; 51+ messages in thread From: Linus Torvalds @ 2024-10-13 0:53 UTC (permalink / raw) To: Andrew Cooper Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sat, 12 Oct 2024 at 10:44, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Anyway, what's the speculation window size like? Note that this is important basically because we do *NOT* want to check the address against TASK_SIZE_MAX like we used to, because not only is TASK_SIZE_MAX not a compile-time constant, but with linear address masking, people actually *want* to use addresses that are in the non-canonical range. IOW, see also arch/x86/include/asm/uaccess_64.h and notice how the x86-64 __access_ok() check *also_ does the whole "top bit set" thing (iow, see __access_ok()). IOW, this actually goes even further back than the commit I mentioned earlier - it goes back to commit 6014bc27561f ("x86-64: make access_ok() independent of LAM") because without the sign bit trick, LAM is a complete disaster. So no, the address masking can not depend on things like __VIRTUAL_MASK_SHIFT, it would need to at least take LAM into account too. Not that I know if there are any CPU's out there that actually have LAM enabled. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-13 0:53 ` Linus Torvalds @ 2024-10-13 1:21 ` Linus Torvalds 2024-10-14 3:54 ` Josh Poimboeuf 2024-10-14 11:56 ` Kirill A. Shutemov 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-13 1:21 UTC (permalink / raw) To: Andrew Cooper Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sat, 12 Oct 2024 at 17:53, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So no, the address masking can not depend on things like > __VIRTUAL_MASK_SHIFT, it would need to at least take LAM into account > too. Not that I know if there are any CPU's out there that actually > have LAM enabled. Lunar Lake and Arrow Lake, apparently. One thing that may make this all moot is that the value loaded from a possible non-canonical range won't actually be used until after we've done a "CLAC". And at least judging by the performance of STAC/CLAC on my machines, those instructions will likely have stopped any speculation cold. So maybe the "what is the actual cycle latency of detecting the faulting instruction" really is the core question here. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-13 1:21 ` Linus Torvalds @ 2024-10-14 3:54 ` Josh Poimboeuf 2024-10-14 6:50 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Josh Poimboeuf @ 2024-10-14 3:54 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sat, Oct 12, 2024 at 06:21:12PM -0700, Linus Torvalds wrote: > On Sat, 12 Oct 2024 at 17:53, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So no, the address masking can not depend on things like > > __VIRTUAL_MASK_SHIFT, it would need to at least take LAM into account > > too. Not that I know if there are any CPU's out there that actually > > have LAM enabled. If I understand correctly, LAM bits are for the benefit of SW and are ignored by HW? Can we just mask those bits off? > And at least judging by the performance of STAC/CLAC on my machines, > those instructions will likely have stopped any speculation cold. > > So maybe the "what is the actual cycle latency of detecting the > faulting instruction" really is the core question here. I think I remember reading that STAC/CLAC can't necessarily be relied on as a speculation barrier, depending on microarchitectural details. It might be safest to assume we can't rely on that. Masking is relatively cheap anyway. It sounds like we may need to go back to a more traditional access_ok() which masks off the LAM bits (right?) and then compares with TASK_SIZE_MAX. While TASK_SIZE_MAX is no longer necessarily a compile-time constant for the CONFIG_X86_5LEVEL case, it's no worse than a single "mov" instruction in task_size_max(). So far I have something like the below which is completely untested and probably actively wrong in some places. diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index 39199eef26be..92c9f2847217 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -50,57 +50,43 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, * The virtual address space space is logically divided into a kernel * half and a user half. When cast to a signed type, user pointers * are positive and kernel pointers are negative. + * + * TODO make sure users are taking AMD non-canonical bit speculation bug into + * account */ #define valid_user_address(x) ((__force long)(x) >= 0) +#ifndef untagged_addr +#define untagged_addr(addr) (addr) +#endif + +static inline bool __access_ok(const void __user *ptr, unsigned long size) +{ + unsigned long addr = (__force unsigned long)untagged_addr(ptr); + unsigned long limit = TASK_SIZE_MAX; + + return (size <= limit) && addr <= (limit - size); +} +#define __access_ok __access_ok + +/* + * Called after access_ok(), so architecturally this is a valid user pointer. + * Force it that way for the speculative case. + */ +#define mask_user_address(x) \ + ((typeof(x))((__force unsigned long)(x) & ((1UL << __VIRTUAL_MASK_SHIFT) - 1))) + /* * Masking the user address is an alternative to a conditional * user_access_begin that can avoid the fencing. This only works * for dense accesses starting at the address. */ -#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63))) -#define masked_user_access_begin(x) ({ \ - __auto_type __masked_ptr = (x); \ - __masked_ptr = mask_user_address(__masked_ptr); \ - __uaccess_begin(); __masked_ptr; }) - -/* - * User pointers can have tag bits on x86-64. This scheme tolerates - * arbitrary values in those bits rather then masking them off. - * - * Enforce two rules: - * 1. 'ptr' must be in the user half of the address space - * 2. 'ptr+size' must not overflow into kernel addresses - * - * Note that addresses around the sign change are not valid addresses, - * and will GP-fault even with LAM enabled if the sign bit is set (see - * "CR3.LAM_SUP" that can narrow the canonicality check if we ever - * enable it, but not remove it entirely). - * - * So the "overflow into kernel addresses" does not imply some sudden - * exact boundary at the sign bit, and we can allow a lot of slop on the - * size check. - * - * In fact, we could probably remove the size check entirely, since - * any kernel accesses will be in increasing address order starting - * at 'ptr', and even if the end might be in kernel space, we'll - * hit the GP faults for non-canonical accesses before we ever get - * there. - * - * That's a separate optimization, for now just handle the small - * constant case. - */ -static inline bool __access_ok(const void __user *ptr, unsigned long size) +static __must_check __always_inline void __user *masked_user_access_begin(void __user *ptr, size_t len) { - if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) { - return valid_user_address(ptr); - } else { - unsigned long sum = size + (__force unsigned long)ptr; - - return valid_user_address(sum) && sum >= (__force unsigned long)ptr; - } + if (unlikely(!__access_ok(ptr,len))) + return (__force void __user *)NULL; + return mask_user_address(ptr); } -#define __access_ok __access_ok /* * Copy To/From Userspace diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index 094224ec9dca..7e09708642e2 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -35,14 +35,16 @@ #include <asm/asm.h> #include <asm/smap.h> -#define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC - -.macro check_range size:req +.macro check_range size:req check=1 .if IS_ENABLED(CONFIG_X86_64) - mov %rax, %rdx /* mask_user_address() */ - sar $63, %rdx - or %rdx, %rax -.else + // FIXME take size into account? + ALTERNATIVE "movq $0x7fffffffffff, %rdx", "movq $0xffffffffffffff, %rdx", X86_FEATURE_LA57 +.if \check == 1 + cmp %rdx, %rax /* access_ok() */ + ja .Lbad_get_user +.endif + and %rdx, %rax /* mask_user_address() */ +.else /* 32 bit */ cmp $TASK_SIZE_MAX-\size+1, %eax jae .Lbad_get_user sbb %edx, %edx /* array_index_mask_nospec() */ @@ -105,10 +107,10 @@ SYM_FUNC_START(__get_user_8) SYM_FUNC_END(__get_user_8) EXPORT_SYMBOL(__get_user_8) -/* .. and the same for __get_user, just without the range checks */ +/* .. and the same for __get_user, just without the access_ok() checks */ SYM_FUNC_START(__get_user_nocheck_1) + check_range size=1 check=0 ASM_STAC - ASM_BARRIER_NOSPEC UACCESS movzbl (%_ASM_AX),%edx xor %eax,%eax ASM_CLAC @@ -117,8 +119,8 @@ SYM_FUNC_END(__get_user_nocheck_1) EXPORT_SYMBOL(__get_user_nocheck_1) SYM_FUNC_START(__get_user_nocheck_2) + check_range size=2 check=0 ASM_STAC - ASM_BARRIER_NOSPEC UACCESS movzwl (%_ASM_AX),%edx xor %eax,%eax ASM_CLAC @@ -127,8 +129,8 @@ SYM_FUNC_END(__get_user_nocheck_2) EXPORT_SYMBOL(__get_user_nocheck_2) SYM_FUNC_START(__get_user_nocheck_4) + check_range size=4 check=0 ASM_STAC - ASM_BARRIER_NOSPEC UACCESS movl (%_ASM_AX),%edx xor %eax,%eax ASM_CLAC @@ -137,8 +139,8 @@ SYM_FUNC_END(__get_user_nocheck_4) EXPORT_SYMBOL(__get_user_nocheck_4) SYM_FUNC_START(__get_user_nocheck_8) + check_range size=8 check=0 ASM_STAC - ASM_BARRIER_NOSPEC #ifdef CONFIG_X86_64 UACCESS movq (%_ASM_AX),%rdx #else diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S index 09b7e37934ab..dde3b7825717 100644 --- a/arch/x86/lib/putuser.S +++ b/arch/x86/lib/putuser.S @@ -32,12 +32,15 @@ * as they get called from within inline assembly. */ -.macro check_range size:req +.macro check_range size:req check=1 .if IS_ENABLED(CONFIG_X86_64) - mov %rcx, %rbx /* mask_user_address() */ - sar $63, %rbx - or %rbx, %rcx -.else + ALTERNATIVE "movq $0x7fffffffffff, %rdx", "movq $0xffffffffffffff, %rdx", X86_FEATURE_LA57 +.if \check == 1 + cmp %rdx, %rax /* access_ok() */ + ja .Lbad_put_user +.endif + and %rdx, %rax /* mask_user_address() */ +.else /* 32 bit */ cmp $TASK_SIZE_MAX-\size+1, %ecx jae .Lbad_put_user .endif @@ -55,6 +58,7 @@ SYM_FUNC_END(__put_user_1) EXPORT_SYMBOL(__put_user_1) SYM_FUNC_START(__put_user_nocheck_1) + check_range size=1 check=0 ASM_STAC 2: movb %al,(%_ASM_CX) xor %ecx,%ecx @@ -74,6 +78,7 @@ SYM_FUNC_END(__put_user_2) EXPORT_SYMBOL(__put_user_2) SYM_FUNC_START(__put_user_nocheck_2) + check_range size=2 check=0 ASM_STAC 4: movw %ax,(%_ASM_CX) xor %ecx,%ecx @@ -93,6 +98,7 @@ SYM_FUNC_END(__put_user_4) EXPORT_SYMBOL(__put_user_4) SYM_FUNC_START(__put_user_nocheck_4) + check_range size=4 check=0 ASM_STAC 6: movl %eax,(%_ASM_CX) xor %ecx,%ecx @@ -115,6 +121,7 @@ SYM_FUNC_END(__put_user_8) EXPORT_SYMBOL(__put_user_8) SYM_FUNC_START(__put_user_nocheck_8) + check_range size=8 check=0 ASM_STAC 9: mov %_ASM_AX,(%_ASM_CX) #ifdef CONFIG_X86_32 diff --git a/fs/select.c b/fs/select.c index a77907faf2b4..2ca7ad555342 100644 --- a/fs/select.c +++ b/fs/select.c @@ -777,9 +777,11 @@ static inline int get_sigset_argpack(struct sigset_argpack *to, { // the path is hot enough for overhead of copy_from_user() to matter if (from) { - if (can_do_masked_user_access()) - from = masked_user_access_begin(from); - else if (!user_read_access_begin(from, sizeof(*from))) + if (can_do_masked_user_access()) { + from = masked_user_access_begin(from, sizeof(*from)); + if (!from) + return -EFAULT; + } else if (!user_read_access_begin(from, sizeof(*from))) return -EFAULT; unsafe_get_user(to->p, &from->p, Efault); unsafe_get_user(to->size, &from->size, Efault); diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index dda9725a9559..fe11b3fb755a 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -37,7 +37,7 @@ #define can_do_masked_user_access() 1 #else #define can_do_masked_user_access() 0 - #define masked_user_access_begin(src) NULL + #define masked_user_access_begin(ptr, len) NULL #endif /* diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c index 989a12a67872..ce5eaaed2365 100644 --- a/lib/strncpy_from_user.c +++ b/lib/strncpy_from_user.c @@ -120,15 +120,6 @@ long strncpy_from_user(char *dst, const char __user *src, long count) if (unlikely(count <= 0)) return 0; - if (can_do_masked_user_access()) { - long retval; - - src = masked_user_access_begin(src); - retval = do_strncpy_from_user(dst, src, count, count); - user_read_access_end(); - return retval; - } - max_addr = TASK_SIZE_MAX; src_addr = (unsigned long)untagged_addr(src); if (likely(src_addr < max_addr)) { @@ -144,11 +135,17 @@ long strncpy_from_user(char *dst, const char __user *src, long count) kasan_check_write(dst, count); check_object_size(dst, count, false); - if (user_read_access_begin(src, max)) { - retval = do_strncpy_from_user(dst, src, count, max); - user_read_access_end(); - return retval; + if (can_do_masked_user_access()) { + src = masked_user_access_begin(src, max); + if (!src) + return -EFAULT; + } else if (!user_read_access_begin(src, max)) { + return -EFAULT; } + + retval = do_strncpy_from_user(dst, src, count, max); + user_read_access_end(); + return retval; } return -EFAULT; } diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c index 6e489f9e90f1..c6066906704a 100644 --- a/lib/strnlen_user.c +++ b/lib/strnlen_user.c @@ -96,15 +96,6 @@ long strnlen_user(const char __user *str, long count) if (unlikely(count <= 0)) return 0; - if (can_do_masked_user_access()) { - long retval; - - str = masked_user_access_begin(str); - retval = do_strnlen_user(str, count, count); - user_read_access_end(); - return retval; - } - max_addr = TASK_SIZE_MAX; src_addr = (unsigned long)untagged_addr(str); if (likely(src_addr < max_addr)) { @@ -118,11 +109,17 @@ long strnlen_user(const char __user *str, long count) if (max > count) max = count; - if (user_read_access_begin(str, max)) { - retval = do_strnlen_user(str, count, max); - user_read_access_end(); - return retval; + if (can_do_masked_user_access()) { + str = masked_user_access_begin(str, max); + if (!str) + return 0; + } else if (!user_read_access_begin(str, max)) { + return 0; } + + retval = do_strnlen_user(str, count, max); + user_read_access_end(); + return retval; } return 0; } ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 3:54 ` Josh Poimboeuf @ 2024-10-14 6:50 ` Linus Torvalds 2024-10-14 9:59 ` David Laight ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Linus Torvalds @ 2024-10-14 6:50 UTC (permalink / raw) To: Josh Poimboeuf Cc: Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland [-- Attachment #1: Type: text/plain, Size: 7423 bytes --] On Sun, 13 Oct 2024 at 20:54, Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > If I understand correctly, LAM bits are for the benefit of SW and are > ignored by HW? Can we just mask those bits off? Yes. But then you waste time on the masking, but particularly if it then causes silly extra overhead just to get the mask. That was why the whole original LAM patches were dismissed - not because an "and" instruction to remove the LAM bits would be expensive, but because when you do static branching depending on it and then load the mask from memory, the code goes from nice simple straight-line efficient code to a horrible mess. The early LAM patches admittedly made it even worse because it did a per-thread mask etc, so it really got quite nasty and wasn't just *one* static jump to a constant. But still.. > > So maybe the "what is the actual cycle latency of detecting the > > faulting instruction" really is the core question here. > > I think I remember reading that STAC/CLAC can't necessarily be relied on > as a speculation barrier, depending on microarchitectural details. It > might be safest to assume we can't rely on that. Masking is relatively > cheap anyway. The WHOLE BUG is all about "microarchitectural details". So arguing that STAC is a microarchitectural detail and not architecturally guaranteed isn't an argument. Because architecturally, this but simply does not exist, and so some architectural definition simply doesn't matter. So all that matters is whether the broken microarchitectures that used the wrong bit for testing also have a STAC that basically hides the bug. Christ. The hardware should have used a bit that is *MEANINGFUL*, namely bit #63, not some random meaningless bit that just happens to be one of the bits that is then checked for canonicality. And it's so annoying, because from a *hardware* perspective, bit #63 vs bit #48 is completely irrelevant. It's literally just a wire choice But from an architectural perspective, bit #63 is not only the *actual* bit that is the real difference ("kernel is at the top of the address space") but for software, bit #48 is fundamentally harder to test. IOW, it's completely the wrong effing bit to test. Honestly, the Intel meltdown thing I at least understood *why* it happened. This AMD "use meaningless bit #48" bug just strikes me as active malice. The paper I found for this bug doesn't go into any details of what the cycle count issues are until the full address is verified, and doesn't even mention which micro-architectures are affected. It says "Zen+" and "Zen 2", from a quick read. The AMD note doesn't say even that. Is this *all* Zen cores? If not, when did it get fixed? > So far I have something like the below which is completely untested and > probably actively wrong in some places. This is worse than actively wrong. It's just adding insult to injury. > +static inline bool __access_ok(const void __user *ptr, unsigned long size) > +{ > + unsigned long addr = (__force unsigned long)untagged_addr(ptr); > + unsigned long limit = TASK_SIZE_MAX; We're not encouraging this complete microarchitectural incompetence by using something expensive like TASK_SIZE_MAX, which actually expands to be conditional on pgtable_l5_enabled() and a shift (maybe the compiler ends up moving the shift into both sides of the conditional?). Which isn't even the right thing, because presumably as far as the HARDWARE is concerned, the actual width of the TLB is what gets tested, and isn't actually dependent on whether 5-level paging is actually *enabled* or not, but on whether the hardware *supports* 5-level paging. I guess as long as we aren't using the high bits, we can just clear all of them (so if we just always clear bit #57 when we also clear #48, we are ok), but it still seems wrong in addition to being pointlessly expensive. > +#define mask_user_address(x) \ > + ((typeof(x))((__force unsigned long)(x) & ((1UL << __VIRTUAL_MASK_SHIFT) - 1))) No. Again, that code makes it look like it's some nice simple constant. It's not. __VIRTUAL_MASK_SHIFT is that conditional thing based on pgtable_l5_enabled(). So now you have static branches etc craziness instead of having a nice constant. We can fix this, but no, we're not going "hardware people did something incredibly stupid, so now to balance things out, *we* will do something equally stupid". We are going to be better than that. So the way to fix this properly - if it even needs fixing - is to just have an actual assembler alternate that does something like this in get_user.S, and does the same for the C code for mask_user_address(). And it needs a *big* comment about the stupidity of hardware checking the wrong bit that has no semantic meaning except for it (too late!) being tested for being canonical with the actual bit that matters (ie bit #63). And again, the whole "if it even needs fixing" is a big thing. Maybe STAC is just taking long enough that the canonicality check *has* been done. We know the STAC isn't a serializing instruction, but we also *do* know that STAC sure as hell will synchronize at least to some degree with memory access permission testing, because that's literally the whole and only point of it. (Of course, if the AC bit was just passed along from the front end and tagged all the instructions, the actual CLAC/STAC instructions wouldn't need to serializing with actual instruction execution, but if that was the case they wouldn't be as expensive as I see them being in profiles, so we know the uarch isn't doing something that clever). Christ. I thought I was over being annoyed by hardware bugs. But this hardware bug is just annoying in how *stupid* it is. Anyway, the attached patch (a) only fixes get_user() - I assume put_user() doesn't even need this, because the store will go into the store buffer, and certainly be killed before it gets anywhere else? (b) only does the asm side, the mask_user_address() would need to do the same thing using the C version: alternative() (c) is entirely untested, and I might have gotten the constants wrong or some other logic wrong. but at least the code it generates doesn't look actively like garbage. It generates something like this: mov %rax,%rdx sar $0x3f,%rdx or %rdx,%rax movabs $0x80007fffffffffff,%rdx and %rdx,%rax which looks about as good as it gets (assuming I didn't screw up the constant). The "or" still sets all bits when it's one of the real kernel addresses), but the "and" will now guarantee that the end result is canonical for positive addresses, and guaranteed non-canonical - and for this AMD bug, a zero bit 48/57 - for the invalid kernel range. Yes, it basically means that as far as the kernel is concerned, "everything is LAM". The kernel would actually accept all positive addresses, and only fault (now with a non-canonical GP fault) when users try to use negative addresses. Which is arguably also quite horrendous, but it's effectively what having LAM enabled would do in hardware anyway, so hey, it's "forward looking". Bah. And we could make the masking constants be 0xfeff7fffffffffff and 0xfeffffffffffff, and *only* mask off bit 48/57. And we could even make the whole "and" be conditional on the AMD bug with a new X86_FEATURE_CANONICAL_LEAK thing. So there are certainly options in this area. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 673 bytes --] arch/x86/lib/getuser.S | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index d066aecf8aeb..7d5730aa18b8 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -37,11 +37,17 @@ #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC +#define X86_CANONICAL_MASK ALTERNATIVE \ + "movq $0x80007fffffffffff,%rdx", \ + "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57 + .macro check_range size:req .if IS_ENABLED(CONFIG_X86_64) mov %rax, %rdx sar $63, %rdx or %rdx, %rax + X86_CANONICAL_MASK + and %rdx,%rax .else cmp $TASK_SIZE_MAX-\size+1, %eax jae .Lbad_get_user ^ permalink raw reply related [flat|nested] 51+ messages in thread
* RE: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 6:50 ` Linus Torvalds @ 2024-10-14 9:59 ` David Laight 2024-10-14 11:20 ` Linus Torvalds 2024-10-14 11:12 ` Andrew Cooper 2024-10-14 12:30 ` Kirill A. Shutemov 2 siblings, 1 reply; 51+ messages in thread From: David Laight @ 2024-10-14 9:59 UTC (permalink / raw) To: 'Linus Torvalds', Josh Poimboeuf Cc: Andrew Cooper, x86@kernel.org, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org, Mark Rutland ... > > If I understand correctly, LAM bits are for the benefit of SW and are > > ignored by HW? Can we just mask those bits off? > > Yes. But then you waste time on the masking, but particularly if it > then causes silly extra overhead just to get the mask. Isn't LAM just plain stupid unless the hardware validates the bits against the TLB? You start with a nice big sparse address space (potentially 63 bits of it) where things can be spread out to make 'random' addresses likely to fault and then alias 32k addresses onto each memory location. Sounds brain-dead to me. If you could set the stack pointer to a 'high' address and have the hardware check that the TLB was for that alias then you'd get reasonable stack overflow checking. ... > namely bit #63, not some random meaningless bit that just happens to > be one of the bits that is then checked for canonicality. > > And it's so annoying, because from a *hardware* perspective, bit #63 > vs bit #48 is completely irrelevant. It's literally just a wire choice > > But from an architectural perspective, bit #63 is not only the > *actual* bit that is the real difference ("kernel is at the top of > the address space") but for software, bit #48 is fundamentally harder > to test. Doesn't ARM64 have the same issue? I'm sure I also remember some other architectural feature that extends the valid virtual addresses beyond 48 bits? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 9:59 ` David Laight @ 2024-10-14 11:20 ` Linus Torvalds 2024-10-14 14:40 ` David Laight 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-14 11:20 UTC (permalink / raw) To: David Laight Cc: Josh Poimboeuf, Andrew Cooper, x86@kernel.org, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org, Mark Rutland On Mon, 14 Oct 2024 at 02:59, David Laight <David.Laight@aculab.com> wrote: > > Isn't LAM just plain stupid unless the hardware validates the bits > against the TLB? What? No. You can't do that. At some point, the TLB isn't filled, and then you have to do the access with lots of linear address bits masked. > Doesn't ARM64 have the same issue? Yes. They have different bits that they'd need to test for this. They have top-byte-ignore, so they can't use the sign bit. On arm64, the address masking might end up looking something like this https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d31e86ef6377 instead. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 11:20 ` Linus Torvalds @ 2024-10-14 14:40 ` David Laight 0 siblings, 0 replies; 51+ messages in thread From: David Laight @ 2024-10-14 14:40 UTC (permalink / raw) To: 'Linus Torvalds' Cc: Josh Poimboeuf, Andrew Cooper, x86@kernel.org, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org, Mark Rutland From: Linus Torvalds > Sent: 14 October 2024 12:21 > > On Mon, 14 Oct 2024 at 02:59, David Laight <David.Laight@aculab.com> wrote: > > > > Isn't LAM just plain stupid unless the hardware validates the bits > > against the TLB? > > What? No. You can't do that. At some point, the TLB isn't filled, and > then you have to do the access with lots of linear address bits > masked. And the 'high' bits get reloaded from the PTE. (Although I suspect there isn't space for them.) Reloading the high bits from the VA that caused the fault would be 'interesting' - certainly better than just ignoring them. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 6:50 ` Linus Torvalds 2024-10-14 9:59 ` David Laight @ 2024-10-14 11:12 ` Andrew Cooper 2024-10-14 12:30 ` Kirill A. Shutemov 2 siblings, 0 replies; 51+ messages in thread From: Andrew Cooper @ 2024-10-14 11:12 UTC (permalink / raw) To: Linus Torvalds, Josh Poimboeuf Cc: x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On 14/10/2024 7:50 am, Linus Torvalds wrote: > On Sun, 13 Oct 2024 at 20:54, Josh Poimboeuf <jpoimboe@kernel.org> wrote: >> If I understand correctly, LAM bits are for the benefit of SW and are >> ignored by HW? That is what is written in the ISE today. But, I'll note that on ARM, MTE (Memory Tagging Extensions) use the same bits that BTI (Top Byte Ignore) made available to software. >>> So maybe the "what is the actual cycle latency of detecting the >>> faulting instruction" really is the core question here. I'm afraid I will have to defer that to AMD. I have not experimented personally. >> I think I remember reading that STAC/CLAC can't necessarily be relied on >> as a speculation barrier, depending on microarchitectural details. STAC and CLAC are not LFENCEes. They happen to be on implementations where AC is not a renamed flag (i.e. can't be updated speculatively), but both vendors want the flexibility in the future. Therefore, it is my recollection that the vendors agreed that STAC/CLAC would prevent memory accesses from occurring under the wrong AC, although I can't actually find this written down anywhere. I think we need to chase some architects until a statement or two fall out. >> It >> might be safest to assume we can't rely on that. Masking is relatively >> cheap anyway. > The WHOLE BUG is all about "microarchitectural details". > > So arguing that STAC is a microarchitectural detail and not > architecturally guaranteed isn't an argument. > > Because architecturally, this but simply does not exist, and so some > architectural definition simply doesn't matter. > > So all that matters is whether the broken microarchitectures that used > the wrong bit for testing also have a STAC that basically hides the > bug. > > Christ. The hardware should have used a bit that is *MEANINGFUL*, > namely bit #63, not some random meaningless bit that just happens to > be one of the bits that is then checked for canonicality. > > And it's so annoying, because from a *hardware* perspective, bit #63 > vs bit #48 is completely irrelevant. It's literally just a wire choice > > But from an architectural perspective, bit #63 is not only the > *actual* bit that is the real difference ("kernel is at the top of > the address space") but for software, bit #48 is fundamentally harder > to test. > > IOW, it's completely the wrong effing bit to test. > > Honestly, the Intel meltdown thing I at least understood *why* it happened. > > This AMD "use meaningless bit #48" bug just strikes me as active malice. I think this is unfair. This was an implementation decision taken probably around the millennium (given lead times to get the K8 out in 2003). There is a reason why - given they're architecturally equivalent - the inside bit is more useful than the outside bit. Some AMD CPUs really only have 48 bits of storage for pointers, and it really does simplify circuitry. Furthermore, since when have software concerns (one bit being harder to test than others) ever factored into hardware design. It's sad but true. From further up the email chain: > It sure was never talked about in kernel circles. Quite possibly not. At the time the Linux accessors were safe, so where ought it to have been discussed? It's not appropriate for an encrypted list, nor security@. AMD did deliver it to software vendors via their normal mechanism. ~Andrew ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 6:50 ` Linus Torvalds 2024-10-14 9:59 ` David Laight 2024-10-14 11:12 ` Andrew Cooper @ 2024-10-14 12:30 ` Kirill A. Shutemov 2024-10-14 15:39 ` Andrew Cooper 2024-10-14 16:55 ` Linus Torvalds 2 siblings, 2 replies; 51+ messages in thread From: Kirill A. Shutemov @ 2024-10-14 12:30 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sun, Oct 13, 2024 at 11:50:55PM -0700, Linus Torvalds wrote: > Anyway, the attached patch > > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S > index d066aecf8aeb..7d5730aa18b8 100644 > --- a/arch/x86/lib/getuser.S > +++ b/arch/x86/lib/getuser.S > @@ -37,11 +37,17 @@ > > #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC > > +#define X86_CANONICAL_MASK ALTERNATIVE \ > + "movq $0x80007fffffffffff,%rdx", \ > + "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57 > + > .macro check_range size:req > .if IS_ENABLED(CONFIG_X86_64) > mov %rax, %rdx > sar $63, %rdx > or %rdx, %rax > + X86_CANONICAL_MASK > + and %rdx,%rax > .else > cmp $TASK_SIZE_MAX-\size+1, %eax > jae .Lbad_get_user Given that LAM enforces bit 47/56 to be equal to bit 63 I think we can do this unconditionally instead of masking: diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index d066aecf8aeb..86d4511520b1 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -37,9 +37,14 @@ #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \ + "shl $(64 - 48), %rdx", \ + "shl $(64 - 57), %rdx", X86_FEATURE_LA57 + .macro check_range size:req .if IS_ENABLED(CONFIG_X86_64) mov %rax, %rdx + SHIFT_LEFT_TO_MSB sar $63, %rdx or %rdx, %rax .else -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 12:30 ` Kirill A. Shutemov @ 2024-10-14 15:39 ` Andrew Cooper 2024-10-15 10:00 ` Borislav Petkov 2024-10-20 22:44 ` Josh Poimboeuf 2024-10-14 16:55 ` Linus Torvalds 1 sibling, 2 replies; 51+ messages in thread From: Andrew Cooper @ 2024-10-14 15:39 UTC (permalink / raw) To: Kirill A. Shutemov, Linus Torvalds Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On 14/10/2024 1:30 pm, Kirill A. Shutemov wrote: > On Sun, Oct 13, 2024 at 11:50:55PM -0700, Linus Torvalds wrote: >> Anyway, the attached patch >> >> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S >> index d066aecf8aeb..7d5730aa18b8 100644 >> --- a/arch/x86/lib/getuser.S >> +++ b/arch/x86/lib/getuser.S >> @@ -37,11 +37,17 @@ >> >> #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC >> >> +#define X86_CANONICAL_MASK ALTERNATIVE \ >> + "movq $0x80007fffffffffff,%rdx", \ >> + "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57 >> + >> .macro check_range size:req >> .if IS_ENABLED(CONFIG_X86_64) >> mov %rax, %rdx >> sar $63, %rdx >> or %rdx, %rax >> + X86_CANONICAL_MASK >> + and %rdx,%rax >> .else >> cmp $TASK_SIZE_MAX-\size+1, %eax >> jae .Lbad_get_user > Given that LAM enforces bit 47/56 to be equal to bit 63 I think we can do > this unconditionally instead of masking: > > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S > index d066aecf8aeb..86d4511520b1 100644 > --- a/arch/x86/lib/getuser.S > +++ b/arch/x86/lib/getuser.S > @@ -37,9 +37,14 @@ > > #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC > > +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \ > + "shl $(64 - 48), %rdx", \ > + "shl $(64 - 57), %rdx", X86_FEATURE_LA57 > + > .macro check_range size:req > .if IS_ENABLED(CONFIG_X86_64) > mov %rax, %rdx > + SHIFT_LEFT_TO_MSB > sar $63, %rdx > or %rdx, %rax > .else That looks like it ought to DTRT in some cases, but I'll definitely ask AMD for confirmation. But, I expect it will malfunction on newer hardware when CONFIG_X86_5LEVEL=n, because it causes Linux to explicitly ignore the LA57 bit. That can be fixed by changing how CONFIG_X86_5LEVEL works. I also expect it will malfunction under virt on an LA57-capable system running a VM in LA48 mode (this time, Linux doesn't get to see the relevant uarch detail), and I have no good suggestion here. ~Andrew ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 15:39 ` Andrew Cooper @ 2024-10-15 10:00 ` Borislav Petkov 2024-10-20 22:44 ` Josh Poimboeuf 1 sibling, 0 replies; 51+ messages in thread From: Borislav Petkov @ 2024-10-15 10:00 UTC (permalink / raw) To: Andrew Cooper Cc: Kirill A. Shutemov, Linus Torvalds, Josh Poimboeuf, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Mon, Oct 14, 2024 at 04:39:26PM +0100, Andrew Cooper wrote: > But, I expect it will malfunction on newer hardware when > CONFIG_X86_5LEVEL=n, because it causes Linux to explicitly ignore the > LA57 bit. That can be fixed by changing how CONFIG_X86_5LEVEL works. https://lore.kernel.org/all/80734605-1926-4ac7-9c63-006fe3ea6b6a@amd.com/ But there was some lmbench regression... -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 15:39 ` Andrew Cooper 2024-10-15 10:00 ` Borislav Petkov @ 2024-10-20 22:44 ` Josh Poimboeuf 2024-10-20 22:59 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: Josh Poimboeuf @ 2024-10-20 22:44 UTC (permalink / raw) To: Andrew Cooper Cc: Kirill A. Shutemov, Linus Torvalds, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Mon, Oct 14, 2024 at 04:39:26PM +0100, Andrew Cooper wrote: > On 14/10/2024 1:30 pm, Kirill A. Shutemov wrote: > > +++ b/arch/x86/lib/getuser.S > > @@ -37,9 +37,14 @@ > > +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \ > > + "shl $(64 - 48), %rdx", \ > > + "shl $(64 - 57), %rdx", X86_FEATURE_LA57 > > + > > .macro check_range size:req > > .if IS_ENABLED(CONFIG_X86_64) > > mov %rax, %rdx > > + SHIFT_LEFT_TO_MSB > > sar $63, %rdx > > or %rdx, %rax > > .else > > That looks like it ought to DTRT in some cases, but I'll definitely ask > AMD for confirmation. > > But, I expect it will malfunction on newer hardware when > CONFIG_X86_5LEVEL=n, because it causes Linux to explicitly ignore the > LA57 bit. That can be fixed by changing how CONFIG_X86_5LEVEL works. > > I also expect it will malfunction under virt on an LA57-capable system > running a VM in LA48 mode (this time, Linux doesn't get to see the > relevant uarch detail), and I have no good suggestion here. BTW the paper [1] nonchalantly mentions: "All Intel CPUs that are vulnerable to MDS attacks inherently have the same flaw described here." Anyway, I'd really like to make forward progress on getting rid of the LFENCEs in copy_from_user() and __get_user(), so until if/when we hear back from both vendors, how about we avoid noncanonical exceptions altogether (along with the edge cases mentioned above) and do something like the below? Sure, it could maybe be optimized by a few bytes if we were given more concrete recommendations, but that can be done later if/when that happens. In the meantime we'd have no canonical exception worries and can use a similar strategy to get rid of the LFENCEs. [1] https://arxiv.org/ftp/arxiv/papers/2108/2108.10771.pdf diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index 112e88ebd07d..dfc6881eb785 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -41,12 +41,21 @@ "shl $(64 - 57), %rdx", X86_FEATURE_LA57, \ "", ALT_NOT(X86_BUG_CANONICAL) #ifdef CONFIG_X86_5LEVEL #define LOAD_TASK_SIZE_MINUS_N(n) \ ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \ __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57 #else #define LOAD_TASK_SIZE_MINUS_N(n) \ mov $(TASK_SIZE_MAX - (n)),%_ASM_DX #endif .macro check_range size .if IS_ENABLED(CONFIG_X86_64) /* If above TASK_SIZE_MAX, convert to all 1's */ LOAD_TASK_SIZE_MINUS_N(size - 1) cmp %rax, %rdx sbb %rdx, %rdx or %rdx, %rax ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-20 22:44 ` Josh Poimboeuf @ 2024-10-20 22:59 ` Linus Torvalds 2024-10-20 23:11 ` Josh Poimboeuf 2024-10-21 10:48 ` Kirill A. Shutemov 0 siblings, 2 replies; 51+ messages in thread From: Linus Torvalds @ 2024-10-20 22:59 UTC (permalink / raw) To: Josh Poimboeuf Cc: Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > Anyway, I'd really like to make forward progress on getting rid of the > LFENCEs in copy_from_user() and __get_user(), so until if/when we hear > back from both vendors, how about we avoid noncanonical exceptions > altogether (along with the edge cases mentioned above) and do something > like the below? That doesn't work for LAM at _all_. So at a minimum, you need to then say "for LAM enabled CPU's we do the 'shift sign bit' trick". Hopefully any LAM-capable CPU doesn't have this issue? And I still think that clac/stac has to serialize with surrounding memory operations, making this all moot. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-20 22:59 ` Linus Torvalds @ 2024-10-20 23:11 ` Josh Poimboeuf 2024-10-20 23:14 ` Josh Poimboeuf 2024-10-20 23:35 ` Linus Torvalds 2024-10-21 10:48 ` Kirill A. Shutemov 1 sibling, 2 replies; 51+ messages in thread From: Josh Poimboeuf @ 2024-10-20 23:11 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sun, Oct 20, 2024 at 03:59:25PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > > Anyway, I'd really like to make forward progress on getting rid of the > > LFENCEs in copy_from_user() and __get_user(), so until if/when we hear > > back from both vendors, how about we avoid noncanonical exceptions > > altogether (along with the edge cases mentioned above) and do something > > like the below? > > That doesn't work for LAM at _all_. Argh, ok. > So at a minimum, you need to then say "for LAM enabled CPU's we do the > 'shift sign bit' trick". Something like below to wipe out the LAM bits beforehand? I'm probably overlooking something else as there are a lot of annoying details here... > Hopefully any LAM-capable CPU doesn't have this issue? > > And I still think that clac/stac has to serialize with surrounding > memory operations, making this all moot. Until it's s/think/know/ can we please put something in place? #define FORCE_CANONICAL \ ALTERNATIVE_2 \ "shl $(64 - 48), %rdx", \ "shl $(64 - 57), %rdx", X86_FEATURE_LA57, \ "", ALT_NOT(X86_FEATURE_LAM) #ifdef CONFIG_X86_5LEVEL #define LOAD_TASK_SIZE_MINUS_N(n) \ ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \ __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57 #else #define LOAD_TASK_SIZE_MINUS_N(n) \ mov $(TASK_SIZE_MAX - (n)),%_ASM_DX #endif .macro check_range size .if IS_ENABLED(CONFIG_X86_64) FORCE_CANONICAL /* If above TASK_SIZE_MAX, convert to all 1's */ LOAD_TASK_SIZE_MINUS_N(size-1) cmp %rax, %rdx sbb %rdx, %rdx or %rdx, %rax ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-20 23:11 ` Josh Poimboeuf @ 2024-10-20 23:14 ` Josh Poimboeuf 2024-10-20 23:35 ` Linus Torvalds 1 sibling, 0 replies; 51+ messages in thread From: Josh Poimboeuf @ 2024-10-20 23:14 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sun, Oct 20, 2024 at 04:11:25PM -0700, Josh Poimboeuf wrote: > #define FORCE_CANONICAL \ > ALTERNATIVE_2 \ > "shl $(64 - 48), %rdx", \ > "shl $(64 - 57), %rdx", X86_FEATURE_LA57, \ ^^^^ these should be rax > "", ALT_NOT(X86_FEATURE_LAM) > > #ifdef CONFIG_X86_5LEVEL > #define LOAD_TASK_SIZE_MINUS_N(n) \ > ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \ > __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57 > #else > #define LOAD_TASK_SIZE_MINUS_N(n) \ > mov $(TASK_SIZE_MAX - (n)),%_ASM_DX > #endif > > .macro check_range size > .if IS_ENABLED(CONFIG_X86_64) > FORCE_CANONICAL > /* If above TASK_SIZE_MAX, convert to all 1's */ > LOAD_TASK_SIZE_MINUS_N(size-1) > cmp %rax, %rdx > sbb %rdx, %rdx > or %rdx, %rax -- Josh ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-20 23:11 ` Josh Poimboeuf 2024-10-20 23:14 ` Josh Poimboeuf @ 2024-10-20 23:35 ` Linus Torvalds 2024-10-23 9:44 ` Borislav Petkov 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-20 23:35 UTC (permalink / raw) To: Josh Poimboeuf Cc: Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sun, 20 Oct 2024 at 16:11, Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > Until it's s/think/know/ can we please put something in place? Honestly, I'm pretty damn fed up with buggy hardware and completely theoretical attacks that have never actually shown themselves to be used in practice. So I think this time we push back on the hardware people and tell them it's *THEIR* damn problem, and if they can't even be bothered to say yay-or-nay, we just sit tight. Because dammit, let's put the onus on where the blame lies, and not just take any random shit from bad hardware and say "oh, but it *might* be a problem". Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-20 23:35 ` Linus Torvalds @ 2024-10-23 9:44 ` Borislav Petkov 2024-10-23 19:17 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Borislav Petkov @ 2024-10-23 9:44 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sun, Oct 20, 2024 at 04:35:21PM -0700, Linus Torvalds wrote: > So I think this time we push back on the hardware people and tell them > it's *THEIR* damn problem, and if they can't even be bothered to say > yay-or-nay, we just sit tight. So I was able to get some info: CLAC/STAC in everything Zen4 and older is serializing so there's no issue there. Zen5 is a different story and AC is being renamed there and thus, non-serializing. So we'd need to do something there, perhaps something like Josh's patch... But the good thing is we can alternative it out only for those machines, apart from the rest. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-23 9:44 ` Borislav Petkov @ 2024-10-23 19:17 ` Linus Torvalds 2024-10-23 20:07 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-23 19:17 UTC (permalink / raw) To: Borislav Petkov Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland [-- Attachment #1: Type: text/plain, Size: 2591 bytes --] On Wed, 23 Oct 2024 at 02:45, Borislav Petkov <bp@alien8.de> wrote: > > So I was able to get some info: CLAC/STAC in everything Zen4 and older is > serializing so there's no issue there. > > Zen5 is a different story and AC is being renamed there and thus, > non-serializing. So we'd need to do something there, perhaps something > like Josh's patch... > > But the good thing is we can alternative it out only for those machines, apart > from the rest. Hmm. The problem with alternatives is that it gets increasingly ugly when there are more than two of them. And even doing a single alternative with some "don't do this unless X" then means either extra no-ops or some horrid static branch. Now, if we ignore LAM, we only have two cases (LA48 vs LA57), but we know that even if we disable LAM until we do the LASS support, that third case *will* happen. Finally - the main reason to use the sign bit is that you don't need some big constant - and the code is small, so the "shift right by 63 and or" is just two instructions and a couple of bytes. But once you have an alternative *with* a big constant, that advantage just goes away because you'll just replace the size the constant takes up with no-ops instead. So then you might as well just _always_ use a constant. Which makes me wonder if we can't just use the runtime-const infrastructure instead. That has the advantage that you can have any number of different constants and it won't complicate the usage sites, and you can easily just update the constant at init time with runtime_const_init() to an arbitrary value. So then LAM support becomes just updating the value for runtime_const_init() - none of the code changes, and no different alternatives cases. Something like the attached, in other words. Note that this actually does that + if (cpu_feature_enabled(X86_FEATURE_LAM)) + USER_PTR_MAX = (1ul << 63) - PAGE_SIZE; partly to show it as docs, and partly because we should just clear that X86_FEATURE_LAM bit if LASS isn't set, so I think it's right anyway. NOTE! This is obviously untested and I didn't check that it does the cmp/sbb/or the right way around. Also, it's worth noting that the cmp/sbb/or sequence (if I *did* get it the right way around) will set all bits only if the original address is *larger* than USER_PTR_MAX, but we already subtract PAGE_SIZE from the actual hardware maximum due to other errata, so that's fine. IOW, it will turn USER_PTR_MAX and anything less into itself, but USER_PTR_MAX+1 (and above) becomes all ones. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 3010 bytes --] arch/x86/include/asm/uaccess_64.h | 17 ++++++++++++++++- arch/x86/kernel/cpu/common.c | 6 ++++++ arch/x86/kernel/vmlinux.lds.S | 1 + arch/x86/lib/getuser.S | 9 +++++++-- 4 files changed, 30 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index afce8ee5d7b7..df1468ada3ed 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -12,6 +12,13 @@ #include <asm/cpufeatures.h> #include <asm/page.h> #include <asm/percpu.h> +#include <asm/runtime-const.h> + +/* + * Virtual variable: there's no actual backing store for this, + * it can purely be used as 'runtime_const_ptr(USER_PTR_MAX)' + */ +extern unsigned long USER_PTR_MAX; #ifdef CONFIG_ADDRESS_MASKING /* @@ -58,7 +65,15 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, * user_access_begin that can avoid the fencing. This only works * for dense accesses starting at the address. */ -#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63))) +static inline void __user *mask_user_address(const void __user *ptr) +{ + void __user *ret; + asm("cmp %1,%0; sbb %k0,%k0; or %1,%0" + :"=r" (ret) + :"r" (ptr), + "0" (runtime_const_ptr(USER_PTR_MAX))); + return ret; +} #define masked_user_access_begin(x) ({ \ __auto_type __masked_ptr = (x); \ __masked_ptr = mask_user_address(__masked_ptr); \ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f1040cb64841..9d419dc1df12 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -69,6 +69,7 @@ #include <asm/sev.h> #include <asm/tdx.h> #include <asm/posted_intr.h> +#include <asm/runtime-const.h> #include "cpu.h" @@ -2389,6 +2390,11 @@ void __init arch_cpu_finalize_init(void) alternative_instructions(); if (IS_ENABLED(CONFIG_X86_64)) { + unsigned long USER_PTR_MAX = TASK_SIZE_MAX; + + if (cpu_feature_enabled(X86_FEATURE_LAM)) + USER_PTR_MAX = (1ul << 63) - PAGE_SIZE; + runtime_const_init(ptr, USER_PTR_MAX); /* * Make sure the first 2MB area is not mapped by huge pages * There are typically fixed size MTRRs in there and overlapping diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index 6726be89b7a6..b8c5741d2fb4 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -358,6 +358,7 @@ SECTIONS #endif RUNTIME_CONST_VARIABLES + RUNTIME_CONST(ptr, USER_PTR_MAX) . = ALIGN(PAGE_SIZE); diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index d066aecf8aeb..202d68e3fea0 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -39,8 +39,13 @@ .macro check_range size:req .if IS_ENABLED(CONFIG_X86_64) - mov %rax, %rdx - sar $63, %rdx + movq $0x0123456789abcdef,%rdx + 1: + .pushsection runtime_ptr_USER_PTR_MAX,"a" + .long 1b - 8 - . + .popsection + cmp %rax, %rdx + sbb %edx, %edx or %rdx, %rax .else cmp $TASK_SIZE_MAX-\size+1, %eax ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-23 19:17 ` Linus Torvalds @ 2024-10-23 20:07 ` Linus Torvalds 2024-10-23 23:32 ` Linus Torvalds 2024-10-24 9:21 ` David Laight 0 siblings, 2 replies; 51+ messages in thread From: Linus Torvalds @ 2024-10-23 20:07 UTC (permalink / raw) To: Borislav Petkov Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Wed, 23 Oct 2024 at 12:17, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > NOTE! This is obviously untested and I didn't check that it does the > cmp/sbb/or the right way around. Well, it boots. The code generation (from strncpy_from_user()) seems ok: movabs $0x123456789abcdef,%rcx cmp %rsi,%rcx sbb %ecx,%ecx or %rsi,%rcx where obviously that constant is the bogus pre-initialized value, not the actual runtime value. Which is not to say that the patch might not be horribly buggy anyway. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-23 20:07 ` Linus Torvalds @ 2024-10-23 23:32 ` Linus Torvalds 2024-10-24 2:00 ` Linus Torvalds 2024-10-24 9:21 ` David Laight 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-23 23:32 UTC (permalink / raw) To: Borislav Petkov Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland [-- Attachment #1: Type: text/plain, Size: 1092 bytes --] On Wed, 23 Oct 2024 at 13:07, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Well, it boots. The code generation (from strncpy_from_user()) seems ok: Actually, doing some more sanity checking, that patch is wrong. Not *badly* wrong, but for some reason I did the "sbb" in 32-bit (quite intentionally, but it's very wrong: I for some reason mentally went "32-bit sign-extends to 64-bit") I'd blame the fact that some of the very earliest x86-64 specs did indeed do exactly that, but the reality is that it was just a brainfart. Anyway, the attached patch seems to actually _really_ work, and DTRT. But considering that I created a 32-bit mask there for a while, maybe somebody else should actually verify. And I guess I should make "__put_user()" do the same thing, just so that we only have one sequence. It probably doesn't matter for put_user(), since there's no data leak coming out of it, but if nothing else, avoiding non-canonical accesses from the kernel for any non-LAM/LASS setup is probably just a good thing once we have this logic. Hmm? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 3011 bytes --] arch/x86/include/asm/uaccess_64.h | 17 ++++++++++++++++- arch/x86/kernel/cpu/common.c | 7 +++++++ arch/x86/kernel/vmlinux.lds.S | 1 + arch/x86/lib/getuser.S | 9 +++++++-- 4 files changed, 31 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index afce8ee5d7b7..52e1a549b98a 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -12,6 +12,13 @@ #include <asm/cpufeatures.h> #include <asm/page.h> #include <asm/percpu.h> +#include <asm/runtime-const.h> + +/* + * Virtual variable: there's no actual backing store for this, + * it can purely be used as 'runtime_const_ptr(USER_PTR_MAX)' + */ +extern unsigned long USER_PTR_MAX; #ifdef CONFIG_ADDRESS_MASKING /* @@ -58,7 +65,15 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, * user_access_begin that can avoid the fencing. This only works * for dense accesses starting at the address. */ -#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63))) +static inline void __user *mask_user_address(const void __user *ptr) +{ + void __user *ret; + asm("cmp %1,%0; sbb %0,%0; or %1,%0" + :"=r" (ret) + :"r" (ptr), + "0" (runtime_const_ptr(USER_PTR_MAX))); + return ret; +} #define masked_user_access_begin(x) ({ \ __auto_type __masked_ptr = (x); \ __masked_ptr = mask_user_address(__masked_ptr); \ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f1040cb64841..8ee6579f694b 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -69,6 +69,7 @@ #include <asm/sev.h> #include <asm/tdx.h> #include <asm/posted_intr.h> +#include <asm/runtime-const.h> #include "cpu.h" @@ -2389,6 +2390,12 @@ void __init arch_cpu_finalize_init(void) alternative_instructions(); if (IS_ENABLED(CONFIG_X86_64)) { + unsigned long USER_PTR_MAX = TASK_SIZE_MAX; + + if (cpu_feature_enabled(X86_FEATURE_LAM)) + USER_PTR_MAX = (1ul << 63) - PAGE_SIZE; + runtime_const_init(ptr, USER_PTR_MAX); + /* * Make sure the first 2MB area is not mapped by huge pages * There are typically fixed size MTRRs in there and overlapping diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index 6726be89b7a6..b8c5741d2fb4 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -358,6 +358,7 @@ SECTIONS #endif RUNTIME_CONST_VARIABLES + RUNTIME_CONST(ptr, USER_PTR_MAX) . = ALIGN(PAGE_SIZE); diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index d066aecf8aeb..4357ec2a0bfc 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -39,8 +39,13 @@ .macro check_range size:req .if IS_ENABLED(CONFIG_X86_64) - mov %rax, %rdx - sar $63, %rdx + movq $0x0123456789abcdef,%rdx + 1: + .pushsection runtime_ptr_USER_PTR_MAX,"a" + .long 1b - 8 - . + .popsection + cmp %rax, %rdx + sbb %rdx, %rdx or %rdx, %rax .else cmp $TASK_SIZE_MAX-\size+1, %eax ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-23 23:32 ` Linus Torvalds @ 2024-10-24 2:00 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2024-10-24 2:00 UTC (permalink / raw) To: Borislav Petkov Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Wed, 23 Oct 2024 at 16:32, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And I guess I should make "__put_user()" do the same thing, just so > that we only have one sequence. No, I decided it's not worth it. The put_user side already also doesn't do any other speculation barriers, simply because it has no speculative outputs that could then be used in some gadget to leak anything. I did extend the USER_ADDR_MAX logic to valid_user_address(), and I wrote a commit log. And sent out what I *think* is a good patch to lkml and the x86 maintainers: https://lore.kernel.org/all/20241024013214.129639-1-torvalds@linux-foundation.org/ I'm not super-happy with the open-coded magic runtime section stuff in getuser.S, but with no other asm users I also didn't want to randomly pollute some header file with ugly asm-specific macros that only get used in one place. Also, I left the LAM case in, but disabled in a comment about how it should be gated by LASS. So that code isn't actually enabled right now. Does anybody see any issues with that patch? It's not that many actual lines of code, and I've been staring at it pretty much all day today (in case anybody wondered why no pull requests), but I've been staring at it so much that I'm patch-blind by now. I've also looked at the generated code. You can look at the asm output, of course, but that ends up being pretty messy due to the fixup hackery. I've been doing objdump --disassemble --no-addresses --no-show-raw-insn vmlinux and you can see where this gets used by searching for "0x123456789abcdef" in the objdumpo disassembly. That's the runtime constant that gets rewritten. Obviously some of them are for another runtime constant (ie dcache_hash), but it's pretty obvious. The code generation seems ok, but like the patch, I'm getting code-blind from having looked at the same thing too many times. Yes, it looked better when it only used the sign bit, but oh well.. And yes, I'm running that code now, and I did a few tests with system calls with invalid addresses and some debug output. Which is still not saying "it has no bugs", but at least any bugs aren't obvious to me. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-23 20:07 ` Linus Torvalds 2024-10-23 23:32 ` Linus Torvalds @ 2024-10-24 9:21 ` David Laight 2024-10-24 16:53 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: David Laight @ 2024-10-24 9:21 UTC (permalink / raw) To: 'Linus Torvalds', Borislav Petkov Cc: Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86@kernel.org, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org, Mark Rutland From: Linus Torvalds > Sent: 23 October 2024 21:08 > > On Wed, 23 Oct 2024 at 12:17, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > NOTE! This is obviously untested and I didn't check that it does the > > cmp/sbb/or the right way around. > > Well, it boots. The code generation (from strncpy_from_user()) seems ok: > > movabs $0x123456789abcdef,%rcx > cmp %rsi,%rcx > sbb %ecx,%ecx > or %rsi,%rcx > > where obviously that constant is the bogus pre-initialized value, not > the actual runtime value. Would it be better to make the 'bogus' constant one that makes all accesses fail? So you soon find out it any code doesn't get patched. I also wonder how big the table of addresses to patch is. If that gets into inlined functions it could be big. OTOH having a real function that does access_ok(), clac and address masking may not problem. Especially if there is always a (PAGE sized) gap between the highest user address and the lowest kernel address so the 'size' argument to access_ok() can be ignored on the assumption that the accesses are (reasonably) linear. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-24 9:21 ` David Laight @ 2024-10-24 16:53 ` Linus Torvalds 2024-10-25 8:56 ` David Laight 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-24 16:53 UTC (permalink / raw) To: David Laight Cc: Borislav Petkov, Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86@kernel.org, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org, Mark Rutland On Thu, 24 Oct 2024 at 02:22, David Laight <David.Laight@aculab.com> wrote: > > Would it be better to make the 'bogus' constant one that makes > all accesses fail? Well, the bogus constant is actually used for other runtime cases too (well, right now that's only the dentry pointer, but could be more), it's not some kind of "just for the max user address". And for things like profiling, where the tools use the original object code, making the bogus constant really obvious is actually helpful. So I do think that "standing out" is more important than having some value that has semantic meaning, when that value is going to be overwritten at boot time. > So you soon find out it any code doesn't get patched. Yes, that would be nice, but if the simple patching logic is broken, you have worse problems. The patching logic really is just a couple of lines of code, and the location where we set this particular value is literally next to the place we do all our other alternatives, so if there's something wrong there, you basically have a broken system. > I also wonder how big the table of addresses to patch is. > If that gets into inlined functions it could be big. It's 4 bytes per location, so yes, it grows linearly by use - but not very much. And the patch table is in the init section that gets discarded after boot, along with all the other things like the altinstructions and things like the return sites for retpolines. Which are *much* bigger and more common. So yes, there's a table, and yes it grows, but at least in my personal config, the USER_PTR_MAX table is 424 bytes - and we free it after boot. 90% of that comes from 'access_ok()' sprinkled around and inlined. Just as a comparison, the altinstruction tables (both the pointers and the actual replacement instructions) is 25kB in that config. Also a drop in the ocean, and also something that gets free'd after init. > OTOH having a real function that does access_ok(), clac and address > masking may not problem. access_ok() itself is so rarely used these days that we could out-line it. But the code cost of a function call is likely higher than inlining the 8-byte constant and a couple of instructions: not because the call instruction itself, but because of the code generation pain around it (register save/restore etc). IOW, the call instruction is just five bytes, but it will typically cause spills and forced register choices for argument and return value. It will obviously depend a lot on the context of the function call, but to save 4 bytes for the table that gets freed, and have the 8-byte constant only once? That's a very false economy. Eg an example code sequence (from a random place in the kernel that I looked at with objdump is movabs $0x123456789abcdef,%rax cmp %rsi,%rax jb <__x64_sys_rt_sigreturn+0x20e> and that's 19 bytes. But I tried it with a function call, and that same function grew from 1320 bytes to 1324 bytes. So the function call version generated 4 bytes bigger code. I didn't figure out why, because register allocation was so different, but I do think it's exactly that: function calls cause limits on register use. So even if we didn't free the 4 byte entry after init, making access_ok() a function call would actually not be a win. And with how slow 'ret' instructions can be, we really don't want small function calls. In fact, it really makes me wonder if we should inline 'get_user()' entirely. > Especially if there is always a (PAGE sized) gap between the highest > user address and the lowest kernel address so the 'size' argument > to access_ok() can be ignored on the assumption that the accesses > are (reasonably) linear. Yes, that's what we do right now for the inline code generation anyway. (Ok, we actually do take the 'size' value into account when it's not constant, or when it's huge, but even that is just out of a silly abundance of caution and not a very common case). Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-24 16:53 ` Linus Torvalds @ 2024-10-25 8:56 ` David Laight 2024-10-25 16:35 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: David Laight @ 2024-10-25 8:56 UTC (permalink / raw) To: 'Linus Torvalds' Cc: Borislav Petkov, Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86@kernel.org, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org, Mark Rutland ... > access_ok() itself is so rarely used these days that we could out-line > it. But the code cost of a function call is likely higher than > inlining the 8-byte constant and a couple of instructions: not because > the call instruction itself, but because of the code generation pain > around it (register save/restore etc). I was thinking it might need a non-standard call sequence so that it could return yes/no and mask the address. I did once wonder whether it could just mask the address. Converting kernel addresses into ones that were guaranteed to fault later. Annoyingly I don't think that NULL can be used, maybe ~0 faults for byte accesses - or is a very boring byte. Certainly loading the TLB/cache line for ~0 isn't going to be useful. ... > And with how slow 'ret' instructions can be, we really don't want > small function calls. Can that matter for leaf functions? IIRC the issue is return stack underflow causing indirect branch prediction be used - probably worse than just using wrapped return stack value! That won't happen for a leaf function unless you get an interrupt. Static analysis could determine function returns that can never underflow the return stack. > In fact, it really makes me wonder if we should inline 'get_user()' entirely. I think that would bloat some cold paths. I know I've put a wrapper on copy_to/from_user() in the past. (I don't think that is inlined at all these days?) So you might want two copies. An entirely horrid way is to have #define get_user(args) inline it then (get_user)(args) would call the real function. Mind you there are a lot of get_user() and put_user() in the getsockopt() functions - which could all have a kernel address passed and that access done by the wrapper. A simple change to a kernel address doesn't change the prototype so the complier doesn't detect unchanged code. I've tried changing it to: new_size_or_errno = getsockopt(..., old_size) but there are a few places that want to return an error and update the size. Plausibly returning (errno << n | size) from those places would solve the problem - and the extra check would be in the error path. Trouble is it is a massive patch series :-( > > Especially if there is always a (PAGE sized) gap between the highest > > user address and the lowest kernel address so the 'size' argument > > to access_ok() can be ignored on the assumption that the accesses > > are (reasonably) linear. > > Yes, that's what we do right now for the inline code generation anyway. Is that gap there on all architectures? > (Ok, we actually do take the 'size' value into account when it's not > constant, or when it's huge, but even that is just out of a silly > abundance of caution and not a very common case). Hmm the buffers for all read/write (etc) calls are variable length. So it is a moderately common case. David > > Linus - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-25 8:56 ` David Laight @ 2024-10-25 16:35 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2024-10-25 16:35 UTC (permalink / raw) To: David Laight Cc: Borislav Petkov, Josh Poimboeuf, Andrew Cooper, Kirill A. Shutemov, x86@kernel.org, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev@lists.ozlabs.org, Mark Rutland On Fri, 25 Oct 2024 at 01:56, David Laight <David.Laight@aculab.com> wrote: > > > > Especially if there is always a (PAGE sized) gap between the highest > > > user address and the lowest kernel address so the 'size' argument > > > to access_ok() can be ignored on the assumption that the accesses > > > are (reasonably) linear. > > > > Yes, that's what we do right now for the inline code generation anyway. > > Is that gap there on all architectures? All of the address checks are architecture-specific, so this page-size gap is purely about x86-64. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-20 22:59 ` Linus Torvalds 2024-10-20 23:11 ` Josh Poimboeuf @ 2024-10-21 10:48 ` Kirill A. Shutemov 2024-10-22 2:36 ` Linus Torvalds 2024-10-22 8:16 ` Pawan Gupta 1 sibling, 2 replies; 51+ messages in thread From: Kirill A. Shutemov @ 2024-10-21 10:48 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland, Alexander Shishkin On Sun, Oct 20, 2024 at 03:59:25PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > > Anyway, I'd really like to make forward progress on getting rid of the > > LFENCEs in copy_from_user() and __get_user(), so until if/when we hear > > back from both vendors, how about we avoid noncanonical exceptions > > altogether (along with the edge cases mentioned above) and do something > > like the below? > > That doesn't work for LAM at _all_. > > So at a minimum, you need to then say "for LAM enabled CPU's we do the > 'shift sign bit' trick". > > Hopefully any LAM-capable CPU doesn't have this issue? LAM brings own speculation issues[1] that is going to be addressed by LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it never got applied for some reason. [1] https://download.vusec.net/papers/slam_sp24.pdf [2] https://lore.kernel.org/all/20240710160655.3402786-1-alexander.shishkin@linux.intel.com [3] https://lore.kernel.org/all/5373262886f2783f054256babdf5a98545dc986b.1706068222.git.pawan.kumar.gupta@linux.intel.com -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-21 10:48 ` Kirill A. Shutemov @ 2024-10-22 2:36 ` Linus Torvalds 2024-10-22 10:33 ` Kirill A. Shutemov 2024-10-22 8:16 ` Pawan Gupta 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-22 2:36 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland, Alexander Shishkin On Mon, 21 Oct 2024 at 03:48, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > LAM brings own speculation issues[1] that is going to be addressed by > LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it > never got applied for some reason. Bah., I think the reason was that nobody knew what - if any - any hardware. I did find an Intel "Product Brief" for "Intel® Xeon® 6 processors with E-cores" that claim both LASS and LAM, but that one puts LAM as a "security feature", so that seems a bit confused. That's Redwood Cove cores, I think. I guess that should mean that Meteor Lake ("Core Ultra") should also have it? Or is it then turned off because the E cores don't do it? I do find LASS - but not LAM - mentioned in the "Intel Core Ultra (PS series)" datasheet. And there's a Intel Core Ultra 200V datasheet (1 of 2) that does mention both LAM and LASS. So I guess now it *really* exists. Can somebody confirm? But yeah, I guess we should just do this: > [3] https://lore.kernel.org/all/5373262886f2783f054256babdf5a98545dc986b.1706068222.git.pawan.kumar.gupta@linux.intel.com for now. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-22 2:36 ` Linus Torvalds @ 2024-10-22 10:33 ` Kirill A. Shutemov 0 siblings, 0 replies; 51+ messages in thread From: Kirill A. Shutemov @ 2024-10-22 10:33 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland, Alexander Shishkin On Mon, Oct 21, 2024 at 07:36:50PM -0700, Linus Torvalds wrote: > On Mon, 21 Oct 2024 at 03:48, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > LAM brings own speculation issues[1] that is going to be addressed by > > LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it > > never got applied for some reason. > > Bah., I think the reason was that nobody knew what - if any - any hardware. > > I did find an Intel "Product Brief" for "Intel® Xeon® 6 processors > with E-cores" that claim both LASS and LAM, but that one puts LAM as a > "security feature", so that seems a bit confused. That's Redwood Cove > cores, I think. > > I guess that should mean that Meteor Lake ("Core Ultra") should also > have it? Or is it then turned off because the E cores don't do it? No, Meteor Lake doesn't support neither LAM nor LASS. So far, according to SDM, it is supported by Sierra Forest (E-core based Xeons) and Lunar Lake. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-21 10:48 ` Kirill A. Shutemov 2024-10-22 2:36 ` Linus Torvalds @ 2024-10-22 8:16 ` Pawan Gupta 2024-10-22 10:44 ` Kirill A. Shutemov 1 sibling, 1 reply; 51+ messages in thread From: Pawan Gupta @ 2024-10-22 8:16 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linus Torvalds, Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland, Alexander Shishkin On Mon, Oct 21, 2024 at 01:48:15PM +0300, Kirill A. Shutemov wrote: > On Sun, Oct 20, 2024 at 03:59:25PM -0700, Linus Torvalds wrote: > > On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > > > > Anyway, I'd really like to make forward progress on getting rid of the > > > LFENCEs in copy_from_user() and __get_user(), so until if/when we hear > > > back from both vendors, how about we avoid noncanonical exceptions > > > altogether (along with the edge cases mentioned above) and do something > > > like the below? > > > > That doesn't work for LAM at _all_. > > > > So at a minimum, you need to then say "for LAM enabled CPU's we do the > > 'shift sign bit' trick". > > > > Hopefully any LAM-capable CPU doesn't have this issue? > > LAM brings own speculation issues[1] that is going to be addressed by > LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it > never got applied for some reason. AIU, issue with LAM is it allows speculative translation of non-canonical user address. It does not allow data fetches from those addresses. SLAM attack uses LAM to form a TLB-based disclosure gadget. In essence, SLAM needs to chain another speculative vulnerability to form a successful attack. If the data accessed by a chained speculative vulnerability is interpreted as a pointer, LAM may allow the translation of non-canonical user address. This is specially true for ascii characters that have MSB masked off. I don't really know how this could be a problem for something like copy_from_user() that ensures the canonicality of bit 63. AFAIK, Sierra Forest, Lunar Lake and Arrow Lake support LAM. To be on safe side, it is better to keep LAM disabled until LASS arrives. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-22 8:16 ` Pawan Gupta @ 2024-10-22 10:44 ` Kirill A. Shutemov 0 siblings, 0 replies; 51+ messages in thread From: Kirill A. Shutemov @ 2024-10-22 10:44 UTC (permalink / raw) To: Pawan Gupta Cc: Linus Torvalds, Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland, Alexander Shishkin On Tue, Oct 22, 2024 at 01:16:58AM -0700, Pawan Gupta wrote: > On Mon, Oct 21, 2024 at 01:48:15PM +0300, Kirill A. Shutemov wrote: > > On Sun, Oct 20, 2024 at 03:59:25PM -0700, Linus Torvalds wrote: > > > On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > > > > > > Anyway, I'd really like to make forward progress on getting rid of the > > > > LFENCEs in copy_from_user() and __get_user(), so until if/when we hear > > > > back from both vendors, how about we avoid noncanonical exceptions > > > > altogether (along with the edge cases mentioned above) and do something > > > > like the below? > > > > > > That doesn't work for LAM at _all_. > > > > > > So at a minimum, you need to then say "for LAM enabled CPU's we do the > > > 'shift sign bit' trick". > > > > > > Hopefully any LAM-capable CPU doesn't have this issue? > > > > LAM brings own speculation issues[1] that is going to be addressed by > > LASS[2]. There was a patch[3] to disable LAM until LASS is landed, but it > > never got applied for some reason. > > AIU, issue with LAM is it allows speculative translation of non-canonical > user address. It does not allow data fetches from those addresses. SLAM > attack uses LAM to form a TLB-based disclosure gadget. In essence, SLAM > needs to chain another speculative vulnerability to form a successful > attack. If the data accessed by a chained speculative vulnerability is > interpreted as a pointer, LAM may allow the translation of non-canonical > user address. This is specially true for ascii characters that have MSB > masked off. Right. > I don't really know how this could be a problem for something like > copy_from_user() that ensures the canonicality of bit 63. It doesn't affect copy_from/to_user() and put/get_user(), but allows to make other kernel code (outside STAC/CLAC block) to speculatively translate addresses and reveal secrets this way. > AFAIK, Sierra Forest, Lunar Lake and Arrow Lake support LAM. To be on safe > side, it is better to keep LAM disabled until LASS arrives. Dave has applied it to tip/urgent. I expect it to be merged soon. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 12:30 ` Kirill A. Shutemov 2024-10-14 15:39 ` Andrew Cooper @ 2024-10-14 16:55 ` Linus Torvalds 2024-10-16 16:10 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-14 16:55 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Mon, 14 Oct 2024 at 05:30, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Given that LAM enforces bit 47/56 to be equal to bit 63 I think we can do > this unconditionally instead of masking: > > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S > index d066aecf8aeb..86d4511520b1 100644 > --- a/arch/x86/lib/getuser.S > +++ b/arch/x86/lib/getuser.S > @@ -37,9 +37,14 @@ > > #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC > > +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \ > + "shl $(64 - 48), %rdx", \ > + "shl $(64 - 57), %rdx", X86_FEATURE_LA57 That's certainly a lot smaller than the big constant, which is good. But I would want some AMD architect to actually tell us *which* cores this affects, and just what the cycle latency of the canonicality check really is. For example, the whole point of the address masking is that the access itself isn't actually conditional, and we want (and expect) the exception in the invalid address case (which very much includes the non-canonicality). So either the attacker has to find an unrelated conditional that precedes the whole user access sequence (ie not just the CLAC afterwards, but the STAC and the address masking that precedes the access), *or* the attacker has to be able to deal with whatever signal noise that comes from taking the GP exception and signal handling and return to user space. The exception handling will almost certainly have destroyed any signal in L1 cache, so some flush+reload attack is going to be very very inconvenient. And we could easily make it much worse by just adding extra work to the GP exception handling. And that is all in *addition* to having the sequence of the non-canonical access followed by the "clac" and then you have to find a dependent load to leak the data you loaded through the non-canonical address. And all *that* needs to happen before the non-canonical address actually being checked. End result: I don't think this is actually a real scenario in the first place. I would want some AMD person to really check. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-14 16:55 ` Linus Torvalds @ 2024-10-16 16:10 ` Linus Torvalds 2024-10-16 22:02 ` Andrew Cooper 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-16 16:10 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Josh Poimboeuf, Andrew Cooper, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Mon, 14 Oct 2024 at 09:55, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 14 Oct 2024 at 05:30, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > Given that LAM enforces bit 47/56 to be equal to bit 63 I think we can do > > this unconditionally instead of masking: > > > > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S > > index d066aecf8aeb..86d4511520b1 100644 > > --- a/arch/x86/lib/getuser.S > > +++ b/arch/x86/lib/getuser.S > > @@ -37,9 +37,14 @@ > > > > #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC > > > > +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \ > > + "shl $(64 - 48), %rdx", \ > > + "shl $(64 - 57), %rdx", X86_FEATURE_LA57 > > That's certainly a lot smaller than the big constant, which is good. Actually, having thought this over, I think we can simplify things more, and not do any shifts AT ALL. IOW, the whole "mask non-canonical bits" model I think works on its own even without the "shift sign bit and or" logic. So it can replace all the shifting entirely (much less the double shifting), and we can have something like this: --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -37,11 +37,14 @@ #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC +#define X86_CANONICAL_MASK ALTERNATIVE \ + "movq $0x80007fffffffffff,%rdx", \ + "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57 + .macro check_range size:req .if IS_ENABLED(CONFIG_X86_64) - mov %rax, %rdx - sar $63, %rdx - or %rdx, %rax + X86_CANONICAL_MASK /* mask into %rdx */ + and %rdx,%rax .else cmp $TASK_SIZE_MAX-\size+1, %eax jae .Lbad_get_user Yes, it does end up having that big constant (not great), but makes up for it by avoiding the register move instruction (3 bytes) and two shift instructions (4 bytes each), so while the "load big constant" instruction is a big 10-byte instruction, it's actually one byte shorter than the old code was. It also avoids a couple of data dependencies, so it's both smaller and simpler. An invalid address with the high bit set will become non-canonical, so it will cause an exception as expected. But it avoids the AMD issue, because while it's non-canonical, it always clears that bit 48 (or 57), so it will never hit a kernel TLB/cache tag. Of course, if some other core then does the sane thing, and uses bit #63 (instead of #48/57), and still has the pseudo-meltdown behavior, then the non-canonical address generated by the plain 'and' will end up having the same old issue. But in (related) news, the address masking trick really does matter: https://lore.kernel.org/all/202410161557.5b87225e-oliver.sang@intel.com/ which reports that ridiculous 6.8% improvement just from avoiding the barrier in user_access_begin(). So that barrier really *is* very expensive. Surprisingly so. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-16 16:10 ` Linus Torvalds @ 2024-10-16 22:02 ` Andrew Cooper 2024-10-16 22:13 ` Kirill A. Shutemov 2024-10-16 22:32 ` Linus Torvalds 0 siblings, 2 replies; 51+ messages in thread From: Andrew Cooper @ 2024-10-16 22:02 UTC (permalink / raw) To: Linus Torvalds, Kirill A. Shutemov Cc: Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On 16/10/2024 5:10 pm, Linus Torvalds wrote: > --- a/arch/x86/lib/getuser.S > +++ b/arch/x86/lib/getuser.S > @@ -37,11 +37,14 @@ > > #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC > > +#define X86_CANONICAL_MASK ALTERNATIVE \ > + "movq $0x80007fffffffffff,%rdx", \ > + "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57 > + > .macro check_range size:req > .if IS_ENABLED(CONFIG_X86_64) > - mov %rax, %rdx > - sar $63, %rdx > - or %rdx, %rax > + X86_CANONICAL_MASK /* mask into %rdx */ > + and %rdx,%rax That doesn't have the same semantics, does it? Consider userspace passing an otherwise-good pointer with bit 60 set. Previously that would have resulted in a failure, whereas now it will succeed. If AMD think it's appropriate, then what you probably want is the real branch as per before (to maintain architectural user behaviour), and then use a trick such as this one in place of the LFENCE for speed in the common case. > But in (related) news, the address masking trick really does matter: > > https://lore.kernel.org/all/202410161557.5b87225e-oliver.sang@intel.com/ > > which reports that ridiculous 6.8% improvement just from avoiding the > barrier in user_access_begin(). > > So that barrier really *is* very expensive. Surprisingly so. 7% performance is what it costs to maintain the security barrier we were sold originally. Then it's a lot of time, arguing in public, and wild guesswork over explicitly undocumented properties that the vendors would prefer not to discuss, to try and claw some of that 7% back while keeping the security barrier intact for the benefit of users who want it both secure and fast. Forgive me if I think that we (the SW people) are getting the raw end of the deal here, while vendors keep selling more and more expensive chips that don't work safely. ~Andrew ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-16 22:02 ` Andrew Cooper @ 2024-10-16 22:13 ` Kirill A. Shutemov 2024-10-16 22:34 ` Linus Torvalds 2024-10-16 22:32 ` Linus Torvalds 1 sibling, 1 reply; 51+ messages in thread From: Kirill A. Shutemov @ 2024-10-16 22:13 UTC (permalink / raw) To: Andrew Cooper Cc: Linus Torvalds, Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Wed, Oct 16, 2024 at 11:02:56PM +0100, Andrew Cooper wrote: > On 16/10/2024 5:10 pm, Linus Torvalds wrote: > > --- a/arch/x86/lib/getuser.S > > +++ b/arch/x86/lib/getuser.S > > @@ -37,11 +37,14 @@ > > > > #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC > > > > +#define X86_CANONICAL_MASK ALTERNATIVE \ > > + "movq $0x80007fffffffffff,%rdx", \ > > + "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57 > > + > > .macro check_range size:req > > .if IS_ENABLED(CONFIG_X86_64) > > - mov %rax, %rdx > > - sar $63, %rdx > > - or %rdx, %rax > > + X86_CANONICAL_MASK /* mask into %rdx */ > > + and %rdx,%rax > > That doesn't have the same semantics, does it? > > Consider userspace passing an otherwise-good pointer with bit 60 set. > Previously that would have resulted in a failure, whereas now it will > succeed. It is worse than that. If we get LAM_SUP enabled (there's KASAN patchset in works) this check will allow arbitrary kernel addresses. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-16 22:13 ` Kirill A. Shutemov @ 2024-10-16 22:34 ` Linus Torvalds 2024-10-28 11:29 ` Kirill A. Shutemov 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-16 22:34 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Wed, 16 Oct 2024 at 15:13, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > It is worse than that. If we get LAM_SUP enabled (there's KASAN patchset > in works) this check will allow arbitrary kernel addresses. Ugh. I haven't seen the LAM_SUP patches. But yeah, I assume any LAM_SUP model would basically then make the GP fault due to non-canonical addresses go away. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-16 22:34 ` Linus Torvalds @ 2024-10-28 11:29 ` Kirill A. Shutemov 2024-10-28 18:44 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Kirill A. Shutemov @ 2024-10-28 11:29 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Wed, Oct 16, 2024 at 03:34:11PM -0700, Linus Torvalds wrote: > On Wed, 16 Oct 2024 at 15:13, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > It is worse than that. If we get LAM_SUP enabled (there's KASAN patchset > > in works) this check will allow arbitrary kernel addresses. > > Ugh. I haven't seen the LAM_SUP patches. > > But yeah, I assume any LAM_SUP model would basically then make the GP > fault due to non-canonical addresses go away. Actually, I've got this wrong. I based my comment about incompatibility with LAM_SUP on the approach description that stated zeroing bits 48/57, but actual masks zeroing bits 47/56 which is compatible with all LAM modes. It will trigger #GP for LAM_SUP too for kernel addresses: bit 63 has to be equal to bit 47/56. I think it is worth looking at this approach again as it gets better code: removes two instructions from get_user() and doesn't get flags involved. Below is patch on top of current mainline. The mask only clears bit 47/56. It gets us stronger canonicality check on machines with LAM off. I am not entirely sure about my take on valid_user_address() here. Any comments? diff --git a/arch/x86/include/asm/runtime-const.h b/arch/x86/include/asm/runtime-const.h index 6652ebddfd02..8d983cfd06ea 100644 --- a/arch/x86/include/asm/runtime-const.h +++ b/arch/x86/include/asm/runtime-const.h @@ -2,6 +2,18 @@ #ifndef _ASM_RUNTIME_CONST_H #define _ASM_RUNTIME_CONST_H +#ifdef __ASSEMBLY__ + +.macro RUNTIME_CONST_PTR sym reg + movq $0x0123456789abcdef, %\reg + 1: + .pushsection runtime_ptr_\sym, "a" + .long 1b - 8 - . + .popsection +.endm + +#else /* __ASSEMBLY__ */ + #define runtime_const_ptr(sym) ({ \ typeof(sym) __ret; \ asm_inline("mov %1,%0\n1:\n" \ @@ -58,4 +70,5 @@ static inline void runtime_const_fixup(void (*fn)(void *, unsigned long), } } +#endif /* __ASSEMBLY__ */ #endif diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index b0a887209400..77acef272b0d 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -16,9 +16,9 @@ /* * Virtual variable: there's no actual backing store for this, - * it can purely be used as 'runtime_const_ptr(USER_PTR_MAX)' + * it can purely be used as 'runtime_const_ptr(USER_CANONICAL_MASK)' */ -extern unsigned long USER_PTR_MAX; +extern unsigned long USER_CANONICAL_MASK; #ifdef CONFIG_ADDRESS_MASKING /* @@ -54,7 +54,7 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, #endif #define valid_user_address(x) \ - ((__force unsigned long)(x) <= runtime_const_ptr(USER_PTR_MAX)) + (!((__force unsigned long)(x) & ~runtime_const_ptr(USER_CANONICAL_MASK))) /* * Masking the user address is an alternative to a conditional @@ -63,13 +63,8 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm, */ static inline void __user *mask_user_address(const void __user *ptr) { - unsigned long mask; - asm("cmp %1,%0\n\t" - "sbb %0,%0" - :"=r" (mask) - :"r" (ptr), - "0" (runtime_const_ptr(USER_PTR_MAX))); - return (__force void __user *)(mask | (__force unsigned long)ptr); + unsigned long mask = runtime_const_ptr(USER_CANONICAL_MASK); + return (__force void __user *)(mask & (__force unsigned long)ptr); } #define masked_user_access_begin(x) ({ \ __auto_type __masked_ptr = (x); \ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index a5f221ea5688..9d25f4f146f4 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2390,14 +2390,9 @@ void __init arch_cpu_finalize_init(void) alternative_instructions(); if (IS_ENABLED(CONFIG_X86_64)) { - unsigned long USER_PTR_MAX = TASK_SIZE_MAX-1; + unsigned long USER_CANONICAL_MASK = ~(1UL << __VIRTUAL_MASK_SHIFT); - /* - * Enable this when LAM is gated on LASS support - if (cpu_feature_enabled(X86_FEATURE_LAM)) - USER_PTR_MAX = (1ul << 63) - PAGE_SIZE - 1; - */ - runtime_const_init(ptr, USER_PTR_MAX); + runtime_const_init(ptr, USER_CANONICAL_MASK); /* * Make sure the first 2MB area is not mapped by huge pages diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index b8c5741d2fb4..65d2f4cad0aa 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -358,7 +358,7 @@ SECTIONS #endif RUNTIME_CONST_VARIABLES - RUNTIME_CONST(ptr, USER_PTR_MAX) + RUNTIME_CONST(ptr, USER_CANONICAL_MASK) . = ALIGN(PAGE_SIZE); diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index 4357ec2a0bfc..b93f0282c6c9 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -32,6 +32,7 @@ #include <asm/errno.h> #include <asm/asm-offsets.h> #include <asm/thread_info.h> +#include <asm/runtime-const.h> #include <asm/asm.h> #include <asm/smap.h> @@ -39,14 +40,8 @@ .macro check_range size:req .if IS_ENABLED(CONFIG_X86_64) - movq $0x0123456789abcdef,%rdx - 1: - .pushsection runtime_ptr_USER_PTR_MAX,"a" - .long 1b - 8 - . - .popsection - cmp %rax, %rdx - sbb %rdx, %rdx - or %rdx, %rax + RUNTIME_CONST_PTR sym=USER_CANONICAL_MASK reg=rdx + and %rdx, %rax .else cmp $TASK_SIZE_MAX-\size+1, %eax jae .Lbad_get_user -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-28 11:29 ` Kirill A. Shutemov @ 2024-10-28 18:44 ` Linus Torvalds 2024-10-28 20:31 ` Kirill A. Shutemov 0 siblings, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-28 18:44 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Mon, 28 Oct 2024 at 01:29, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > I think it is worth looking at this approach again as it gets better > code: removes two instructions from get_user() and doesn't get flags > involved. The problem with the 'and' is that it's just re-introducing the same old bug. > The mask only clears bit 47/56. It gets us stronger canonicality check on > machines with LAM off. But it means that you can just pass in non-canonical addresses, and cause the same old AMD data leak. And yes, it so happens that the AMD microarchitectures use bit 47/56 for "user address", and now it's only leaking user data, but that's a random microarchitectural choice - and a really horribly bad one at that. IOW, the WHOLE BUG was because some AMD engineer decided to use bits [0..48] for indexing, when they SHOULD have used [0..47,63], since bits 63 and 48 *should* be the same, and the latter is the one that is a hell of a lot more convenient and obvious, and would have made ALL of this completely irrelevant, and we'd have just used the sign bit and everything would be fine. But instead of picking the sane bit, they picked a random bit. And now I think your patch has two bugs in it: - it depends on that random undocumented microarchitectural mistake - it uses __VIRTUAL_MASK_SHIFT which probably doesn't even match what hardware does I The *hardware* presumably uses either bit 47/56 because that's the actual hardware width of the TLB / cache matching. But __VIRTUAL_MASK_SHIFT is hardcoded to 47 is 5-level page tables are disabled. So no. We're not replacing one simple microarchitectural assumption (sign bit is sufficient for safety) with another more complicated one (bit X is sufficient for safety). Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-28 18:44 ` Linus Torvalds @ 2024-10-28 20:31 ` Kirill A. Shutemov 2024-10-28 20:49 ` Linus Torvalds 0 siblings, 1 reply; 51+ messages in thread From: Kirill A. Shutemov @ 2024-10-28 20:31 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Mon, Oct 28, 2024 at 08:44:10AM -1000, Linus Torvalds wrote: > I The *hardware* presumably uses either bit 47/56 because that's the > actual hardware width of the TLB / cache matching. It is a guess, right? The bug makes more sense to me if the bit number depends on the active paging mode: this bit naturally used for indexing in top level page table. The mistake is more understandable this way. > So no. We're not replacing one simple microarchitectural assumption > (sign bit is sufficient for safety) with another more complicated one > (bit X is sufficient for safety). Okay, fair enough. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-28 20:31 ` Kirill A. Shutemov @ 2024-10-28 20:49 ` Linus Torvalds 0 siblings, 0 replies; 51+ messages in thread From: Linus Torvalds @ 2024-10-28 20:49 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Mon, 28 Oct 2024 at 10:31, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Mon, Oct 28, 2024 at 08:44:10AM -1000, Linus Torvalds wrote: > > I The *hardware* presumably uses either bit 47/56 because that's the > > actual hardware width of the TLB / cache matching. > > It is a guess, right? Yes, I have no idea what they actually do. > The bug makes more sense to me if the bit number depends on the active > paging mode: this bit naturally used for indexing in top level page table. > The mistake is more understandable this way. If I were a hardware designer - which I'm obviously not - I would *not* make the TLB lookup be based on the paging mode at all. Only at actual TLB fill time would I care about things like 4-vs-5 level page tables (and the canonicality check would then obviously cause the TLB fill to not happen, so you would never have a TLB entry that could clash with the 4-vs-5 level page table model) So I would assume that hardware that supports 5-level paging would just unconditionally use bits 0-56, and older hardware would just use bits 0-47. Why put extra pointless logic in what is arguably the most critical path of the most critical piece on the CPU? But I have zero inside knowledge, so yes, it's all a guess. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-16 22:02 ` Andrew Cooper 2024-10-16 22:13 ` Kirill A. Shutemov @ 2024-10-16 22:32 ` Linus Torvalds 2024-10-17 11:00 ` Borislav Petkov 1 sibling, 1 reply; 51+ messages in thread From: Linus Torvalds @ 2024-10-16 22:32 UTC (permalink / raw) To: Andrew Cooper Cc: Kirill A. Shutemov, Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Wed, 16 Oct 2024 at 15:03, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > That doesn't have the same semantics, does it? Correct. It just basically makes all positive addresses be force-canonicalized. > If AMD think it's appropriate, then what you probably want is the real > branch as per before (to maintain architectural user behaviour), and > then use a trick such as this one in place of the LFENCE for speed in > the common case. The problem with the branch is that it really can only branch on the sign bit - because of LAM. So with LAM, those bits are pretty much ignored anyway. > > So that barrier really *is* very expensive. Surprisingly so. > > 7% performance is what it costs to maintain the security barrier we were > sold originally. Absolutely. And the masking was something that basically says "we get it all back" (with "all" being just this part, of course - never mind all the other workarounds). > Forgive me if I think that we (the SW people) are getting the raw end of > the deal here, while vendors keep selling more and more expensive chips > that don't work safely. I'm 100% with you. My preference would actually be to do nothing, on the assumption that the AMD issue is actually impossible to trigger (due to CLAC/STAC serializing memory address checks - which the timings certainly imply they do). But if we have to do magic bit masking, I'd rather it be *fast* magic bit masking. Linus ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-16 22:32 ` Linus Torvalds @ 2024-10-17 11:00 ` Borislav Petkov 0 siblings, 0 replies; 51+ messages in thread From: Borislav Petkov @ 2024-10-17 11:00 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Cooper, Kirill A. Shutemov, Josh Poimboeuf, x86, Thomas Gleixner, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Wed, Oct 16, 2024 at 03:32:32PM -0700, Linus Torvalds wrote: > My preference would actually be to do nothing, on the assumption that > the AMD issue is actually impossible to trigger (due to CLAC/STAC > serializing memory address checks - which the timings certainly imply > they do). I'm verifying that assumption... -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() 2024-10-13 0:53 ` Linus Torvalds 2024-10-13 1:21 ` Linus Torvalds @ 2024-10-14 11:56 ` Kirill A. Shutemov 1 sibling, 0 replies; 51+ messages in thread From: Kirill A. Shutemov @ 2024-10-14 11:56 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Cooper, Josh Poimboeuf, x86, Thomas Gleixner, Borislav Petkov, Peter Zijlstra, Pawan Gupta, Waiman Long, Dave Hansen, Ingo Molnar, Michael Ellerman, linuxppc-dev, Mark Rutland On Sat, Oct 12, 2024 at 05:53:19PM -0700, Linus Torvalds wrote: > On Sat, 12 Oct 2024 at 10:44, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Anyway, what's the speculation window size like? > > Note that this is important basically because we do *NOT* want to > check the address against TASK_SIZE_MAX like we used to, because not > only is TASK_SIZE_MAX not a compile-time constant, but with linear > address masking, people actually *want* to use addresses that are in > the non-canonical range. > > IOW, see also > > arch/x86/include/asm/uaccess_64.h > > and notice how the x86-64 __access_ok() check *also_ does the whole > "top bit set" thing (iow, see __access_ok()). > > IOW, this actually goes even further back than the commit I mentioned > earlier - it goes back to commit 6014bc27561f ("x86-64: make > access_ok() independent of LAM") because without the sign bit trick, > LAM is a complete disaster. > > So no, the address masking can not depend on things like > __VIRTUAL_MASK_SHIFT, it would need to at least take LAM into account > too. Not that I know if there are any CPU's out there that actually > have LAM enabled. Actually LAM is fine with the __VIRTUAL_MASK_SHIFT check. LAM enforces bit 47 (or 56 for 5-level paging) to be equal to bit 63. Otherwise it is canonicality violation. -- Kiryl Shutsemau / Kirill A. Shutemov ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2024-10-28 20:50 UTC | newest] Thread overview: 51+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-12 4:09 [PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user() Josh Poimboeuf 2024-10-12 8:48 ` Andrew Cooper 2024-10-12 14:09 ` Josh Poimboeuf 2024-10-12 14:21 ` Borislav Petkov 2024-10-12 15:58 ` Linus Torvalds 2024-10-12 14:26 ` Andrew Cooper 2024-10-12 15:44 ` Linus Torvalds 2024-10-12 17:23 ` Andrew Cooper 2024-10-12 17:44 ` Linus Torvalds 2024-10-13 0:53 ` Linus Torvalds 2024-10-13 1:21 ` Linus Torvalds 2024-10-14 3:54 ` Josh Poimboeuf 2024-10-14 6:50 ` Linus Torvalds 2024-10-14 9:59 ` David Laight 2024-10-14 11:20 ` Linus Torvalds 2024-10-14 14:40 ` David Laight 2024-10-14 11:12 ` Andrew Cooper 2024-10-14 12:30 ` Kirill A. Shutemov 2024-10-14 15:39 ` Andrew Cooper 2024-10-15 10:00 ` Borislav Petkov 2024-10-20 22:44 ` Josh Poimboeuf 2024-10-20 22:59 ` Linus Torvalds 2024-10-20 23:11 ` Josh Poimboeuf 2024-10-20 23:14 ` Josh Poimboeuf 2024-10-20 23:35 ` Linus Torvalds 2024-10-23 9:44 ` Borislav Petkov 2024-10-23 19:17 ` Linus Torvalds 2024-10-23 20:07 ` Linus Torvalds 2024-10-23 23:32 ` Linus Torvalds 2024-10-24 2:00 ` Linus Torvalds 2024-10-24 9:21 ` David Laight 2024-10-24 16:53 ` Linus Torvalds 2024-10-25 8:56 ` David Laight 2024-10-25 16:35 ` Linus Torvalds 2024-10-21 10:48 ` Kirill A. Shutemov 2024-10-22 2:36 ` Linus Torvalds 2024-10-22 10:33 ` Kirill A. Shutemov 2024-10-22 8:16 ` Pawan Gupta 2024-10-22 10:44 ` Kirill A. Shutemov 2024-10-14 16:55 ` Linus Torvalds 2024-10-16 16:10 ` Linus Torvalds 2024-10-16 22:02 ` Andrew Cooper 2024-10-16 22:13 ` Kirill A. Shutemov 2024-10-16 22:34 ` Linus Torvalds 2024-10-28 11:29 ` Kirill A. Shutemov 2024-10-28 18:44 ` Linus Torvalds 2024-10-28 20:31 ` Kirill A. Shutemov 2024-10-28 20:49 ` Linus Torvalds 2024-10-16 22:32 ` Linus Torvalds 2024-10-17 11:00 ` Borislav Petkov 2024-10-14 11:56 ` Kirill A. Shutemov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).