Re: [PATCH 0/4] riscv: uaccess: optimizations

Generic Linux architectural discussions
 help / color / mirror / Atom feed

* Re: [PATCH 0/4] riscv: uaccess: optimizations
       [not found] <20240625040500.1788-1-jszhang@kernel.org>
@ 2024-06-25  7:21 ` Arnd Bergmann
  2024-06-25 18:12   ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2024-06-25  7:21 UTC (permalink / raw)
  To: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Linux-Arch, Linus Torvalds
  Cc: linux-riscv, linux-kernel

On Tue, Jun 25, 2024, at 06:04, Jisheng Zhang wrote:
> This series tries to optimize riscv uaccess in the following way:
>
> patch1 implement the user_access_begin and families, I.E the unsafe
> accessors. when a function like strncpy_from_user() is called,
> the userspace access protection is disabled and enabled for every
> word read. After patch1, the protection is disabled at the beginning
> of the copy and enabled at the end.
>
> patch2 is a simple clean up
>
> patch3 uses 'asm goto' for put_user()
> patch4 uses 'asm goto output' for get_user()
>
> Both patch3 and patch4 are based on the fact -- 'asm goto' and
> 'asm goto output'generates noticeably better code since we don't need
> to test the error etc, the exception just jumps to the error handling
> directly.

Nice!

I hope that we can one day make this more straightforward
and share more of the implementation across architectures,
in particular the bits that are just wrappers around
the low-level asm.

We have something like this already on powerpc and x86,
and Linus just did the version for arm64, which I assume
you are using as a template for this. Four architectures
is probably the point at which we should try to consolidate
the code again and move as much as we can into asm-generic.

I think the only bets we really need from an architecture
here are:

- __enable_user_access()/__disable_user_access() in
   the label version
- __raw_get_mem_{1,2,4,8}() and __raw_put_mem_{1,2,4,8}()

and then we can build all the normal interface functions
on those in a way that assumes we have as goto available,
with the appropriate fallback in __raw_get_mem_*() as
long as we need to support gcc-10 and older.

      Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] riscv: uaccess: optimizations
  2024-06-25  7:21 ` [PATCH 0/4] riscv: uaccess: optimizations Arnd Bergmann
@ 2024-06-25 18:12   ` Linus Torvalds
  2024-06-26 13:04     ` Jisheng Zhang
  2024-06-30 16:59     ` Linus Torvalds
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2024-06-25 18:12 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Linux-Arch, linux-riscv, linux-kernel

On Tue, 25 Jun 2024 at 00:22, Arnd Bergmann <arnd@arndb.de> wrote:
>
> I think the only bets we really need from an architecture
> here are:
>
> - __enable_user_access()/__disable_user_access() in
>    the label version

KCSAN wants user_access_save/restore() too, but yes.

> - __raw_get_mem_{1,2,4,8}() and __raw_put_mem_{1,2,4,8}()

You still need to split it into user/kernel.

Yes, *often* there is just one address space and they can be one "mem"
accessor thing, but we do have architectures with actually separate
user and kernel address spaces.

But yes, it would be lovely if we did things as "implement the
low-level accessor functions and we'll wrap them for the generic case"
rather than have every architecture have to do the wrapping..

The wrapping isn't just the label-based "unsafe" accesses and the
traditional error number return case, it's also the size-based case
statements ("poor man's generics").

The problem, of course, is that the majority of architectures are
based on the legacy interfaces, so it's a lot of annoying low-level
small details. I think there's a reason why nobody did the arm64
"unsafe" ones - the patch didn't end up being that bad, but it's just
fairly grotty code.

But apparently the arm64 patch was simple enough that at least RISC-V
people went "that doesn't look so bad".

Maybe other architectures will also be fairly straightforward when
there's a fairly clear example of how it should be done.

             Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] riscv: uaccess: optimizations
  2024-06-25 18:12   ` Linus Torvalds
@ 2024-06-26 13:04     ` Jisheng Zhang
  2024-06-30 16:59     ` Linus Torvalds
  1 sibling, 0 replies; 10+ messages in thread
From: Jisheng Zhang @ 2024-06-26 13:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnd Bergmann, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Linux-Arch, linux-riscv, linux-kernel

On Tue, Jun 25, 2024 at 11:12:16AM -0700, Linus Torvalds wrote:
> On Tue, 25 Jun 2024 at 00:22, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > I think the only bets we really need from an architecture
> > here are:
> >
> > - __enable_user_access()/__disable_user_access() in
> >    the label version
> 
> KCSAN wants user_access_save/restore() too, but yes.
> 
> > - __raw_get_mem_{1,2,4,8}() and __raw_put_mem_{1,2,4,8}()
> 
> You still need to split it into user/kernel.
> 
> Yes, *often* there is just one address space and they can be one "mem"
> accessor thing, but we do have architectures with actually separate
> user and kernel address spaces.
> 
> But yes, it would be lovely if we did things as "implement the
> low-level accessor functions and we'll wrap them for the generic case"
> rather than have every architecture have to do the wrapping..
> 
> The wrapping isn't just the label-based "unsafe" accesses and the
> traditional error number return case, it's also the size-based case
> statements ("poor man's generics").
> 
> The problem, of course, is that the majority of architectures are
> based on the legacy interfaces, so it's a lot of annoying low-level
> small details. I think there's a reason why nobody did the arm64
> "unsafe" ones - the patch didn't end up being that bad, but it's just
> fairly grotty code.
> 
> But apparently the arm64 patch was simple enough that at least RISC-V
> people went "that doesn't look so bad".
> 
> Maybe other architectures will also be fairly straightforward when
> there's a fairly clear example of how it should be done.

>> We have something like this already on powerpc and x86,
>> and Linus just did the version for arm64, which I assume
>> you are using as a template for this. Four architectures

Yep, this series is inspired by Linus's patch series, and to be honest,
some code is borrowed from his arm64 version ;) I saw he improved
arm64, then I thought a nice improvement we want for riscv too, can I do
similarly for riscv? 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] riscv: uaccess: optimizations
  2024-06-25 18:12   ` Linus Torvalds
  2024-06-26 13:04     ` Jisheng Zhang
@ 2024-06-30 16:59     ` Linus Torvalds
  2024-07-05 11:25       ` Will Deacon
  1 sibling, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2024-06-30 16:59 UTC (permalink / raw)
  To: Arnd Bergmann, Will Deacon, Catalin Marinas
  Cc: Jisheng Zhang, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Linux-Arch, linux-riscv, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3794 bytes --]

On Tue, 25 Jun 2024 at 11:12, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But yes, it would be lovely if we did things as "implement the
> low-level accessor functions and we'll wrap them for the generic case"
> rather than have every architecture have to do the wrapping..

Btw, to do that _well_, we need to expand on the user access functions
a bit more.

In particular, we can already implement "get_user()" on top of
user_access_begin() and friends something like this:

  #define get_user(x,ptr) ({                                    \
        __label__ Efault_read;                                  \
        long __err;                                             \
        __typeof__(ptr) __ptr = (ptr);                          \
        if (likely(user_access_begin(__ptr, sizeof(x)))) {      \
                unsafe_get_user(x, __ptr, Efault_read);         \
                user_access_end();                              \
                __err = 0;                                      \
        } else {                                                \
                if (0) {                                        \
Efault_read:            user_access_end();                      \
                }                                               \
                x = (__typeof__(x))0;                           \
                __err = -EFAULT;                                \
        }                                                       \
        __err; })

and it doesn't generate _horrible_ code. It looks pretty bad, but all
the error handling goes out-of-line, so on x86-64 (without debug
options) it generates code something like this:

        test   %rdi,%rdi
        js     <cap_validate_magic+0x98>
        stac
        lfence
        mov    (%rdi),%ecx
        clac

which is certainly not horrid. But it's not great, because that lfence
ends up really screwing up the pipeline.

The manually coded out-of-line code generates this instead:

        mov    %rax,%rdx
        sar    $0x3f,%rdx
        or     %rdx,%rax
        stac
        movzbl (%rax),%edx
        xor    %eax,%eax
        clac
        ret

because it doesn't do a conditional branch (with the required spectre
thing), but instead does the address as a data dependency and knows
that "all bits set" if the address was negative will cause a page
fault.

But we *can* get the user accesses to do the same with a slight
expansion of user_access_begin():

        stac
        mov    %rdi,%rax
        sar    $0x3f,%rax
        or     %rdi,%rax
        mov    (%rax),%eax
        clac

by just introducing a notion of "masked_user_access". The get_user()
macro part would look like this:

        __typeof__(ptr) __ptr;                                  \
        __ptr = masked_user_access_begin(ptr);                  \
        if (1) {                                                \
                unsafe_get_user(x, __ptr, Efault_read);         \
                user_access_end();                              \

and the patch to implement this is attached. I've had it around for a
while, but I don't know how many architectures can do this.

Note this part of the commit message:

  This model only works for dense accesses that start at 'src' and on
  architectures that have a guard region that is guaranteed to fault in
  between the user space and the kernel space area.

which is true on x86-64, but that "guard region" thing might not be
true everywhere.

Will/Catalin - would that

        src = masked_user_access_begin(src);

work on arm64? The code does do something like that with
__uaccess_mask_ptr() already, but at least currently it doesn't do the
"avoid conditional entirely", the masking is just in _addition_ to the
access_ok().

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 5321 bytes --]

From 6b2c9a69efc21b9e6e0497a5661273f6fbe204b2 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 8 Apr 2024 20:04:58 -0700
Subject: [PATCH] x86: support user address masking instead of non-speculative
 conditional

The Spectre-v1 mitigations made "access_ok()" much more expensive, since
it has to serialize execution with the test for a valid user address.

All the normal user copy routines avoid this by just masking the user
address with a data-dependent mask instead, but the fast
"unsafe_user_read()" kind of patterms that were supposed to be a fast
case got slowed down.

This introduces a notion of using

	src = masked_user_access_begin(src);

to do the user address sanity using a data-dependent mask instead of the
more traditional conditional

	if (user_read_access_begin(src, len)) {

model.

This model only works for dense accesses that start at 'src' and on
architectures that have a guard region that is guaranteed to fault in
between the user space and the kernel space area.

With this, the user access doesn't need to be manually checked, because
a bad address is guaranteed to fault (by some architecture masking
trick: on x86-64 this involves just turning an invalid user address into
all ones, since we don't map the top of address space).

This only converts a couple of examples for now.  Example x86-64 code
generation for loading two words from user space:

        stac
        mov    %rax,%rcx
        sar    $0x3f,%rcx
        or     %rax,%rcx
        mov    (%rcx),%r13
        mov    0x8(%rcx),%r14
        clac

where all the error handling and -EFAULT is now purely handled out of
line by the exception path.

Of course, if the micro-architecture does badly at 'clac' and 'stac',
the above is still pitifully slow.  But at least we did as well as we
could.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/x86/include/asm/uaccess_64.h | 8 ++++++++
 fs/select.c                       | 4 +++-
 include/linux/uaccess.h           | 7 +++++++
 lib/strncpy_from_user.c           | 9 +++++++++
 lib/strnlen_user.c                | 9 +++++++++
 5 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 04789f45ab2b..a10149a96d9e 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -53,6 +53,14 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
  */
 #define valid_user_address(x) ((__force long)(x) >= 0)
 
+/*
+ * Masking the user address is an alternative to a conditional
+ * user_access_begin that can avoid the fencing. This only works
+ * for dense accesses starting at the address.
+ */
+#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63)))
+#define masked_user_access_begin(x) ({ __uaccess_begin(); mask_user_address(x); })
+
 /*
  * User pointers can have tag bits on x86-64.  This scheme tolerates
  * arbitrary values in those bits rather then masking them off.
diff --git a/fs/select.c b/fs/select.c
index 9515c3fa1a03..bc185d111436 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -780,7 +780,9 @@ static inline int get_sigset_argpack(struct sigset_argpack *to,
 {
 	// the path is hot enough for overhead of copy_from_user() to matter
 	if (from) {
-		if (!user_read_access_begin(from, sizeof(*from)))
+		if (can_do_masked_user_access())
+			from = masked_user_access_begin(from);
+		else if (!user_read_access_begin(from, sizeof(*from)))
 			return -EFAULT;
 		unsafe_get_user(to->p, &from->p, Efault);
 		unsafe_get_user(to->size, &from->size, Efault);
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 3064314f4832..f18371f6cf36 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -32,6 +32,13 @@
 })
 #endif
 
+#ifdef masked_user_access_begin
+ #define can_do_masked_user_access() 1
+#else
+ #define can_do_masked_user_access() 0
+ #define masked_user_access_begin(src) NULL
+#endif
+
 /*
  * Architectures should provide two primitives (raw_copy_{to,from}_user())
  * and get rid of their private instances of copy_{to,from}_user() and
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index 6432b8c3e431..989a12a67872 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -120,6 +120,15 @@ long strncpy_from_user(char *dst, const char __user *src, long count)
 	if (unlikely(count <= 0))
 		return 0;
 
+	if (can_do_masked_user_access()) {
+		long retval;
+
+		src = masked_user_access_begin(src);
+		retval = do_strncpy_from_user(dst, src, count, count);
+		user_read_access_end();
+		return retval;
+	}
+
 	max_addr = TASK_SIZE_MAX;
 	src_addr = (unsigned long)untagged_addr(src);
 	if (likely(src_addr < max_addr)) {
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index feeb935a2299..6e489f9e90f1 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -96,6 +96,15 @@ long strnlen_user(const char __user *str, long count)
 	if (unlikely(count <= 0))
 		return 0;
 
+	if (can_do_masked_user_access()) {
+		long retval;
+
+		str = masked_user_access_begin(str);
+		retval = do_strnlen_user(str, count, count);
+		user_read_access_end();
+		return retval;
+	}
+
 	max_addr = TASK_SIZE_MAX;
 	src_addr = (unsigned long)untagged_addr(str);
 	if (likely(src_addr < max_addr)) {

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] riscv: uaccess: optimizations
  2024-06-30 16:59     ` Linus Torvalds
@ 2024-07-05 11:25       ` Will Deacon
  2024-07-05 17:58         ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2024-07-05 11:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnd Bergmann, Catalin Marinas, Jisheng Zhang, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Linux-Arch, linux-riscv, linux-kernel,
	mark.rutland

On Sun, Jun 30, 2024 at 09:59:36AM -0700, Linus Torvalds wrote:
> On Tue, 25 Jun 2024 at 11:12, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > But yes, it would be lovely if we did things as "implement the
> > low-level accessor functions and we'll wrap them for the generic case"
> > rather than have every architecture have to do the wrapping..
> 
> Btw, to do that _well_, we need to expand on the user access functions
> a bit more.

[...]

> Will/Catalin - would that
> 
>         src = masked_user_access_begin(src);
> 
> work on arm64? The code does do something like that with
> __uaccess_mask_ptr() already, but at least currently it doesn't do the
> "avoid conditional entirely", the masking is just in _addition_ to the
> access_ok().

I think we'd need to go back to our old __uaccess_mask_ptr()
implementation, where kernel addresses end up being forced to NULL. In
other words, revert 2305b809be93 ("arm64: uaccess: simplify
uaccess_mask_ptr()"). If we then want to drop the access_ok() entirely,
we'd probably want to use an address that lives between the two TTBRs
(i.e. in the "guard region" you mentioned above), just in case somebody
has fscked around with /proc/sys/vm/mmap_min_addr.

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] riscv: uaccess: optimizations
  2024-07-05 11:25       ` Will Deacon
@ 2024-07-05 17:58         ` Linus Torvalds
  2024-07-08 13:52           ` Will Deacon
  2024-07-08 15:21           ` Mark Rutland
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2024-07-05 17:58 UTC (permalink / raw)
  To: Will Deacon
  Cc: Arnd Bergmann, Catalin Marinas, Jisheng Zhang, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Linux-Arch, linux-riscv, linux-kernel,
	mark.rutland

On Fri, 5 Jul 2024 at 04:25, Will Deacon <will@kernel.org> wrote:
>
> we'd probably want to use an address that lives between the two TTBRs
> (i.e. in the "guard region" you mentioned above), just in case somebody
> has fscked around with /proc/sys/vm/mmap_min_addr.

Yes, I don't want to use a NULL pointer and rely on mmap_min_addr.

For x86-64, we have two "guard regions" that can be used to generate
an address that is guaranteed to fault:

 - the kernel always lives in the "top bit set" part of the address
space (and any address tagging bits don't touch that part), and does
not map the highest virtual address because that's used for error
pointers, so the "all bits set" address always faults

 - the region between valid user addresses and kernel addresses is
also always going to fault, and we don't have them adjacent to each
other (unlike, for example, 32-bit i386, where the kernel address
space is directly adjacent to the top of user addresses)

So on x86-64, the simple solution is to just say "we know if the top
bit is clear, it cannot ever touch kernel code, and if the top bit is
set we have to make the address fault". So just duplicating the top
bit (with an arithmetic shift) and or'ing it with the low bits, we get
exactly what we want.

But my knowledge of arm64 is weak enough that while I am reading
assembly language and I know that instead of the top bit, it's bit55,
I don't know what the actual rules for the translation table registers
are.

If the all-bits-set address is guaranteed to always trap, then arm64
could just use the same thing x86 does (just duplicating bit 55
instead of the sign bit)?

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] riscv: uaccess: optimizations
  2024-07-05 17:58         ` Linus Torvalds
@ 2024-07-08 13:52           ` Will Deacon
  2024-07-08 15:30             ` Mark Rutland
  2024-07-08 15:21           ` Mark Rutland
  1 sibling, 1 reply; 10+ messages in thread
From: Will Deacon @ 2024-07-08 13:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnd Bergmann, Catalin Marinas, Jisheng Zhang, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Linux-Arch, linux-riscv, linux-kernel,
	mark.rutland

On Fri, Jul 05, 2024 at 10:58:29AM -0700, Linus Torvalds wrote:
> On Fri, 5 Jul 2024 at 04:25, Will Deacon <will@kernel.org> wrote:
> >
> > we'd probably want to use an address that lives between the two TTBRs
> > (i.e. in the "guard region" you mentioned above), just in case somebody
> > has fscked around with /proc/sys/vm/mmap_min_addr.
> 
> Yes, I don't want to use a NULL pointer and rely on mmap_min_addr.
> 
> For x86-64, we have two "guard regions" that can be used to generate
> an address that is guaranteed to fault:
> 
>  - the kernel always lives in the "top bit set" part of the address
> space (and any address tagging bits don't touch that part), and does
> not map the highest virtual address because that's used for error
> pointers, so the "all bits set" address always faults
> 
>  - the region between valid user addresses and kernel addresses is
> also always going to fault, and we don't have them adjacent to each
> other (unlike, for example, 32-bit i386, where the kernel address
> space is directly adjacent to the top of user addresses)

I think we're very similar on arm64. The kernel lives at the top (i.e.
virtual address space descends below 0) and is mapped by TTBR1.
Userspace lives at the bottom (i.e. virtual address space ascends from
0) and is mapped by TTBR0. There's an unaddressable gap in the middle
and it's bit 55 of the address which determines user vs kernel (well, it
selects the TTBR to be precise).

The kernel doesn't map the top 8MiB of its VA space, although talking to
Mark, it sounds like we might not be as robust as x86 if there's a stray
dereference of an unaligned error pointer that straddles 0. He can
elaborate, but we probably can't rely on a pointer of all-ones faulting
safely for the same reason.

> So on x86-64, the simple solution is to just say "we know if the top
> bit is clear, it cannot ever touch kernel code, and if the top bit is
> set we have to make the address fault". So just duplicating the top
> bit (with an arithmetic shift) and or'ing it with the low bits, we get
> exactly what we want.
> 
> But my knowledge of arm64 is weak enough that while I am reading
> assembly language and I know that instead of the top bit, it's bit55,
> I don't know what the actual rules for the translation table registers
> are.
> 
> If the all-bits-set address is guaranteed to always trap, then arm64
> could just use the same thing x86 does (just duplicating bit 55
> instead of the sign bit)?

Perhaps we could just force accesses with bit 55 set to the address
'1UL << 55'? That should sit slap bang in the middle of the guard
region between the TTBRs and I think it would resolve any issues we may
have with wrapping. It still means effectively reverting 2305b809be93
("arm64: uaccess: simplify uaccess_mask_ptr()"), though.

Dunno. Mark, Catalin, what do you guys reckon?

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] riscv: uaccess: optimizations
  2024-07-08 13:52           ` Will Deacon
@ 2024-07-08 15:30             ` Mark Rutland
  2024-07-23 14:16               ` Will Deacon
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Rutland @ 2024-07-08 15:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Arnd Bergmann, Catalin Marinas, Jisheng Zhang,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Linux-Arch, linux-riscv,
	linux-kernel

On Mon, Jul 08, 2024 at 02:52:43PM +0100, Will Deacon wrote:
> On Fri, Jul 05, 2024 at 10:58:29AM -0700, Linus Torvalds wrote:
> > On Fri, 5 Jul 2024 at 04:25, Will Deacon <will@kernel.org> wrote:
> > So on x86-64, the simple solution is to just say "we know if the top
> > bit is clear, it cannot ever touch kernel code, and if the top bit is
> > set we have to make the address fault". So just duplicating the top
> > bit (with an arithmetic shift) and or'ing it with the low bits, we get
> > exactly what we want.
> > 
> > But my knowledge of arm64 is weak enough that while I am reading
> > assembly language and I know that instead of the top bit, it's bit55,
> > I don't know what the actual rules for the translation table registers
> > are.
> > 
> > If the all-bits-set address is guaranteed to always trap, then arm64
> > could just use the same thing x86 does (just duplicating bit 55
> > instead of the sign bit)?
> 
> Perhaps we could just force accesses with bit 55 set to the address
> '1UL << 55'? That should sit slap bang in the middle of the guard
> region between the TTBRs

Yep, that'll work until we handle FEAT_D128 where (1UL << 55) will be
the start of the TTBR1 range in some configurations.

> and I think it would resolve any issues we may have with wrapping. It
> still means effectively reverting 2305b809be93 ("arm64: uaccess:
> simplify uaccess_mask_ptr()"), though.

If we do bring that back, it'd be nice if we could do that without the
CSEL+CSDB, as the CSDB is liable to be expensive on some parts (e.g.
it's an alias of DSB on older designs).

> Dunno. Mark, Catalin, what do you guys reckon?

I think there's a slight variant of the x86 approach that might work,
I've posted in my reply at

  https://lore.kernel.org/lkml/ZowD3LQT_KTz2g4X@J2N7QTR9R3/

Mark.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] riscv: uaccess: optimizations
  2024-07-08 15:30             ` Mark Rutland
@ 2024-07-23 14:16               ` Will Deacon
  0 siblings, 0 replies; 10+ messages in thread
From: Will Deacon @ 2024-07-23 14:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Linus Torvalds, Arnd Bergmann, Catalin Marinas, Jisheng Zhang,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Linux-Arch, linux-riscv,
	linux-kernel

On Mon, Jul 08, 2024 at 04:30:52PM +0100, Mark Rutland wrote:
> On Mon, Jul 08, 2024 at 02:52:43PM +0100, Will Deacon wrote:
> > On Fri, Jul 05, 2024 at 10:58:29AM -0700, Linus Torvalds wrote:
> > > On Fri, 5 Jul 2024 at 04:25, Will Deacon <will@kernel.org> wrote:
> > > So on x86-64, the simple solution is to just say "we know if the top
> > > bit is clear, it cannot ever touch kernel code, and if the top bit is
> > > set we have to make the address fault". So just duplicating the top
> > > bit (with an arithmetic shift) and or'ing it with the low bits, we get
> > > exactly what we want.
> > > 
> > > But my knowledge of arm64 is weak enough that while I am reading
> > > assembly language and I know that instead of the top bit, it's bit55,
> > > I don't know what the actual rules for the translation table registers
> > > are.
> > > 
> > > If the all-bits-set address is guaranteed to always trap, then arm64
> > > could just use the same thing x86 does (just duplicating bit 55
> > > instead of the sign bit)?
> > 
> > Perhaps we could just force accesses with bit 55 set to the address
> > '1UL << 55'? That should sit slap bang in the middle of the guard
> > region between the TTBRs
> 
> Yep, that'll work until we handle FEAT_D128 where (1UL << 55) will be
> the start of the TTBR1 range in some configurations.
> 
> > and I think it would resolve any issues we may have with wrapping. It
> > still means effectively reverting 2305b809be93 ("arm64: uaccess:
> > simplify uaccess_mask_ptr()"), though.
> 
> If we do bring that back, it'd be nice if we could do that without the
> CSEL+CSDB, as the CSDB is liable to be expensive on some parts (e.g.
> it's an alias of DSB on older designs).

DSB?! Are you sure? I thought it was basically a NOP for everybody apart
from a small subset of implementations.

Will

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] riscv: uaccess: optimizations
  2024-07-05 17:58         ` Linus Torvalds
  2024-07-08 13:52           ` Will Deacon
@ 2024-07-08 15:21           ` Mark Rutland
  1 sibling, 0 replies; 10+ messages in thread
From: Mark Rutland @ 2024-07-08 15:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Arnd Bergmann, Catalin Marinas, Jisheng Zhang,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Linux-Arch, linux-riscv,
	linux-kernel

On Fri, Jul 05, 2024 at 10:58:29AM -0700, Linus Torvalds wrote:
> On Fri, 5 Jul 2024 at 04:25, Will Deacon <will@kernel.org> wrote:
> >
> > we'd probably want to use an address that lives between the two TTBRs
> > (i.e. in the "guard region" you mentioned above), just in case somebody
> > has fscked around with /proc/sys/vm/mmap_min_addr.
> 
> Yes, I don't want to use a NULL pointer and rely on mmap_min_addr.
> 
> For x86-64, we have two "guard regions" that can be used to generate
> an address that is guaranteed to fault:
> 
>  - the kernel always lives in the "top bit set" part of the address
> space (and any address tagging bits don't touch that part), and does
> not map the highest virtual address because that's used for error
> pointers, so the "all bits set" address always faults

The same should be true on arm64, though I'm not immediately sure if we
explicitly reserve that VA region -- if we don't, then we should.

>  - the region between valid user addresses and kernel addresses is
> also always going to fault, and we don't have them adjacent to each
> other (unlike, for example, 32-bit i386, where the kernel address
> space is directly adjacent to the top of user addresses)

Today we have a gap between the TTBR0 and TTBR1 VA ranges in all
configurations, but in future (with the new FEAT_D128 page table format)
we will have configurations where there's no gap between the two ranges.

> So on x86-64, the simple solution is to just say "we know if the top
> bit is clear, it cannot ever touch kernel code, and if the top bit is
> set we have to make the address fault". So just duplicating the top
> bit (with an arithmetic shift) and or'ing it with the low bits, we get
> exactly what we want.
> 
> But my knowledge of arm64 is weak enough that while I am reading
> assembly language and I know that instead of the top bit, it's bit55,
> I don't know what the actual rules for the translation table registers
> are.
> 
> If the all-bits-set address is guaranteed to always trap, then arm64
> could just use the same thing x86 does (just duplicating bit 55
> instead of the sign bit)?

I think something of that shape can work (see below). There are a couple
of things that make using all-ones unsafe:

1) Non-faulting parts of a misaligned load/store can occur *before* the
   fault is raised. If you have two pages where one of which is writable
   and the other of which is not writeable (in either order), a store
   which straddles those pages can write to the writeable page before
   raising a fault on the non-writeable page.

   I've seen this behaviour on real HW, and IIUC this is fairly common.

2) Loads/stores which wrap past 0xFFFF_FFFF_FFFF_FFFF access bytes at
   UNKNOWN addresses. An N-byte store at 0xFFFF_FFFF_FFFF_FFFF may write
   to N-1 bytes at an arbitrary address which is not
   0x0000_0000_0000_0000.

   In the latest ARM ARM (K.a), this is described tersely in section
   K1.2.9 "Out of range virtual address".

   That can be found at:

   https://developer.arm.com/documentation/ddi0487/ka/?lang=en

   I'm aware of implementation styles where that address is not zero and
   can be a TTBR1 (kernel) address.

Given that, we'd need to avoid all-ones, but provided we know that the
first access using the pointer will be limited to PAGE_SIZE bytes past
the pointer, we could round down the bad pointer to be somewhere within
the error pointer page, e.g.

	SBFX <mask>, <ptr>, #55, #1
	ORR  <ptr>, <ptr>, <mask>
	BIC <ptr>, <ptr>, <mask>, lsr #(64 - PAGE_SHIFT)

That last `BIC` instructions is "BIt Clear" AKA "AND NOT". When bit 55
is one that will clear the lower bits to round down to a page boundary,
and when bit 55 is zero it will have no effect (as it'll be an AND with
all-ones).

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-07-23 14:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240625040500.1788-1-jszhang@kernel.org>
2024-06-25  7:21 ` [PATCH 0/4] riscv: uaccess: optimizations Arnd Bergmann
2024-06-25 18:12   ` Linus Torvalds
2024-06-26 13:04     ` Jisheng Zhang
2024-06-30 16:59     ` Linus Torvalds
2024-07-05 11:25       ` Will Deacon
2024-07-05 17:58         ` Linus Torvalds
2024-07-08 13:52           ` Will Deacon
2024-07-08 15:30             ` Mark Rutland
2024-07-23 14:16               ` Will Deacon
2024-07-08 15:21           ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox