* [PATCH 0/5] powerpc: Implement masked user access
@ 2025-06-22 9:52 Christophe Leroy
2025-06-22 9:52 ` [PATCH 1/5] uaccess: Add masked_user_{read/write}_access_begin Christophe Leroy
` (5 more replies)
0 siblings, 6 replies; 36+ messages in thread
From: Christophe Leroy @ 2025-06-22 9:52 UTC (permalink / raw)
To: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, David Laight,
Dave Hansen, Linus Torvalds
Cc: Christophe Leroy, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Masked user access avoids the address/size verification by access_ok().
Allthough its main purpose is to skip the speculation in the
verification of user address and size hence avoid the need of spec
mitigation, it also has the advantage to reduce the amount of
instructions needed so it also benefits to platforms that don't
need speculation mitigation, especially when the size of the copy is
not know at build time.
Unlike x86_64 which masks the address to 'all bits set' when the
user address is invalid, here the address is set to an address in
the gap. It avoids relying on the zero page to catch offseted
accesses. On book3s/32 it makes sure the opening remains on user
segment. The overcost is a single instruction in the masking.
First patch adds masked_user_read_access_begin() and
masked_user_write_access_begin() to match with user_read_access_end()
and user_write_access_end().
Second patch adds speculation barrier to copy_from_user_iter() so that
the barrier in powerpc raw_copy_from_user() which is redundant with
the one in copy_from_user() can be removed.
Third patch removes the redundant barrier_nospec() in
raw_copy_from_user().
Fourth patch removes the unused size parameter when enabling/disabling
user access.
Last patch implements masked user access.
Christophe Leroy (5):
uaccess: Add masked_user_{read/write}_access_begin
uaccess: Add speculation barrier to copy_from_user_iter()
powerpc: Remove unused size parametre to KUAP enabling/disabling
functions
powerpc: Move barrier_nospec() out of allow_read_{from/write}_user()
powerpc: Implement masked user access
arch/powerpc/Kconfig | 2 +-
arch/powerpc/include/asm/book3s/32/kup.h | 2 +-
arch/powerpc/include/asm/book3s/64/kup.h | 4 +-
arch/powerpc/include/asm/kup.h | 24 ++--
arch/powerpc/include/asm/nohash/32/kup-8xx.h | 2 +-
arch/powerpc/include/asm/nohash/kup-booke.h | 2 +-
arch/powerpc/include/asm/uaccess.h | 140 ++++++++++++++++---
fs/select.c | 2 +-
include/linux/uaccess.h | 8 ++
kernel/futex/futex.h | 4 +-
lib/iov_iter.c | 7 +
lib/strncpy_from_user.c | 2 +-
lib/strnlen_user.c | 2 +-
13 files changed, 158 insertions(+), 43 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH 1/5] uaccess: Add masked_user_{read/write}_access_begin
2025-06-22 9:52 [PATCH 0/5] powerpc: Implement masked user access Christophe Leroy
@ 2025-06-22 9:52 ` Christophe Leroy
2025-06-22 16:35 ` David Laight
2025-06-22 9:52 ` [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter() Christophe Leroy
` (4 subsequent siblings)
5 siblings, 1 reply; 36+ messages in thread
From: Christophe Leroy @ 2025-06-22 9:52 UTC (permalink / raw)
To: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, David Laight,
Dave Hansen, Linus Torvalds
Cc: Christophe Leroy, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Allthough masked_user_access_begin() seems to only be used when reading
data from user at the moment, introduce masked_user_read_access_begin()
and masked_user_write_access_begin() in order to match
user_read_access_begin() and user_write_access_begin().
Have them default to masked_user_access_begin() when they are
not defined.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
fs/select.c | 2 +-
include/linux/uaccess.h | 8 ++++++++
kernel/futex/futex.h | 4 ++--
lib/strncpy_from_user.c | 2 +-
lib/strnlen_user.c | 2 +-
5 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index 9fb650d03d52..d8547bedf5eb 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -777,7 +777,7 @@ static inline int get_sigset_argpack(struct sigset_argpack *to,
// the path is hot enough for overhead of copy_from_user() to matter
if (from) {
if (can_do_masked_user_access())
- from = masked_user_access_begin(from);
+ from = masked_user_read_access_begin(from);
else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(to->p, &from->p, Efault);
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 7c06f4795670..682a0cd2fe51 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -41,6 +41,14 @@
#define mask_user_address(src) (src)
#endif
+#ifndef masked_user_write_access_begin
+#define masked_user_write_access_begin masked_user_access_begin
+#endif
+#ifndef masked_user_read_access_begin
+#define masked_user_read_access_begin masked_user_access_begin
+#endif
+
+
/*
* Architectures should provide two primitives (raw_copy_{to,from}_user())
* and get rid of their private instances of copy_{to,from}_user() and
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index fcd1617212ee..6cfcafa00736 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -305,7 +305,7 @@ static __always_inline int futex_get_value(u32 *dest, u32 __user *from)
u32 val;
if (can_do_masked_user_access())
- from = masked_user_access_begin(from);
+ from = masked_user_read_access_begin(from);
else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(val, from, Efault);
@@ -320,7 +320,7 @@ static __always_inline int futex_get_value(u32 *dest, u32 __user *from)
static __always_inline int futex_put_value(u32 val, u32 __user *to)
{
if (can_do_masked_user_access())
- to = masked_user_access_begin(to);
+ to = masked_user_read_access_begin(to);
else if (!user_read_access_begin(to, sizeof(*to)))
return -EFAULT;
unsafe_put_user(val, to, Efault);
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index 6dc234913dd5..5bb752ff7c61 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -126,7 +126,7 @@ long strncpy_from_user(char *dst, const char __user *src, long count)
if (can_do_masked_user_access()) {
long retval;
- src = masked_user_access_begin(src);
+ src = masked_user_read_access_begin(src);
retval = do_strncpy_from_user(dst, src, count, count);
user_read_access_end();
return retval;
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index 6e489f9e90f1..4a6574b67f82 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -99,7 +99,7 @@ long strnlen_user(const char __user *str, long count)
if (can_do_masked_user_access()) {
long retval;
- str = masked_user_access_begin(str);
+ str = masked_user_read_access_begin(str);
retval = do_strnlen_user(str, count, count);
user_read_access_end();
return retval;
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter()
2025-06-22 9:52 [PATCH 0/5] powerpc: Implement masked user access Christophe Leroy
2025-06-22 9:52 ` [PATCH 1/5] uaccess: Add masked_user_{read/write}_access_begin Christophe Leroy
@ 2025-06-22 9:52 ` Christophe Leroy
2025-06-22 16:52 ` David Laight
2025-06-22 16:57 ` Linus Torvalds
2025-06-22 9:52 ` [PATCH 3/5] powerpc: Remove unused size parametre to KUAP enabling/disabling functions Christophe Leroy
` (3 subsequent siblings)
5 siblings, 2 replies; 36+ messages in thread
From: Christophe Leroy @ 2025-06-22 9:52 UTC (permalink / raw)
To: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, David Laight,
Dave Hansen, Linus Torvalds
Cc: Christophe Leroy, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
The results of "access_ok()" can be mis-speculated. The result is that
you can end speculatively:
if (access_ok(from, size))
// Right here
For the same reason as done in copy_from_user() by
commit 74e19ef0ff80 ("uaccess: Add speculation barrier to
copy_from_user()"), add a speculation barrier to copy_from_user_iter().
See commit 74e19ef0ff80 ("uaccess: Add speculation barrier to
copy_from_user()") for more details.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
lib/iov_iter.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9193f952f49..ebf524a37907 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -50,6 +50,13 @@ size_t copy_from_user_iter(void __user *iter_from, size_t progress,
if (should_fail_usercopy())
return len;
if (access_ok(iter_from, len)) {
+ /*
+ * Ensure that bad access_ok() speculation will not
+ * lead to nasty side effects *after* the copy is
+ * finished:
+ */
+ barrier_nospec();
+
to += progress;
instrument_copy_from_user_before(to, iter_from, len);
res = raw_copy_from_user(to, iter_from, len);
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 3/5] powerpc: Remove unused size parametre to KUAP enabling/disabling functions
2025-06-22 9:52 [PATCH 0/5] powerpc: Implement masked user access Christophe Leroy
2025-06-22 9:52 ` [PATCH 1/5] uaccess: Add masked_user_{read/write}_access_begin Christophe Leroy
2025-06-22 9:52 ` [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter() Christophe Leroy
@ 2025-06-22 9:52 ` Christophe Leroy
2025-06-22 9:52 ` [PATCH 4/5] powerpc: Move barrier_nospec() out of allow_read_{from/write}_user() Christophe Leroy
` (2 subsequent siblings)
5 siblings, 0 replies; 36+ messages in thread
From: Christophe Leroy @ 2025-06-22 9:52 UTC (permalink / raw)
To: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, David Laight,
Dave Hansen, Linus Torvalds
Cc: Christophe Leroy, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Since commit 16132529cee5 ("powerpc/32s: Rework Kernel Userspace
Access Protection") the size parameter is unused on all platforms.
Remove it.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/include/asm/book3s/32/kup.h | 2 +-
arch/powerpc/include/asm/book3s/64/kup.h | 4 +--
arch/powerpc/include/asm/kup.h | 22 ++++++------
arch/powerpc/include/asm/nohash/32/kup-8xx.h | 2 +-
arch/powerpc/include/asm/nohash/kup-booke.h | 2 +-
arch/powerpc/include/asm/uaccess.h | 36 ++++++++++----------
6 files changed, 33 insertions(+), 35 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h
index 4e14a5427a63..8ea68d136152 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -98,7 +98,7 @@ static __always_inline unsigned long __kuap_get_and_assert_locked(void)
#define __kuap_get_and_assert_locked __kuap_get_and_assert_locked
static __always_inline void allow_user_access(void __user *to, const void __user *from,
- u32 size, unsigned long dir)
+ unsigned long dir)
{
BUILD_BUG_ON(!__builtin_constant_p(dir));
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h
index 497a7bd31ecc..853fa2fb12be 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -354,7 +354,7 @@ __bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
}
static __always_inline void allow_user_access(void __user *to, const void __user *from,
- unsigned long size, unsigned long dir)
+ unsigned long dir)
{
unsigned long thread_amr = 0;
@@ -384,7 +384,7 @@ static __always_inline unsigned long get_kuap(void)
static __always_inline void set_kuap(unsigned long value) { }
static __always_inline void allow_user_access(void __user *to, const void __user *from,
- unsigned long size, unsigned long dir)
+ unsigned long dir)
{ }
#endif /* !CONFIG_PPC_KUAP */
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 2bb03d941e3e..4c70be11b99a 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -73,7 +73,7 @@ static __always_inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned
*/
#ifndef CONFIG_PPC_BOOK3S_64
static __always_inline void allow_user_access(void __user *to, const void __user *from,
- unsigned long size, unsigned long dir) { }
+ unsigned long dir) { }
static __always_inline void prevent_user_access(unsigned long dir) { }
static __always_inline unsigned long prevent_user_access_return(void) { return 0UL; }
static __always_inline void restore_user_access(unsigned long flags) { }
@@ -132,36 +132,34 @@ static __always_inline void kuap_assert_locked(void)
kuap_get_and_assert_locked();
}
-static __always_inline void allow_read_from_user(const void __user *from, unsigned long size)
+static __always_inline void allow_read_from_user(const void __user *from)
{
barrier_nospec();
- allow_user_access(NULL, from, size, KUAP_READ);
+ allow_user_access(NULL, from, KUAP_READ);
}
-static __always_inline void allow_write_to_user(void __user *to, unsigned long size)
+static __always_inline void allow_write_to_user(void __user *to)
{
- allow_user_access(to, NULL, size, KUAP_WRITE);
+ allow_user_access(to, NULL, KUAP_WRITE);
}
-static __always_inline void allow_read_write_user(void __user *to, const void __user *from,
- unsigned long size)
+static __always_inline void allow_read_write_user(void __user *to, const void __user *from)
{
barrier_nospec();
- allow_user_access(to, from, size, KUAP_READ_WRITE);
+ allow_user_access(to, from, KUAP_READ_WRITE);
}
-static __always_inline void prevent_read_from_user(const void __user *from, unsigned long size)
+static __always_inline void prevent_read_from_user(const void __user *from)
{
prevent_user_access(KUAP_READ);
}
-static __always_inline void prevent_write_to_user(void __user *to, unsigned long size)
+static __always_inline void prevent_write_to_user(void __user *to)
{
prevent_user_access(KUAP_WRITE);
}
-static __always_inline void prevent_read_write_user(void __user *to, const void __user *from,
- unsigned long size)
+static __always_inline void prevent_read_write_user(void __user *to, const void __user *from)
{
prevent_user_access(KUAP_READ_WRITE);
}
diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index 46bc5925e5fd..c2b32b392d41 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -50,7 +50,7 @@ static __always_inline void uaccess_end_8xx(void)
}
static __always_inline void allow_user_access(void __user *to, const void __user *from,
- unsigned long size, unsigned long dir)
+ unsigned long dir)
{
uaccess_begin_8xx(MD_APG_INIT);
}
diff --git a/arch/powerpc/include/asm/nohash/kup-booke.h b/arch/powerpc/include/asm/nohash/kup-booke.h
index 0c7c3258134c..6035d51af3cd 100644
--- a/arch/powerpc/include/asm/nohash/kup-booke.h
+++ b/arch/powerpc/include/asm/nohash/kup-booke.h
@@ -74,7 +74,7 @@ static __always_inline void uaccess_end_booke(void)
}
static __always_inline void allow_user_access(void __user *to, const void __user *from,
- unsigned long size, unsigned long dir)
+ unsigned long dir)
{
uaccess_begin_booke(current->thread.pid);
}
diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 4f5a46a77fa2..dd5cf325ecde 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -45,14 +45,14 @@
do { \
__label__ __pu_failed; \
\
- allow_write_to_user(__pu_addr, __pu_size); \
+ allow_write_to_user(__pu_addr); \
__put_user_size_goto(__pu_val, __pu_addr, __pu_size, __pu_failed); \
- prevent_write_to_user(__pu_addr, __pu_size); \
+ prevent_write_to_user(__pu_addr); \
__pu_err = 0; \
break; \
\
__pu_failed: \
- prevent_write_to_user(__pu_addr, __pu_size); \
+ prevent_write_to_user(__pu_addr); \
__pu_err = -EFAULT; \
} while (0); \
\
@@ -301,9 +301,9 @@ do { \
__typeof__(sizeof(*(ptr))) __gu_size = sizeof(*(ptr)); \
\
might_fault(); \
- allow_read_from_user(__gu_addr, __gu_size); \
+ allow_read_from_user(__gu_addr); \
__get_user_size_allowed(__gu_val, __gu_addr, __gu_size, __gu_err); \
- prevent_read_from_user(__gu_addr, __gu_size); \
+ prevent_read_from_user(__gu_addr); \
(x) = (__typeof__(*(ptr)))__gu_val; \
\
__gu_err; \
@@ -329,9 +329,9 @@ raw_copy_in_user(void __user *to, const void __user *from, unsigned long n)
{
unsigned long ret;
- allow_read_write_user(to, from, n);
+ allow_read_write_user(to, from);
ret = __copy_tofrom_user(to, from, n);
- prevent_read_write_user(to, from, n);
+ prevent_read_write_user(to, from);
return ret;
}
#endif /* __powerpc64__ */
@@ -341,9 +341,9 @@ static inline unsigned long raw_copy_from_user(void *to,
{
unsigned long ret;
- allow_read_from_user(from, n);
+ allow_read_from_user(from);
ret = __copy_tofrom_user((__force void __user *)to, from, n);
- prevent_read_from_user(from, n);
+ prevent_read_from_user(from);
return ret;
}
@@ -352,9 +352,9 @@ raw_copy_to_user(void __user *to, const void *from, unsigned long n)
{
unsigned long ret;
- allow_write_to_user(to, n);
+ allow_write_to_user(to);
ret = __copy_tofrom_user(to, (__force const void __user *)from, n);
- prevent_write_to_user(to, n);
+ prevent_write_to_user(to);
return ret;
}
@@ -365,9 +365,9 @@ static inline unsigned long __clear_user(void __user *addr, unsigned long size)
unsigned long ret;
might_fault();
- allow_write_to_user(addr, size);
+ allow_write_to_user(addr);
ret = __arch_clear_user(addr, size);
- prevent_write_to_user(addr, size);
+ prevent_write_to_user(addr);
return ret;
}
@@ -395,9 +395,9 @@ copy_mc_to_user(void __user *to, const void *from, unsigned long n)
{
if (check_copy_size(from, n, true)) {
if (access_ok(to, n)) {
- allow_write_to_user(to, n);
+ allow_write_to_user(to);
n = copy_mc_generic((void __force *)to, from, n);
- prevent_write_to_user(to, n);
+ prevent_write_to_user(to);
}
}
@@ -415,7 +415,7 @@ static __must_check __always_inline bool user_access_begin(const void __user *pt
might_fault();
- allow_read_write_user((void __user *)ptr, ptr, len);
+ allow_read_write_user((void __user *)ptr, ptr);
return true;
}
#define user_access_begin user_access_begin
@@ -431,7 +431,7 @@ user_read_access_begin(const void __user *ptr, size_t len)
might_fault();
- allow_read_from_user(ptr, len);
+ allow_read_from_user(ptr);
return true;
}
#define user_read_access_begin user_read_access_begin
@@ -445,7 +445,7 @@ user_write_access_begin(const void __user *ptr, size_t len)
might_fault();
- allow_write_to_user((void __user *)ptr, len);
+ allow_write_to_user((void __user *)ptr);
return true;
}
#define user_write_access_begin user_write_access_begin
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 4/5] powerpc: Move barrier_nospec() out of allow_read_{from/write}_user()
2025-06-22 9:52 [PATCH 0/5] powerpc: Implement masked user access Christophe Leroy
` (2 preceding siblings ...)
2025-06-22 9:52 ` [PATCH 3/5] powerpc: Remove unused size parametre to KUAP enabling/disabling functions Christophe Leroy
@ 2025-06-22 9:52 ` Christophe Leroy
2025-06-22 9:52 ` [PATCH 5/5] powerpc: Implement masked user access Christophe Leroy
2025-06-22 16:20 ` [PATCH 0/5] " David Laight
5 siblings, 0 replies; 36+ messages in thread
From: Christophe Leroy @ 2025-06-22 9:52 UTC (permalink / raw)
To: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, David Laight,
Dave Hansen, Linus Torvalds
Cc: Christophe Leroy, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Move barrier_nospec() out of allow_read_from_user() and
allow_read_write_user() in order to allow reuse of those
functions when implementing masked user access.
Don't add it back in raw_copy_from_user() as it is already done
by callers of raw_copy_from_user().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/include/asm/kup.h | 2 --
arch/powerpc/include/asm/uaccess.h | 4 ++++
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 4c70be11b99a..4e2c79df4cdb 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -134,7 +134,6 @@ static __always_inline void kuap_assert_locked(void)
static __always_inline void allow_read_from_user(const void __user *from)
{
- barrier_nospec();
allow_user_access(NULL, from, KUAP_READ);
}
@@ -145,7 +144,6 @@ static __always_inline void allow_write_to_user(void __user *to)
static __always_inline void allow_read_write_user(void __user *to, const void __user *from)
{
- barrier_nospec();
allow_user_access(to, from, KUAP_READ_WRITE);
}
diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index dd5cf325ecde..89d53d4c2236 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -301,6 +301,7 @@ do { \
__typeof__(sizeof(*(ptr))) __gu_size = sizeof(*(ptr)); \
\
might_fault(); \
+ barrier_nospec(); \
allow_read_from_user(__gu_addr); \
__get_user_size_allowed(__gu_val, __gu_addr, __gu_size, __gu_err); \
prevent_read_from_user(__gu_addr); \
@@ -329,6 +330,7 @@ raw_copy_in_user(void __user *to, const void __user *from, unsigned long n)
{
unsigned long ret;
+ barrier_nospec();
allow_read_write_user(to, from);
ret = __copy_tofrom_user(to, from, n);
prevent_read_write_user(to, from);
@@ -415,6 +417,7 @@ static __must_check __always_inline bool user_access_begin(const void __user *pt
might_fault();
+ barrier_nospec();
allow_read_write_user((void __user *)ptr, ptr);
return true;
}
@@ -431,6 +434,7 @@ user_read_access_begin(const void __user *ptr, size_t len)
might_fault();
+ barrier_nospec();
allow_read_from_user(ptr);
return true;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH 5/5] powerpc: Implement masked user access
2025-06-22 9:52 [PATCH 0/5] powerpc: Implement masked user access Christophe Leroy
` (3 preceding siblings ...)
2025-06-22 9:52 ` [PATCH 4/5] powerpc: Move barrier_nospec() out of allow_read_{from/write}_user() Christophe Leroy
@ 2025-06-22 9:52 ` Christophe Leroy
2025-06-22 17:13 ` David Laight
2025-06-22 16:20 ` [PATCH 0/5] " David Laight
5 siblings, 1 reply; 36+ messages in thread
From: Christophe Leroy @ 2025-06-22 9:52 UTC (permalink / raw)
To: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, David Laight,
Dave Hansen, Linus Torvalds
Cc: Christophe Leroy, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Masked user access avoids the address/size verification by access_ok().
Allthough its main purpose is to skip the speculation in the
verification of user address and size hence avoid the need of spec
mitigation, it also has the advantage to reduce the amount of
instructions needed so it also benefits to platforms that don't
need speculation mitigation, especially when the size of the copy is
not know at build time.
So implement masked user access on powerpc. The only requirement is
to have memory gap that faults between the top user space and the
real start of kernel area. On 64 bits platform it is easy, bit 0 is
always 0 for user addresses and always 1 for kernel addresses and
user addresses stop long before the end of the area. On 32 bits it
is more tricky. It theory user space can go up to 0xbfffffff while
kernel will usually start at 0xc0000000. So a gap needs to be added
inbetween. Allthough in theory a single 4k page would suffice, it
is easier and more efficient to enforce a 128k gap below kernel,
as it simplifies the masking.
Unlike x86_64 which masks the address to 'all bits set' when the
user address is invalid, here the address is set to an address is
the gap. It avoids relying on the zero page to catch offseted
accesses.
e500 has the isel instruction which allows selecting one value or
the other without branch and that instruction is not speculative, so
use it. Allthough GCC usually generates code using that instruction,
it is safer to use inline assembly to be sure. The result is:
14: 3d 20 bf fe lis r9,-16386
18: 7c 03 48 40 cmplw r3,r9
1c: 7c 69 18 5e iselgt r3,r9,r3
On other ones, when kernel space is over 0x80000000 and user space
is below, the logic in mask_user_address_simple() leads to a
3 instruction sequence:
14: 7c 69 fe 70 srawi r9,r3,31
18: 7c 63 48 78 andc r3,r3,r9
1c: 51 23 00 00 rlwimi r3,r9,0,0,0
This is the default on powerpc 8xx.
When the limit between user space and kernel space is not 0x80000000,
mask_user_address_32() is used and a 6 instructions sequence is
generated:
24: 54 69 7c 7e srwi r9,r3,17
28: 21 29 57 ff subfic r9,r9,22527
2c: 7d 29 fe 70 srawi r9,r9,31
30: 75 2a b0 00 andis. r10,r9,45056
34: 7c 63 48 78 andc r3,r3,r9
38: 7c 63 53 78 or r3,r3,r10
The constraint is that TASK_SIZE be aligned to 128K in order to get
the most optimal number of instructions.
When CONFIG_PPC_BARRIER_NOSPEC is not defined, fallback on the
test-based masking as it is quicker than the 6 instructions sequence
but not necessarily quicker than the 3 instructions sequences above.
On 64 bits, kernel is always above 0x8000000000000000 and user always
below, which leads to a 4 instructions sequence:
80: 7c 69 1b 78 mr r9,r3
84: 7c 63 fe 76 sradi r3,r3,63
88: 7d 29 18 78 andc r9,r9,r3
8c: 79 23 00 4c rldimi r3,r9,0,1
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/Kconfig | 2 +-
arch/powerpc/include/asm/uaccess.h | 100 +++++++++++++++++++++++++++++
2 files changed, 101 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c3e0cc83f120..c26a39b4504a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -1303,7 +1303,7 @@ config TASK_SIZE
hex "Size of user task space" if TASK_SIZE_BOOL
default "0x80000000" if PPC_8xx
default "0xb0000000" if PPC_BOOK3S_32 && EXECMEM
- default "0xc0000000"
+ default "0xbffe0000"
config MODULES_SIZE_BOOL
bool "Set custom size for modules/execmem area"
diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 89d53d4c2236..19743ee80523 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -2,6 +2,8 @@
#ifndef _ARCH_POWERPC_UACCESS_H
#define _ARCH_POWERPC_UACCESS_H
+#include <linux/sizes.h>
+
#include <asm/processor.h>
#include <asm/page.h>
#include <asm/extable.h>
@@ -455,6 +457,104 @@ user_write_access_begin(const void __user *ptr, size_t len)
#define user_write_access_begin user_write_access_begin
#define user_write_access_end prevent_current_write_to_user
+/*
+ * Masking the user address is an alternative to a conditional
+ * user_access_begin that can avoid the fencing. This only works
+ * for dense accesses starting at the address.
+ */
+static inline void __user *mask_user_address_simple(const void __user *ptr)
+{
+ unsigned long addr = (unsigned long)ptr;
+ unsigned long mask = (unsigned long)((long)addr >> (BITS_PER_LONG - 1));
+
+ addr = ((addr & ~mask) & (~0UL >> 1)) | (mask & (1UL << (BITS_PER_LONG - 1)));
+
+ return (void __user *)addr;
+}
+
+static inline void __user *mask_user_address_e500(const void __user *ptr)
+{
+ unsigned long addr;
+
+ asm("cmplw %1, %2; iselgt %0, %2, %1" : "=r"(addr) : "r"(ptr), "r"(TASK_SIZE): "cr0");
+
+ return (void __user *)addr;
+}
+
+/* Make sure TASK_SIZE is a multiple of 128K for shifting by 17 to the right */
+static inline void __user *mask_user_address_32(const void __user *ptr)
+{
+ unsigned long addr = (unsigned long)ptr;
+ unsigned long mask = (unsigned long)((long)((TASK_SIZE >> 17) - 1 - (addr >> 17)) >> 31);
+
+ addr = (addr & ~mask) | (TASK_SIZE & mask);
+
+ return (void __user *)addr;
+}
+
+static inline void __user *mask_user_address_fallback(const void __user *ptr)
+{
+ unsigned long addr = (unsigned long)ptr;
+
+ return (void __user *)(addr < TASK_SIZE ? addr : TASK_SIZE);
+}
+
+static inline void __user *mask_user_address(const void __user *ptr)
+{
+#ifdef MODULES_VADDR
+ const unsigned long border = MODULES_VADDR;
+#else
+ const unsigned long border = PAGE_OFFSET;
+#endif
+ BUILD_BUG_ON(TASK_SIZE_MAX & (SZ_128K - 1));
+ BUILD_BUG_ON(TASK_SIZE_MAX + SZ_128K > border);
+ BUILD_BUG_ON(TASK_SIZE_MAX & 0x8000000000000000ULL);
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC64) && !(PAGE_OFFSET & 0x8000000000000000ULL));
+
+ if (IS_ENABLED(CONFIG_PPC64))
+ return mask_user_address_simple(ptr);
+ if (IS_ENABLED(CONFIG_E500))
+ return mask_user_address_e500(ptr);
+ if (TASK_SIZE <= SZ_2G && border >= SZ_2G)
+ return mask_user_address_simple(ptr);
+ if (IS_ENABLED(CONFIG_PPC_BARRIER_NOSPEC))
+ return mask_user_address_32(ptr);
+ return mask_user_address_fallback(ptr);
+}
+
+static inline void __user *masked_user_access_begin(const void __user *p)
+{
+ void __user *ptr = mask_user_address(p);
+
+ might_fault();
+ allow_read_write_user(ptr, ptr);
+
+ return ptr;
+}
+#define masked_user_access_begin masked_user_access_begin
+
+static inline void __user *masked_user_read_access_begin(const void __user *p)
+{
+ void __user *ptr = mask_user_address(p);
+
+ might_fault();
+ allow_read_from_user(ptr);
+
+ return ptr;
+}
+#define masked_user_read_access_begin masked_user_read_access_begin
+
+static inline void __user *masked_user_write_access_begin(const void __user *p)
+{
+ void __user *ptr = mask_user_address(p);
+
+ might_fault();
+ allow_write_to_user(ptr);
+
+ return ptr;
+}
+#define masked_user_write_access_begin masked_user_write_access_begin
+
#define unsafe_get_user(x, p, e) do { \
__long_type(*(p)) __gu_val; \
__typeof__(*(p)) __user *__gu_addr = (p); \
--
2.49.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-22 9:52 [PATCH 0/5] powerpc: Implement masked user access Christophe Leroy
` (4 preceding siblings ...)
2025-06-22 9:52 ` [PATCH 5/5] powerpc: Implement masked user access Christophe Leroy
@ 2025-06-22 16:20 ` David Laight
2025-06-24 5:27 ` Christophe Leroy
5 siblings, 1 reply; 36+ messages in thread
From: David Laight @ 2025-06-22 16:20 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Sun, 22 Jun 2025 11:52:38 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> Masked user access avoids the address/size verification by access_ok().
> Allthough its main purpose is to skip the speculation in the
> verification of user address and size hence avoid the need of spec
> mitigation, it also has the advantage to reduce the amount of
> instructions needed so it also benefits to platforms that don't
> need speculation mitigation, especially when the size of the copy is
> not know at build time.
It also removes a conditional branch that is quite likely to be
statically predicted 'the wrong way'.
> Unlike x86_64 which masks the address to 'all bits set' when the
> user address is invalid, here the address is set to an address in
> the gap. It avoids relying on the zero page to catch offseted
> accesses. On book3s/32 it makes sure the opening remains on user
> segment. The overcost is a single instruction in the masking.
That isn't true (any more).
Linus changed the check to (approx):
if (uaddr > TASK_SIZE)
uaddr = TASK_SIZE;
(Implemented with a conditional move)
Replacing the original version that used cmp, sbb, or to get 'all bits set'.
Quite likely the comments are wrong!
I thought there was a second architecture that implemented it - and might
still set ~0u?
As you noted returning 'TASK_SIZE' (or, at least, the base of a page that
is guaranteed to fault) means that the caller only has to do 'reasonably
sequential' accesses, and not guarantee to read offset zero first.
As a separate patch, provided there is a guard page between user and kernel,
and user accesses are 'reasonably sequential' even access_ok() need not
check the transfer length. Linus wasn't that brave :-)
I think some of the 'API' is still based on the original 386 code where
the page tables had to be checked by hand for CoW.
David
>
> First patch adds masked_user_read_access_begin() and
> masked_user_write_access_begin() to match with user_read_access_end()
> and user_write_access_end().
>
> Second patch adds speculation barrier to copy_from_user_iter() so that
> the barrier in powerpc raw_copy_from_user() which is redundant with
> the one in copy_from_user() can be removed.
>
> Third patch removes the redundant barrier_nospec() in
> raw_copy_from_user().
>
> Fourth patch removes the unused size parameter when enabling/disabling
> user access.
>
> Last patch implements masked user access.
>
> Christophe Leroy (5):
> uaccess: Add masked_user_{read/write}_access_begin
> uaccess: Add speculation barrier to copy_from_user_iter()
> powerpc: Remove unused size parametre to KUAP enabling/disabling
> functions
> powerpc: Move barrier_nospec() out of allow_read_{from/write}_user()
> powerpc: Implement masked user access
>
> arch/powerpc/Kconfig | 2 +-
> arch/powerpc/include/asm/book3s/32/kup.h | 2 +-
> arch/powerpc/include/asm/book3s/64/kup.h | 4 +-
> arch/powerpc/include/asm/kup.h | 24 ++--
> arch/powerpc/include/asm/nohash/32/kup-8xx.h | 2 +-
> arch/powerpc/include/asm/nohash/kup-booke.h | 2 +-
> arch/powerpc/include/asm/uaccess.h | 140 ++++++++++++++++---
> fs/select.c | 2 +-
> include/linux/uaccess.h | 8 ++
> kernel/futex/futex.h | 4 +-
> lib/iov_iter.c | 7 +
> lib/strncpy_from_user.c | 2 +-
> lib/strnlen_user.c | 2 +-
> 13 files changed, 158 insertions(+), 43 deletions(-)
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/5] uaccess: Add masked_user_{read/write}_access_begin
2025-06-22 9:52 ` [PATCH 1/5] uaccess: Add masked_user_{read/write}_access_begin Christophe Leroy
@ 2025-06-22 16:35 ` David Laight
2025-06-24 5:34 ` Christophe Leroy
0 siblings, 1 reply; 36+ messages in thread
From: David Laight @ 2025-06-22 16:35 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Sun, 22 Jun 2025 11:52:39 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> Allthough masked_user_access_begin() seems to only be used when reading
> data from user at the moment, introduce masked_user_read_access_begin()
> and masked_user_write_access_begin() in order to match
> user_read_access_begin() and user_write_access_begin().
>
> Have them default to masked_user_access_begin() when they are
> not defined.
>
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
> fs/select.c | 2 +-
> include/linux/uaccess.h | 8 ++++++++
> kernel/futex/futex.h | 4 ++--
> lib/strncpy_from_user.c | 2 +-
> lib/strnlen_user.c | 2 +-
> 5 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/fs/select.c b/fs/select.c
> index 9fb650d03d52..d8547bedf5eb 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -777,7 +777,7 @@ static inline int get_sigset_argpack(struct sigset_argpack *to,
> // the path is hot enough for overhead of copy_from_user() to matter
> if (from) {
> if (can_do_masked_user_access())
> - from = masked_user_access_begin(from);
> + from = masked_user_read_access_begin(from);
> else if (!user_read_access_begin(from, sizeof(*from)))
> return -EFAULT;
> unsafe_get_user(to->p, &from->p, Efault);
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 7c06f4795670..682a0cd2fe51 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -41,6 +41,14 @@
> #ifdef masked_user_access_begin
> #define can_do_masked_user_access() 1
> #else
> #define can_do_masked_user_access() 0
> #define masked_user_access_begin(src) NULL
> #define mask_user_address(src) (src)
> #endif
>
> +#ifndef masked_user_write_access_begin
> +#define masked_user_write_access_begin masked_user_access_begin
> +#endif
> +#ifndef masked_user_read_access_begin
> +#define masked_user_read_access_begin masked_user_access_begin
> +#endif
I think that needs merging with the bit above.
Perhaps generating something like:
#ifdef masked_user_access_begin
#define masked_user_read_access_begin masked_user_access_begin
#define masked_user_write_access_begin masked_user_access_begin
#endif
#ifdef masked_user_read_access_begin
#define can_do_masked_user_access() 1
#else
#define can_do_masked_user_access() 0
#define masked_user_read_access_begin(src) NULL
#define masked_user_write_access_begin(src) NULL
#define mask_user_address(src) (src)
#endif
Otherwise you'll have to #define masked_user_access_begin even though
it is never used.
Two more patches could change x86-64 to define both and then remove
the 'then unused' first check - but that has to be for later.
David
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter()
2025-06-22 9:52 ` [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter() Christophe Leroy
@ 2025-06-22 16:52 ` David Laight
2025-06-22 16:57 ` Linus Torvalds
1 sibling, 0 replies; 36+ messages in thread
From: David Laight @ 2025-06-22 16:52 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Sun, 22 Jun 2025 11:52:40 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> The results of "access_ok()" can be mis-speculated. The result is that
> you can end speculatively:
>
> if (access_ok(from, size))
> // Right here
>
> For the same reason as done in copy_from_user() by
> commit 74e19ef0ff80 ("uaccess: Add speculation barrier to
> copy_from_user()"), add a speculation barrier to copy_from_user_iter().
I'm sure I sent a patch to change this code to used the 'masked' functions.
Probably ought to be done at the same time.
Would have been early feb, about the time I suggested:
+#ifdef masked_user_access_begin
+#define masked_user_read_access_begin(from, size) \
+ ((*(from) = masked_user_access_begin(*(from))), 1)
+#define masked_user_write_access_begin(from, size) \
+ ((*(from) = masked_user_access_begin(*(from))), 1)
+#else
+#define masked_user_read_access_begin(from, size) \
+ user_read_access_begin(*(from), size)
+#define masked_user_write_access_begin(from, size) \
+ user_write_access_begin(*(from), size)
+#endif
allowing:
- if (!user_read_access_begin(from, sizeof(*from)))
+ if (!masked_user_read_access_begin(&from, sizeof(*from)))
David
>
> See commit 74e19ef0ff80 ("uaccess: Add speculation barrier to
> copy_from_user()") for more details.
>
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
> lib/iov_iter.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index f9193f952f49..ebf524a37907 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -50,6 +50,13 @@ size_t copy_from_user_iter(void __user *iter_from, size_t progress,
> if (should_fail_usercopy())
> return len;
> if (access_ok(iter_from, len)) {
> + /*
> + * Ensure that bad access_ok() speculation will not
> + * lead to nasty side effects *after* the copy is
> + * finished:
> + */
> + barrier_nospec();
> +
> to += progress;
> instrument_copy_from_user_before(to, iter_from, len);
> res = raw_copy_from_user(to, iter_from, len);
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter()
2025-06-22 9:52 ` [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter() Christophe Leroy
2025-06-22 16:52 ` David Laight
@ 2025-06-22 16:57 ` Linus Torvalds
2025-06-22 20:18 ` David Laight
2025-06-24 5:49 ` Christophe Leroy
1 sibling, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2025-06-22 16:57 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, David Laight,
Dave Hansen, linux-kernel, linuxppc-dev, linux-fsdevel, linux-mm
On Sun, 22 Jun 2025 at 02:52, Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> The results of "access_ok()" can be mis-speculated.
Hmm. This code is critical. I think it should be converted to use that
masked address thing if we have to add it here.
And at some point this access_ok() didn't even exist, because we check
the addresses at iter creation time. So this one might be a "belt and
suspenders" check, rather than something critical.
(Although I also suspect that when we added ITER_UBUF we might have
created cases where those user addresses aren't checked at iter
creation time any more).
Linus
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 5/5] powerpc: Implement masked user access
2025-06-22 9:52 ` [PATCH 5/5] powerpc: Implement masked user access Christophe Leroy
@ 2025-06-22 17:13 ` David Laight
2025-06-22 17:40 ` Linus Torvalds
2025-06-22 18:57 ` Segher Boessenkool
0 siblings, 2 replies; 36+ messages in thread
From: David Laight @ 2025-06-22 17:13 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Sun, 22 Jun 2025 11:52:43 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> Masked user access avoids the address/size verification by access_ok().
> Allthough its main purpose is to skip the speculation in the
> verification of user address and size hence avoid the need of spec
> mitigation, it also has the advantage to reduce the amount of
> instructions needed so it also benefits to platforms that don't
> need speculation mitigation, especially when the size of the copy is
> not know at build time.
Not checking the size is slightly orthogonal.
It really just depends on the accesses being 'reasonably sequential'.
That is probably always true since access_ok() covers a single copy.
>
> So implement masked user access on powerpc. The only requirement is
> to have memory gap that faults between the top user space and the
> real start of kernel area. On 64 bits platform it is easy, bit 0 is
> always 0 for user addresses and always 1 for kernel addresses and
> user addresses stop long before the end of the area. On 32 bits it
> is more tricky. It theory user space can go up to 0xbfffffff while
> kernel will usually start at 0xc0000000. So a gap needs to be added
> inbetween. Allthough in theory a single 4k page would suffice, it
> is easier and more efficient to enforce a 128k gap below kernel,
> as it simplifies the masking.
The gap isn't strictly necessary - provided the first access is guaranteed
to be at the specified address and the transfer are guaranteed sequential.
But that is hard to guarantee.
Where does the vdso end up?
My guess is 'near the top of userspace' - but maybe not.
>
> Unlike x86_64 which masks the address to 'all bits set' when the
> user address is invalid, here the address is set to an address is
> the gap. It avoids relying on the zero page to catch offseted
> accesses.
Not true.
Using 'cmov' also removed an instruction.
>
> e500 has the isel instruction which allows selecting one value or
> the other without branch and that instruction is not speculative, so
> use it. Allthough GCC usually generates code using that instruction,
> it is safer to use inline assembly to be sure. The result is:
>
> 14: 3d 20 bf fe lis r9,-16386
> 18: 7c 03 48 40 cmplw r3,r9
> 1c: 7c 69 18 5e iselgt r3,r9,r3
>
> On other ones, when kernel space is over 0x80000000 and user space
> is below, the logic in mask_user_address_simple() leads to a
> 3 instruction sequence:
>
> 14: 7c 69 fe 70 srawi r9,r3,31
> 18: 7c 63 48 78 andc r3,r3,r9
> 1c: 51 23 00 00 rlwimi r3,r9,0,0,0
>
> This is the default on powerpc 8xx.
>
> When the limit between user space and kernel space is not 0x80000000,
> mask_user_address_32() is used and a 6 instructions sequence is
> generated:
>
> 24: 54 69 7c 7e srwi r9,r3,17
> 28: 21 29 57 ff subfic r9,r9,22527
> 2c: 7d 29 fe 70 srawi r9,r9,31
> 30: 75 2a b0 00 andis. r10,r9,45056
> 34: 7c 63 48 78 andc r3,r3,r9
> 38: 7c 63 53 78 or r3,r3,r10
>
> The constraint is that TASK_SIZE be aligned to 128K in order to get
> the most optimal number of instructions.
>
> When CONFIG_PPC_BARRIER_NOSPEC is not defined, fallback on the
> test-based masking as it is quicker than the 6 instructions sequence
> but not necessarily quicker than the 3 instructions sequences above.
Doesn't that depend on whether the branch is predicted correctly?
I can't read ppc asm well enough to check the above.
And the C is also a bit tortuous.
>
> On 64 bits, kernel is always above 0x8000000000000000 and user always
> below, which leads to a 4 instructions sequence:
>
> 80: 7c 69 1b 78 mr r9,r3
> 84: 7c 63 fe 76 sradi r3,r3,63
> 88: 7d 29 18 78 andc r9,r9,r3
> 8c: 79 23 00 4c rldimi r3,r9,0,1
>
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
> arch/powerpc/Kconfig | 2 +-
> arch/powerpc/include/asm/uaccess.h | 100 +++++++++++++++++++++++++++++
> 2 files changed, 101 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index c3e0cc83f120..c26a39b4504a 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -1303,7 +1303,7 @@ config TASK_SIZE
> hex "Size of user task space" if TASK_SIZE_BOOL
> default "0x80000000" if PPC_8xx
> default "0xb0000000" if PPC_BOOK3S_32 && EXECMEM
> - default "0xc0000000"
> + default "0xbffe0000"
>
> config MODULES_SIZE_BOOL
> bool "Set custom size for modules/execmem area"
> diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
> index 89d53d4c2236..19743ee80523 100644
> --- a/arch/powerpc/include/asm/uaccess.h
> +++ b/arch/powerpc/include/asm/uaccess.h
> @@ -2,6 +2,8 @@
> #ifndef _ARCH_POWERPC_UACCESS_H
> #define _ARCH_POWERPC_UACCESS_H
>
> +#include <linux/sizes.h>
> +
> #include <asm/processor.h>
> #include <asm/page.h>
> #include <asm/extable.h>
> @@ -455,6 +457,104 @@ user_write_access_begin(const void __user *ptr, size_t len)
> #define user_write_access_begin user_write_access_begin
> #define user_write_access_end prevent_current_write_to_user
>
> +/*
> + * Masking the user address is an alternative to a conditional
> + * user_access_begin that can avoid the fencing. This only works
> + * for dense accesses starting at the address.
I think you need to say that kernel addresses get converted to an invalid
address between user and kernel addresses.
It works provided accesses are 'reasonably dense'.
David
> + */
> +static inline void __user *mask_user_address_simple(const void __user *ptr)
> +{
> + unsigned long addr = (unsigned long)ptr;
> + unsigned long mask = (unsigned long)((long)addr >> (BITS_PER_LONG - 1));
> +
> + addr = ((addr & ~mask) & (~0UL >> 1)) | (mask & (1UL << (BITS_PER_LONG - 1)));
> +
> + return (void __user *)addr;
> +}
> +
> +static inline void __user *mask_user_address_e500(const void __user *ptr)
> +{
> + unsigned long addr;
> +
> + asm("cmplw %1, %2; iselgt %0, %2, %1" : "=r"(addr) : "r"(ptr), "r"(TASK_SIZE): "cr0");
> +
> + return (void __user *)addr;
> +}
> +
> +/* Make sure TASK_SIZE is a multiple of 128K for shifting by 17 to the right */
> +static inline void __user *mask_user_address_32(const void __user *ptr)
> +{
> + unsigned long addr = (unsigned long)ptr;
> + unsigned long mask = (unsigned long)((long)((TASK_SIZE >> 17) - 1 - (addr >> 17)) >> 31);
> +
> + addr = (addr & ~mask) | (TASK_SIZE & mask);
> +
> + return (void __user *)addr;
> +}
> +
> +static inline void __user *mask_user_address_fallback(const void __user *ptr)
> +{
> + unsigned long addr = (unsigned long)ptr;
> +
> + return (void __user *)(addr < TASK_SIZE ? addr : TASK_SIZE);
> +}
> +
> +static inline void __user *mask_user_address(const void __user *ptr)
> +{
> +#ifdef MODULES_VADDR
> + const unsigned long border = MODULES_VADDR;
> +#else
> + const unsigned long border = PAGE_OFFSET;
> +#endif
> + BUILD_BUG_ON(TASK_SIZE_MAX & (SZ_128K - 1));
> + BUILD_BUG_ON(TASK_SIZE_MAX + SZ_128K > border);
> + BUILD_BUG_ON(TASK_SIZE_MAX & 0x8000000000000000ULL);
> + BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC64) && !(PAGE_OFFSET & 0x8000000000000000ULL));
> +
> + if (IS_ENABLED(CONFIG_PPC64))
> + return mask_user_address_simple(ptr);
> + if (IS_ENABLED(CONFIG_E500))
> + return mask_user_address_e500(ptr);
> + if (TASK_SIZE <= SZ_2G && border >= SZ_2G)
> + return mask_user_address_simple(ptr);
> + if (IS_ENABLED(CONFIG_PPC_BARRIER_NOSPEC))
> + return mask_user_address_32(ptr);
> + return mask_user_address_fallback(ptr);
> +}
> +
> +static inline void __user *masked_user_access_begin(const void __user *p)
> +{
> + void __user *ptr = mask_user_address(p);
> +
> + might_fault();
> + allow_read_write_user(ptr, ptr);
> +
> + return ptr;
> +}
> +#define masked_user_access_begin masked_user_access_begin
> +
> +static inline void __user *masked_user_read_access_begin(const void __user *p)
> +{
> + void __user *ptr = mask_user_address(p);
> +
> + might_fault();
> + allow_read_from_user(ptr);
> +
> + return ptr;
> +}
> +#define masked_user_read_access_begin masked_user_read_access_begin
> +
> +static inline void __user *masked_user_write_access_begin(const void __user *p)
> +{
> + void __user *ptr = mask_user_address(p);
> +
> + might_fault();
> + allow_write_to_user(ptr);
> +
> + return ptr;
> +}
> +#define masked_user_write_access_begin masked_user_write_access_begin
> +
> #define unsafe_get_user(x, p, e) do { \
> __long_type(*(p)) __gu_val; \
> __typeof__(*(p)) __user *__gu_addr = (p); \
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 5/5] powerpc: Implement masked user access
2025-06-22 17:13 ` David Laight
@ 2025-06-22 17:40 ` Linus Torvalds
2025-06-22 19:51 ` David Laight
2025-06-22 18:57 ` Segher Boessenkool
1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2025-06-22 17:40 UTC (permalink / raw)
To: David Laight
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
linux-kernel, linuxppc-dev, linux-fsdevel, linux-mm
On Sun, 22 Jun 2025 at 10:13, David Laight <david.laight.linux@gmail.com> wrote:
>
> Not checking the size is slightly orthogonal.
> It really just depends on the accesses being 'reasonably sequential'.
> That is probably always true since access_ok() covers a single copy.
It is probably true in practice, but yeah, it's worth thinking about.
Particularly for various user level structure accesses, we do end up
often accessing the members individually and thus potentially out of
order, but as you say "reasonable sequential" is still true: the
accesses are within a reasonably small offset of each other.
And when we have potentially very big accesses with large offsets from
the beginning (ie things like read/write() calls), we do them
sequentially.
There *might* be odd ioctls and such that get offsets from user space,
though. So any conversion to using 'masked_user_access_begin()' needs
to have at least *some* thought and not be just a mindless conversion
from access_ok().
We have this same issue in access_ok() itself, and on x86-64 that does
static inline bool __access_ok(const void __user *ptr, unsigned long size)
{
if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) {
return valid_user_address(ptr);
.. do the more careful one that actually uses the 'size' ...
so it turns access_ok() itself into just a simple single-ended
comparison with the starting address for small sizes, because we know
it's ok to overflow by a bit (because of how valid_user_address()
works on x86).
Linus
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 5/5] powerpc: Implement masked user access
2025-06-22 17:13 ` David Laight
2025-06-22 17:40 ` Linus Torvalds
@ 2025-06-22 18:57 ` Segher Boessenkool
1 sibling, 0 replies; 36+ messages in thread
From: Segher Boessenkool @ 2025-06-22 18:57 UTC (permalink / raw)
To: David Laight
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Hi!
On Sun, Jun 22, 2025 at 06:13:51PM +0100, David Laight wrote:
> On Sun, 22 Jun 2025 11:52:43 +0200
> Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> > e500 has the isel instruction which allows selecting one value or
> > the other without branch and that instruction is not speculative, so
> > use it. Allthough GCC usually generates code using that instruction,
> > it is safer to use inline assembly to be sure. The result is:
The instruction (which is a standard Power instruction since
architecture version 2.03, published in 2006) can in principle be
speculative, but there exist no Power implementations that do any data
speculation like this at all.
If you want any particular machine instructions to be generated you have
to manually write it, sure, in inline asm or preferably in actual asm.
But you can be sure that GCC will generate isel or similar (like the
v3.1 set[n]bc[r] insns, best instructions ever!), whenever appropriate,
i.e. when it is a) allowed at all, and b) advantageous.
> > 14: 3d 20 bf fe lis r9,-16386
> > 18: 7c 03 48 40 cmplw r3,r9
> > 1c: 7c 69 18 5e iselgt r3,r9,r3
> >
> > On other ones, when kernel space is over 0x80000000 and user space
> > is below, the logic in mask_user_address_simple() leads to a
> > 3 instruction sequence:
> >
> > 14: 7c 69 fe 70 srawi r9,r3,31
> > 18: 7c 63 48 78 andc r3,r3,r9
> > 1c: 51 23 00 00 rlwimi r3,r9,0,0,0
> >
> > This is the default on powerpc 8xx.
> >
> > When the limit between user space and kernel space is not 0x80000000,
> > mask_user_address_32() is used and a 6 instructions sequence is
> > generated:
> >
> > 24: 54 69 7c 7e srwi r9,r3,17
> > 28: 21 29 57 ff subfic r9,r9,22527
> > 2c: 7d 29 fe 70 srawi r9,r9,31
> > 30: 75 2a b0 00 andis. r10,r9,45056
> > 34: 7c 63 48 78 andc r3,r3,r9
> > 38: 7c 63 53 78 or r3,r3,r10
> >
> > The constraint is that TASK_SIZE be aligned to 128K in order to get
> > the most optimal number of instructions.
> >
> > When CONFIG_PPC_BARRIER_NOSPEC is not defined, fallback on the
> > test-based masking as it is quicker than the 6 instructions sequence
> > but not necessarily quicker than the 3 instructions sequences above.
>
> Doesn't that depend on whether the branch is predicted correctly?
>
> I can't read ppc asm well enough to check the above.
[ PowerPC or Power (or Power Architecture, or Power ISA) ]
> And the C is also a bit tortuous.
I can read the code ;-) All those instructions are normal simple
integer instructions. Shifts, adds, logicals.
In general, correctly predicted non-taken bvranches cost absolutely
nothing. Correctly predicted taken branches cost the same as any taken
branch, so a refetch, maybe resulting in a cycle or so of decode bubble.
And a mispredicted branch can be very expensive, say on the order of a
hundred cycles (but usually more like ten, which is still a lot of insns
worth).
So branches are great for predictable stuff, and "not so great" for
not so predictable stuff.
Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 5/5] powerpc: Implement masked user access
2025-06-22 17:40 ` Linus Torvalds
@ 2025-06-22 19:51 ` David Laight
0 siblings, 0 replies; 36+ messages in thread
From: David Laight @ 2025-06-22 19:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
linux-kernel, linuxppc-dev, linux-fsdevel, linux-mm
On Sun, 22 Jun 2025 10:40:00 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 22 Jun 2025 at 10:13, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > Not checking the size is slightly orthogonal.
> > It really just depends on the accesses being 'reasonably sequential'.
> > That is probably always true since access_ok() covers a single copy.
>
> It is probably true in practice, but yeah, it's worth thinking about.
> Particularly for various user level structure accesses, we do end up
> often accessing the members individually and thus potentially out of
> order, but as you say "reasonable sequential" is still true: the
> accesses are within a reasonably small offset of each other.
I found one that did ptr[4] followed by ptr[0].
Which was potentially problematic if changed to use 'masked' accesses
before you changed the code to use cmov.
>
> And when we have potentially very big accesses with large offsets from
> the beginning (ie things like read/write() calls), we do them
> sequentially.
>
> There *might* be odd ioctls and such that get offsets from user space,
> though. So any conversion to using 'masked_user_access_begin()' needs
> to have at least *some* thought and not be just a mindless conversion
> from access_ok().
True - but the ioctl (like) code is more likely to be using copy_to/from_user()
on the offsetted address rather than trying to be too clever.
>
> We have this same issue in access_ok() itself, and on x86-64 that does
>
> static inline bool __access_ok(const void __user *ptr, unsigned long size)
> {
> if (__builtin_constant_p(size <= PAGE_SIZE) && size <= PAGE_SIZE) {
> return valid_user_address(ptr);
> .. do the more careful one that actually uses the 'size' ...
>
> so it turns access_ok() itself into just a simple single-ended
> comparison with the starting address for small sizes, because we know
> it's ok to overflow by a bit (because of how valid_user_address()
> works on x86).
IIRC there is a comment just below that the says the size could (probably)
just be ignored.
Given how few access_ok() there ought to be, checking them shouldn't be
a problem.
But I get either io_uring or bpf does something strange and unexpected
that is probably a bug waiting to be found.
Remembers some very strange code that has two iovec[] for reading data
from a second process.
I think I failed to find all the access_ok() tests.
IIRC it isn't used by anything 'important' and could be nuked on
security grounds.
David
>
> Linus
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter()
2025-06-22 16:57 ` Linus Torvalds
@ 2025-06-22 20:18 ` David Laight
2025-06-24 5:49 ` Christophe Leroy
1 sibling, 0 replies; 36+ messages in thread
From: David Laight @ 2025-06-22 20:18 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
linux-kernel, linuxppc-dev, linux-fsdevel, linux-mm
On Sun, 22 Jun 2025 09:57:20 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 22 Jun 2025 at 02:52, Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
> >
> > The results of "access_ok()" can be mis-speculated.
>
> Hmm. This code is critical. I think it should be converted to use that
> masked address thing if we have to add it here.
If access_ok() is mis-speculated then you get a read from the user-specified
kernel address - I don't think that matters.
The hacker would need to find somewhere where the read value was used
in a test or memory access so that side effects (typically cache line
evictions) can be detected.
But copy_from_user_iter() is pretty much always used for 'data' not
'control pane' - so you'd be hard pushed to find somewhere 'useful'.
Not only that the cpu would have to return from copy_from_user_iter()
before correcting the mis-speculation.
I can't imagine that happening - even without all the 'return thunk' stuff.
The same might be true for copy_from_user().
It might only be get_user() that actually has any chance of being exploited.
>
> And at some point this access_ok() didn't even exist, because we check
> the addresses at iter creation time. So this one might be a "belt and
> suspenders" check, rather than something critical.
IIRC there was a patch to move the access_ok() much nearer the use copy.
But it didn't go as far as removing the one from import_iovec().
Although removing that one might make sense.
(I've also looked about whether the 'direction' is needed in the 'iter'.
98% of the code knows what it should be - and may contain pointless
checks, but some bits seem to rely on it.)
David
>
> (Although I also suspect that when we added ITER_UBUF we might have
> created cases where those user addresses aren't checked at iter
> creation time any more).
>
> Linus
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-22 16:20 ` [PATCH 0/5] " David Laight
@ 2025-06-24 5:27 ` Christophe Leroy
2025-06-24 8:32 ` David Laight
2025-06-24 13:17 ` Segher Boessenkool
0 siblings, 2 replies; 36+ messages in thread
From: Christophe Leroy @ 2025-06-24 5:27 UTC (permalink / raw)
To: David Laight
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Le 22/06/2025 à 18:20, David Laight a écrit :
> On Sun, 22 Jun 2025 11:52:38 +0200
> Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
>
>> Masked user access avoids the address/size verification by access_ok().
>> Allthough its main purpose is to skip the speculation in the
>> verification of user address and size hence avoid the need of spec
>> mitigation, it also has the advantage to reduce the amount of
>> instructions needed so it also benefits to platforms that don't
>> need speculation mitigation, especially when the size of the copy is
>> not know at build time.
>
> It also removes a conditional branch that is quite likely to be
> statically predicted 'the wrong way'.
But include/asm-generic/access_ok.h defines access_ok() as:
#define access_ok(addr, size) likely(__access_ok(addr, size))
So GCC uses the 'unlikely' variant of the branch instruction to force
the correct prediction, doesn't it ?
>
>> Unlike x86_64 which masks the address to 'all bits set' when the
>> user address is invalid, here the address is set to an address in
>> the gap. It avoids relying on the zero page to catch offseted
>> accesses. On book3s/32 it makes sure the opening remains on user
>> segment. The overcost is a single instruction in the masking.
>
> That isn't true (any more).
> Linus changed the check to (approx):
> if (uaddr > TASK_SIZE)
> uaddr = TASK_SIZE;
> (Implemented with a conditional move)
Ah ok, I overlooked that, I didn't know the cmove instruction, seem
similar to the isel instruction on powerpc e500.
Christophe
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 1/5] uaccess: Add masked_user_{read/write}_access_begin
2025-06-22 16:35 ` David Laight
@ 2025-06-24 5:34 ` Christophe Leroy
0 siblings, 0 replies; 36+ messages in thread
From: Christophe Leroy @ 2025-06-24 5:34 UTC (permalink / raw)
To: David Laight
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Le 22/06/2025 à 18:35, David Laight a écrit :
> On Sun, 22 Jun 2025 11:52:39 +0200
> Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
>
>> Allthough masked_user_access_begin() seems to only be used when reading
>> data from user at the moment, introduce masked_user_read_access_begin()
>> and masked_user_write_access_begin() in order to match
>> user_read_access_begin() and user_write_access_begin().
>>
>> Have them default to masked_user_access_begin() when they are
>> not defined.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>> fs/select.c | 2 +-
>> include/linux/uaccess.h | 8 ++++++++
>> kernel/futex/futex.h | 4 ++--
>> lib/strncpy_from_user.c | 2 +-
>> lib/strnlen_user.c | 2 +-
>> 5 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/select.c b/fs/select.c
>> index 9fb650d03d52..d8547bedf5eb 100644
>> --- a/fs/select.c
>> +++ b/fs/select.c
>> @@ -777,7 +777,7 @@ static inline int get_sigset_argpack(struct sigset_argpack *to,
>> // the path is hot enough for overhead of copy_from_user() to matter
>> if (from) {
>> if (can_do_masked_user_access())
>> - from = masked_user_access_begin(from);
>> + from = masked_user_read_access_begin(from);
>> else if (!user_read_access_begin(from, sizeof(*from)))
>> return -EFAULT;
>> unsafe_get_user(to->p, &from->p, Efault);
>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>> index 7c06f4795670..682a0cd2fe51 100644
>> --- a/include/linux/uaccess.h
>> +++ b/include/linux/uaccess.h
>> @@ -41,6 +41,14 @@
>
>> #ifdef masked_user_access_begin
>> #define can_do_masked_user_access() 1
>> #else
>> #define can_do_masked_user_access() 0
>> #define masked_user_access_begin(src) NULL
>> #define mask_user_address(src) (src)
>> #endif
>>
>> +#ifndef masked_user_write_access_begin
>> +#define masked_user_write_access_begin masked_user_access_begin
>> +#endif
>> +#ifndef masked_user_read_access_begin
>> +#define masked_user_read_access_begin masked_user_access_begin
>> +#endif
>
> I think that needs merging with the bit above.
> Perhaps generating something like:
>
> #ifdef masked_user_access_begin
> #define masked_user_read_access_begin masked_user_access_begin
> #define masked_user_write_access_begin masked_user_access_begin
> #endif
>
> #ifdef masked_user_read_access_begin
> #define can_do_masked_user_access() 1
> #else
> #define can_do_masked_user_access() 0
> #define masked_user_read_access_begin(src) NULL
> #define masked_user_write_access_begin(src) NULL
> #define mask_user_address(src) (src)
> #endif
>
> Otherwise you'll have to #define masked_user_access_begin even though
> it is never used.
I'm not sure I understand what you mean.
masked_user_access_begin() is used, for instance in
arch/x86/include/asm/futex.h so it will remain.
masked_user_access_begin() is the analogy of user_access_begin(), it
starts a read-write user access and is worth it.
>
> Two more patches could change x86-64 to define both and then remove
> the 'then unused' first check - but that has to be for later.
>
Christophe
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter()
2025-06-22 16:57 ` Linus Torvalds
2025-06-22 20:18 ` David Laight
@ 2025-06-24 5:49 ` Christophe Leroy
2025-06-24 8:07 ` David Laight
2025-06-24 15:15 ` Linus Torvalds
1 sibling, 2 replies; 36+ messages in thread
From: Christophe Leroy @ 2025-06-24 5:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, David Laight,
Dave Hansen, linux-kernel, linuxppc-dev, linux-fsdevel, linux-mm
Le 22/06/2025 à 18:57, Linus Torvalds a écrit :
> On Sun, 22 Jun 2025 at 02:52, Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>>
>> The results of "access_ok()" can be mis-speculated.
>
> Hmm. This code is critical. I think it should be converted to use that
> masked address thing if we have to add it here.
Ok, I'll add it.
>
> And at some point this access_ok() didn't even exist, because we check
> the addresses at iter creation time. So this one might be a "belt and
> suspenders" check, rather than something critical.
>
> (Although I also suspect that when we added ITER_UBUF we might have
> created cases where those user addresses aren't checked at iter
> creation time any more).
>
Let's take the follow path as an exemple:
snd_pcm_ioctl(SNDRV_PCM_IOCTL_WRITEI_FRAMES)
snd_pcm_common_ioctl()
snd_pcm_xferi_frames_ioctl()
snd_pcm_lib_write()
__snd_pcm_lib_xfer()
default_write_copy()
copy_from_iter()
_copy_from_iter()
__copy_from_iter()
iterate_and_advance()
iterate_and_advance2()
iterate_iovec()
copy_from_user_iter()
As far as I can see, none of those functions check the accessibility of
the iovec. Am I missing something ?
Christophe
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter()
2025-06-24 5:49 ` Christophe Leroy
@ 2025-06-24 8:07 ` David Laight
2025-06-24 15:15 ` Linus Torvalds
1 sibling, 0 replies; 36+ messages in thread
From: David Laight @ 2025-06-24 8:07 UTC (permalink / raw)
To: Christophe Leroy
Cc: Linus Torvalds, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
linux-kernel, linuxppc-dev, linux-fsdevel, linux-mm
On Tue, 24 Jun 2025 07:49:03 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> Le 22/06/2025 à 18:57, Linus Torvalds a écrit :
> > On Sun, 22 Jun 2025 at 02:52, Christophe Leroy
> > <christophe.leroy@csgroup.eu> wrote:
> >>
> >> The results of "access_ok()" can be mis-speculated.
> >
> > Hmm. This code is critical. I think it should be converted to use that
> > masked address thing if we have to add it here.
>
> Ok, I'll add it.
>
> >
> > And at some point this access_ok() didn't even exist, because we check
> > the addresses at iter creation time. So this one might be a "belt and
> > suspenders" check, rather than something critical.
> >
> > (Although I also suspect that when we added ITER_UBUF we might have
> > created cases where those user addresses aren't checked at iter
> > creation time any more).
> >
>
> Let's take the follow path as an exemple:
>
> snd_pcm_ioctl(SNDRV_PCM_IOCTL_WRITEI_FRAMES)
> snd_pcm_common_ioctl()
> snd_pcm_xferi_frames_ioctl()
> snd_pcm_lib_write()
> __snd_pcm_lib_xfer()
> default_write_copy()
> copy_from_iter()
> _copy_from_iter()
> __copy_from_iter()
> iterate_and_advance()
> iterate_and_advance2()
> iterate_iovec()
> copy_from_user_iter()
>
> As far as I can see, none of those functions check the accessibility of
> the iovec. Am I missing something ?
The import_ubuf() in do_transfer() ought to contain one.
But really you want the one in copy_from_user_iter() rather than the outer one.
Mind you that code is horrid.
The code only ever copies a single buffer, so could be much shorter.
And is that deep call chain really needed for the very common case of one buffer.
David
>
> Christophe
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-24 5:27 ` Christophe Leroy
@ 2025-06-24 8:32 ` David Laight
2025-06-24 21:37 ` Segher Boessenkool
2025-06-24 13:17 ` Segher Boessenkool
1 sibling, 1 reply; 36+ messages in thread
From: David Laight @ 2025-06-24 8:32 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Tue, 24 Jun 2025 07:27:47 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> Le 22/06/2025 à 18:20, David Laight a écrit :
> > On Sun, 22 Jun 2025 11:52:38 +0200
> > Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> >
> >> Masked user access avoids the address/size verification by access_ok().
> >> Allthough its main purpose is to skip the speculation in the
> >> verification of user address and size hence avoid the need of spec
> >> mitigation, it also has the advantage to reduce the amount of
> >> instructions needed so it also benefits to platforms that don't
> >> need speculation mitigation, especially when the size of the copy is
> >> not know at build time.
> >
> > It also removes a conditional branch that is quite likely to be
> > statically predicted 'the wrong way'.
>
> But include/asm-generic/access_ok.h defines access_ok() as:
>
> #define access_ok(addr, size) likely(__access_ok(addr, size))
>
> So GCC uses the 'unlikely' variant of the branch instruction to force
> the correct prediction, doesn't it ?
Nope...
Most architectures don't have likely/unlikely variants of branches.
So all gcc can do is decide which path is the fall-through and
whether the branch is forwards or backwards.
Additionally unless there is code in both the 'if' and 'else' clauses
the [un]likely seems to have no effect.
So on simple cpu that predict 'backwards branches taken' you can get
the desired effect - but it may need an 'asm comment' to force the
compiler to generate the required branches (eg forwards branch directly
to a backwards unconditional jump).
On x86 it is all more complicated.
I think the pre-fetch code is likely to assume 'not taken' (but might
use stale info on the cache line).
The predictor itself never does 'static prediction' - it is always
based on the referenced branch prediction data structure.
So, unless you are in a loop (eg running a benchmark!) there is pretty
much a 50% chance of a branch mispredict.
I've been trying to benchmark different versions of the u64 * u64 / u64
function - and I think mispredicted branches make a big difference.
I need to sit down and sequence the test cases so that I can see
the effect of each branch!
>
> >
> >> Unlike x86_64 which masks the address to 'all bits set' when the
> >> user address is invalid, here the address is set to an address in
> >> the gap. It avoids relying on the zero page to catch offseted
> >> accesses. On book3s/32 it makes sure the opening remains on user
> >> segment. The overcost is a single instruction in the masking.
> >
> > That isn't true (any more).
> > Linus changed the check to (approx):
> > if (uaddr > TASK_SIZE)
> > uaddr = TASK_SIZE;
> > (Implemented with a conditional move)
>
> Ah ok, I overlooked that, I didn't know the cmove instruction, seem
> similar to the isel instruction on powerpc e500.
It got added for the 386 - I learnt 8086 :-)
I suspect x86 got there first...
Although called 'conditional move' I very much suspect the write is
actually unconditional.
So the hardware implementation is much the same as 'add carry' except
the ALU operation is a simple multiplex.
Which means it is unlikely to be speculative.
David
>
> Christophe
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-24 5:27 ` Christophe Leroy
2025-06-24 8:32 ` David Laight
@ 2025-06-24 13:17 ` Segher Boessenkool
2025-06-24 16:50 ` David Laight
1 sibling, 1 reply; 36+ messages in thread
From: Segher Boessenkool @ 2025-06-24 13:17 UTC (permalink / raw)
To: Christophe Leroy
Cc: David Laight, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:
> Ah ok, I overlooked that, I didn't know the cmove instruction, seem
> similar to the isel instruction on powerpc e500.
cmove does a move (register or memory) when some condition is true.
isel (which is base PowerPC, not something "e500" only) is a
computational instruction, it copies one of two registers to a third,
which of the two is decided by any bit in the condition register.
But sure, seen from very far off both isel and cmove can be used to
implemente the ternary operator ("?:"), are similar in that way :-)
Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter()
2025-06-24 5:49 ` Christophe Leroy
2025-06-24 8:07 ` David Laight
@ 2025-06-24 15:15 ` Linus Torvalds
1 sibling, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2025-06-24 15:15 UTC (permalink / raw)
To: Christophe Leroy
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, David Laight,
Dave Hansen, linux-kernel, linuxppc-dev, linux-fsdevel, linux-mm
On Mon, 23 Jun 2025 at 22:49, Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> >
> > (Although I also suspect that when we added ITER_UBUF we might have
> > created cases where those user addresses aren't checked at iter
> > creation time any more).
> >
>
> Let's take the follow path as an exemple:
>
> snd_pcm_ioctl(SNDRV_PCM_IOCTL_WRITEI_FRAMES)
> snd_pcm_common_ioctl()
> snd_pcm_xferi_frames_ioctl()
> snd_pcm_lib_write()
> __snd_pcm_lib_xfer()
> default_write_copy()
> copy_from_iter()
> _copy_from_iter()
> __copy_from_iter()
> iterate_and_advance()
> iterate_and_advance2()
> iterate_iovec()
> copy_from_user_iter()
>
> As far as I can see, none of those functions check the accessibility of
> the iovec. Am I missing something ?
So we still to do this checking at creation time (see import_iovec ->
__import_iovec, and import_ubuf).
In the path you give as an example, the check happens at that
"do_transfer()" stage when it does
err = import_ubuf(type, (__force void __user *)data, bytes, &iter);
but yeah, it's very non-obvious (see __snd_pcm_lib_xfer(), which calls
writer() which is either interleaved_copy() or noninterleaved_copy(),
and then they do that do_transfer() thing which does that
import_ubuf() thing.
So *because* you were supposed to have checked your iov_iters
beforehand, the actual iter code itself at some point just used
__copy_to_user() directly with no checking at all.
And that all was really *much* too subtle, and Al fixed this a few
years ago (see commit 09fc68dc66f7: "iov_iter: saner checks on
copyin/copyout")
Linus
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-24 13:17 ` Segher Boessenkool
@ 2025-06-24 16:50 ` David Laight
2025-06-24 18:25 ` Segher Boessenkool
0 siblings, 1 reply; 36+ messages in thread
From: David Laight @ 2025-06-24 16:50 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Tue, 24 Jun 2025 08:17:14 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:
> On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:
> > Ah ok, I overlooked that, I didn't know the cmove instruction, seem
> > similar to the isel instruction on powerpc e500.
>
> cmove does a move (register or memory) when some condition is true.
The destination of x86 'cmov' is always a register (only the source can be
memory - an is probably always read).
It is a also a computational instruction.
It may well always do the register write - hard to detect.
There is a planned new instruction that would do a conditional write
to memory - but not on any cpu yet.
> isel (which is base PowerPC, not something "e500" only) is a
> computational instruction, it copies one of two registers to a third,
> which of the two is decided by any bit in the condition register.
Does that mean it could be used for all the ppc cpu variants?
> But sure, seen from very far off both isel and cmove can be used to
> implement the ternary operator ("?:"), are similar in that way :-)
Which is exactly what you want to avoid speculation.
David
>
>
> Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-24 16:50 ` David Laight
@ 2025-06-24 18:25 ` Segher Boessenkool
2025-06-24 21:08 ` David Laight
0 siblings, 1 reply; 36+ messages in thread
From: Segher Boessenkool @ 2025-06-24 18:25 UTC (permalink / raw)
To: David Laight
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Hi!
On Tue, Jun 24, 2025 at 05:50:01PM +0100, David Laight wrote:
> On Tue, 24 Jun 2025 08:17:14 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
>
> > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:
> > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem
> > > similar to the isel instruction on powerpc e500.
> >
> > cmove does a move (register or memory) when some condition is true.
>
> The destination of x86 'cmov' is always a register (only the source can be
> memory - an is probably always read).
Both source operands can be mem, right? But probably not both at the
same time.
> It is a also a computational instruction.
Terminology...
x86 is not a RISC architecture, or more generally, a load/store
architecture.
A computational instruction is one that doesn't touch memory or does a
branch, or some system function, some supervisor or hypervisor
instruction maybe.
x86 does not have many computational insns, most insns can touch
memory :-)
(The important thing is that most computational insns do not ever cause
exceptions, the only exceptions are if you divide by zero or
similar :-) )
> It may well always do the register write - hard to detect.
>
> There is a planned new instruction that would do a conditional write
> to memory - but not on any cpu yet.
Interesting! Instructions like the atomic store insns we got for p9,
maybe? They can do minimum/maximum and various kinds of more generic
reductions and similar.
> > isel (which is base PowerPC, not something "e500" only) is a
> > computational instruction, it copies one of two registers to a third,
> > which of the two is decided by any bit in the condition register.
>
> Does that mean it could be used for all the ppc cpu variants?
No, only things that implement architecture version of 2.03 or later.
That is from 2006, so essentially everything that is still made
implements it :-)
But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet
have it (they are ISA 2.01 and 2.02 respectively). And the older p5's
do not have it yet either, but the newer ones do.
And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-)
> > But sure, seen from very far off both isel and cmove can be used to
> > implement the ternary operator ("?:"), are similar in that way :-)
>
> Which is exactly what you want to avoid speculation.
There are cheaper / simpler / more effective / better ways to get that,
but sure, everything is better than a conditional branch, always :-)
Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-24 18:25 ` Segher Boessenkool
@ 2025-06-24 21:08 ` David Laight
2025-06-26 5:56 ` Christophe Leroy
2025-06-26 21:39 ` Segher Boessenkool
0 siblings, 2 replies; 36+ messages in thread
From: David Laight @ 2025-06-24 21:08 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Tue, 24 Jun 2025 13:25:05 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:
> Hi!
>
> On Tue, Jun 24, 2025 at 05:50:01PM +0100, David Laight wrote:
> > On Tue, 24 Jun 2025 08:17:14 -0500
> > Segher Boessenkool <segher@kernel.crashing.org> wrote:
> >
> > > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:
> > > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem
> > > > similar to the isel instruction on powerpc e500.
> > >
> > > cmove does a move (register or memory) when some condition is true.
> >
> > The destination of x86 'cmov' is always a register (only the source can be
> > memory - and is probably always read).
>
> Both source operands can be mem, right? But probably not both at the
> same time.
It only has one 'real' source, but the implementation could easily
read the destination register and then decide which value to write
back - rather than doing a conditional write to the register file.
A conditional write would be a right PITA for the alu result
forwarding logic
>
> > It is a also a computational instruction.
>
> Terminology...
>
> x86 is not a RISC architecture, or more generally, a load/store
> architecture.
It sort of is these days.
The memory transfers are separate u-ops, so a 'reg += mem' instruction
is split into two be the decoder.
Although some u-ops get merged together and executed in one clock,
obvious example is some 'compare+branch' pairs.
> A computational instruction is one that doesn't touch memory or does a
> branch, or some system function, some supervisor or hypervisor
> instruction maybe.
>
> x86 does not have many computational insns, most insns can touch
> memory :-)
Except that the memory 'bit' is executed separately from any alu 'stuff'.
So for a 'reg += mem' instruction the memory read can be started as soon
as the registers that contain the address are valid, the 'add' requires
the memory read have completed and the instruction that generated the
old value of 'reg' have completed - which could be waiting on all sorts
of things (like a divide). Once both values are ready the 'add' can be
executed (provided a suitable alu is available).
> (The important thing is that most computational insns do not ever cause
> exceptions, the only exceptions are if you divide by zero or
> similar :-) )
>
> > It may well always do the register write - hard to detect.
> >
> > There is a planned new instruction that would do a conditional write
> > to memory - but not on any cpu yet.
>
> Interesting! Instructions like the atomic store insns we got for p9,
> maybe? They can do minimum/maximum and various kinds of more generic
> reductions and similar.
I think they are only conditional stores.
But they do save a conditional branch.
A late disable of a memory write is far less problematic than a disabled
register file write. No one minds (too much) about slight delays between
writes and reads of the same location (reduced by a store to load forwarder)
but you don't want to lose clocks between adjacent simple alu instructions.
For my sins I re-implemented a soft cpu last year...
Which doesn't have a 'cmov' :-(
>
> > > isel (which is base PowerPC, not something "e500" only) is a
> > > computational instruction, it copies one of two registers to a third,
> > > which of the two is decided by any bit in the condition register.
> >
> > Does that mean it could be used for all the ppc cpu variants?
>
> No, only things that implement architecture version of 2.03 or later.
> That is from 2006, so essentially everything that is still made
> implements it :-)
>
> But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet
> have it (they are ISA 2.01 and 2.02 respectively). And the older p5's
> do not have it yet either, but the newer ones do.
>
> And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-)
That make more sense than the list in patch 5/5.
>
> > > But sure, seen from very far off both isel and cmove can be used to
> > > implement the ternary operator ("?:"), are similar in that way :-)
> >
> > Which is exactly what you want to avoid speculation.
>
> There are cheaper / simpler / more effective / better ways to get that,
> but sure, everything is better than a conditional branch, always :-)
Everything except a TLB miss :-)
And for access_ok() avoiding the conditional is a good enough reason
to use a 'conditional move' instruction.
Avoiding speculation is actually free.
>
>
> Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-24 8:32 ` David Laight
@ 2025-06-24 21:37 ` Segher Boessenkool
2025-06-25 8:30 ` David Laight
0 siblings, 1 reply; 36+ messages in thread
From: Segher Boessenkool @ 2025-06-24 21:37 UTC (permalink / raw)
To: David Laight
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Hi!
On Tue, Jun 24, 2025 at 09:32:58AM +0100, David Laight wrote:
> > So GCC uses the 'unlikely' variant of the branch instruction to force
> > the correct prediction, doesn't it ?
>
> Nope...
> Most architectures don't have likely/unlikely variants of branches.
In GCC, "likely" means 80%. "Very likely" means 99.95%. Most things get
something more appropriate than such coarse things predicted.
Most of the time GCC uses these predicted branch probabilities to lay
out code in such a way that the fall-through path is the expected one.
Target backends can do special things with it as well, but usually that
isn't necessary.
There are many different predictors. GCC usually can predict things
not bad by just looking at the shape of the code, using various
heuristics. Things like profile-guided optimisation allow to use a
profile from an actual execution to optimise the code such that it will
work faster (assuming that future executions of the code will execute
similarly!)
You also can use __builtin_expect() in the source code, to put coarse
static prediction in. That is what the kernel "{un,}likely" macros do.
If the compiler knows some branch is not very predictable, it can
optimise the code knowing that. Like, it could use other strategies
than conditional branches.
On old CPUs something like "this branch is taken 50% of the time" makes
it a totally unpredictable branch. But if say it branches exactly every
second time, it is 100% predicted correctly by more advanced predictors,
not just 50%.
To properly model modern branch predictors we need to record a "how
predictable is this branch" score as well for every branch, not just a
"how often does it branch instead of falling through" score. We're not
there yet.
Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-24 21:37 ` Segher Boessenkool
@ 2025-06-25 8:30 ` David Laight
0 siblings, 0 replies; 36+ messages in thread
From: David Laight @ 2025-06-25 8:30 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Tue, 24 Jun 2025 16:37:12 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:
> Hi!
>
> On Tue, Jun 24, 2025 at 09:32:58AM +0100, David Laight wrote:
> > > So GCC uses the 'unlikely' variant of the branch instruction to force
> > > the correct prediction, doesn't it ?
> >
> > Nope...
> > Most architectures don't have likely/unlikely variants of branches.
>
> In GCC, "likely" means 80%. "Very likely" means 99.95%. Most things get
> something more appropriate than such coarse things predicted.
>
> Most of the time GCC uses these predicted branch probabilities to lay
> out code in such a way that the fall-through path is the expected one.
That is fine provided the cpu doesn't predict the 'taken' path.
If you write:
if (unlikely(x))
continue;
gcc is very likely to generate a backwards conditional branch that
will get predicted taken (by a cpu without dynamic branch prediction).
You need to but something (an asm comment will do) before the 'continue'
to force gcc to generate a forwards (predicted not taken) branch to
the backwards jump.
> Target backends can do special things with it as well, but usually that
> isn't necessary.
>
> There are many different predictors. GCC usually can predict things
> not bad by just looking at the shape of the code, using various
> heuristics. Things like profile-guided optimisation allow to use a
> profile from an actual execution to optimise the code such that it will
> work faster (assuming that future executions of the code will execute
> similarly!)
Without cpu instructions to force static prediction I don't see how that
helps as much as you might think.
Each time the code is loaded into the I-cache the branch predictor state
is likely to have been destroyed by other code.
So the branches get predicted by 'the other code' regardless of any layout.
>
> You also can use __builtin_expect() in the source code, to put coarse
> static prediction in. That is what the kernel "{un,}likely" macros do.
>
> If the compiler knows some branch is not very predictable, it can
> optimise the code knowing that. Like, it could use other strategies
> than conditional branches.
>
> On old CPUs something like "this branch is taken 50% of the time" makes
> it a totally unpredictable branch. But if say it branches exactly every
> second time, it is 100% predicted correctly by more advanced predictors,
> not just 50%.
Only once you are in a code loop.
Dynamic branch prediction is pretty hopeless for linear code.
The first time you execute a branch it is likely to be predicted taken
50% of the time.
(I guess a bit less than 50% - it will be percentage of branches that
are taken.)
>
> To properly model modern branch predictors we need to record a "how
> predictable is this branch" score as well for every branch, not just a
> "how often does it branch instead of falling through" score. We're not
> there yet.
If you are going to adjust the source code you want to determine correct
static prediction for most branches.
That probably requires an 'every other' static prediction.
I spent a lot of time optimising some code to minimise the worst case path,
the first thing I had to do was disable the dynamic branch prediction logic.
David
>
>
> Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-24 21:08 ` David Laight
@ 2025-06-26 5:56 ` Christophe Leroy
2025-06-26 22:01 ` Segher Boessenkool
2025-06-26 21:39 ` Segher Boessenkool
1 sibling, 1 reply; 36+ messages in thread
From: Christophe Leroy @ 2025-06-26 5:56 UTC (permalink / raw)
To: David Laight, Segher Boessenkool
Cc: Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Le 24/06/2025 à 23:08, David Laight a écrit :
> On Tue, 24 Jun 2025 13:25:05 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
>
>>>> isel (which is base PowerPC, not something "e500" only) is a
>>>> computational instruction, it copies one of two registers to a third,
>>>> which of the two is decided by any bit in the condition register.
>>>
>>> Does that mean it could be used for all the ppc cpu variants?
>>
>> No, only things that implement architecture version of 2.03 or later.
>> That is from 2006, so essentially everything that is still made
>> implements it :-)
>>
>> But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet
>> have it (they are ISA 2.01 and 2.02 respectively). And the older p5's
>> do not have it yet either, but the newer ones do.
For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10
>>
>> And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-)
>
> That make more sense than the list in patch 5/5.
Sorry for the ambiguity. In patch 5/5 I was addressing only powerpc/32,
and as far as I know the only powerpc/32 supported by Linux that has
isel is the 85xx which has an e500 core.
For powerpc/64 we have less constraint than on powerpc32:
- Kernel memory starts at 0xc000000000000000
- User memory stops at 0x0010000000000000
So we can easily use 0x8000000000000000 as demarcation line, which
allows a 3 instructions sequence:
sradi r9,r3,63 => shirt right algebric: r9 = 0 when r3 >= 0; r9 = -1
when r3 < 0
andc r3,r3,r9 => and with complement: r3 unchanged when r9 = 0, r3 =
0 when r9 = -1
rldimi r3,r9,0,1 => Rotate left and mask insert: Insert highest bit of
r9 into r3
Whereas using isel requires a 4 instructions sequence:
li r9, 1 => load immediate: r9 = 1
rotldi r9, r9, 63 => rotate left: r9 = 0x8000000000000000
cmpld r3, r9 => compare logically
iselgt r3, r9, r3 => integer select greater than: select r3 when result
of above compare is <= ; select r9 when result of compare is >
>
>>
>>>> But sure, seen from very far off both isel and cmove can be used to
>>>> implement the ternary operator ("?:"), are similar in that way :-)
>>>
>>> Which is exactly what you want to avoid speculation.
>>
>> There are cheaper / simpler / more effective / better ways to get that,
>> but sure, everything is better than a conditional branch, always :-)
>
> Everything except a TLB miss :-)
>
> And for access_ok() avoiding the conditional is a good enough reason
> to use a 'conditional move' instruction.
> Avoiding speculation is actually free.
>
And on CPUs that are not affected by Spectre and Meltdown like powerpc
8xx or powerpc 603, masking the pointer is still worth it as it
generates better code, even if the masking involves branching.
That's the reason why I did:
- e500 (affected by Spectre v1) ==> Use isel ==> 3 instructions sequence:
lis r9, TASK_SIZE@h => load immediate shifted => r9 = (TASK_SIZE >> 16)
<< 16
cmplw r3, r9 => compare addr with TASK_SIZE
iselgt r3, r9, r3 => addr = TASK_SIZE when addr > TASK_SIZE; addr =
addr when <= TASK_SIZE
For others:
- If TASK_SIZE <= 0x80000000 and kernel >= 0x80000000 ==> 3 instructions
sequence similar to the 64 bits one above:
srawi r9,r3,63
andc r3,r3,r9
rlwimi r3,r9,0,0,0
- Otherwise, if speculation mitigation is required (e600), use the
6-instructions masking sequence (r3 contains the address to mask)
srwi r9,r3,17 => shift right: shift r3 by 17 to keep 15 bits (positive
16 bits)
subfic r9,r9,22527 => sub from immediate: r9 = (TASK_SIZE >> 17) - r9
=> r9 < 0 when r3 is greater
srawi r9,r9,31 => shift right algebric: r9 = 0 when r9 >= 0; r9 = -1
when r9 < 0
andis. r10,r9,45056 => and immediat shifted: r10 = r9 and upper part of
TASK_SIZE => r10 = TASK_SIZE when r3 > TASK_SIZE, r10 = 0 otherwise
andc r3,r3,r9 => and with complement: r3 is unchanged when r9 = 0 so
when r3 <= TASK_SIZE, r3 is cleared when r9 = -1 so when r3 > TASK_SIZE
or r3,r3,r10 => r3 = r3 | r10
- If no speculation mitigation is required, let GCC do as it wants
List of affected CPUs is available here:
https://docs.nxp.com/bundle/GUID-805CC0EA-4001-47AD-86CD-4F340751F6B7/page/GUID-17B5D04F-6471-4EC6-BEB9-DE4D0AFA034A.html
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-24 21:08 ` David Laight
2025-06-26 5:56 ` Christophe Leroy
@ 2025-06-26 21:39 ` Segher Boessenkool
1 sibling, 0 replies; 36+ messages in thread
From: Segher Boessenkool @ 2025-06-26 21:39 UTC (permalink / raw)
To: David Laight
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Hi!
On Tue, Jun 24, 2025 at 10:08:16PM +0100, David Laight wrote:
> On Tue, 24 Jun 2025 13:25:05 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
> > On Tue, Jun 24, 2025 at 05:50:01PM +0100, David Laight wrote:
> > > On Tue, 24 Jun 2025 08:17:14 -0500
> > > Segher Boessenkool <segher@kernel.crashing.org> wrote:
> > >
> > > > On Tue, Jun 24, 2025 at 07:27:47AM +0200, Christophe Leroy wrote:
> > > > > Ah ok, I overlooked that, I didn't know the cmove instruction, seem
> > > > > similar to the isel instruction on powerpc e500.
> > > >
> > > > cmove does a move (register or memory) when some condition is true.
> > >
> > > The destination of x86 'cmov' is always a register (only the source can be
> > > memory - and is probably always read).
> >
> > Both source operands can be mem, right? But probably not both at the
> > same time.
>
> It only has one 'real' source, but the implementation could easily
> read the destination register and then decide which value to write
> back - rather than doing a conditional write to the register file.
Yeah, in x86 many (most insns?) can read any reg that they write. Not
a great design, but heh.
> A conditional write would be a right PITA for the alu result
> forwarding logic
Depends. An implementation can always do the register forwarding etc.,
just annul the actual store where appropriate (and not put it in the
various store queues either, heh -- annul all the effects of the store).
> > x86 is not a RISC architecture, or more generally, a load/store
> > architecture.
>
> It sort of is these days.
Not at all. Most *implementations* are, the uarchs, but the
architecture (which determines the required visible semantics) is not.
That impedance difference is quite painful, yes, for code generation
more than for the processor implementation even -- as usual the
compilers have to save the day!
> The memory transfers are separate u-ops, so a 'reg += mem' instruction
> is split into two be the decoder.
Yup. Very expensive. Both for the implementation, and for the
performance of eventual code running on it.
> Although some u-ops get merged together and executed in one clock,
> obvious example is some 'compare+branch' pairs.
On many other architectures such things run in 0 cycles anyway :-)
> > A computational instruction is one that doesn't touch memory or does a
> > branch, or some system function, some supervisor or hypervisor
> > instruction maybe.
> >
> > x86 does not have many computational insns, most insns can touch
> > memory :-)
>
> Except that the memory 'bit' is executed separately from any alu 'stuff'.
On many uarchs, yes. But not in the arch. No uarch can decide to just
not implement these difficult and expensive insns :-)
> > > There is a planned new instruction that would do a conditional write
> > > to memory - but not on any cpu yet.
> >
> > Interesting! Instructions like the atomic store insns we got for p9,
> > maybe? They can do minimum/maximum and various kinds of more generic
> > reductions and similar.
>
> I think they are only conditional stores.
> But they do save a conditional branch.
Yeah, but those are not ever executed *anyway*, there is branch
prediction and we require that to be pretty good to get reasonable
performance anyway.
A branch around the store insns is just fine if it can be predicted
correctly. If it cannot be predicted correctly, you can do the store
always, just have the address that is stored to depend on the condition
(such the data is stored to some dummy memory if it "should not be
done"). Source code gets such a transform done manually in the
performance critical paths not infrequently, already.
GCC does not currently do such a transformation AFAIK, but it is a
pretty basic thing to do. Conditional stores are not often written in
source code programs, or there would probably be an implementation for
this already :-)
> A late disable of a memory write is far less problematic than a disabled
> register file write. No one minds (too much) about slight delays between
> writes and reads of the same location (reduced by a store to load forwarder)
> but you don't want to lose clocks between adjacent simple alu instructions.
Writes to memory take tens of cycles *anyway*, but all of that is hidden
by the various memory load and store queues (which let you do forwarding
in just a few cycles).
> For my sins I re-implemented a soft cpu last year...
Ouch :-) But it was fun to do I hope?
> Which doesn't have a 'cmov' :-(
The x86 flag register bits are so limited and complicated in the first
place, cmov is the easier part there ;-)
> > But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet
> > have it (they are ISA 2.01 and 2.02 respectively). And the older p5's
> > do not have it yet either, but the newer ones do.
> >
> > And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-)
>
> That make more sense than the list in patch 5/5.
Not sure what list that is. I'll find it :-)
> > > > But sure, seen from very far off both isel and cmove can be used to
> > > > implement the ternary operator ("?:"), are similar in that way :-)
> > >
> > > Which is exactly what you want to avoid speculation.
> >
> > There are cheaper / simpler / more effective / better ways to get that,
> > but sure, everything is better than a conditional branch, always :-)
>
> Everything except a TLB miss :-)
Heh. TLBa are just a tiny part of translation on Power. We mostly care
about the ERATs. Look it up, if you want to be introduced to another
level of pain :-)
> And for access_ok() avoiding the conditional is a good enough reason
> to use a 'conditional move' instruction.
> Avoiding speculation is actually free.
*Assuming* that avoiding speculation is actually free, you mean?
Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-26 5:56 ` Christophe Leroy
@ 2025-06-26 22:01 ` Segher Boessenkool
2025-07-05 10:55 ` Christophe Leroy
2025-07-05 18:33 ` David Laight
0 siblings, 2 replies; 36+ messages in thread
From: Segher Boessenkool @ 2025-06-26 22:01 UTC (permalink / raw)
To: Christophe Leroy
Cc: David Laight, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote:
> Le 24/06/2025 à 23:08, David Laight a écrit :
> >On Tue, 24 Jun 2025 13:25:05 -0500
> >Segher Boessenkool <segher@kernel.crashing.org> wrote:
> >>>>isel (which is base PowerPC, not something "e500" only) is a
> >>>>computational instruction, it copies one of two registers to a third,
> >>>>which of the two is decided by any bit in the condition register.
> >>>
> >>>Does that mean it could be used for all the ppc cpu variants?
> >>
> >>No, only things that implement architecture version of 2.03 or later.
> >>That is from 2006, so essentially everything that is still made
> >>implements it :-)
> >>
> >>But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet
> >>have it (they are ISA 2.01 and 2.02 respectively). And the older p5's
> >>do not have it yet either, but the newer ones do.
>
> For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10
I have no idea what "book3s64" means.
Some ancient Power architecture versions had something called
"Book III-S", which was juxtaposed to "Book III-E", which essentially
corresponds to the old aborted BookE stuff.
I guess you mean almost all non-FSL implementations? Most of those
support the isel insns. Like, Power5+ (GS). And everything after that.
I have no idea why you think power9 has it while older CPUS do not. In
the GCC source code we have this comment:
/* For ISA 2.06, don't add ISEL, since in general it isn't a win, but
altivec is a win so enable it. */
and in fact we do not enable it for ISA 2.06 (p8) either, probably for
a similar reason.
> >>And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-)
> >
> >That make more sense than the list in patch 5/5.
>
> Sorry for the ambiguity. In patch 5/5 I was addressing only powerpc/32,
> and as far as I know the only powerpc/32 supported by Linux that has
> isel is the 85xx which has an e500 core.
What is "powerpc/32"? It does not help if you use different names from
what everyone else does.
The name "powerpc32" is sometimes used colloquially to mean PowerPC code
running in SF=0 mode (MSR[SF]=0), but perhaps more often it is used for
32-bit only implementations (so, those that do not even have that bit:
it's bit 0 in the 64-bit MSR, so all implementations that have an only
32-bit MSR, for example).
> For powerpc/64 we have less constraint than on powerpc32:
> - Kernel memory starts at 0xc000000000000000
> - User memory stops at 0x0010000000000000
That isn't true, not even if you mean some existing name. Usually
userspace code is mapped at 256MB (0x10000000). On powerpc64-linux
anyway, different default on different ABIs of course :-)
> >And for access_ok() avoiding the conditional is a good enough reason
> >to use a 'conditional move' instruction.
> >Avoiding speculation is actually free.
>
> And on CPUs that are not affected by Spectre and Meltdown like powerpc
> 8xx or powerpc 603,
Erm.
Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-26 22:01 ` Segher Boessenkool
@ 2025-07-05 10:55 ` Christophe Leroy
2025-07-05 11:42 ` Segher Boessenkool
2025-07-05 18:33 ` David Laight
1 sibling, 1 reply; 36+ messages in thread
From: Christophe Leroy @ 2025-07-05 10:55 UTC (permalink / raw)
To: Segher Boessenkool
Cc: David Laight, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Le 27/06/2025 à 00:01, Segher Boessenkool a écrit :
> On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote:
>> Le 24/06/2025 à 23:08, David Laight a écrit :
>>> On Tue, 24 Jun 2025 13:25:05 -0500
>>> Segher Boessenkool <segher@kernel.crashing.org> wrote:
>>>>>> isel (which is base PowerPC, not something "e500" only) is a
>>>>>> computational instruction, it copies one of two registers to a third,
>>>>>> which of the two is decided by any bit in the condition register.
>>>>>
>>>>> Does that mean it could be used for all the ppc cpu variants?
>>>>
>>>> No, only things that implement architecture version of 2.03 or later.
>>>> That is from 2006, so essentially everything that is still made
>>>> implements it :-)
>>>>
>>>> But ancient things do not. Both 970 (Apple G5) and Cell BE do not yet
>>>> have it (they are ISA 2.01 and 2.02 respectively). And the older p5's
>>>> do not have it yet either, but the newer ones do.
>>
>> For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10
>
> I have no idea what "book3s64" means.
Well that's the name given in Linux kernel to the 64 bits power CPU
processors. See commits subject:
f5164797284d book3s64/radix : Optimize vmemmap start alignment
58450938f771 book3s64/radix : Handle error conditions properly in
radix_vmemmap_populate
9cf7e13fecba book3s64/radix : Align section vmemmap start address to
PAGE_SIZE
29bdc1f1c1df book3s64/radix: Fix compile errors when
CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP=n
d629d7a8efc3 powerpc/book3s64/hugetlb: Fix disabling hugetlb when fadump
is active
5959ffabbb67 arch/powerpc: teach book3s64
arch_get_unmapped_area{_topdown} to handle hugetlb mappings
8846d9683884 book3s64/hash: Early detect debug_pagealloc size requirement
76b7d6463fc5 book3s64/hash: Disable kfence if not early init
b5fbf7e2c6a4 book3s64/radix: Refactoring common kfence related functions
8fec58f503b2 book3s64/hash: Add kfence functionality
47dd2e63d42a book3s64/hash: Disable debug_pagealloc if it requires more
memory
...
>
> Some ancient Power architecture versions had something called
> "Book III-S", which was juxtaposed to "Book III-E", which essentially
> corresponds to the old aborted BookE stuff.
>
> I guess you mean almost all non-FSL implementations? Most of those
> support the isel insns. Like, Power5+ (GS). And everything after that.
>
> I have no idea why you think power9 has it while older CPUS do not. In
> the GCC source code we have this comment:
I think nothing, I just observed that GCC doesn't use it unless you tell
it is power9 or power10. But OK, the comment explains why.
> /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but
> altivec is a win so enable it. */
> and in fact we do not enable it for ISA 2.06 (p8) either, probably for
> a similar reason.
>
>>>> And all classic PowerPC is ISA 1.xx of course. Medieval CPUs :-)
>>>
>>> That make more sense than the list in patch 5/5.
>>
>> Sorry for the ambiguity. In patch 5/5 I was addressing only powerpc/32,
>> and as far as I know the only powerpc/32 supported by Linux that has
>> isel is the 85xx which has an e500 core.
>
> What is "powerpc/32"? It does not help if you use different names from
> what everyone else does.
Again, that's the way it is called in Linux kernel, refer below commits
subjects:
$ git log --oneline arch/powerpc/ | grep powerpc/32
2bf3caa7cc3b powerpc/32: Stop printing Kernel virtual memory layout
2a17a5bebc9a powerpc/32: Replace mulhdu() by mul_u64_u64_shr()
dca5b1d69aea powerpc/32: Implement validation of emergency stack
2f2b9a3adc66 powerpc/32s: Reduce default size of module/execmem area
5799cd765fea powerpc/32: Convert patch_instruction() to patch_uint()
6035e7e35482 powerpc/32: Curb objtool unannotated intra-function call
warning
b72c066ba85a powerpc/32: fix ADB_CUDA kconfig warning
cb615bbe5526 powerpc/32: fix ADB_CUDA kconfig warning
c8a1634145c2 powerpc/32: Drop unused grackle_set_stg()
aad26d3b6af1 powerpc/32s: Implement local_flush_tlb_page_psize()
bac4cffc7c4a powerpc/32s: Introduce _PAGE_READ and remove _PAGE_USER
46ebef51fd92 powerpc/32s: Add _PAGE_WRITE to supplement _PAGE_RW
f84b727d132c powerpc/32: Enable POWER_RESET in pmac32_defconfig
a3ef2fef198c powerpc/32: Add dependencies of POWER_RESET for pmac32
7cb0094be4a5 powerpc/32s: Cleanup the mess in __set_pte_at()
6958ad05d578 powerpc/32: Rearrange _switch to prepare for 32/64 merge
fc8562c9b69a powerpc/32: Remove sync from _switch
...
It means everything built with CONFIG_PPC32
>
> The name "powerpc32" is sometimes used colloquially to mean PowerPC code
> running in SF=0 mode (MSR[SF]=0), but perhaps more often it is used for
> 32-bit only implementations (so, those that do not even have that bit:
> it's bit 0 in the 64-bit MSR, so all implementations that have an only
> 32-bit MSR, for example).
>
>> For powerpc/64 we have less constraint than on powerpc32:
>> - Kernel memory starts at 0xc000000000000000
>> - User memory stops at 0x0010000000000000
>
> That isn't true, not even if you mean some existing name. Usually
> userspace code is mapped at 256MB (0x10000000). On powerpc64-linux
> anyway, different default on different ABIs of course :-)
0x10000000 is below 0x0010000000000000, isn't it ? So why isn't it true ?
On 64 bits powerpc, everything related to user is between 0 and
TASK_SIZE_MAX.
TASK_SIZE_MAX is either TASK_SIZE_4PB or TASK_SIZE_64TB depending on
page size (64k or 4k)
TASK_SIZE_4PB is 0x0010000000000000UL
Christophe
>
>>> And for access_ok() avoiding the conditional is a good enough reason
>>> to use a 'conditional move' instruction.
>>> Avoiding speculation is actually free.
>>
>> And on CPUs that are not affected by Spectre and Meltdown like powerpc
>> 8xx or powerpc 603,
>
> Erm.
>
>
> Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-07-05 10:55 ` Christophe Leroy
@ 2025-07-05 11:42 ` Segher Boessenkool
0 siblings, 0 replies; 36+ messages in thread
From: Segher Boessenkool @ 2025-07-05 11:42 UTC (permalink / raw)
To: Christophe Leroy
Cc: David Laight, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Hi!
On Sat, Jul 05, 2025 at 12:55:06PM +0200, Christophe Leroy wrote:
> > > For book3s64, GCC only use isel with -mcpu=power9 or -mcpu=power10
> >
> > I have no idea what "book3s64" means.
>
> Well that's the name given in Linux kernel to the 64 bits power CPU
> processors.
A fantasy name. Great.
> > What is "powerpc/32"? It does not help if you use different names from
> > what everyone else does.
>
> Again, that's the way it is called in Linux kernel, refer below commits
> subjects:
And another.
> It means everything built with CONFIG_PPC32
Similar names for very dissimilar concepts, even! Woohoo!
> > > For powerpc/64 we have less constraint than on powerpc32:
> > > - Kernel memory starts at 0xc000000000000000
> > > - User memory stops at 0x0010000000000000
> >
> > That isn't true, not even if you mean some existing name. Usually
> > userspace code is mapped at 256MB (0x10000000). On powerpc64-linux
> > anyway, different default on different ABIs of course :-)
>
> 0x10000000 is below 0x0010000000000000, isn't it ? So why isn't it true ?
I understood "starts at". I read cross-eyed maybe, hehe.
Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-06-26 22:01 ` Segher Boessenkool
2025-07-05 10:55 ` Christophe Leroy
@ 2025-07-05 18:33 ` David Laight
2025-07-05 20:15 ` Segher Boessenkool
1 sibling, 1 reply; 36+ messages in thread
From: David Laight @ 2025-07-05 18:33 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Thu, 26 Jun 2025 17:01:48 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:
> On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote:
...
> I have no idea why you think power9 has it while older CPUS do not. In
> the GCC source code we have this comment:
> /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but
> altivec is a win so enable it. */
> and in fact we do not enable it for ISA 2.06 (p8) either, probably for
> a similar reason.
Odd, I'd have thought that replacing a conditional branch with a
conditional move would pretty much always be a win.
Unless, of course, you only consider benchmark loops where the
branch predictor in 100% accurate.
OTOH isn't altivec 'simd' instructions?
They pretty much only help for loops with lots of iterations.
I don't know about ppc, but I've seen gcc make a real 'pigs breakfast'
of loop vectorisation on x86.
For the linux kernel (which as Linus keeps reminding people) tends
to run 'cold cache', you probably want conditional moves in order
to avoid mis-predicted branches and non-linear execution, but
don't want loop vectorisation because the setup and end cases
cost too much compared to the gain for each iteration.
David
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-07-05 18:33 ` David Laight
@ 2025-07-05 20:15 ` Segher Boessenkool
2025-07-05 21:05 ` David Laight
0 siblings, 1 reply; 36+ messages in thread
From: Segher Boessenkool @ 2025-07-05 20:15 UTC (permalink / raw)
To: David Laight
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Hi!
On Sat, Jul 05, 2025 at 07:33:32PM +0100, David Laight wrote:
> On Thu, 26 Jun 2025 17:01:48 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
> > On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote:
> ...
> > I have no idea why you think power9 has it while older CPUS do not. In
> > the GCC source code we have this comment:
> > /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but
> > altivec is a win so enable it. */
> > and in fact we do not enable it for ISA 2.06 (p8) either, probably for
2.07 I meant of course. Sigh.
> > a similar reason.
>
> Odd, I'd have thought that replacing a conditional branch with a
> conditional move would pretty much always be a win.
> Unless, of course, you only consider benchmark loops where the
> branch predictor in 100% accurate.
The isel machine instruction is super expensive on p8: it is marked as
first in an instruction group, and has latency 5 for the GPR sources,
and 8 for the CR field source.
On p7 it wasn't great either, it was actually converted to a branch
sequence internally!
On p8 there are bc+8 optimisations done by the core as well, conditional
branches that skip one insn are faster than equivalent isel insns!
Since p9 it is a lot better :-)
> OTOH isn't altivec 'simd' instructions?
AltiVec is the old motorola marketing name for what is called the
"Vector Facility" in the architecture, and which at IBM is still called
VMX, the name it was developed under ("Vector Multimedia Extension").
Since p7 (ISA 2.06, 2010) there also is the Vector-Scalar Extension
Facility, VSX, which adds another 32 vector registers, and the
traditional floating point registers are physically the same (but those
use only the first half of each vector reg). Many new VSX instructions
can do simple floating point stuff on all 64 VSX registers, either just
on the first lane ("scalar") or on all lanes ("vector").
This does largely mean that all floating point is stored in IEEE DP
format internally (on older cores usually some close to 70-bit format
was used internally), which in olden times actually allowed to make the
cores faster. Only when storing a value to memory it was actually
converted to IEEE format (but of course it was always rounded correctly,
etc.)
> They pretty much only help for loops with lots of iterations.
> I don't know about ppc, but I've seen gcc make a real 'pigs breakfast'
> of loop vectorisation on x86.
For PowerPC (or Power, the more modern name) of course we also have our
fair share of problems with vectorisation. It does help that we were
the first architecture used by GCC that had a serious Vector thing,
the C syntax extension for Vector literals is taken from the old
extensions in the AltiVec PIM but using curly brackets {} instead of
round brackets (), for example.
> For the linux kernel (which as Linus keeps reminding people) tends
> to run 'cold cache', you probably want conditional moves in order
> to avoid mis-predicted branches and non-linear execution, but
> don't want loop vectorisation because the setup and end cases
> cost too much compared to the gain for each iteration.
You are best off using what GCC gives you, usually. It is very well
tuned, both the generic and the machine-specific code :-)
The kernel of course disables all Vector and FP stuff, essentially it
disables use of any of the associated registers, and that's pretty much
the end of it ;-)
(The reason for that is that it would make task switches more expensive,
long ago all task switches, but nowadays still user<->kernel switches).
Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-07-05 20:15 ` Segher Boessenkool
@ 2025-07-05 21:05 ` David Laight
2025-07-05 21:37 ` Segher Boessenkool
0 siblings, 1 reply; 36+ messages in thread
From: David Laight @ 2025-07-05 21:05 UTC (permalink / raw)
To: Segher Boessenkool
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
On Sat, 5 Jul 2025 15:15:57 -0500
Segher Boessenkool <segher@kernel.crashing.org> wrote:
...
> The isel machine instruction is super expensive on p8: it is marked as
> first in an instruction group, and has latency 5 for the GPR sources,
> and 8 for the CR field source.
>
> On p7 it wasn't great either, it was actually converted to a branch
> sequence internally!
Ugg...
You'd think they'd add instructions that can be implemented.
It isn't as though isel is any harder than 'add with carry'.
Not that uncommon, IIRC amd added adox/adcx (add carry using the
overflow/carry flag and without changing any other flags) as very
slow instructions. Intel invented them without making jcxz (dec %cx
and jump non-zero) fast - so you can't (easily) put them in a loop.
Not to mention all the AVX512 fubars.
Conditional move is more of a problem with a mips-like cpu where
alu ops read two registers and write a third.
You don't want to do a conditional write because it messes up
the decision of whether to forward the alu result to the following
instruction.
So I think you might need to do 'cmov odd/even' and read the LSB
from a third copy (or third read port) of the registers indexed
by what would normally be the 'output' register number.
Then tweak the register numbers early in the pipeline so that the
result goes to one of the 'input' registers rather than the normal
'output' one.
Not really that hard - could add to the cpu I did in 1/2 a day :-)
David
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH 0/5] powerpc: Implement masked user access
2025-07-05 21:05 ` David Laight
@ 2025-07-05 21:37 ` Segher Boessenkool
0 siblings, 0 replies; 36+ messages in thread
From: Segher Boessenkool @ 2025-07-05 21:37 UTC (permalink / raw)
To: David Laight
Cc: Christophe Leroy, Michael Ellerman, Nicholas Piggin, Naveen N Rao,
Madhavan Srinivasan, Alexander Viro, Christian Brauner, Jan Kara,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Andre Almeida, Andrew Morton, Dave Hansen,
Linus Torvalds, linux-kernel, linuxppc-dev, linux-fsdevel,
linux-mm
Hi!
On Sat, Jul 05, 2025 at 10:05:38PM +0100, David Laight wrote:
> On Sat, 5 Jul 2025 15:15:57 -0500
> Segher Boessenkool <segher@kernel.crashing.org> wrote:
>
> ...
> > The isel machine instruction is super expensive on p8: it is marked as
> > first in an instruction group, and has latency 5 for the GPR sources,
> > and 8 for the CR field source.
> >
> > On p7 it wasn't great either, it was actually converted to a branch
> > sequence internally!
>
> Ugg...
>
> You'd think they'd add instructions that can be implemented.
> It isn't as though isel is any harder than 'add with carry'.
It is though! isel was the first instruction that takes both GPR inputs
and a CR field input. We now have more, the ISA 3.0 (p9) setb insn,
and esp. the extremely useful ISA 3.1 (p10) set[n]bc[r] insns -- well,
those don't take any GPR inputs actually, but like isel their output is
a GPR :-)
> Not that uncommon, IIRC amd added adox/adcx (add carry using the
> overflow/carry flag and without changing any other flags) as very
We have a similar "addex" insn since p9, which allows to use the OV
bit instead of the CA bit, and prepares to allow an extra three possible
bits as carry bits, too. Using it you can run multiple carry chains
in parallel using insns very close to the traditional stuff.
The compiler still doesn't ever generate this, it is mostly useful
for handcoded assembler routines.
The carry bits are stored in the XER register, the "fixed-point
exception register", while results from comparison instructions are
stored in the CR, which holds eight four-bit CR fields, which you can
use in conditional jumps, or in isel and the like, or in the crlogical
insns (which can do any logic function on two CR field inputs and store
in a third, just like the logical insns on GPRs that also have the full
complement of 14 two source functions).
> slow instructions. Intel invented them without making jcxz (dec %cx
> and jump non-zero) fast - so you can't (easily) put them in a loop.
> Not to mention all the AVX512 fubars.
Sounds lovely :-)
> Conditional move is more of a problem with a mips-like cpu where
> alu ops read two registers and write a third.
Like most Power implementations as well.
> You don't want to do a conditional write because it messes up
> the decision of whether to forward the alu result to the following
> instruction.
> So I think you might need to do 'cmov odd/even' and read the LSB
> from a third copy (or third read port) of the registers indexed
> by what would normally be the 'output' register number.
> Then tweak the register numbers early in the pipeline so that the
> result goes to one of the 'input' registers rather than the normal
> 'output' one.
> Not really that hard - could add to the cpu I did in 1/2 a day :-)
On p9 and later both GPR (or constant) inputs are fed into the execution
unit as well as some CR bit, and it just writes to a GPR. Easier for
the hardware, easier for the compiler, and easier for the programmer!
Win-win-win. The kind of tradeoffs I like best :-)
Segher
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2025-07-05 21:38 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-22 9:52 [PATCH 0/5] powerpc: Implement masked user access Christophe Leroy
2025-06-22 9:52 ` [PATCH 1/5] uaccess: Add masked_user_{read/write}_access_begin Christophe Leroy
2025-06-22 16:35 ` David Laight
2025-06-24 5:34 ` Christophe Leroy
2025-06-22 9:52 ` [PATCH 2/5] uaccess: Add speculation barrier to copy_from_user_iter() Christophe Leroy
2025-06-22 16:52 ` David Laight
2025-06-22 16:57 ` Linus Torvalds
2025-06-22 20:18 ` David Laight
2025-06-24 5:49 ` Christophe Leroy
2025-06-24 8:07 ` David Laight
2025-06-24 15:15 ` Linus Torvalds
2025-06-22 9:52 ` [PATCH 3/5] powerpc: Remove unused size parametre to KUAP enabling/disabling functions Christophe Leroy
2025-06-22 9:52 ` [PATCH 4/5] powerpc: Move barrier_nospec() out of allow_read_{from/write}_user() Christophe Leroy
2025-06-22 9:52 ` [PATCH 5/5] powerpc: Implement masked user access Christophe Leroy
2025-06-22 17:13 ` David Laight
2025-06-22 17:40 ` Linus Torvalds
2025-06-22 19:51 ` David Laight
2025-06-22 18:57 ` Segher Boessenkool
2025-06-22 16:20 ` [PATCH 0/5] " David Laight
2025-06-24 5:27 ` Christophe Leroy
2025-06-24 8:32 ` David Laight
2025-06-24 21:37 ` Segher Boessenkool
2025-06-25 8:30 ` David Laight
2025-06-24 13:17 ` Segher Boessenkool
2025-06-24 16:50 ` David Laight
2025-06-24 18:25 ` Segher Boessenkool
2025-06-24 21:08 ` David Laight
2025-06-26 5:56 ` Christophe Leroy
2025-06-26 22:01 ` Segher Boessenkool
2025-07-05 10:55 ` Christophe Leroy
2025-07-05 11:42 ` Segher Boessenkool
2025-07-05 18:33 ` David Laight
2025-07-05 20:15 ` Segher Boessenkool
2025-07-05 21:05 ` David Laight
2025-07-05 21:37 ` Segher Boessenkool
2025-06-26 21:39 ` Segher Boessenkool
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).