* [PATCH v1 0/2] epoll: Save one stac/clac pair in epoll_put_uevent().
@ 2025-10-23 0:04 Kuniyuki Iwashima
2025-10-23 0:04 ` [PATCH v1 1/2] uaccess: Add __user_write_access_begin() Kuniyuki Iwashima
2025-10-23 0:04 ` [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent() Kuniyuki Iwashima
0 siblings, 2 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-23 0:04 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Jens Axboe, Christian Brauner, Linus Torvalds
Cc: Nicholas Piggin, Christophe Leroy, Alexandre Ghiti,
H. Peter Anvin, Eric Dumazet, Kuniyuki Iwashima,
Kuniyuki Iwashima, x86, linux-arm-kernel, linuxppc-dev,
linux-riscv, linux-kernel
epoll_put_uevent() calls __put_user() twice, which are inlined
to two calls of out-of-line functions, __put_user_nocheck_4()
and __put_user_nocheck_8().
Both functions wrap mov with a stac/clac pair, which is expensive
on an AMD EPYC 7B12 64-Core Processor platform.
__put_user_nocheck_4 /proc/kcore [Percent: local period]
Percent │
89.91 │ stac
0.19 │ mov %eax,(%rcx)
0.15 │ xor %ecx,%ecx
9.69 │ clac
0.06 │ ← retq
This was remarkable while testing neper/tcp_rr with 1000 flows per
thread.
Overhead Shared O Symbol
10.08% [kernel] [k] _copy_to_iter
7.12% [kernel] [k] ip6_output
6.40% [kernel] [k] sock_poll
5.71% [kernel] [k] move_addr_to_user
4.39% [kernel] [k] __put_user_nocheck_4
...
1.06% [kernel] [k] ep_try_send_events
... ^- epoll_put_uevent() was inlined
0.78% [kernel] [k] __put_user_nocheck_8
Patch 1 adds a new uaccess helper that is inlined to a bare stac
without address masking or uaccess_ok(), which is already checked
in ep_check_params().
Patch 2 uses the helper and unsafe_put_user() in epoll_put_uevent().
Kuniyuki Iwashima (2):
uaccess: Add __user_write_access_begin().
epoll: Use __user_write_access_begin() and unsafe_put_user() in
epoll_put_uevent().
arch/arm64/include/asm/uaccess.h | 1 +
arch/powerpc/include/asm/uaccess.h | 13 ++++++++++---
arch/riscv/include/asm/uaccess.h | 1 +
arch/x86/include/asm/uaccess.h | 1 +
include/linux/eventpoll.h | 13 ++++++++-----
include/linux/uaccess.h | 1 +
6 files changed, 22 insertions(+), 8 deletions(-)
--
2.51.1.814.gb8fa24458f-goog
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v1 1/2] uaccess: Add __user_write_access_begin().
2025-10-23 0:04 [PATCH v1 0/2] epoll: Save one stac/clac pair in epoll_put_uevent() Kuniyuki Iwashima
@ 2025-10-23 0:04 ` Kuniyuki Iwashima
2025-10-23 5:37 ` Linus Torvalds
2025-10-23 0:04 ` [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent() Kuniyuki Iwashima
1 sibling, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-23 0:04 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Jens Axboe, Christian Brauner, Linus Torvalds
Cc: Nicholas Piggin, Christophe Leroy, Alexandre Ghiti,
H. Peter Anvin, Eric Dumazet, Kuniyuki Iwashima,
Kuniyuki Iwashima, x86, linux-arm-kernel, linuxppc-dev,
linux-riscv, linux-kernel
In epoll_wait(2), ep_check_params() performs a bulk check for
the passed user address:
if (!access_ok(evs, maxevents * sizeof(struct epoll_event)))
And later, epoll_put_uevent() uses __put_user() twice to copy
2 data into the region.
unsafe_put_user() can be used to save a stac/clac pair, but
masked_user_access_begin() or user_access_begin() introduces
an unnecessary address masking or access_ok().
Add a low-level helper for such a use case.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
arch/arm64/include/asm/uaccess.h | 1 +
arch/powerpc/include/asm/uaccess.h | 13 ++++++++++---
arch/riscv/include/asm/uaccess.h | 1 +
arch/x86/include/asm/uaccess.h | 1 +
include/linux/uaccess.h | 1 +
5 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 1aa4ecb73429..30726ce182cb 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -422,6 +422,7 @@ static __must_check __always_inline bool user_access_begin(const void __user *pt
}
#define user_access_begin(a,b) user_access_begin(a,b)
#define user_access_end() uaccess_ttbr0_disable()
+#define __user_write_access_begin(a,b) uaccess_ttbr0_enable()
#define unsafe_put_user(x, ptr, label) \
__raw_put_mem("sttr", x, uaccess_mask_ptr(ptr), label, U)
#define unsafe_get_user(x, ptr, label) \
diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 4f5a46a77fa2..910bf469128d 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -437,15 +437,22 @@ user_read_access_begin(const void __user *ptr, size_t len)
#define user_read_access_begin user_read_access_begin
#define user_read_access_end prevent_current_read_from_user
+static __always_inline void
+__user_write_access_begin(const void __user *ptr, size_t len)
+{
+ might_fault();
+
+ allow_write_to_user((void __user *)ptr, len);
+}
+#define __user_write_access_begin __user_write_access_begin
+
static __must_check __always_inline bool
user_write_access_begin(const void __user *ptr, size_t len)
{
if (unlikely(!access_ok(ptr, len)))
return false;
- might_fault();
-
- allow_write_to_user((void __user *)ptr, len);
+ __user_write_access_begin(ptr, len);
return true;
}
#define user_write_access_begin user_write_access_begin
diff --git a/arch/riscv/include/asm/uaccess.h b/arch/riscv/include/asm/uaccess.h
index f5f4f7f85543..9adc8f0dd1c8 100644
--- a/arch/riscv/include/asm/uaccess.h
+++ b/arch/riscv/include/asm/uaccess.h
@@ -452,6 +452,7 @@ static __must_check __always_inline bool user_access_begin(const void __user *pt
}
#define user_access_begin user_access_begin
#define user_access_end __disable_user_access
+#define __user_write_access_begin(a,b) __enable_user_access()
static inline unsigned long user_access_save(void) { return 0UL; }
static inline void user_access_restore(unsigned long enabled) { }
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 91a3fb8ae7ff..23edbaef9f71 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -524,6 +524,7 @@ static __must_check __always_inline bool user_access_begin(const void __user *pt
}
#define user_access_begin(a,b) user_access_begin(a,b)
#define user_access_end() __uaccess_end()
+#define __user_write_access_begin(a,b) __uaccess_begin()
#define user_access_save() smap_save()
#define user_access_restore(x) smap_restore(x)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 1beb5b395d81..a6e32784e6cd 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -552,6 +552,7 @@ do { \
#ifndef user_access_begin
#define user_access_begin(ptr,len) access_ok(ptr, len)
#define user_access_end() do { } while (0)
+#define __user_write_access_begin(ptr,len) do { } while (0)
#define unsafe_op_wrap(op, err) do { if (unlikely(op)) goto err; } while (0)
#define unsafe_get_user(x,p,e) unsafe_op_wrap(__get_user(x,p),e)
#define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e)
--
2.51.1.814.gb8fa24458f-goog
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-23 0:04 [PATCH v1 0/2] epoll: Save one stac/clac pair in epoll_put_uevent() Kuniyuki Iwashima
2025-10-23 0:04 ` [PATCH v1 1/2] uaccess: Add __user_write_access_begin() Kuniyuki Iwashima
@ 2025-10-23 0:04 ` Kuniyuki Iwashima
2025-10-23 19:40 ` Dave Hansen
1 sibling, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-23 0:04 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Jens Axboe, Christian Brauner, Linus Torvalds
Cc: Nicholas Piggin, Christophe Leroy, Alexandre Ghiti,
H. Peter Anvin, Eric Dumazet, Kuniyuki Iwashima,
Kuniyuki Iwashima, x86, linux-arm-kernel, linuxppc-dev,
linux-riscv, linux-kernel
epoll_put_uevent() calls __put_user() twice, which are inlined
to two calls of out-of-line functions, __put_user_nocheck_4()
and __put_user_nocheck_8().
Both functions wrap mov with a stac/clac pair, which is expensive
on an AMD EPYC 7B12 64-Core Processor platform.
__put_user_nocheck_4 /proc/kcore [Percent: local period]
Percent │
89.91 │ stac
0.19 │ mov %eax,(%rcx)
0.15 │ xor %ecx,%ecx
9.69 │ clac
0.06 │ ← retq
This was remarkable while testing neper/tcp_rr with 1000 flows per
thread.
Overhead Shared O Symbol
10.08% [kernel] [k] _copy_to_iter
7.12% [kernel] [k] ip6_output
6.40% [kernel] [k] sock_poll
5.71% [kernel] [k] move_addr_to_user
4.39% [kernel] [k] __put_user_nocheck_4
...
1.06% [kernel] [k] ep_try_send_events
... ^- epoll_put_uevent() was inlined
0.78% [kernel] [k] __put_user_nocheck_8
Use __user_write_access_begin() and unsafe_put_user() in
epoll_put_uevent().
We see 1% improvement on tcp_rr throughput by just saving a single
stac/clac pair.
Another option would be to use can_do_masked_user_access()
and masked_user_access_begin(), but we saw ~5% regression with
unnecessary 3 operations for address masking, which is already
checked by ep_check_params().
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
---
include/linux/eventpoll.h | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index ccb478eb174b..efc0aa2d496f 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -82,11 +82,14 @@ static inline struct epoll_event __user *
epoll_put_uevent(__poll_t revents, __u64 data,
struct epoll_event __user *uevent)
{
- if (__put_user(revents, &uevent->events) ||
- __put_user(data, &uevent->data))
- return NULL;
-
- return uevent+1;
+ __user_write_access_begin(uevent, sizeof(*uevent));
+ unsafe_put_user(revents, &uevent->events, efault);
+ unsafe_put_user(data, &uevent->data, efault);
+ user_access_end();
+ return uevent + 1;
+efault:
+ user_access_end();
+ return NULL;
}
#endif
--
2.51.1.814.gb8fa24458f-goog
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v1 1/2] uaccess: Add __user_write_access_begin().
2025-10-23 0:04 ` [PATCH v1 1/2] uaccess: Add __user_write_access_begin() Kuniyuki Iwashima
@ 2025-10-23 5:37 ` Linus Torvalds
2025-10-23 8:29 ` David Laight
0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2025-10-23 5:37 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Catalin Marinas, Will Deacon, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Jens Axboe, Christian Brauner, Nicholas Piggin, Christophe Leroy,
Alexandre Ghiti, H. Peter Anvin, Eric Dumazet, Kuniyuki Iwashima,
x86, linux-arm-kernel, linuxppc-dev, linux-riscv, linux-kernel
On Wed, 22 Oct 2025 at 14:05, Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> unsafe_put_user() can be used to save a stac/clac pair, but
> masked_user_access_begin() or user_access_begin() introduces
> an unnecessary address masking or access_ok().
>
> Add a low-level helper for such a use case.
I really suspect that you cannot actually measure the cost of the
extra masking, and would be much happier if you just used a regular
"user_access_begin()" (perhaps the "user_write_access_begin()"
variant).
The masking is very cheap - literally just a couple of ALU
instructions. And unless you can actually measure some real advantage
of avoiding it, let's not add another helper to this area.
We spent a fair amount of time undoing years of "__get_user()" and
"__put_user()" cases that didn't actually help, and sometimes only
made it hard to see where the actual user pointer validation was done.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 1/2] uaccess: Add __user_write_access_begin().
2025-10-23 5:37 ` Linus Torvalds
@ 2025-10-23 8:29 ` David Laight
2025-10-24 5:31 ` Kuniyuki Iwashima
0 siblings, 1 reply; 16+ messages in thread
From: David Laight @ 2025-10-23 8:29 UTC (permalink / raw)
To: Linus Torvalds
Cc: Kuniyuki Iwashima, Catalin Marinas, Will Deacon,
Madhavan Srinivasan, Michael Ellerman, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, Jens Axboe, Christian Brauner,
Nicholas Piggin, Christophe Leroy, Alexandre Ghiti,
H. Peter Anvin, Eric Dumazet, Kuniyuki Iwashima, x86,
linux-arm-kernel, linuxppc-dev, linux-riscv, linux-kernel
On Wed, 22 Oct 2025 19:37:27 -1000
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 22 Oct 2025 at 14:05, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >
> > unsafe_put_user() can be used to save a stac/clac pair, but
> > masked_user_access_begin() or user_access_begin() introduces
> > an unnecessary address masking or access_ok().
> >
> > Add a low-level helper for such a use case.
>
> I really suspect that you cannot actually measure the cost of the
> extra masking, and would be much happier if you just used a regular
> "user_access_begin()" (perhaps the "user_write_access_begin()"
> variant).
Or wait for scoped_user_write_access() to get committed and then use that.
David
>
> The masking is very cheap - literally just a couple of ALU
> instructions. And unless you can actually measure some real advantage
> of avoiding it, let's not add another helper to this area.
>
> We spent a fair amount of time undoing years of "__get_user()" and
> "__put_user()" cases that didn't actually help, and sometimes only
> made it hard to see where the actual user pointer validation was done.
>
> Linus
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-23 0:04 ` [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent() Kuniyuki Iwashima
@ 2025-10-23 19:40 ` Dave Hansen
2025-10-24 5:16 ` Kuniyuki Iwashima
0 siblings, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2025-10-23 19:40 UTC (permalink / raw)
To: Kuniyuki Iwashima, Catalin Marinas, Will Deacon,
Madhavan Srinivasan, Michael Ellerman, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, Jens Axboe, Christian Brauner,
Linus Torvalds
Cc: Nicholas Piggin, Christophe Leroy, Alexandre Ghiti,
H. Peter Anvin, Eric Dumazet, Kuniyuki Iwashima, x86,
linux-arm-kernel, linuxppc-dev, linux-riscv, linux-kernel
On 10/22/25 17:04, Kuniyuki Iwashima wrote:
> --- a/include/linux/eventpoll.h
> +++ b/include/linux/eventpoll.h
> @@ -82,11 +82,14 @@ static inline struct epoll_event __user *
> epoll_put_uevent(__poll_t revents, __u64 data,
> struct epoll_event __user *uevent)
> {
> - if (__put_user(revents, &uevent->events) ||
> - __put_user(data, &uevent->data))
> - return NULL;
> -
> - return uevent+1;
> + __user_write_access_begin(uevent, sizeof(*uevent));
> + unsafe_put_user(revents, &uevent->events, efault);
> + unsafe_put_user(data, &uevent->data, efault);
> + user_access_end();
> + return uevent + 1;
> +efault:
> + user_access_end();
> + return NULL;
> }
> #endif
This makes me nervous. The access_ok() check is quite a distance away.
I'd kinda want to see some performance numbers before doing this. Is
removing a single access_ok() even measurable?
Also, even if we go do this, shouldn't __user_write_access_begin() be
called something more like unsafe_user_write_access_begin()?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-23 19:40 ` Dave Hansen
@ 2025-10-24 5:16 ` Kuniyuki Iwashima
2025-10-24 14:05 ` Dave Hansen
0 siblings, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-24 5:16 UTC (permalink / raw)
To: dave.hansen
Cc: alex, aou, axboe, bp, brauner, catalin.marinas, christophe.leroy,
dave.hansen, edumazet, hpa, kuni1840, kuniyu, linux-arm-kernel,
linux-kernel, linux-riscv, linuxppc-dev, maddy, mingo, mpe,
npiggin, palmer, pjw, tglx, torvalds, will, x86
From: Dave Hansen <dave.hansen@intel.com>
Date: Thu, 23 Oct 2025 12:40:59 -0700
> On 10/22/25 17:04, Kuniyuki Iwashima wrote:
> > --- a/include/linux/eventpoll.h
> > +++ b/include/linux/eventpoll.h
> > @@ -82,11 +82,14 @@ static inline struct epoll_event __user *
> > epoll_put_uevent(__poll_t revents, __u64 data,
> > struct epoll_event __user *uevent)
> > {
> > - if (__put_user(revents, &uevent->events) ||
> > - __put_user(data, &uevent->data))
> > - return NULL;
> > -
> > - return uevent+1;
> > + __user_write_access_begin(uevent, sizeof(*uevent));
> > + unsafe_put_user(revents, &uevent->events, efault);
> > + unsafe_put_user(data, &uevent->data, efault);
> > + user_access_end();
> > + return uevent + 1;
> > +efault:
> > + user_access_end();
> > + return NULL;
> > }
> > #endif
>
> This makes me nervous. The access_ok() check is quite a distance away.
> I'd kinda want to see some performance numbers before doing this. Is
> removing a single access_ok() even measurable?
I noticed I made a typo in commit message, s/tcp_rr/udp_rr/.
epoll_put_uevent() can be called multiple times in a single
epoll_wait(), and we can see 1.7% more pps on UDP even when
1 thread has 1000 sockets only:
server: $ udp_rr --nolog -6 -F 1000 -T 1 -l 3600
client: $ udp_rr --nolog -6 -F 1000 -T 256 -l 3600 -c -H $SERVER
server: $ nstat > /dev/null; sleep 10; nstat | grep -i udp
Without patch (2 stac/clac):
Udp6InDatagrams 2205209 0.0
With patch (1 stac/clac):
Udp6InDatagrams 2242602 0.0
>>> 2242602 / 2205209 * 100
101.6956669413194
I also took a microbenchmark with bpftrace and we can see
more invocations of ep_try_send_events_ns() finish faster,
and 4% more total calls:
$ sudo bpftrace -e '
k:ep_try_send_events { @start[cpu] = nsecs; }
kr:ep_try_send_events {
if (@start[cpu]) {
$delay = nsecs - @start[cpu];
delete(@start[cpu]);
@ep_try_send_events_ns = hist($delay);
}
}
END { clear(@start); }' -c 'sleep 10'
Without patch:
@ep_try_send_events_ns:
[256, 512) 2483257 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 850735 |@@@@@@@@@@@@@@@@@ |
[1K, 2K) 254027 |@@@@@ |
[2K, 4K) 26646 | |
[4K, 8K) 1358 | |
[8K, 16K) 66 | |
[16K, 32K) 3 | |
With patch:
@ep_try_send_events_ns:
[256, 512) 2844733 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 733956 |@@@@@@@@@@@@@ |
[1K, 2K) 166349 |@@@ |
[2K, 4K) 13495 | |
[4K, 8K) 526 | |
[8K, 16K) 63 | |
[16K, 32K) 5 | |
>>> (2844733 + 733956 + 166349 + 13495 + 526 + 63 + 5) / \
... (2483257 + 850735 + 254027 + 26646 + 1358 + 66 + 3) * 100
103.95551329999347
>
> Also, even if we go do this, shouldn't __user_write_access_begin() be
> called something more like unsafe_user_write_access_begin()?
Sounds good.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 1/2] uaccess: Add __user_write_access_begin().
2025-10-23 8:29 ` David Laight
@ 2025-10-24 5:31 ` Kuniyuki Iwashima
0 siblings, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-24 5:31 UTC (permalink / raw)
To: David Laight
Cc: Linus Torvalds, Catalin Marinas, Will Deacon, Madhavan Srinivasan,
Michael Ellerman, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Jens Axboe, Christian Brauner, Nicholas Piggin, Christophe Leroy,
Alexandre Ghiti, H. Peter Anvin, Eric Dumazet, Kuniyuki Iwashima,
x86, linux-arm-kernel, linuxppc-dev, linux-riscv, linux-kernel
On Thu, Oct 23, 2025 at 1:29 AM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Wed, 22 Oct 2025 19:37:27 -1000
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Wed, 22 Oct 2025 at 14:05, Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > >
> > > unsafe_put_user() can be used to save a stac/clac pair, but
> > > masked_user_access_begin() or user_access_begin() introduces
> > > an unnecessary address masking or access_ok().
> > >
> > > Add a low-level helper for such a use case.
> >
> > I really suspect that you cannot actually measure the cost of the
> > extra masking, and would be much happier if you just used a regular
> > "user_access_begin()" (perhaps the "user_write_access_begin()"
> > variant).
>
> Or wait for scoped_user_write_access() to get committed and then use that.
IIUC, scoped_user_write_access() is simply inlined to
masked_user_access_begin() or user_access_begin(), and this
is the case where I saw no improvement or even worse performance.
>
> David
>
> >
> > The masking is very cheap - literally just a couple of ALU
> > instructions. And unless you can actually measure some real advantage
> > of avoiding it, let's not add another helper to this area.
Yes, it's only 3 instructions on x86_64, but by saving them
I saw better performance constantly. Please see the numbers here.
https://lore.kernel.org/lkml/20251024051653.66329-1-kuniyu@google.com/
> >
> > We spent a fair amount of time undoing years of "__get_user()" and
> > "__put_user()" cases that didn't actually help, and sometimes only
> > made it hard to see where the actual user pointer validation was done.
> >
> > Linus
> >
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-24 5:16 ` Kuniyuki Iwashima
@ 2025-10-24 14:05 ` Dave Hansen
2025-10-24 14:47 ` David Laight
2025-10-28 5:32 ` Kuniyuki Iwashima
0 siblings, 2 replies; 16+ messages in thread
From: Dave Hansen @ 2025-10-24 14:05 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: alex, aou, axboe, bp, brauner, catalin.marinas, christophe.leroy,
dave.hansen, edumazet, hpa, kuni1840, linux-arm-kernel,
linux-kernel, linux-riscv, linuxppc-dev, maddy, mingo, mpe,
npiggin, palmer, pjw, tglx, torvalds, will, x86
On 10/23/25 22:16, Kuniyuki Iwashima wrote:
>> This makes me nervous. The access_ok() check is quite a distance away.
>> I'd kinda want to see some performance numbers before doing this. Is
>> removing a single access_ok() even measurable?
> I noticed I made a typo in commit message, s/tcp_rr/udp_rr/.
>
> epoll_put_uevent() can be called multiple times in a single
> epoll_wait(), and we can see 1.7% more pps on UDP even when
> 1 thread has 1000 sockets only:
>
> server: $ udp_rr --nolog -6 -F 1000 -T 1 -l 3600
> client: $ udp_rr --nolog -6 -F 1000 -T 256 -l 3600 -c -H $SERVER
> server: $ nstat > /dev/null; sleep 10; nstat | grep -i udp
>
> Without patch (2 stac/clac):
> Udp6InDatagrams 2205209 0.0
>
> With patch (1 stac/clac):
> Udp6InDatagrams 2242602 0.0
I'm totally with you about removing a stac/clac:
https://lore.kernel.org/lkml/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com/
The thing I'm worried about is having the access_ok() so distant
from the unsafe_put_user(). I'm wondering if this:
- __user_write_access_begin(uevent, sizeof(*uevent));
+ if (!user_write_access_begin(uevent, sizeof(*uevent))
+ return NULL;
unsafe_put_user(revents, &uevent->events, efault);
unsafe_put_user(data, &uevent->data, efault);
user_access_end();
is measurably slower than what was in your series. If it is
not measurably slower, then the series gets simpler because it
does not need to refactor user_write_access_begin(). It also ends
up more obviously correct because the access check is closer to
the unsafe_put_user() calls.
Also, the extra access_ok() is *much* cheaper than stac/clac.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-24 14:05 ` Dave Hansen
@ 2025-10-24 14:47 ` David Laight
2025-10-28 5:32 ` Kuniyuki Iwashima
1 sibling, 0 replies; 16+ messages in thread
From: David Laight @ 2025-10-24 14:47 UTC (permalink / raw)
To: Dave Hansen
Cc: Kuniyuki Iwashima, alex, aou, axboe, bp, brauner, catalin.marinas,
christophe.leroy, dave.hansen, edumazet, hpa, kuni1840,
linux-arm-kernel, linux-kernel, linux-riscv, linuxppc-dev, maddy,
mingo, mpe, npiggin, palmer, pjw, tglx, torvalds, will, x86
On Fri, 24 Oct 2025 07:05:50 -0700
Dave Hansen <dave.hansen@intel.com> wrote:
> On 10/23/25 22:16, Kuniyuki Iwashima wrote:
> >> This makes me nervous. The access_ok() check is quite a distance away.
> >> I'd kinda want to see some performance numbers before doing this. Is
> >> removing a single access_ok() even measurable?
> > I noticed I made a typo in commit message, s/tcp_rr/udp_rr/.
> >
> > epoll_put_uevent() can be called multiple times in a single
> > epoll_wait(), and we can see 1.7% more pps on UDP even when
> > 1 thread has 1000 sockets only:
> >
> > server: $ udp_rr --nolog -6 -F 1000 -T 1 -l 3600
> > client: $ udp_rr --nolog -6 -F 1000 -T 256 -l 3600 -c -H $SERVER
> > server: $ nstat > /dev/null; sleep 10; nstat | grep -i udp
> >
> > Without patch (2 stac/clac):
> > Udp6InDatagrams 2205209 0.0
> >
> > With patch (1 stac/clac):
> > Udp6InDatagrams 2242602 0.0
>
> I'm totally with you about removing a stac/clac:
>
> https://lore.kernel.org/lkml/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com/
>
> The thing I'm worried about is having the access_ok() so distant
> from the unsafe_put_user(). I'm wondering if this:
>
> - __user_write_access_begin(uevent, sizeof(*uevent));
> + if (!user_write_access_begin(uevent, sizeof(*uevent))
> + return NULL;
> unsafe_put_user(revents, &uevent->events, efault);
> unsafe_put_user(data, &uevent->data, efault);
> user_access_end();
>
> is measurably slower than what was in your series. If it is
> not measurably slower, then the series gets simpler because it
> does not need to refactor user_write_access_begin(). It also ends
> up more obviously correct because the access check is closer to
> the unsafe_put_user() calls.
>
> Also, the extra access_ok() is *much* cheaper than stac/clac.
access_ok() does contain a conditional branch
- just waiting for the misprediction penalty (say 20 clocks).
OTOH you shouldn't get that more that twice for the loop.
I'm pretty sure access_ok() itself contains an lfence - needed for reads.
But that ought to be absent from user_write_access_begin().
The 'masked' version uses alu operations (on x86-64) and don't need
lfence (or anything else) and don't contain a mispredictable branch.
They should be faster than the above - unless the code has serious
register pressure and too much gets spilled to stack.
The timings may also depend on the cpu you are using.
I'm sure I remember some of the very recent ones having much faster
stac/clac and/or lfence.
David
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-24 14:05 ` Dave Hansen
2025-10-24 14:47 ` David Laight
@ 2025-10-28 5:32 ` Kuniyuki Iwashima
2025-10-28 9:54 ` David Laight
1 sibling, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-28 5:32 UTC (permalink / raw)
To: dave.hansen
Cc: alex, aou, axboe, bp, brauner, catalin.marinas, christophe.leroy,
dave.hansen, edumazet, hpa, kuni1840, kuniyu, linux-arm-kernel,
linux-kernel, linux-riscv, linuxppc-dev, maddy, mingo, mpe,
npiggin, palmer, pjw, tglx, torvalds, will, x86
From: Dave Hansen <dave.hansen@intel.com>
Date: Fri, 24 Oct 2025 07:05:50 -0700
> On 10/23/25 22:16, Kuniyuki Iwashima wrote:
> >> This makes me nervous. The access_ok() check is quite a distance away.
> >> I'd kinda want to see some performance numbers before doing this. Is
> >> removing a single access_ok() even measurable?
> > I noticed I made a typo in commit message, s/tcp_rr/udp_rr/.
> >
> > epoll_put_uevent() can be called multiple times in a single
> > epoll_wait(), and we can see 1.7% more pps on UDP even when
> > 1 thread has 1000 sockets only:
> >
> > server: $ udp_rr --nolog -6 -F 1000 -T 1 -l 3600
> > client: $ udp_rr --nolog -6 -F 1000 -T 256 -l 3600 -c -H $SERVER
> > server: $ nstat > /dev/null; sleep 10; nstat | grep -i udp
> >
> > Without patch (2 stac/clac):
> > Udp6InDatagrams 2205209 0.0
> >
> > With patch (1 stac/clac):
> > Udp6InDatagrams 2242602 0.0
>
> I'm totally with you about removing a stac/clac:
>
> https://lore.kernel.org/lkml/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com/
>
> The thing I'm worried about is having the access_ok() so distant
> from the unsafe_put_user(). I'm wondering if this:
>
> - __user_write_access_begin(uevent, sizeof(*uevent));
> + if (!user_write_access_begin(uevent, sizeof(*uevent))
> + return NULL;
> unsafe_put_user(revents, &uevent->events, efault);
> unsafe_put_user(data, &uevent->data, efault);
> user_access_end();
>
> is measurably slower than what was in your series. If it is
> not measurably slower, then the series gets simpler because it
> does not need to refactor user_write_access_begin(). It also ends
> up more obviously correct because the access check is closer to
> the unsafe_put_user() calls.
>
> Also, the extra access_ok() is *much* cheaper than stac/clac.
Sorry for the late!
I rebased on 19ab0a22efbd and tested 4 versions on
AMD EPYC 7B12 machine:
1) Base 19ab0a22efbd
2) masked_user_access_begin()
-> 97% pps and 96% calls of ep_try_send_events()
3) user_write_access_begin() (Dave's diff above) (NEW)
-> 102.2% pps and 103% calls of ep_try_send_events()
4) __user_write_access_begin() (This patch)
-> 102.4% pps and 103% calls of ep_try_send_events().
Interestingly user_write_access_begin() was as fast as
__user_write_access_begin() !
Also, as with the previous result, masked_user_access_begin()
was the worst somehow.
So, I'll drop patch 1 and post v2 with user_write_access_begin().
Thank you!
1) Base (19ab0a22efbd)
# nstat > /dev/null; sleep 10; nstat | grep -i udp
Udp6InDatagrams 2184011 0.0
@ep_try_send_events_ns:
[256, 512) 2796601 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 627863 |@@@@@@@@@@@ |
[1K, 2K) 166403 |@@@ |
[2K, 4K) 10437 | |
[4K, 8K) 1396 | |
[8K, 16K) 116 | |
2) masked_user_access_begin() + masked_user_access_begin()
97% pps compared to 1).
96% calls of ep_try_send_events().
# nstat > /dev/null; sleep 10; nstat | grep -i udp
Udp6InDatagrams 2120498 0.0
@ep_try_send_events_ns:
[256, 512) 2690803 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 533750 |@@@@@@@@@@ |
[1K, 2K) 225969 |@@@@ |
[2K, 4K) 35176 | |
[4K, 8K) 2428 | |
[8K, 16K) 199 | |
3) user_write_access_begin()
102.2% pps compared to 1).
103% calls of ep_try_send_events().
# nstat > /dev/null; sleep 10; nstat | grep -i udp
Udp6InDatagrams 2232730 0.0
@ep_try_send_events_ns:
[256, 512) 2900655 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 622045 |@@@@@@@@@@@ |
[1K, 2K) 172831 |@@@ |
[2K, 4K) 17687 | |
[4K, 8K) 1103 | |
[8K, 16K) 174 | |
4) __user_write_access_begin()
102.4% pps compared to 1).
103% calls of ep_try_send_events().
# nstat > /dev/null; sleep 10; nstat | grep -i udp
Udp6InDatagrams 2238524 0.0
@ep_try_send_events_ns:
[256, 512) 2906752 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 630199 |@@@@@@@@@@@ |
[1K, 2K) 161741 |@@ |
[2K, 4K) 17141 | |
[4K, 8K) 1041 | |
[8K, 16K) 61 | |
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-28 5:32 ` Kuniyuki Iwashima
@ 2025-10-28 9:54 ` David Laight
2025-10-28 16:42 ` Kuniyuki Iwashima
0 siblings, 1 reply; 16+ messages in thread
From: David Laight @ 2025-10-28 9:54 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: dave.hansen, alex, aou, axboe, bp, brauner, catalin.marinas,
christophe.leroy, dave.hansen, edumazet, hpa, kuni1840,
linux-arm-kernel, linux-kernel, linux-riscv, linuxppc-dev, maddy,
mingo, mpe, npiggin, palmer, pjw, tglx, torvalds, will, x86
On Tue, 28 Oct 2025 05:32:13 +0000
Kuniyuki Iwashima <kuniyu@google.com> wrote:
....
> I rebased on 19ab0a22efbd and tested 4 versions on
> AMD EPYC 7B12 machine:
That is zen5 which I believe has much faster clac/stac than anything else.
(It might also have a faster lfence - not sure.)
Getting a 3% change for that diff also seems unlikely.
Even if you halved the execution time of that code the system would have
to be spending 6% of the time in that loop.
Even your original post only shows 1% in ep_try_send_events().
An 'interesting' test is to replicate the code you are optimising
to see how much slower it goes - you can't gain more than the slowdown.
What is more likely is that breathing on the code changes the cache
line layout and that causes a larger performance change.
A better test for epoll_put_event would be to create 1000 fd (pipes or events).
Then time calls epoll_wait() that return lots of events.
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-28 9:54 ` David Laight
@ 2025-10-28 16:42 ` Kuniyuki Iwashima
2025-10-28 16:58 ` Linus Torvalds
2025-10-28 22:30 ` David Laight
0 siblings, 2 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-28 16:42 UTC (permalink / raw)
To: David Laight
Cc: dave.hansen, alex, aou, axboe, bp, brauner, catalin.marinas,
christophe.leroy, dave.hansen, edumazet, hpa, kuni1840,
linux-arm-kernel, linux-kernel, linux-riscv, linuxppc-dev, maddy,
mingo, mpe, npiggin, palmer, pjw, tglx, torvalds, will, x86
On Tue, Oct 28, 2025 at 2:54 AM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Tue, 28 Oct 2025 05:32:13 +0000
> Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> ....
> > I rebased on 19ab0a22efbd and tested 4 versions on
> > AMD EPYC 7B12 machine:
>
> That is zen5 which I believe has much faster clac/stac than anything else.
> (It might also have a faster lfence - not sure.)
This is the Zen 2 platform, so probably the stac/clac cost will be
more expensive than you expect on Zen 5.
>
> Getting a 3% change for that diff also seems unlikely.
> Even if you halved the execution time of that code the system would have
> to be spending 6% of the time in that loop.
> Even your original post only shows 1% in ep_try_send_events().
We saw a similar improvement on the same platform by
1fb0e471611d ("net: remove one stac/clac pair from
move_addr_to_user()").
>
> An 'interesting' test is to replicate the code you are optimising
> to see how much slower it goes - you can't gain more than the slowdown.
>
> What is more likely is that breathing on the code changes the cache
> line layout and that causes a larger performance change.
>
> A better test for epoll_put_event would be to create 1000 fd (pipes or events).
> Then time calls epoll_wait() that return lots of events.
>
> David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-28 16:42 ` Kuniyuki Iwashima
@ 2025-10-28 16:58 ` Linus Torvalds
2025-10-29 1:42 ` Andrew Cooper
2025-10-28 22:30 ` David Laight
1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2025-10-28 16:58 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David Laight, dave.hansen, alex, aou, axboe, bp, brauner,
catalin.marinas, christophe.leroy, dave.hansen, edumazet, hpa,
kuni1840, linux-arm-kernel, linux-kernel, linux-riscv,
linuxppc-dev, maddy, mingo, mpe, npiggin, palmer, pjw, tglx, will,
x86
On Tue, 28 Oct 2025 at 09:42, Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> This is the Zen 2 platform, so probably the stac/clac cost will be
> more expensive than you expect on Zen 5.
Yeah, clac/stac are horrenously expensive on Zen 2. I think they are
microcoded - and not the fast kind - so effectively serializing.
They got enormously faster and pipelined in later Zen microarchitectures.
Sadly, Agner Fog doesn't do timings on those instructions, and I
haven't found any other place that has some overview of the
performance on different microarchitectures.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-28 16:42 ` Kuniyuki Iwashima
2025-10-28 16:58 ` Linus Torvalds
@ 2025-10-28 22:30 ` David Laight
1 sibling, 0 replies; 16+ messages in thread
From: David Laight @ 2025-10-28 22:30 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: dave.hansen, alex, aou, axboe, bp, brauner, catalin.marinas,
christophe.leroy, dave.hansen, edumazet, hpa, kuni1840,
linux-arm-kernel, linux-kernel, linux-riscv, linuxppc-dev, maddy,
mingo, mpe, npiggin, palmer, pjw, tglx, torvalds, will, x86
On Tue, 28 Oct 2025 09:42:25 -0700
Kuniyuki Iwashima <kuniyu@google.com> wrote:
> On Tue, Oct 28, 2025 at 2:54 AM David Laight
> <david.laight.linux@gmail.com> wrote:
> >
> > On Tue, 28 Oct 2025 05:32:13 +0000
> > Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >
> > ....
> > > I rebased on 19ab0a22efbd and tested 4 versions on
> > > AMD EPYC 7B12 machine:
> >
> > That is zen5 which I believe has much faster clac/stac than anything else.
> > (It might also have a faster lfence - not sure.)
>
> This is the Zen 2 platform, so probably the stac/clac cost will be
> more expensive than you expect on Zen 5.
I must has looked the cpu type incorrectly.
AMD haven't made it easy working out the cpu architecture.
I need to get an older zen cpu for my set of test systems
(and some newer Intel ones).
> > Getting a 3% change for that diff also seems unlikely.
> > Even if you halved the execution time of that code the system would have
> > to be spending 6% of the time in that loop.
> > Even your original post only shows 1% in ep_try_send_events().
I realised after that you might be showing a 3% change in that 1%.
>
> We saw a similar improvement on the same platform by
> 1fb0e471611d ("net: remove one stac/clac pair from
> move_addr_to_user()").
Certainly removing one could easily be measurable.
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent().
2025-10-28 16:58 ` Linus Torvalds
@ 2025-10-29 1:42 ` Andrew Cooper
0 siblings, 0 replies; 16+ messages in thread
From: Andrew Cooper @ 2025-10-29 1:42 UTC (permalink / raw)
To: torvalds
Cc: alex, aou, axboe, bp, brauner, catalin.marinas, christophe.leroy,
dave.hansen, dave.hansen, david.laight.linux, edumazet, hpa,
kuni1840, kuniyu, linux-arm-kernel, linux-kernel, linux-riscv,
linuxppc-dev, maddy, mingo, mpe, npiggin, palmer, pjw, tglx, will,
x86
> Yeah, clac/stac are horrenously expensive on Zen 2. I think they are
> microcoded - and not the fast kind - so effectively serializing.
AIUI, Zen5 is the first uarch capable of renaming the AC flag.
This make STAC/CLAC "lfence-like" in practice on older CPUs, because of
an implementation constraint of not being able to execute speculatively.
~Andrew
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-10-29 1:42 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23 0:04 [PATCH v1 0/2] epoll: Save one stac/clac pair in epoll_put_uevent() Kuniyuki Iwashima
2025-10-23 0:04 ` [PATCH v1 1/2] uaccess: Add __user_write_access_begin() Kuniyuki Iwashima
2025-10-23 5:37 ` Linus Torvalds
2025-10-23 8:29 ` David Laight
2025-10-24 5:31 ` Kuniyuki Iwashima
2025-10-23 0:04 ` [PATCH v1 2/2] epoll: Use __user_write_access_begin() and unsafe_put_user() in epoll_put_uevent() Kuniyuki Iwashima
2025-10-23 19:40 ` Dave Hansen
2025-10-24 5:16 ` Kuniyuki Iwashima
2025-10-24 14:05 ` Dave Hansen
2025-10-24 14:47 ` David Laight
2025-10-28 5:32 ` Kuniyuki Iwashima
2025-10-28 9:54 ` David Laight
2025-10-28 16:42 ` Kuniyuki Iwashima
2025-10-28 16:58 ` Linus Torvalds
2025-10-29 1:42 ` Andrew Cooper
2025-10-28 22:30 ` David Laight
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).