* [PATCH v4 1/2] arm64: Optimize __READ_ONCE() with CONFIG_LTO=y
2026-02-16 14:16 [PATCH v4 0/2] arm64: Fixes for __READ_ONCE() with CONFIG_LTO=y Marco Elver
@ 2026-02-16 14:16 ` Marco Elver
2026-02-16 14:16 ` [PATCH v4 2/2] arm64, compiler-context-analysis: Permit alias analysis through " Marco Elver
2026-02-27 3:16 ` [PATCH v4 0/2] arm64: Fixes for " Will Deacon
2 siblings, 0 replies; 6+ messages in thread
From: Marco Elver @ 2026-02-16 14:16 UTC (permalink / raw)
To: elver, Peter Zijlstra, Will Deacon
Cc: Ingo Molnar, Thomas Gleixner, Boqun Feng, Waiman Long,
Bart Van Assche, llvm, David Laight, Al Viro, Catalin Marinas,
Nathan Chancellor, Arnd Bergmann, linux-arm-kernel, linux-kernel
Rework arm64 LTO __READ_ONCE() to improve code generation as follows:
1. Replace _Generic-based __unqual_scalar_typeof() with more complete
__rwonce_typeof_unqual(). This strips qualifiers from all types, not
just integer types, which is required to be able to assign (must be
non-const) to __u.__val in the non-atomic case (required for #2).
One subtle point here is that non-integer types of __val could be const
or volatile within the union with the old __unqual_scalar_typeof(), if
the passed variable is const or volatile. This would then result in a
forced load from the stack if __u.__val is volatile; in the case of
const, it does look odd if the underlying storage changes, but the
compiler is told said member is "const" -- it smells like UB.
2. Eliminate the atomic flag and ternary conditional expression. Move
the fallback volatile load into the default case of the switch,
ensuring __u is unconditionally initialized across all paths.
The statement expression now unconditionally returns __u.__val.
This refactoring appears to help the compiler improve (or fix) code
generation.
With a defconfig + LTO + debug options builds, we observe different
codegen for the following functions:
btrfs_reclaim_sweep (708 -> 1032 bytes)
btrfs_sinfo_bg_reclaim_threshold_store (200 -> 204 bytes)
check_mem_access (3652 -> 3692 bytes) [inlined bpf_map_is_rdonly]
console_flush_all (1268 -> 1264 bytes)
console_lock_spinning_disable_and_check (180 -> 176 bytes)
igb_add_filter (640 -> 636 bytes)
igb_config_tx_modes (2404 -> 2400 bytes)
kvm_vcpu_on_spin (480 -> 476 bytes)
map_freeze (376 -> 380 bytes)
netlink_bind (1664 -> 1656 bytes)
nmi_cpu_backtrace (404 -> 400 bytes)
set_rps_cpu (516 -> 520 bytes)
swap_cluster_readahead (944 -> 932 bytes)
tcp_accecn_third_ack (328 -> 336 bytes)
tcp_create_openreq_child (1764 -> 1772 bytes)
tcp_data_queue (5784 -> 5892 bytes)
tcp_ecn_rcv_synack (620 -> 628 bytes)
xen_manage_runstate_time (944 -> 896 bytes)
xen_steal_clock (340 -> 296 bytes)
Increase of some functions are due to more aggressive inlining due to
better codegen (in this build, e.g. bpf_map_is_rdonly is no longer
present due to being inlined completely).
NOTE: The return-value-of-function-drops-qualifiers hack was first
suggested by Al Viro in [1], which notes some of its limitations which
make it unsuitable for a general __unqual_scalar_typeof() replacement.
Notably, array types are not supported, and GCC 8.1-8.3 still fail. Why
should we use it here? READ_ONCE() does not support reading whole
arrays, and the GCC version problem only affects 3 minor releases of a
very ancient still-supported GCC version; not only that, this arm64
READ_ONCE() version is currently only activated by LTO builds, which
to-date are *only supported by Clang*!
Link: https://lore.kernel.org/all/20260111182010.GH3634291@ZenIV/ [1]
Signed-off-by: Marco Elver <elver@google.com>
---
v4:
* Use the return-value-of-function-drops-qualifiers hack to paint the
bikeshed.
v3:
* Comment.
v2:
* Add __rwonce_typeof_unqual() as fallback for old compilers.
---
arch/arm64/include/asm/rwonce.h | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
index fc0fb42b0b64..9fd24cef3376 100644
--- a/arch/arm64/include/asm/rwonce.h
+++ b/arch/arm64/include/asm/rwonce.h
@@ -19,6 +19,17 @@
"ldapr" #sfx "\t" #regs, \
ARM64_HAS_LDAPR)
+/*
+ * Replace this with typeof_unqual() when minimum compiler versions are
+ * increased to GCC 14 and Clang 19. For the time being, we need this
+ * workaround, which relies on function return values dropping qualifiers.
+ */
+#define __rwonce_typeof_unqual(x) typeof(({ \
+ __diag_push() \
+ __diag_ignore_all("-Wignored-qualifiers", "") \
+ ((typeof(x)(*)(void))0)(); \
+ __diag_pop() }))
+
/*
* When building with LTO, there is an increased risk of the compiler
* converting an address dependency headed by a READ_ONCE() invocation
@@ -32,8 +43,7 @@
#define __READ_ONCE(x) \
({ \
typeof(&(x)) __x = &(x); \
- int atomic = 1; \
- union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
+ union { __rwonce_typeof_unqual(*__x) __val; char __c[1]; } __u; \
switch (sizeof(x)) { \
case 1: \
asm volatile(__LOAD_RCPC(b, %w0, %1) \
@@ -56,9 +66,9 @@
: "Q" (*__x) : "memory"); \
break; \
default: \
- atomic = 0; \
+ __u.__val = *(volatile typeof(*__x) *)__x; \
} \
- atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(*__x) *)__x);\
+ __u.__val; \
})
#endif /* !BUILD_VDSO */
--
2.53.0.335.g19a08e0c02-goog
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH v4 2/2] arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y
2026-02-16 14:16 [PATCH v4 0/2] arm64: Fixes for __READ_ONCE() with CONFIG_LTO=y Marco Elver
2026-02-16 14:16 ` [PATCH v4 1/2] arm64: Optimize " Marco Elver
@ 2026-02-16 14:16 ` Marco Elver
2026-02-16 18:00 ` David Laight
2026-02-27 3:16 ` [PATCH v4 0/2] arm64: Fixes for " Will Deacon
2 siblings, 1 reply; 6+ messages in thread
From: Marco Elver @ 2026-02-16 14:16 UTC (permalink / raw)
To: elver, Peter Zijlstra, Will Deacon
Cc: Ingo Molnar, Thomas Gleixner, Boqun Feng, Waiman Long,
Bart Van Assche, llvm, David Laight, Al Viro, Catalin Marinas,
Nathan Chancellor, Arnd Bergmann, linux-arm-kernel, linux-kernel,
kernel test robot, Boqun Feng
When enabling Clang's Context Analysis (aka. Thread Safety Analysis) on
kernel/futex/core.o (see Peter's changes at [1]), in arm64 LTO builds we
could see:
| kernel/futex/core.c:982:1: warning: spinlock 'atomic ? __u.__val : q->lock_ptr' is still held at the end of function [-Wthread-safety-analysis]
| 982 | }
| | ^
| kernel/futex/core.c:976:2: note: spinlock acquired here
| 976 | spin_lock(lock_ptr);
| | ^
| kernel/futex/core.c:982:1: warning: expecting spinlock 'q->lock_ptr' to be held at the end of function [-Wthread-safety-analysis]
| 982 | }
| | ^
| kernel/futex/core.c:966:6: note: spinlock acquired here
| 966 | void futex_q_lockptr_lock(struct futex_q *q)
| | ^
| 2 warnings generated.
Where we have:
extern void futex_q_lockptr_lock(struct futex_q *q) __acquires(q->lock_ptr);
..
void futex_q_lockptr_lock(struct futex_q *q)
{
spinlock_t *lock_ptr;
/*
* See futex_unqueue() why lock_ptr can change.
*/
guard(rcu)();
retry:
>> lock_ptr = READ_ONCE(q->lock_ptr);
spin_lock(lock_ptr);
...
}
At the time of the above report (prior to removal of the 'atomic' flag),
Clang Thread Safety Analysis's alias analysis resolved 'lock_ptr' to
'atomic ? __u.__val : q->lock_ptr' (now just '__u.__val'), and used
this as the identity of the context lock given it cannot "see through"
the inline assembly; however, we want 'q->lock_ptr' as the canonical
context lock.
While for code generation the compiler simplified to '__u.__val' for
pointers (8 byte case -> 'atomic' was set), TSA's analysis (a) happens
much earlier on the AST, and (b) would be the wrong deduction.
Now that we've gotten rid of the 'atomic' ternary comparison, we can
return '__u.__val' through a pointer that we initialize with '&x', but
then update via a pointer-to-pointer. When READ_ONCE()'ing a context
lock pointer, TSA's alias analysis does not invalidate the initial alias
when updated through the pointer-to-pointer, and we make it effectively
"see through" the __READ_ONCE().
Code generation is unchanged.
Link: https://lkml.kernel.org/r/20260121110704.221498346@infradead.org [1]
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202601221040.TeM0ihff-lkp@intel.com/
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Boqun Feng <boqun@kernel.org>
Reviewed-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Marco Elver <elver@google.com>
---
v3:
* Use 'typeof(*__ret)'.
* Commit message.
v2:
* Rebase.
---
arch/arm64/include/asm/rwonce.h | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
index 9fd24cef3376..0f3a01d30f66 100644
--- a/arch/arm64/include/asm/rwonce.h
+++ b/arch/arm64/include/asm/rwonce.h
@@ -42,8 +42,12 @@
*/
#define __READ_ONCE(x) \
({ \
- typeof(&(x)) __x = &(x); \
- union { __rwonce_typeof_unqual(*__x) __val; char __c[1]; } __u; \
+ auto __x = &(x); \
+ auto __ret = (__rwonce_typeof_unqual(*__x) *)__x; \
+ /* Hides alias reassignment from Clang's -Wthread-safety. */ \
+ auto __retp = &__ret; \
+ union { typeof(*__ret) __val; char __c[1]; } __u; \
+ *__retp = &__u.__val; \
switch (sizeof(x)) { \
case 1: \
asm volatile(__LOAD_RCPC(b, %w0, %1) \
@@ -68,7 +72,7 @@
default: \
__u.__val = *(volatile typeof(*__x) *)__x; \
} \
- __u.__val; \
+ *__ret; \
})
#endif /* !BUILD_VDSO */
--
2.53.0.335.g19a08e0c02-goog
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH v4 2/2] arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y
2026-02-16 14:16 ` [PATCH v4 2/2] arm64, compiler-context-analysis: Permit alias analysis through " Marco Elver
@ 2026-02-16 18:00 ` David Laight
2026-02-16 18:40 ` Marco Elver
0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2026-02-16 18:00 UTC (permalink / raw)
To: Marco Elver
Cc: Peter Zijlstra, Will Deacon, Ingo Molnar, Thomas Gleixner,
Boqun Feng, Waiman Long, Bart Van Assche, llvm, Al Viro,
Catalin Marinas, Nathan Chancellor, Arnd Bergmann,
linux-arm-kernel, linux-kernel, kernel test robot, Boqun Feng
On Mon, 16 Feb 2026 15:16:23 +0100
Marco Elver <elver@google.com> wrote:
> When enabling Clang's Context Analysis (aka. Thread Safety Analysis) on
> kernel/futex/core.o (see Peter's changes at [1]), in arm64 LTO builds we
> could see:
>
> | kernel/futex/core.c:982:1: warning: spinlock 'atomic ? __u.__val : q->lock_ptr' is still held at the end of function [-Wthread-safety-analysis]
> | 982 | }
> | | ^
> | kernel/futex/core.c:976:2: note: spinlock acquired here
> | 976 | spin_lock(lock_ptr);
> | | ^
> | kernel/futex/core.c:982:1: warning: expecting spinlock 'q->lock_ptr' to be held at the end of function [-Wthread-safety-analysis]
> | 982 | }
> | | ^
> | kernel/futex/core.c:966:6: note: spinlock acquired here
> | 966 | void futex_q_lockptr_lock(struct futex_q *q)
> | | ^
> | 2 warnings generated.
>
> Where we have:
>
> extern void futex_q_lockptr_lock(struct futex_q *q) __acquires(q->lock_ptr);
> ..
> void futex_q_lockptr_lock(struct futex_q *q)
> {
> spinlock_t *lock_ptr;
>
> /*
> * See futex_unqueue() why lock_ptr can change.
> */
> guard(rcu)();
> retry:
> >> lock_ptr = READ_ONCE(q->lock_ptr);
Did you try adding OPTIMZER_HIDE_VAR(lock_ptr) here?
That might force the TSA logic to use 'where lock_ptr points to'
instead of trying to allow an unlock(q->lock_ptr) (or similar)
which is clearly entirely broken.
Testing a compile with the unlock() missing ought to generate the
warning - if not you've just confused the code enough the it stops
caring.
David
> spin_lock(lock_ptr);
> ...
> }
>
> At the time of the above report (prior to removal of the 'atomic' flag),
> Clang Thread Safety Analysis's alias analysis resolved 'lock_ptr' to
> 'atomic ? __u.__val : q->lock_ptr' (now just '__u.__val'), and used
> this as the identity of the context lock given it cannot "see through"
> the inline assembly; however, we want 'q->lock_ptr' as the canonical
> context lock.
>
> While for code generation the compiler simplified to '__u.__val' for
> pointers (8 byte case -> 'atomic' was set), TSA's analysis (a) happens
> much earlier on the AST, and (b) would be the wrong deduction.
>
> Now that we've gotten rid of the 'atomic' ternary comparison, we can
> return '__u.__val' through a pointer that we initialize with '&x', but
> then update via a pointer-to-pointer. When READ_ONCE()'ing a context
> lock pointer, TSA's alias analysis does not invalidate the initial alias
> when updated through the pointer-to-pointer, and we make it effectively
> "see through" the __READ_ONCE().
>
> Code generation is unchanged.
>
> Link: https://lkml.kernel.org/r/20260121110704.221498346@infradead.org [1]
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202601221040.TeM0ihff-lkp@intel.com/
> Cc: Peter Zijlstra <peterz@infradead.org>
> Tested-by: Boqun Feng <boqun@kernel.org>
> Reviewed-by: David Laight <david.laight.linux@gmail.com>
> Signed-off-by: Marco Elver <elver@google.com>
> ---
> v3:
> * Use 'typeof(*__ret)'.
> * Commit message.
>
> v2:
> * Rebase.
> ---
> arch/arm64/include/asm/rwonce.h | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/rwonce.h b/arch/arm64/include/asm/rwonce.h
> index 9fd24cef3376..0f3a01d30f66 100644
> --- a/arch/arm64/include/asm/rwonce.h
> +++ b/arch/arm64/include/asm/rwonce.h
> @@ -42,8 +42,12 @@
> */
> #define __READ_ONCE(x) \
> ({ \
> - typeof(&(x)) __x = &(x); \
> - union { __rwonce_typeof_unqual(*__x) __val; char __c[1]; } __u; \
> + auto __x = &(x); \
> + auto __ret = (__rwonce_typeof_unqual(*__x) *)__x; \
> + /* Hides alias reassignment from Clang's -Wthread-safety. */ \
> + auto __retp = &__ret; \
> + union { typeof(*__ret) __val; char __c[1]; } __u; \
> + *__retp = &__u.__val; \
> switch (sizeof(x)) { \
> case 1: \
> asm volatile(__LOAD_RCPC(b, %w0, %1) \
> @@ -68,7 +72,7 @@
> default: \
> __u.__val = *(volatile typeof(*__x) *)__x; \
> } \
> - __u.__val; \
> + *__ret; \
> })
>
> #endif /* !BUILD_VDSO */
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH v4 2/2] arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y
2026-02-16 18:00 ` David Laight
@ 2026-02-16 18:40 ` Marco Elver
0 siblings, 0 replies; 6+ messages in thread
From: Marco Elver @ 2026-02-16 18:40 UTC (permalink / raw)
To: David Laight
Cc: Peter Zijlstra, Will Deacon, Ingo Molnar, Thomas Gleixner,
Boqun Feng, Waiman Long, Bart Van Assche, llvm, Al Viro,
Catalin Marinas, Nathan Chancellor, Arnd Bergmann,
linux-arm-kernel, linux-kernel, kernel test robot, Boqun Feng
On Mon, 16 Feb 2026 at 19:00, David Laight <david.laight.linux@gmail.com> wrote:
>
> On Mon, 16 Feb 2026 15:16:23 +0100
> Marco Elver <elver@google.com> wrote:
>
> > When enabling Clang's Context Analysis (aka. Thread Safety Analysis) on
> > kernel/futex/core.o (see Peter's changes at [1]), in arm64 LTO builds we
> > could see:
> >
> > | kernel/futex/core.c:982:1: warning: spinlock 'atomic ? __u.__val : q->lock_ptr' is still held at the end of function [-Wthread-safety-analysis]
> > | 982 | }
> > | | ^
> > | kernel/futex/core.c:976:2: note: spinlock acquired here
> > | 976 | spin_lock(lock_ptr);
> > | | ^
> > | kernel/futex/core.c:982:1: warning: expecting spinlock 'q->lock_ptr' to be held at the end of function [-Wthread-safety-analysis]
> > | 982 | }
> > | | ^
> > | kernel/futex/core.c:966:6: note: spinlock acquired here
> > | 966 | void futex_q_lockptr_lock(struct futex_q *q)
> > | | ^
> > | 2 warnings generated.
> >
> > Where we have:
> >
> > extern void futex_q_lockptr_lock(struct futex_q *q) __acquires(q->lock_ptr);
> > ..
> > void futex_q_lockptr_lock(struct futex_q *q)
> > {
> > spinlock_t *lock_ptr;
> >
> > /*
> > * See futex_unqueue() why lock_ptr can change.
> > */
> > guard(rcu)();
> > retry:
> > >> lock_ptr = READ_ONCE(q->lock_ptr);
>
> Did you try adding OPTIMZER_HIDE_VAR(lock_ptr) here?
> That might force the TSA logic to use 'where lock_ptr points to'
> instead of trying to allow an unlock(q->lock_ptr) (or similar)
> which is clearly entirely broken.
>
> Testing a compile with the unlock() missing ought to generate the
> warning - if not you've just confused the code enough the it stops
> caring.
OPTIMIZER_HIDE_VAR() is not appropriate as it might pessimise real
codegen which we don't want - the warning/analysis happens way earlier
in semantic analysis. But we have 'context_unsafe_alias()' which could
be used for this purpose, so 'context_unsafe_alias(lock_ptr)' after
the READ_ONCE() would work for what you intended. Except that this
function wants to be annotated with __acquires(q->lock_ptr) so the
compiler needs to be able to resolve the alias properly for this to
work.
Overall we can't really expect the compiler to see through inline asm
during semantic analysis, so we need to trick it - codegen should not
be affected for any compiler that does a reasonable job of folding
these local variable accesses.
Note, it does work for all other architectures as-is, given most don't
use inline asm for READ_ONCE().
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 0/2] arm64: Fixes for __READ_ONCE() with CONFIG_LTO=y
2026-02-16 14:16 [PATCH v4 0/2] arm64: Fixes for __READ_ONCE() with CONFIG_LTO=y Marco Elver
2026-02-16 14:16 ` [PATCH v4 1/2] arm64: Optimize " Marco Elver
2026-02-16 14:16 ` [PATCH v4 2/2] arm64, compiler-context-analysis: Permit alias analysis through " Marco Elver
@ 2026-02-27 3:16 ` Will Deacon
2 siblings, 0 replies; 6+ messages in thread
From: Will Deacon @ 2026-02-27 3:16 UTC (permalink / raw)
To: Peter Zijlstra, Marco Elver
Cc: catalin.marinas, kernel-team, Will Deacon, Ingo Molnar,
Waiman Long, Bart Van Assche, llvm, David Laight, Al Viro,
Nathan Chancellor, Arnd Bergmann, linux-arm-kernel, linux-kernel,
Thomas Gleixner, Boqun Feng
On Mon, 16 Feb 2026 15:16:21 +0100, Marco Elver wrote:
> While investigating a Clang Context Analysis [1] false positive [2], I
> started to dig deeper into arm64's __READ_ONCE() implementation with
> LTO. That rabbit hole led me to find one critical bug with the current
> implementation (fixed already in [3]), and subtle improvements that then
> enabled me to fix the original false positive.
>
> Patch 1 refactors the macro to use a different way of getting an
> unqualified type and eliminates the ternary conditional.
>
> [...]
Applied to arm64 (for-next/read-once), thanks!
[1/2] arm64: Optimize __READ_ONCE() with CONFIG_LTO=y
https://git.kernel.org/arm64/c/abf1be684dc2
[2/2] arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y
https://git.kernel.org/arm64/c/773b24bcedc1
Peter -- please don't pull this just yet. I'd like it to get some
exposure to the various build bots and CI systems in case of any nasty
toolchain interactions.
Cheers,
--
Will
https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev
^ permalink raw reply [flat|nested] 6+ messages in thread