Hi Paul, In liburcu, `rcu_dereference()' is either implemented with volatile access with `CMM_LOAD_SHARED()' followed by a memory barrier depends, or a atomic load with CONSUME memory ordering (configurable by users on a compilation unit basis). However, it is my understanding that the CONSUME memory ordering semantic has some deficiency [0] and will be promoted to a ACQUIRE memory ordering. This is somewhat inefficient (see benchmarks at the end) for weakly-ordered architectures [1]: rcu_dereference_consume: sub sp, sp, #16 add x1, sp, 8 str x0, [sp, 8] ldar x0, [x1] ;; Load acquire add sp, sp, 16 ret rcu_dereference_relaxed: sub sp, sp, #16 add x1, sp, 8 str x0, [sp, 8] ldr x0, [x1] ;; Load add sp, sp, 16 rer I had a discussion with Mathieu on that, and using the RELAXED memory ordering (on every architecture except Alpha) + a compiler barrier would not prevent compiler value-speculative optimizations (e.g. VSS: Value Speculation Scheduling). Consider the following code: #define cmm_barrier() asm volatile ("" : : : "memory") #define rcu_dereference(p) __atomic_load_n(&(p), __ATOMIC_RELAXED) // Assume QSBR flavor #define rcu_read_lock() do { } while (0) #define rcu_read_unlock() do { } while (0) struct foo { long x; }; struct foo *foo; extern void do_stuff(long); // Assume that global pointer `foo' is never NULL for simplicity. void func(void) { struct foo *a, *b; rcu_read_lock(); { a = rcu_dereference(foo); do_stuff(a->x); } rcu_read_unlock(); cmm_barrier(); rcu_read_lock(); { b = rcu_dereference(foo); if (a == b) do_stuff(b->x); } rcu_read_unlock(); } and the resulting assembler on ARM64 (GCC 14.2.0) [2]: func: stp x29, x30, [sp, -32]! mov x29, sp stp x19, x20, [sp, 16] adrp x19, .LANCHOR0 add x19, x19, :lo12:.LANCHOR0 ldr x20, [x19] ;; a = rcu_dereference | <-- here ... ldr x0, [x20] ;; a->x bl do_stuff ldr x0, [x19] ;; b = rcu_dereference cmp x20, x0 beq .L5 ldp x19, x20, [sp, 16] ldp x29, x30, [sp], 32 ret .L5: ldr x0, [x20] ;; b->x | can be reordered up-to ... ldp x19, x20, [sp, 16] ldp x29, x30, [sp], 32 b do_stuff foo: .zero 8 From my understanding of the ARM memory ordering and its ISA, the processor is within its right to reorder the `ldr x0, [x20]' in `.L5', up to its dependency at `ldr x20, [x19]', which happen before the RCU dereferencing of `b'. This looks similar to what Mathieu described here [3]. Our proposed solution is to keep using the CONSUME memory ordering by default, therefore guaranteeing correctness above all for all cases. However, to allow for better performance, users can opt-in to use "traditional" volatile access instead of atomic builtins for `rcu_dereference()', as long as pointer comparisons are avoided or as long as the `ptr_eq' wrapper proposed by Mathieu [3] is used for them. Thus, `rcu_dereference()' would be defined as something like: #ifdef URCU_DEREFERENCE_USE_VOLATILE # define rcu_dereference(p) do { CMM_LOAD_SHARED(p); cmm_smp_rmc(); } while(0) #else # define rcu_dereference(p) uatomic_load(&(p), CMM_CONSUME) #endif and would yield, if using `cmm_ptr_eq' (ARM64 (GCC 14.2.0)) [4]: func: stp x29, x30, [sp, -32]! mov x29, sp stp x19, x20, [sp, 16] adrp x20, .LANCHOR0 ldr x19, [x20, #:lo12:.LANCHOR0] ;; a = rcu_dereference ldr x0, [x19] ;; a->x bl do_stuff ldr x2, [x20, #:lo12:.LANCHOR0] ;; b = rcu_dereference | <-- here ... mov x0, x19 ;; side effect of cmm_ptr_eq, force to use more registers mov x1, x2 ;; and more registers cmp x0, x1 beq .L5 ldp x19, x20, [sp, 16] ldp x29, x30, [sp], 32 ret .L5: ldp x19, x20, [sp, 16] ldp x29, x30, [sp], 32 ldr x0, [x2] ;; b->x | can be re-ordered up-to ... b do_stuff foo: .zero 8 The Pro & Cons overall for selecting the volatile for rcu_dereference: Pro: - Yield better performance on weakly-ordered architectures for all `rcu_dereference'. Cons: - Users would need to use the `cmm_ptr_eq' for pointer comparisons, even on strongly ordered architectures. - `cmm_ptr_eq' can increase register pressure, resulting in possible register spilling. Here is a benchmark summary. You can find more details in the attached file. CPU: Aarch64 Cortex-A57 Program ran with perf. Thight loop of the above example 1 000 000 000 times. Variants are: - Baseline v0.14.1:: rcu_dereference() implemented with CMM_ACCESS_ONCE(). Pointers comparisons with `==' operator. - Volatile access:: rcu_dereference() implemented with CMM_ACCESS_ONCE(). Pointers comparisons with cmm_ptr_eq. - Atomic builtins:: rcu_dereference() implementd with __atomic_load_n CONSUME. Pointers comparisons with cmm_ptr_eq. All variants were compiled with _LGPL_SOURCE. | Variant | Time [s] | Cycles | Instructions | Branch misses | |-----------------+-------------+---------------+----------------+---------------| | Baseline | 4.217609351 | 8 015 627 017 | 15 008 330 513 | 26 607 | |-----------------+-------------+---------------+----------------+---------------| | Volatile access | +10.95 % | +11.14 % | +6.25 % | +10.81 % | | Atomic builtins | +423.18 % | +425.94 % | +6.87 % | +188.37 % | Any thoughts on that? Thanks, Olivier [0] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html [1] https://godbolt.org/z/xxqGPjaxK [2] https://godbolt.org/z/cPzxq7PKb [3] https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/ [4] https://godbolt.org/z/979jnccc9