[lttng-dev] [Userspace RCU] - rcu_dereference() memory ordering - Olivier Dion via lttng-dev

From: Olivier Dion via lttng-dev <lttng-dev@lists.lttng.org>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: lttng-dev@lists.lttng.org
Subject: [lttng-dev] [Userspace RCU] - rcu_dereference() memory ordering
Date: Mon, 21 Oct 2024 15:53:04 -0400	[thread overview]
Message-ID: <87y12hmdwf.fsf@laura> (raw)

[-- Attachment #1: Type: text/plain, Size: 6377 bytes --]

Hi Paul,

In liburcu, `rcu_dereference()' is either implemented with volatile
access with `CMM_LOAD_SHARED()' followed by a memory barrier depends, or
a atomic load with CONSUME memory ordering (configurable by users on a
compilation unit basis).

However, it is my understanding that the CONSUME memory ordering
semantic has some deficiency [0] and will be promoted to a ACQUIRE
memory ordering.  This is somewhat inefficient (see benchmarks at the
end) for weakly-ordered architectures [1]:

  rcu_dereference_consume:
          sub     sp, sp, #16
          add     x1, sp, 8
          str     x0, [sp, 8]
          ldar    x0, [x1]     ;; Load acquire
          add     sp, sp, 16
          ret     
  rcu_dereference_relaxed:
          sub     sp, sp, #16
          add     x1, sp, 8
          str     x0, [sp, 8]
          ldr     x0, [x1]     ;; Load
          add     sp, sp, 16
          rer

I had a discussion with Mathieu on that, and using the RELAXED memory
ordering (on every architecture except Alpha) + a compiler barrier would
not prevent compiler value-speculative optimizations (e.g. VSS: Value
Speculation Scheduling).

Consider the following code:

  #define cmm_barrier() asm volatile ("" : : : "memory")
  #define rcu_dereference(p) __atomic_load_n(&(p), __ATOMIC_RELAXED)

  // Assume QSBR flavor
  #define rcu_read_lock() do { } while (0)
  #define rcu_read_unlock() do { } while (0)

  struct foo {
      long x;
  };

  struct foo *foo;

  extern void do_stuff(long);

  // Assume that global pointer `foo' is never NULL for simplicity.
  void func(void)
  {
      struct foo *a, *b;

      rcu_read_lock(); {
          a = rcu_dereference(foo);
          do_stuff(a->x);
      } rcu_read_unlock();
      cmm_barrier();
      rcu_read_lock(); {
          b = rcu_dereference(foo);
          if (a == b)
              do_stuff(b->x);
      } rcu_read_unlock();
  }

and the resulting assembler on ARM64 (GCC 14.2.0) [2]:

  func:
          stp     x29, x30, [sp, -32]!
          mov     x29, sp
          stp     x19, x20, [sp, 16]
          adrp    x19, .LANCHOR0
          add     x19, x19, :lo12:.LANCHOR0
          ldr     x20, [x19] ;; a = rcu_dereference | <-- here ...
          ldr     x0, [x20]  ;; a->x
          bl      do_stuff
          ldr     x0, [x19]  ;; b = rcu_dereference
          cmp     x20, x0
          beq     .L5
          ldp     x19, x20, [sp, 16]
          ldp     x29, x30, [sp], 32
          ret
  .L5:
          ldr     x0, [x20] ;; b->x | can be reordered up-to ...
          ldp     x19, x20, [sp, 16]
          ldp     x29, x30, [sp], 32
          b       do_stuff
  foo:
          .zero   8

From my understanding of the ARM memory ordering and its ISA, the
processor is within its right to reorder the `ldr x0, [x20]' in `.L5',
up to its dependency at `ldr x20, [x19]', which happen before the RCU
dereferencing of `b'.

This looks similar to what Mathieu described here [3].

Our proposed solution is to keep using the CONSUME memory ordering by
default, therefore guaranteeing correctness above all for all cases.
However, to allow for better performance, users can opt-in to use
"traditional" volatile access instead of atomic builtins for
`rcu_dereference()', as long as pointer comparisons are avoided or as
long as the `ptr_eq' wrapper proposed by Mathieu [3] is used for them.

Thus, `rcu_dereference()' would be defined as something like:

  #ifdef URCU_DEREFERENCE_USE_VOLATILE
  #  define rcu_dereference(p) do { CMM_LOAD_SHARED(p); cmm_smp_rmc(); } while(0)
  #else
  #  define rcu_dereference(p)  uatomic_load(&(p), CMM_CONSUME)
  #endif

and would yield, if using `cmm_ptr_eq' (ARM64 (GCC 14.2.0)) [4]:

  func:
          stp     x29, x30, [sp, -32]!
          mov     x29, sp
          stp     x19, x20, [sp, 16]
          adrp    x20, .LANCHOR0
          ldr     x19, [x20, #:lo12:.LANCHOR0] ;; a = rcu_dereference
          ldr     x0, [x19] ;; a->x
          bl      do_stuff
          ldr     x2, [x20, #:lo12:.LANCHOR0] ;; b = rcu_dereference | <-- here ...
          mov     x0, x19 ;; side effect of cmm_ptr_eq, force to use more registers
          mov     x1, x2  ;; and more registers
          cmp     x0, x1
          beq     .L5
          ldp     x19, x20, [sp, 16]
          ldp     x29, x30, [sp], 32
          ret
  .L5:
          ldp     x19, x20, [sp, 16]
          ldp     x29, x30, [sp], 32
          ldr     x0, [x2] ;; b->x | can be re-ordered up-to ...
          b       do_stuff
  foo:
          .zero   8

The Pro & Cons overall for selecting the volatile for rcu_dereference:

  Pro:

    - Yield better performance on weakly-ordered architectures for all
      `rcu_dereference'.

  Cons:

    - Users would need to use the `cmm_ptr_eq' for pointer comparisons,
      even on strongly ordered architectures.

    - `cmm_ptr_eq' can increase register pressure, resulting in possible
      register spilling.

Here is a benchmark summary.  You can find more details in the attached
file.

  CPU: Aarch64 Cortex-A57

  Program ran with perf.  Thight loop of the above example 1 000 000 000
  times.

  Variants are:

    - Baseline v0.14.1:: rcu_dereference() implemented with
      CMM_ACCESS_ONCE().  Pointers comparisons with `==' operator.

    - Volatile access:: rcu_dereference() implemented with
      CMM_ACCESS_ONCE().  Pointers comparisons with cmm_ptr_eq.

    - Atomic builtins:: rcu_dereference() implementd with
      __atomic_load_n CONSUME.  Pointers comparisons with cmm_ptr_eq.

  All variants were compiled with _LGPL_SOURCE.  

  | Variant         | Time [s]    | Cycles        | Instructions   | Branch misses |
  |-----------------+-------------+---------------+----------------+---------------|
  | Baseline        | 4.217609351 | 8 015 627 017 | 15 008 330 513 | 26 607        |
  |-----------------+-------------+---------------+----------------+---------------|
  | Volatile access | +10.95 %    | +11.14 %      | +6.25 %        | +10.81 %      |
  | Atomic builtins | +423.18 %   | +425.94 %     | +6.87 %        | +188.37 %     |

Any thoughts on that?

Thanks,
Olivier

 [0] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
 [1] https://godbolt.org/z/xxqGPjaxK
 [2] https://godbolt.org/z/cPzxq7PKb
 [3] https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/
 [4] https://godbolt.org/z/979jnccc9

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 20241018T121818--benchmarks-urcu__urcu.org --]
[-- Type: text/x-org, Size: 5843 bytes --]

#+title:      Benchmarks URCU 
#+date:       [2024-10-18 Fri 12:18]
#+filetags:   :urcu:
#+identifier: 20241018T121818

* Program

  #+begin_src c
    #define _LGPL_SOURCE
    #define URCU_DEREFERENCE_USE_VOLATILE

    #include <assert.h>
    #include <stdlib.h>

    #include <urcu/pointer.h>

    struct foo {
    	long x;
    };

    volatile long flag;

    static void do_stuff(long x)
    {
    	flag = x;
    }

    struct foo *foo;

    __attribute__((noinline))
    static void func()
    {
    	struct foo *a, *b;

    	a = rcu_dereference(foo);

    	do_stuff(a->x);

    	b = rcu_dereference(foo);

    	if (cmm_ptr_eq(a, b)) {
    		do_stuff(b->x);
    	}
    }

    int main(int argc, char *argv[])
    {
    	struct foo fuz;

    	foo = &fuz;

    	cmm_barrier();

    	assert(2 == argc);

    	long loop = atoll(argv[1]);

    	for (long k = 0; k < loop; ++k) {
    		func();
    	}

    	return 0;
    }
  #+end_src

* v0.14.1

  #+begin_src asm
    0000000000000860 <func>:
     860:   90000100        adrp    x0, 20000 <__libc_start_main@GLIBC_2.34>
     864:   91012002        add     x2, x0, #0x48
     868:   f9402401        ldr     x1, [x0, #72]
     86c:   f9400023        ldr     x3, [x1]
     870:   f9000443        str     x3, [x2, #8]
     874:   f9402400        ldr     x0, [x0, #72]
     878:   eb00003f        cmp     x1, x0
     87c:   54000040        b.eq    884 <func+0x24>  // b.none
     880:   d65f03c0        ret
     884:   f9400020        ldr     x0, [x1]
     888:   f9000440        str     x0, [x2, #8]
     88c:   d65f03c0        ret
  #+end_src

   Performance counter stats for './a.out 1000000000':

          4,214.47 msec task-clock                       #    0.999 CPUs utilized          
                14      context-switches                 #    3.322 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
                44      page-faults                      #   10.440 /sec                   
     8,015,627,017      cycles                           #    1.902 GHz                    
    15,008,330,513      instructions                     #    1.87  insn per cycle         
   <not supported>      branches                                                    
            26,607      branch-misses                                                      

       4.217609351 seconds time elapsed

       4.213169000 seconds user
       0.004001000 seconds sys

* Proposal

** Volatile access

   #+begin_src asm
     0000000000000860 <func>:
      860:   90000101        adrp    x1, 20000 <__libc_start_main@GLIBC_2.34>
      864:   91012022        add     x2, x1, #0x48
      868:   f9402420        ldr     x0, [x1, #72]
      86c:   f9400003        ldr     x3, [x0]
      870:   f9000443        str     x3, [x2, #8]
      874:   f9402423        ldr     x3, [x1, #72]
      878:   aa0303e1        mov     x1, x3
      87c:   eb01001f        cmp     x0, x1
      880:   54000040        b.eq    888 <func+0x28>  // b.none
      884:   d65f03c0        ret
      888:   f9400060        ldr     x0, [x3]
      88c:   f9000440        str     x0, [x2, #8]
      890:   d65f03c0        ret
   #+end_src

    Performance counter stats for './a.out 1000000000':

          4,733.47 msec task-clock                       #    0.999 CPUs utilized
                18      context-switches                 #    3.803 /sec
                 0      cpu-migrations                   #    0.000 /sec
                42      page-faults                      #    8.873 /sec
     9,020,349,576      cycles                           #    1.906 GHz
    16,009,538,541      instructions                     #    1.77  insn per cycle
   <not supported>      branches
            29,834      branch-misses

       4.736255910 seconds time elapsed

       4.731968000 seconds user
       0.003999000 seconds sys

** Atomic builtins   

  #+begin_src asm
    0000000000000860 <func>:
     860:   90000101        adrp    x1, 20000 <__libc_start_main@GLIBC_2.34>
     864:   91012021        add     x1, x1, #0x48
     868:   c8dffc20        ldar    x0, [x1]
     86c:   f9400002        ldr     x2, [x0]
     870:   f9000422        str     x2, [x1, #8]
     874:   c8dffc23        ldar    x3, [x1]
     878:   aa0303e2        mov     x2, x3
     87c:   eb00005f        cmp     x2, x0
     880:   54000040        b.eq    888 <func+0x28>  // b.none
     884:   d65f03c0        ret
     888:   f9400060        ldr     x0, [x3]
     88c:   f9000420        str     x0, [x1, #8]
     890:   d65f03c0        ret
  #+end_src

   Performance counter stats for './a.out 1000000000':

         22,062.47 msec task-clock                       #    1.000 CPUs utilized
                17      context-switches                 #    0.771 /sec
                 0      cpu-migrations                   #    0.000 /sec
                43      page-faults                      #    1.949 /sec
    42,157,489,783      cycles                           #    1.911 GHz
    16,039,077,935      instructions                     #    0.38  insn per cycle
   <not supported>      branches
            76,727      branch-misses

      22.065725405 seconds time elapsed

      22.061101000 seconds user
       0.004000000 seconds sys

* Summary

  Aarch64 Cortex-A57

  | Variant         | Time [s]    | Cycles        | Instructions   | Branch misses |
  |-----------------+-------------+---------------+----------------+---------------|
  | Baseline        | 4.217609351 | 8 015 627 017 | 15 008 330 513 | 26 607        |
  |-----------------+-------------+---------------+----------------+---------------|
  | Volatile access | +10.95 %    | +11.14 %      | +6.25 %        | +10.81 %      |
  | Atomic builtins | +423.18 %   | +425.94 %     | +6.87 %        | +188.37 %     |

[-- Attachment #3: Type: text/plain, Size: 57 bytes --]

-- 
Olivier Dion
EfficiOS Inc.
https://www.efficios.com

[-- Attachment #4: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev