Supporting New Memory Barrier Types in BPF

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* Supporting New Memory Barrier Types in BPF
@ 2024-07-29 18:32 Peilin Ye
  2024-07-30  1:28 ` Alexei Starovoitov
  0 siblings, 1 reply; 20+ messages in thread
From: Peilin Ye @ 2024-07-29 18:32 UTC (permalink / raw)
  To: bpf
  Cc: Josh Don, Barret Rhoden, Neel Natu, Benjamin Segall,
	Paul E. McKenney, Alexei Starovoitov, David Vernet,
	Dave Marchevsky, Peilin Ye

Hi list!

As we are looking at running sched_ext-style BPF scheduling on architectures
with a more relaxed memory model (i.e. ARM), we would like to:

  1. have fine-grained control over memory ordering in BPF (instead of
     defaulting to a full barrier), for performance reasons
  2. pay closer attention to if memory barriers are being used correctly in
     BPF

To that end, our main goal here is to support more types of memory barriers in
BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
model [1], Paul agreed that it makes sense to support some basic types first.
Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
related to memory ordering, which will be described in details below.

I. We need more types of BPF memory barriers
--------------------------------------------

Currently, when it comes to BPF memory barriers, our choices are effectively
limited to:

  * compiler barrier: 'asm volatile ("" ::: "memory");'
  * full memory barriers implied by compiler built-ins like
    __sync_val_compare_and_swap()

We need more.  During offline discussion with Paul, we agreed we can start
from:

  * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
  * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);

Theoretically, the BPF JIT compiler could also reorder instructions just like
Clang or GCC, though it might not currently do so.  If we ever developed a more
optimizing BPF JIT compiler, it would also be nice to have an optimization
barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.

II. Implicit barriers can get confusing
---------------------------------------

We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
or "acquire and release" (full barrier) semantics, depending on if its return
value is used:

Case (a): return value is used

  SEC("...")
  int64_t foo;

  int64_t func(...) {
      return __sync_fetch_and_add(&foo, 1);
  }

For case (a), Clang gave us:

  3:	db 01 00 00 01 00 00 00	r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)

  opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
  imm (0x00000001): BPF_ADD | BPF_FETCH

Case (b): return value is ignored

  SEC("...")
  int64_t foo;

  int64_t func(...) {
      __sync_fetch_and_add(&foo, 1);

      return foo;
  }

For case (b), Clang gave us:

  3:	db 12 00 00 00 00 00 00	lock *(u64 *)(r2 + 0x0) += r1

  opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
  imm (0x00000000): BPF_ADD

LLVM decided to drop BPF_FETCH, since the return value of
__sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
atomic instructions are being used):

  case BPF_ADD:
          emit(A64_STADD(isdw, reg, src), ctx);
          break;
  <...>
  case BPF_ADD | BPF_FETCH:
          emit(A64_LDADDAL(isdw, src, reg, src), ctx);
          break;

STADD is an alias for LDADD.  According to [3]:

  * LDADDAL for case (a) has "acquire" plus "release" semantics
  * LDADD for case (b) "has neither acquire nor release semantics"

This is pretty non-intuitive; a compiler built-in should not have inconsistent
implications on memory ordering, and it is better not to require all BPF
programmers to memorize this.

GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
always imply a full barrier.  GCC considers these __sync_*() built-ins as
"legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
are designed to be a lot more explicit on memory ordering, for example:

  type __atomic_fetch_add (type *ptr, type val, int memorder)

This requires the programmer to specify a memory order type (relaxed, acquire,
release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
__atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
counterparts (the "memorder" parameter seems effectively ignored), and are not
fully supported.

III. Next steps
---------------

Roughly, the scope of this work includes:

  * decide how to extend the BPF ISA (add new instructions and/or extend
    current ones)
  * teach LLVM and GCC to generate the new/extended instructions
  * teach the BPF verifier to understand them
  * teach the BPF JIT compiler to compile them
  * update BPF memory model and tooling
  * update IETF specification

Additionally, for the issue described in the previous section, we need to:

  * check if GCC has the same behavior
  * at least clearly document the implied effects on BPF memory ordering of
    current __sync_*fetch*() built-ins (especially for architectures like ARM),
    as described
  * fully support the new __atomic_*fetch*() built-ins for BPF to replace the
    __sync_*fetch*() ones

Any suggestions or corrections would be most welcome!

Thanks,
Peilin Ye

[1] Instruction-Level BPF Memory Model
https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing

[2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic
    instructions").  Search for "LLVM will check the return value" in the
    commit message.

[3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI
    0487K.a, ID032224), C6.2.149, page 2006

[4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
    6.58 Legacy __sync Built-in Functions for Atomic Memory Access
    "In most cases, these built-in functions are considered a full barrier."

[5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
    6.59 Built-in Functions for Memory Model Aware Atomic Operations

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-29 18:32 Supporting New Memory Barrier Types in BPF Peilin Ye
@ 2024-07-30  1:28 ` Alexei Starovoitov
  2024-07-30  3:49   ` Paul E. McKenney
                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2024-07-30  1:28 UTC (permalink / raw)
  To: Peilin Ye, Yonghong Song, Jose E. Marchesi
  Cc: bpf, Josh Don, Barret Rhoden, Neel Natu, Benjamin Segall,
	Paul E. McKenney, Alexei Starovoitov, David Vernet,
	Dave Marchevsky

On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
>
> Hi list!
>
> As we are looking at running sched_ext-style BPF scheduling on architectures
> with a more relaxed memory model (i.e. ARM), we would like to:
>
>   1. have fine-grained control over memory ordering in BPF (instead of
>      defaulting to a full barrier), for performance reasons
>   2. pay closer attention to if memory barriers are being used correctly in
>      BPF
>
> To that end, our main goal here is to support more types of memory barriers in
> BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
> model [1], Paul agreed that it makes sense to support some basic types first.
> Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
> related to memory ordering, which will be described in details below.
>
> I. We need more types of BPF memory barriers
> --------------------------------------------
>
> Currently, when it comes to BPF memory barriers, our choices are effectively
> limited to:
>
>   * compiler barrier: 'asm volatile ("" ::: "memory");'
>   * full memory barriers implied by compiler built-ins like
>     __sync_val_compare_and_swap()
>
> We need more.  During offline discussion with Paul, we agreed we can start
> from:
>
>   * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
>   * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);

we would need inline asm equivalent too. Similar to kernel
smp_load_acquire() macro.

> Theoretically, the BPF JIT compiler could also reorder instructions just like
> Clang or GCC, though it might not currently do so.  If we ever developed a more
> optimizing BPF JIT compiler, it would also be nice to have an optimization
> barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
> instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.

It can be a standalone insn that is a compiler barrier only but that feels like
a waste of an instruction. So depending how we end up encoding various
real barriers
there may be a bit to spend in such a barrier insn that is only a
compiler barrier.
In this case optimizing JIT barrier.

> II. Implicit barriers can get confusing
> ---------------------------------------
>
> We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
> always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
> frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
> or "acquire and release" (full barrier) semantics, depending on if its return
> value is used:
>
> Case (a): return value is used
>
>   SEC("...")
>   int64_t foo;
>
>   int64_t func(...) {
>       return __sync_fetch_and_add(&foo, 1);
>   }
>
> For case (a), Clang gave us:
>
>   3:    db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)
>
>   opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>   imm (0x00000001): BPF_ADD | BPF_FETCH
>
> Case (b): return value is ignored
>
>   SEC("...")
>   int64_t foo;
>
>   int64_t func(...) {
>       __sync_fetch_and_add(&foo, 1);
>
>       return foo;
>   }
>
> For case (b), Clang gave us:
>
>   3:    db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1
>
>   opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>   imm (0x00000000): BPF_ADD
>
> LLVM decided to drop BPF_FETCH, since the return value of
> __sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
> emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
> atomic instructions are being used):
>
>   case BPF_ADD:
>           emit(A64_STADD(isdw, reg, src), ctx);
>           break;
>   <...>
>   case BPF_ADD | BPF_FETCH:
>           emit(A64_LDADDAL(isdw, src, reg, src), ctx);
>           break;
>
> STADD is an alias for LDADD.  According to [3]:
>
>   * LDADDAL for case (a) has "acquire" plus "release" semantics
>   * LDADD for case (b) "has neither acquire nor release semantics"
>
> This is pretty non-intuitive; a compiler built-in should not have inconsistent
> implications on memory ordering, and it is better not to require all BPF
> programmers to memorize this.
>
> GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
> always imply a full barrier.  GCC considers these __sync_*() built-ins as
> "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
> Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
> are designed to be a lot more explicit on memory ordering, for example:
>
>   type __atomic_fetch_add (type *ptr, type val, int memorder)
>
> This requires the programmer to specify a memory order type (relaxed, acquire,
> release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
> __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
> counterparts (the "memorder" parameter seems effectively ignored), and are not
> fully supported.

This sounds like a compiler bug.

Yonghong, Jose,
do you know what compilers do for other backends?
Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?

> III. Next steps
> ---------------
>
> Roughly, the scope of this work includes:
>
>   * decide how to extend the BPF ISA (add new instructions and/or extend
>     current ones)

ldx/stx insns support MEM and MEMSX modifiers.
Adding MEM_ACQ_REL feels like a natural fit. Better name?

For barriers we would need a new insn. Not sure which class would fit the best.
Maybe BPF_LD ?

Another alternative for barriers is to use nocsr kfuncs.
Then we have the freedom to make mistakes and fix them later.
One kfunc per barrier would do.
JITs would inline them into appropriate insns.
In bpf progs they will be used just like in the kernel code smp_mb(),
smp_rmb(), etc.

I don't think compilers have to emit barriers from C code, so my
preference is kfuncs atm.

>   * teach LLVM and GCC to generate the new/extended instructions
>   * teach the BPF verifier to understand them
>   * teach the BPF JIT compiler to compile them
>   * update BPF memory model and tooling
>   * update IETF specification
>
> Additionally, for the issue described in the previous section, we need to:
>
>   * check if GCC has the same behavior
>   * at least clearly document the implied effects on BPF memory ordering of
>     current __sync_*fetch*() built-ins (especially for architectures like ARM),
>     as described
>   * fully support the new __atomic_*fetch*() built-ins for BPF to replace the
>     __sync_*fetch*() ones
>
> Any suggestions or corrections would be most welcome!
>
> Thanks,
> Peilin Ye
>
>
> [1] Instruction-Level BPF Memory Model
> https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing
>
> [2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic
>     instructions").  Search for "LLVM will check the return value" in the
>     commit message.
>
> [3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI
>     0487K.a, ID032224), C6.2.149, page 2006
>
> [4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
>     6.58 Legacy __sync Built-in Functions for Atomic Memory Access
>     "In most cases, these built-in functions are considered a full barrier."
>
> [5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
>     6.59 Built-in Functions for Memory Model Aware Atomic Operations
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-30  1:28 ` Alexei Starovoitov
@ 2024-07-30  3:49   ` Paul E. McKenney
  2024-07-30  4:03     ` Alexei Starovoitov
  2024-07-30  5:14   ` Yonghong Song
  2024-08-06 19:22   ` Peilin Ye
  2 siblings, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2024-07-30  3:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peilin Ye, Yonghong Song, Jose E. Marchesi, bpf, Josh Don,
	Barret Rhoden, Neel Natu, Benjamin Segall, Alexei Starovoitov,
	David Vernet, Dave Marchevsky

On Mon, Jul 29, 2024 at 06:28:16PM -0700, Alexei Starovoitov wrote:
> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
> >
> > Hi list!
> >
> > As we are looking at running sched_ext-style BPF scheduling on architectures
> > with a more relaxed memory model (i.e. ARM), we would like to:
> >
> >   1. have fine-grained control over memory ordering in BPF (instead of
> >      defaulting to a full barrier), for performance reasons
> >   2. pay closer attention to if memory barriers are being used correctly in
> >      BPF
> >
> > To that end, our main goal here is to support more types of memory barriers in
> > BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
> > model [1], Paul agreed that it makes sense to support some basic types first.
> > Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
> > related to memory ordering, which will be described in details below.
> >
> > I. We need more types of BPF memory barriers
> > --------------------------------------------
> >
> > Currently, when it comes to BPF memory barriers, our choices are effectively
> > limited to:
> >
> >   * compiler barrier: 'asm volatile ("" ::: "memory");'
> >   * full memory barriers implied by compiler built-ins like
> >     __sync_val_compare_and_swap()
> >
> > We need more.  During offline discussion with Paul, we agreed we can start
> > from:
> >
> >   * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
> >   * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
> 
> we would need inline asm equivalent too. Similar to kernel
> smp_load_acquire() macro.
> 
> > Theoretically, the BPF JIT compiler could also reorder instructions just like
> > Clang or GCC, though it might not currently do so.  If we ever developed a more
> > optimizing BPF JIT compiler, it would also be nice to have an optimization
> > barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
> > instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
> 
> It can be a standalone insn that is a compiler barrier only but that feels like
> a waste of an instruction. So depending how we end up encoding various
> real barriers
> there may be a bit to spend in such a barrier insn that is only a
> compiler barrier.
> In this case optimizing JIT barrier.

When reading BPF instructions back into a compiler backend, would
it make sense to convert an acquire-load instruction back to
__atomic_load_n(... memorder=__ATOMIC_ACQUIRE)?  Or the internal
representation thereof?

							Thanx, Paul

> > II. Implicit barriers can get confusing
> > ---------------------------------------
> >
> > We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
> > always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
> > frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
> > or "acquire and release" (full barrier) semantics, depending on if its return
> > value is used:
> >
> > Case (a): return value is used
> >
> >   SEC("...")
> >   int64_t foo;
> >
> >   int64_t func(...) {
> >       return __sync_fetch_and_add(&foo, 1);
> >   }
> >
> > For case (a), Clang gave us:
> >
> >   3:    db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)
> >
> >   opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
> >   imm (0x00000001): BPF_ADD | BPF_FETCH
> >
> > Case (b): return value is ignored
> >
> >   SEC("...")
> >   int64_t foo;
> >
> >   int64_t func(...) {
> >       __sync_fetch_and_add(&foo, 1);
> >
> >       return foo;
> >   }
> >
> > For case (b), Clang gave us:
> >
> >   3:    db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1
> >
> >   opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
> >   imm (0x00000000): BPF_ADD
> >
> > LLVM decided to drop BPF_FETCH, since the return value of
> > __sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
> > emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
> > atomic instructions are being used):
> >
> >   case BPF_ADD:
> >           emit(A64_STADD(isdw, reg, src), ctx);
> >           break;
> >   <...>
> >   case BPF_ADD | BPF_FETCH:
> >           emit(A64_LDADDAL(isdw, src, reg, src), ctx);
> >           break;
> >
> > STADD is an alias for LDADD.  According to [3]:
> >
> >   * LDADDAL for case (a) has "acquire" plus "release" semantics
> >   * LDADD for case (b) "has neither acquire nor release semantics"
> >
> > This is pretty non-intuitive; a compiler built-in should not have inconsistent
> > implications on memory ordering, and it is better not to require all BPF
> > programmers to memorize this.
> >
> > GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
> > always imply a full barrier.  GCC considers these __sync_*() built-ins as
> > "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
> > Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
> > are designed to be a lot more explicit on memory ordering, for example:
> >
> >   type __atomic_fetch_add (type *ptr, type val, int memorder)
> >
> > This requires the programmer to specify a memory order type (relaxed, acquire,
> > release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
> > __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
> > counterparts (the "memorder" parameter seems effectively ignored), and are not
> > fully supported.
> 
> This sounds like a compiler bug.
> 
> Yonghong, Jose,
> do you know what compilers do for other backends?
> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
> 
> > III. Next steps
> > ---------------
> >
> > Roughly, the scope of this work includes:
> >
> >   * decide how to extend the BPF ISA (add new instructions and/or extend
> >     current ones)
> 
> ldx/stx insns support MEM and MEMSX modifiers.
> Adding MEM_ACQ_REL feels like a natural fit. Better name?
> 
> For barriers we would need a new insn. Not sure which class would fit the best.
> Maybe BPF_LD ?
> 
> Another alternative for barriers is to use nocsr kfuncs.
> Then we have the freedom to make mistakes and fix them later.
> One kfunc per barrier would do.
> JITs would inline them into appropriate insns.
> In bpf progs they will be used just like in the kernel code smp_mb(),
> smp_rmb(), etc.
> 
> I don't think compilers have to emit barriers from C code, so my
> preference is kfuncs atm.
> 
> >   * teach LLVM and GCC to generate the new/extended instructions
> >   * teach the BPF verifier to understand them
> >   * teach the BPF JIT compiler to compile them
> >   * update BPF memory model and tooling
> >   * update IETF specification
> >
> > Additionally, for the issue described in the previous section, we need to:
> >
> >   * check if GCC has the same behavior
> >   * at least clearly document the implied effects on BPF memory ordering of
> >     current __sync_*fetch*() built-ins (especially for architectures like ARM),
> >     as described
> >   * fully support the new __atomic_*fetch*() built-ins for BPF to replace the
> >     __sync_*fetch*() ones
> >
> > Any suggestions or corrections would be most welcome!
> >
> > Thanks,
> > Peilin Ye
> >
> >
> > [1] Instruction-Level BPF Memory Model
> > https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing
> >
> > [2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic
> >     instructions").  Search for "LLVM will check the return value" in the
> >     commit message.
> >
> > [3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI
> >     0487K.a, ID032224), C6.2.149, page 2006
> >
> > [4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> >     6.58 Legacy __sync Built-in Functions for Atomic Memory Access
> >     "In most cases, these built-in functions are considered a full barrier."
> >
> > [5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> >     6.59 Built-in Functions for Memory Model Aware Atomic Operations
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-30  3:49   ` Paul E. McKenney
@ 2024-07-30  4:03     ` Alexei Starovoitov
  0 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2024-07-30  4:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peilin Ye, Yonghong Song, Jose E. Marchesi, bpf, Josh Don,
	Barret Rhoden, Neel Natu, Benjamin Segall, Alexei Starovoitov,
	David Vernet, Dave Marchevsky

On Mon, Jul 29, 2024 at 8:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
> >
> > It can be a standalone insn that is a compiler barrier only but that feels like
> > a waste of an instruction. So depending how we end up encoding various
> > real barriers
> > there may be a bit to spend in such a barrier insn that is only a
> > compiler barrier.
> > In this case optimizing JIT barrier.
>
> When reading BPF instructions back into a compiler backend, would
> it make sense to convert an acquire-load instruction back to
> __atomic_load_n(... memorder=__ATOMIC_ACQUIRE)?  Or the internal
> representation thereof?

Internal representation?
We need to pick asm mnemonics for ld_acq insn, but pushing
C-like asm to emit __atomic_load_n() in disasm is imo too much.
Currently LDX insn will be disasm-ed to r1 = *(u32 *)r2;
For load acquire insn we can emit r1 = smp_load_acquire_u32(r2);
or r1 = load_acquire((u32 *)r2);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-30  1:28 ` Alexei Starovoitov
  2024-07-30  3:49   ` Paul E. McKenney
@ 2024-07-30  5:14   ` Yonghong Song
  2024-07-31  1:19     ` Alexei Starovoitov
  2024-08-01 12:47     ` Jose E. Marchesi
  2024-08-06 19:22   ` Peilin Ye
  2 siblings, 2 replies; 20+ messages in thread
From: Yonghong Song @ 2024-07-30  5:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Peilin Ye, Jose E. Marchesi
  Cc: bpf, Josh Don, Barret Rhoden, Neel Natu, Benjamin Segall,
	Paul E. McKenney, Alexei Starovoitov, David Vernet,
	Dave Marchevsky


On 7/29/24 6:28 PM, Alexei Starovoitov wrote:
> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
>> Hi list!
>>
>> As we are looking at running sched_ext-style BPF scheduling on architectures
>> with a more relaxed memory model (i.e. ARM), we would like to:
>>
>>    1. have fine-grained control over memory ordering in BPF (instead of
>>       defaulting to a full barrier), for performance reasons
>>    2. pay closer attention to if memory barriers are being used correctly in
>>       BPF
>>
>> To that end, our main goal here is to support more types of memory barriers in
>> BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
>> model [1], Paul agreed that it makes sense to support some basic types first.
>> Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
>> related to memory ordering, which will be described in details below.
>>
>> I. We need more types of BPF memory barriers
>> --------------------------------------------
>>
>> Currently, when it comes to BPF memory barriers, our choices are effectively
>> limited to:
>>
>>    * compiler barrier: 'asm volatile ("" ::: "memory");'
>>    * full memory barriers implied by compiler built-ins like
>>      __sync_val_compare_and_swap()
>>
>> We need more.  During offline discussion with Paul, we agreed we can start
>> from:
>>
>>    * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
>>    * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
> we would need inline asm equivalent too. Similar to kernel
> smp_load_acquire() macro.
>
>> Theoretically, the BPF JIT compiler could also reorder instructions just like
>> Clang or GCC, though it might not currently do so.  If we ever developed a more
>> optimizing BPF JIT compiler, it would also be nice to have an optimization
>> barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
>> instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
> It can be a standalone insn that is a compiler barrier only but that feels like
> a waste of an instruction. So depending how we end up encoding various
> real barriers
> there may be a bit to spend in such a barrier insn that is only a
> compiler barrier.
> In this case optimizing JIT barrier.
>
>> II. Implicit barriers can get confusing
>> ---------------------------------------
>>
>> We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
>> always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
>> frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
>> or "acquire and release" (full barrier) semantics, depending on if its return
>> value is used:
>>
>> Case (a): return value is used
>>
>>    SEC("...")
>>    int64_t foo;
>>
>>    int64_t func(...) {
>>        return __sync_fetch_and_add(&foo, 1);
>>    }
>>
>> For case (a), Clang gave us:
>>
>>    3:    db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)
>>
>>    opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>    imm (0x00000001): BPF_ADD | BPF_FETCH
>>
>> Case (b): return value is ignored
>>
>>    SEC("...")
>>    int64_t foo;
>>
>>    int64_t func(...) {
>>        __sync_fetch_and_add(&foo, 1);
>>
>>        return foo;
>>    }
>>
>> For case (b), Clang gave us:
>>
>>    3:    db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1
>>
>>    opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>    imm (0x00000000): BPF_ADD
>>
>> LLVM decided to drop BPF_FETCH, since the return value of
>> __sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
>> emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
>> atomic instructions are being used):
>>
>>    case BPF_ADD:
>>            emit(A64_STADD(isdw, reg, src), ctx);
>>            break;
>>    <...>
>>    case BPF_ADD | BPF_FETCH:
>>            emit(A64_LDADDAL(isdw, src, reg, src), ctx);
>>            break;
>>
>> STADD is an alias for LDADD.  According to [3]:
>>
>>    * LDADDAL for case (a) has "acquire" plus "release" semantics
>>    * LDADD for case (b) "has neither acquire nor release semantics"
>>
>> This is pretty non-intuitive; a compiler built-in should not have inconsistent
>> implications on memory ordering, and it is better not to require all BPF
>> programmers to memorize this.
>>
>> GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
>> always imply a full barrier.  GCC considers these __sync_*() built-ins as
>> "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
>> Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
>> are designed to be a lot more explicit on memory ordering, for example:
>>
>>    type __atomic_fetch_add (type *ptr, type val, int memorder)
>>
>> This requires the programmer to specify a memory order type (relaxed, acquire,
>> release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
>> __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
>> counterparts (the "memorder" parameter seems effectively ignored), and are not
>> fully supported.
> This sounds like a compiler bug.
>
> Yonghong, Jose,
> do you know what compilers do for other backends?
> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?

This behavior is introduced by the following llvm commit:
https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744

Specifically the following commit message:

=======
Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
instructions are added for atomic operations which do not
have return values. LLVM will check the return value for
__sync_fetch_and_{add,and,or,xor}.
If the return value is used, instructions atomic_fetch_<op>
will be used. Otherwise, atomic_<op> instructions will be used.
======

Basically, if no return value, __sync_fetch_and_add() will use
xadd insn. The decision is made at that time to maintain backward compatibility.
For one example, in bcc
   https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
we have
   #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))

Should we use atomic_fetch_*() always regardless of whether the return
val is used or not? Probably, it should still work. Not sure what gcc
does for this case.

>
>> III. Next steps
>> ---------------
>>
>> Roughly, the scope of this work includes:
>>
>>    * decide how to extend the BPF ISA (add new instructions and/or extend
>>      current ones)
> ldx/stx insns support MEM and MEMSX modifiers.
> Adding MEM_ACQ_REL feels like a natural fit. Better name?
>
> For barriers we would need a new insn. Not sure which class would fit the best.
> Maybe BPF_LD ?
>
> Another alternative for barriers is to use nocsr kfuncs.
> Then we have the freedom to make mistakes and fix them later.
> One kfunc per barrier would do.
> JITs would inline them into appropriate insns.
> In bpf progs they will be used just like in the kernel code smp_mb(),
> smp_rmb(), etc.
>
> I don't think compilers have to emit barriers from C code, so my
> preference is kfuncs atm.
>
>>    * teach LLVM and GCC to generate the new/extended instructions
>>    * teach the BPF verifier to understand them
>>    * teach the BPF JIT compiler to compile them
>>    * update BPF memory model and tooling
>>    * update IETF specification
>>
>> Additionally, for the issue described in the previous section, we need to:
>>
>>    * check if GCC has the same behavior
>>    * at least clearly document the implied effects on BPF memory ordering of
>>      current __sync_*fetch*() built-ins (especially for architectures like ARM),
>>      as described
>>    * fully support the new __atomic_*fetch*() built-ins for BPF to replace the
>>      __sync_*fetch*() ones
>>
>> Any suggestions or corrections would be most welcome!
>>
>> Thanks,
>> Peilin Ye
>>
>>
>> [1] Instruction-Level BPF Memory Model
>> https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing
>>
>> [2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic
>>      instructions").  Search for "LLVM will check the return value" in the
>>      commit message.
>>
>> [3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI
>>      0487K.a, ID032224), C6.2.149, page 2006
>>
>> [4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
>>      6.58 Legacy __sync Built-in Functions for Atomic Memory Access
>>      "In most cases, these built-in functions are considered a full barrier."
>>
>> [5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
>>      6.59 Built-in Functions for Memory Model Aware Atomic Operations
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-30  5:14   ` Yonghong Song
@ 2024-07-31  1:19     ` Alexei Starovoitov
  2024-07-31  3:51       ` Yonghong Song
  2024-08-01 12:47     ` Jose E. Marchesi
  1 sibling, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2024-07-31  1:19 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Peilin Ye, Jose E. Marchesi, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky

On Mon, Jul 29, 2024 at 10:14 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> > This sounds like a compiler bug.
> >
> > Yonghong, Jose,
> > do you know what compilers do for other backends?
> > Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
>
> This behavior is introduced by the following llvm commit:
> https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
>
> Specifically the following commit message:
>
> =======
> Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
> instructions are added for atomic operations which do not
> have return values. LLVM will check the return value for
> __sync_fetch_and_{add,and,or,xor}.
> If the return value is used, instructions atomic_fetch_<op>
> will be used. Otherwise, atomic_<op> instructions will be used.

So it's a bpf backend bug. Great. That's fixable.
Would have been much harder if this transformation was performed
by the middle end.

> ======
>
> Basically, if no return value, __sync_fetch_and_add() will use
> xadd insn. The decision is made at that time to maintain backward compatibility.
> For one example, in bcc
>    https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
> we have
>    #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))
>
> Should we use atomic_fetch_*() always regardless of whether the return
> val is used or not? Probably, it should still work. Not sure what gcc
> does for this case.

Right. We did it for backward compat. Older llvm was
completely wrong to generate xadd for __sync_fetch_and_add.
That was my hack from 10 years ago when xadd was all we had.
So we fixed that old llvm bug, but introduced another with all
good intentions.
Since proper atomic insns were introduced 3 years ago we should
remove this backward compat feature/bug from llvm.
The only breakage is for kernels older than 5.12.
I think that's an acceptable risk.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-31  1:19     ` Alexei Starovoitov
@ 2024-07-31  3:51       ` Yonghong Song
  2024-07-31 20:44         ` Peilin Ye
  0 siblings, 1 reply; 20+ messages in thread
From: Yonghong Song @ 2024-07-31  3:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peilin Ye, Jose E. Marchesi, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky


On 7/30/24 6:19 PM, Alexei Starovoitov wrote:
> On Mon, Jul 29, 2024 at 10:14 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> This sounds like a compiler bug.
>>>
>>> Yonghong, Jose,
>>> do you know what compilers do for other backends?
>>> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
>> This behavior is introduced by the following llvm commit:
>> https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
>>
>> Specifically the following commit message:
>>
>> =======
>> Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
>> instructions are added for atomic operations which do not
>> have return values. LLVM will check the return value for
>> __sync_fetch_and_{add,and,or,xor}.
>> If the return value is used, instructions atomic_fetch_<op>
>> will be used. Otherwise, atomic_<op> instructions will be used.
> So it's a bpf backend bug. Great. That's fixable.
> Would have been much harder if this transformation was performed
> by the middle end.
>
>> ======
>>
>> Basically, if no return value, __sync_fetch_and_add() will use
>> xadd insn. The decision is made at that time to maintain backward compatibility.
>> For one example, in bcc
>>     https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
>> we have
>>     #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))
>>
>> Should we use atomic_fetch_*() always regardless of whether the return
>> val is used or not? Probably, it should still work. Not sure what gcc
>> does for this case.
> Right. We did it for backward compat. Older llvm was
> completely wrong to generate xadd for __sync_fetch_and_add.
> That was my hack from 10 years ago when xadd was all we had.
> So we fixed that old llvm bug, but introduced another with all
> good intentions.
> Since proper atomic insns were introduced 3 years ago we should
> remove this backward compat feature/bug from llvm.
> The only breakage is for kernels older than 5.12.
> I think that's an acceptable risk.

Sounds good, I will remove the backward compat part in llvm20.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-31  3:51       ` Yonghong Song
@ 2024-07-31 20:44         ` Peilin Ye
  2024-07-31 23:17           ` Yonghong Song
  0 siblings, 1 reply; 20+ messages in thread
From: Peilin Ye @ 2024-07-31 20:44 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Jose E. Marchesi, bpf, Josh Don,
	Barret Rhoden, Neel Natu, Benjamin Segall, Paul E. McKenney,
	Alexei Starovoitov, David Vernet, Dave Marchevsky

Hi Alexei, Yonghong,

On Tue, Jul 30, 2024 at 08:51:15PM -0700, Yonghong Song wrote:
> > > > This sounds like a compiler bug.
> > > > 
> > > > Yonghong, Jose,
> > > > do you know what compilers do for other backends?
> > > > Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
> > >
> > > This behavior is introduced by the following llvm commit:
> > > https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
> > > 
> > > Specifically the following commit message:
> > > 
> > > =======
> > > Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
> > > instructions are added for atomic operations which do not
> > > have return values. LLVM will check the return value for
> > > __sync_fetch_and_{add,and,or,xor}.
> > > If the return value is used, instructions atomic_fetch_<op>
> > > will be used. Otherwise, atomic_<op> instructions will be used.
> >
> > So it's a bpf backend bug. Great. That's fixable.
> > Would have been much harder if this transformation was performed
> > by the middle end.
> > 
> > > ======
> > > 
> > > Basically, if no return value, __sync_fetch_and_add() will use
> > > xadd insn. The decision is made at that time to maintain backward compatibility.
> > > For one example, in bcc
> > >     https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
> > > we have
> > >     #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))
> > > 
> > > Should we use atomic_fetch_*() always regardless of whether the return
> > > val is used or not? Probably, it should still work. Not sure what gcc
> > > does for this case.
> >
> > Right. We did it for backward compat. Older llvm was
> > completely wrong to generate xadd for __sync_fetch_and_add.
> > That was my hack from 10 years ago when xadd was all we had.
> > So we fixed that old llvm bug, but introduced another with all
> > good intentions.
> > Since proper atomic insns were introduced 3 years ago we should
> > remove this backward compat feature/bug from llvm.
> > The only breakage is for kernels older than 5.12.
> > I think that's an acceptable risk.
> 
> Sounds good, I will remove the backward compat part in llvm20.

Thanks for confirming!  Would you mind if I fix it myself?  It may
affect some of the BPF code that we will be running on ARM, so we would
like to get it fixed sooner.  Also, I would love to gain some
experience in LLVM development!

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-31 20:44         ` Peilin Ye
@ 2024-07-31 23:17           ` Yonghong Song
  2024-08-01  0:11             ` Peilin Ye
  0 siblings, 1 reply; 20+ messages in thread
From: Yonghong Song @ 2024-07-31 23:17 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Alexei Starovoitov, Jose E. Marchesi, bpf, Josh Don,
	Barret Rhoden, Neel Natu, Benjamin Segall, Paul E. McKenney,
	Alexei Starovoitov, David Vernet, Dave Marchevsky


On 7/31/24 1:44 PM, Peilin Ye wrote:
> Hi Alexei, Yonghong,
>
> On Tue, Jul 30, 2024 at 08:51:15PM -0700, Yonghong Song wrote:
>>>>> This sounds like a compiler bug.
>>>>>
>>>>> Yonghong, Jose,
>>>>> do you know what compilers do for other backends?
>>>>> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
>>>> This behavior is introduced by the following llvm commit:
>>>> https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
>>>>
>>>> Specifically the following commit message:
>>>>
>>>> =======
>>>> Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
>>>> instructions are added for atomic operations which do not
>>>> have return values. LLVM will check the return value for
>>>> __sync_fetch_and_{add,and,or,xor}.
>>>> If the return value is used, instructions atomic_fetch_<op>
>>>> will be used. Otherwise, atomic_<op> instructions will be used.
>>> So it's a bpf backend bug. Great. That's fixable.
>>> Would have been much harder if this transformation was performed
>>> by the middle end.
>>>
>>>> ======
>>>>
>>>> Basically, if no return value, __sync_fetch_and_add() will use
>>>> xadd insn. The decision is made at that time to maintain backward compatibility.
>>>> For one example, in bcc
>>>>      https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
>>>> we have
>>>>      #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))
>>>>
>>>> Should we use atomic_fetch_*() always regardless of whether the return
>>>> val is used or not? Probably, it should still work. Not sure what gcc
>>>> does for this case.
>>> Right. We did it for backward compat. Older llvm was
>>> completely wrong to generate xadd for __sync_fetch_and_add.
>>> That was my hack from 10 years ago when xadd was all we had.
>>> So we fixed that old llvm bug, but introduced another with all
>>> good intentions.
>>> Since proper atomic insns were introduced 3 years ago we should
>>> remove this backward compat feature/bug from llvm.
>>> The only breakage is for kernels older than 5.12.
>>> I think that's an acceptable risk.
>> Sounds good, I will remove the backward compat part in llvm20.
> Thanks for confirming!  Would you mind if I fix it myself?  It may
> affect some of the BPF code that we will be running on ARM, so we would
> like to get it fixed sooner.  Also, I would love to gain some
> experience in LLVM development!

Peilin, when I saw your email, I have almost done with the change.
The below is the llvm patch:
   https://github.com/llvm/llvm-project/pull/101428

Please help take a look. You are certainly welcome to do llvm
related work. Just respond earlier to mention you intend to do
a particular llvm patch and we are happy for you to contribute
and will help when you have any questions.

>
> Thanks,
> Peilin Ye
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-31 23:17           ` Yonghong Song
@ 2024-08-01  0:11             ` Peilin Ye
  0 siblings, 0 replies; 20+ messages in thread
From: Peilin Ye @ 2024-08-01  0:11 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Jose E. Marchesi, bpf, Josh Don,
	Barret Rhoden, Neel Natu, Benjamin Segall, Paul E. McKenney,
	Alexei Starovoitov, David Vernet, Dave Marchevsky

Hi Yonghong,

On Wed, Jul 31, 2024 at 04:17:36PM -0700, Yonghong Song wrote:
> On 7/31/24 1:44 PM, Peilin Ye wrote:
> > Thanks for confirming!  Would you mind if I fix it myself?  It may
> > affect some of the BPF code that we will be running on ARM, so we would
> > like to get it fixed sooner.  Also, I would love to gain some
> > experience in LLVM development!
> 
> Peilin, when I saw your email, I have almost done with the change.
> The below is the llvm patch:
>   https://github.com/llvm/llvm-project/pull/101428

Wow, that is really fast!  Thanks for the quick fix.

> Please help take a look. You are certainly welcome to do llvm
> related work. Just respond earlier to mention you intend to do
> a particular llvm patch and we are happy for you to contribute
> and will help when you have any questions.

Sure!  I'll look into that PR.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-30  5:14   ` Yonghong Song
  2024-07-31  1:19     ` Alexei Starovoitov
@ 2024-08-01 12:47     ` Jose E. Marchesi
  2024-08-01 14:20       ` Jose E. Marchesi
  1 sibling, 1 reply; 20+ messages in thread
From: Jose E. Marchesi @ 2024-08-01 12:47 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Peilin Ye, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky


> On 7/29/24 6:28 PM, Alexei Starovoitov wrote:
>> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
>>> Hi list!
>>>
>>> As we are looking at running sched_ext-style BPF scheduling on architectures
>>> with a more relaxed memory model (i.e. ARM), we would like to:
>>>
>>>    1. have fine-grained control over memory ordering in BPF (instead of
>>>       defaulting to a full barrier), for performance reasons
>>>    2. pay closer attention to if memory barriers are being used correctly in
>>>       BPF
>>>
>>> To that end, our main goal here is to support more types of memory barriers in
>>> BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
>>> model [1], Paul agreed that it makes sense to support some basic types first.
>>> Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
>>> related to memory ordering, which will be described in details below.
>>>
>>> I. We need more types of BPF memory barriers
>>> --------------------------------------------
>>>
>>> Currently, when it comes to BPF memory barriers, our choices are effectively
>>> limited to:
>>>
>>>    * compiler barrier: 'asm volatile ("" ::: "memory");'
>>>    * full memory barriers implied by compiler built-ins like
>>>      __sync_val_compare_and_swap()
>>>
>>> We need more.  During offline discussion with Paul, we agreed we can start
>>> from:
>>>
>>>    * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
>>>    * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
>> we would need inline asm equivalent too. Similar to kernel
>> smp_load_acquire() macro.
>>
>>> Theoretically, the BPF JIT compiler could also reorder instructions just like
>>> Clang or GCC, though it might not currently do so.  If we ever developed a more
>>> optimizing BPF JIT compiler, it would also be nice to have an optimization
>>> barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
>>> instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
>> It can be a standalone insn that is a compiler barrier only but that feels like
>> a waste of an instruction. So depending how we end up encoding various
>> real barriers
>> there may be a bit to spend in such a barrier insn that is only a
>> compiler barrier.
>> In this case optimizing JIT barrier.
>>
>>> II. Implicit barriers can get confusing
>>> ---------------------------------------
>>>
>>> We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
>>> always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
>>> frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
>>> or "acquire and release" (full barrier) semantics, depending on if its return
>>> value is used:
>>>
>>> Case (a): return value is used
>>>
>>>    SEC("...")
>>>    int64_t foo;
>>>
>>>    int64_t func(...) {
>>>        return __sync_fetch_and_add(&foo, 1);
>>>    }
>>>
>>> For case (a), Clang gave us:
>>>
>>>    3:    db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)
>>>
>>>    opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>>    imm (0x00000001): BPF_ADD | BPF_FETCH
>>>
>>> Case (b): return value is ignored
>>>
>>>    SEC("...")
>>>    int64_t foo;
>>>
>>>    int64_t func(...) {
>>>        __sync_fetch_and_add(&foo, 1);
>>>
>>>        return foo;
>>>    }
>>>
>>> For case (b), Clang gave us:
>>>
>>>    3:    db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1
>>>
>>>    opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>>    imm (0x00000000): BPF_ADD
>>>
>>> LLVM decided to drop BPF_FETCH, since the return value of
>>> __sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
>>> emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
>>> atomic instructions are being used):
>>>
>>>    case BPF_ADD:
>>>            emit(A64_STADD(isdw, reg, src), ctx);
>>>            break;
>>>    <...>
>>>    case BPF_ADD | BPF_FETCH:
>>>            emit(A64_LDADDAL(isdw, src, reg, src), ctx);
>>>            break;
>>>
>>> STADD is an alias for LDADD.  According to [3]:
>>>
>>>    * LDADDAL for case (a) has "acquire" plus "release" semantics
>>>    * LDADD for case (b) "has neither acquire nor release semantics"
>>>
>>> This is pretty non-intuitive; a compiler built-in should not have inconsistent
>>> implications on memory ordering, and it is better not to require all BPF
>>> programmers to memorize this.
>>>
>>> GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
>>> always imply a full barrier.  GCC considers these __sync_*() built-ins as
>>> "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
>>> Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
>>> are designed to be a lot more explicit on memory ordering, for example:
>>>
>>>    type __atomic_fetch_add (type *ptr, type val, int memorder)
>>>
>>> This requires the programmer to specify a memory order type (relaxed, acquire,
>>> release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
>>> __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
>>> counterparts (the "memorder" parameter seems effectively ignored), and are not
>>> fully supported.
>> This sounds like a compiler bug.
>>
>> Yonghong, Jose,
>> do you know what compilers do for other backends?
>> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
>
> This behavior is introduced by the following llvm commit:
> https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
>
> Specifically the following commit message:
>
> =======
> Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
> instructions are added for atomic operations which do not
> have return values. LLVM will check the return value for
> __sync_fetch_and_{add,and,or,xor}.
> If the return value is used, instructions atomic_fetch_<op>
> will be used. Otherwise, atomic_<op> instructions will be used.
> ======
>
> Basically, if no return value, __sync_fetch_and_add() will use
> xadd insn. The decision is made at that time to maintain backward compatibility.
> For one example, in bcc
>   https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
> we have
>   #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))
>
> Should we use atomic_fetch_*() always regardless of whether the return
> val is used or not? Probably, it should still work. Not sure what gcc
> does for this case.

GCC behaves similarly.

For program A:

  long foo;
  
  long func () {
        return __sync_fetch_and_add(&foo, 1);
  }

bpf-unknown-none-gcc -O2 compiles to:

  0000000000000000 <func>:
     0:	18 00 00 00 00 00 00 00 	r0=0 ll
     8:	00 00 00 00 00 00 00 00 
    10:	b7 01 00 00 01 00 00 00 	r1=1
    18:	db 10 00 00 01 00 00 00 	r1=atomic_fetch_add((u64*)(r0+0),r1)
    20:	bf 10 00 00 00 00 00 00 	r0=r1
    28:	95 00 00 00 00 00 00 00 	exit

And for program B:

  long foo;
  
  long func () {
       __sync_fetch_and_add(&foo, 1);
        return foo;
  }

bpf-unknown-none-gcc -O2 compiles to:

  0000000000000000 <func>:
     0:	18 00 00 00 00 00 00 00 	r0=0 ll
     8:	00 00 00 00 00 00 00 00 
    10:	b7 01 00 00 01 00 00 00 	r1=1
    18:	db 10 00 00 00 00 00 00 	lock *(u64*)(r0+0)+=r1
    20:	79 00 00 00 00 00 00 00 	r0=*(u64*)(r0+0)
    28:	95 00 00 00 00 00 00 00 	exit

Internally:

- When compiling the program A GCC decides to emit an
  `atomic_fetch_addDI' insn, documented as:

  'atomic_fetch_addMODE', 'atomic_fetch_subMODE'
  'atomic_fetch_orMODE', 'atomic_fetch_andMODE'
  'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE'

     These patterns emit code for an atomic operation on memory with
     memory model semantics, and return the original value.  Operand 0
     is an output operand which contains the value of the memory
     location before the operation was performed.  Operand 1 is the
     memory on which the atomic operation is performed.  Operand 2 is
     the second operand to the binary operator.  Operand 3 is the memory
     model to be used by the operation.

  The BPF backend defines atomic_fetch_add for DI modes (long) to expand
  to this BPF instruction:

      %w0 = atomic_fetch_add((<smop> *)%1, %w0)

- When compiling the program B GCC decides to emit an `atomic_addDI'
  insn, documented as:

  'atomic_addMODE', 'atomic_subMODE'
  'atomic_orMODE', 'atomic_andMODE'
  'atomic_xorMODE', 'atomic_nandMODE'

     These patterns emit code for an atomic operation on memory with
     memory model semantics.  Operand 0 is the memory on which the
     atomic operation is performed.  Operand 1 is the second operand to
     the binary operator.  Operand 2 is the memory model to be used by
     the operation.

  The BPF backend defines atomic_fetch_add for DI modes (long) to expand
  to this BPF instruction:

      lock *(<smop> *)%w0 += %w1

This is done for all targets. In x86-64, for example, case A compiles
to:

  0000000000000000 <func>:
     0:	b8 01 00 00 00       	mov    $0x1,%eax
     5:	f0 48 0f c1 05 00 00 	lock xadd %rax,0x0(%rip)        # e <func+0xe>
     c:	00 00 
     e:	c3                   	retq   

And case B compiles to:

  0000000000000000 <func>:
     0:	f0 48 83 05 00 00 00 	lock addq $0x1,0x0(%rip)        # 9 <func+0x9>
     7:	00 01 
     9:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 10 <func+0x10>
    10:	c3                   	retq   

Why wouldn't the compiler be allowed to optimize from atomic_fetch_add
to atomic_add in this case?

>
>>
>>> III. Next steps
>>> ---------------
>>>
>>> Roughly, the scope of this work includes:
>>>
>>>    * decide how to extend the BPF ISA (add new instructions and/or extend
>>>      current ones)
>> ldx/stx insns support MEM and MEMSX modifiers.
>> Adding MEM_ACQ_REL feels like a natural fit. Better name?
>>
>> For barriers we would need a new insn. Not sure which class would fit the best.
>> Maybe BPF_LD ?
>>
>> Another alternative for barriers is to use nocsr kfuncs.
>> Then we have the freedom to make mistakes and fix them later.
>> One kfunc per barrier would do.
>> JITs would inline them into appropriate insns.
>> In bpf progs they will be used just like in the kernel code smp_mb(),
>> smp_rmb(), etc.
>>
>> I don't think compilers have to emit barriers from C code, so my
>> preference is kfuncs atm.
>>
>>>    * teach LLVM and GCC to generate the new/extended instructions
>>>    * teach the BPF verifier to understand them
>>>    * teach the BPF JIT compiler to compile them
>>>    * update BPF memory model and tooling
>>>    * update IETF specification
>>>
>>> Additionally, for the issue described in the previous section, we need to:
>>>
>>>    * check if GCC has the same behavior
>>>    * at least clearly document the implied effects on BPF memory ordering of
>>>      current __sync_*fetch*() built-ins (especially for architectures like ARM),
>>>      as described
>>>    * fully support the new __atomic_*fetch*() built-ins for BPF to replace the
>>>      __sync_*fetch*() ones
>>>
>>> Any suggestions or corrections would be most welcome!
>>>
>>> Thanks,
>>> Peilin Ye
>>>
>>>
>>> [1] Instruction-Level BPF Memory Model
>>> https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing
>>>
>>> [2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic
>>>      instructions").  Search for "LLVM will check the return value" in the
>>>      commit message.
>>>
>>> [3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI
>>>      0487K.a, ID032224), C6.2.149, page 2006
>>>
>>> [4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
>>>      6.58 Legacy __sync Built-in Functions for Atomic Memory Access
>>>      "In most cases, these built-in functions are considered a full barrier."
>>>
>>> [5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
>>>      6.59 Built-in Functions for Memory Model Aware Atomic Operations
>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-08-01 12:47     ` Jose E. Marchesi
@ 2024-08-01 14:20       ` Jose E. Marchesi
  2024-08-01 16:44         ` Yonghong Song
  2024-08-01 22:00         ` Peilin Ye
  0 siblings, 2 replies; 20+ messages in thread
From: Jose E. Marchesi @ 2024-08-01 14:20 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Peilin Ye, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky


>> On 7/29/24 6:28 PM, Alexei Starovoitov wrote:
>>> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
>>>> Hi list!
>>>>
>>>> As we are looking at running sched_ext-style BPF scheduling on architectures
>>>> with a more relaxed memory model (i.e. ARM), we would like to:
>>>>
>>>>    1. have fine-grained control over memory ordering in BPF (instead of
>>>>       defaulting to a full barrier), for performance reasons
>>>>    2. pay closer attention to if memory barriers are being used correctly in
>>>>       BPF
>>>>
>>>> To that end, our main goal here is to support more types of memory barriers in
>>>> BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
>>>> model [1], Paul agreed that it makes sense to support some basic types first.
>>>> Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
>>>> related to memory ordering, which will be described in details below.
>>>>
>>>> I. We need more types of BPF memory barriers
>>>> --------------------------------------------
>>>>
>>>> Currently, when it comes to BPF memory barriers, our choices are effectively
>>>> limited to:
>>>>
>>>>    * compiler barrier: 'asm volatile ("" ::: "memory");'
>>>>    * full memory barriers implied by compiler built-ins like
>>>>      __sync_val_compare_and_swap()
>>>>
>>>> We need more.  During offline discussion with Paul, we agreed we can start
>>>> from:
>>>>
>>>>    * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
>>>>    * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
>>> we would need inline asm equivalent too. Similar to kernel
>>> smp_load_acquire() macro.
>>>
>>>> Theoretically, the BPF JIT compiler could also reorder instructions just like
>>>> Clang or GCC, though it might not currently do so.  If we ever developed a more
>>>> optimizing BPF JIT compiler, it would also be nice to have an optimization
>>>> barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
>>>> instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
>>> It can be a standalone insn that is a compiler barrier only but that feels like
>>> a waste of an instruction. So depending how we end up encoding various
>>> real barriers
>>> there may be a bit to spend in such a barrier insn that is only a
>>> compiler barrier.
>>> In this case optimizing JIT barrier.
>>>
>>>> II. Implicit barriers can get confusing
>>>> ---------------------------------------
>>>>
>>>> We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
>>>> always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
>>>> frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
>>>> or "acquire and release" (full barrier) semantics, depending on if its return
>>>> value is used:
>>>>
>>>> Case (a): return value is used
>>>>
>>>>    SEC("...")
>>>>    int64_t foo;
>>>>
>>>>    int64_t func(...) {
>>>>        return __sync_fetch_and_add(&foo, 1);
>>>>    }
>>>>
>>>> For case (a), Clang gave us:
>>>>
>>>>    3:    db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)
>>>>
>>>>    opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>>>    imm (0x00000001): BPF_ADD | BPF_FETCH
>>>>
>>>> Case (b): return value is ignored
>>>>
>>>>    SEC("...")
>>>>    int64_t foo;
>>>>
>>>>    int64_t func(...) {
>>>>        __sync_fetch_and_add(&foo, 1);
>>>>
>>>>        return foo;
>>>>    }
>>>>
>>>> For case (b), Clang gave us:
>>>>
>>>>    3:    db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1
>>>>
>>>>    opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>>>    imm (0x00000000): BPF_ADD
>>>>
>>>> LLVM decided to drop BPF_FETCH, since the return value of
>>>> __sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
>>>> emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
>>>> atomic instructions are being used):
>>>>
>>>>    case BPF_ADD:
>>>>            emit(A64_STADD(isdw, reg, src), ctx);
>>>>            break;
>>>>    <...>
>>>>    case BPF_ADD | BPF_FETCH:
>>>>            emit(A64_LDADDAL(isdw, src, reg, src), ctx);
>>>>            break;
>>>>
>>>> STADD is an alias for LDADD.  According to [3]:
>>>>
>>>>    * LDADDAL for case (a) has "acquire" plus "release" semantics
>>>>    * LDADD for case (b) "has neither acquire nor release semantics"
>>>>
>>>> This is pretty non-intuitive; a compiler built-in should not have inconsistent
>>>> implications on memory ordering, and it is better not to require all BPF
>>>> programmers to memorize this.
>>>>
>>>> GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
>>>> always imply a full barrier.  GCC considers these __sync_*() built-ins as
>>>> "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
>>>> Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
>>>> are designed to be a lot more explicit on memory ordering, for example:
>>>>
>>>>    type __atomic_fetch_add (type *ptr, type val, int memorder)
>>>>
>>>> This requires the programmer to specify a memory order type (relaxed, acquire,
>>>> release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
>>>> __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
>>>> counterparts (the "memorder" parameter seems effectively ignored), and are not
>>>> fully supported.
>>> This sounds like a compiler bug.
>>>
>>> Yonghong, Jose,
>>> do you know what compilers do for other backends?
>>> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
>>
>> This behavior is introduced by the following llvm commit:
>> https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
>>
>> Specifically the following commit message:
>>
>> =======
>> Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
>> instructions are added for atomic operations which do not
>> have return values. LLVM will check the return value for
>> __sync_fetch_and_{add,and,or,xor}.
>> If the return value is used, instructions atomic_fetch_<op>
>> will be used. Otherwise, atomic_<op> instructions will be used.
>> ======
>>
>> Basically, if no return value, __sync_fetch_and_add() will use
>> xadd insn. The decision is made at that time to maintain backward compatibility.
>> For one example, in bcc
>>   https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
>> we have
>>   #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))
>>
>> Should we use atomic_fetch_*() always regardless of whether the return
>> val is used or not? Probably, it should still work. Not sure what gcc
>> does for this case.
>
> GCC behaves similarly.
>
> For program A:
>
>   long foo;
>   
>   long func () {
>         return __sync_fetch_and_add(&foo, 1);
>   }
>
> bpf-unknown-none-gcc -O2 compiles to:
>
>   0000000000000000 <func>:
>      0:	18 00 00 00 00 00 00 00 	r0=0 ll
>      8:	00 00 00 00 00 00 00 00 
>     10:	b7 01 00 00 01 00 00 00 	r1=1
>     18:	db 10 00 00 01 00 00 00 	r1=atomic_fetch_add((u64*)(r0+0),r1)
>     20:	bf 10 00 00 00 00 00 00 	r0=r1
>     28:	95 00 00 00 00 00 00 00 	exit
>
> And for program B:
>
>   long foo;
>   
>   long func () {
>        __sync_fetch_and_add(&foo, 1);
>         return foo;
>   }
>
> bpf-unknown-none-gcc -O2 compiles to:
>
>   0000000000000000 <func>:
>      0:	18 00 00 00 00 00 00 00 	r0=0 ll
>      8:	00 00 00 00 00 00 00 00 
>     10:	b7 01 00 00 01 00 00 00 	r1=1
>     18:	db 10 00 00 00 00 00 00 	lock *(u64*)(r0+0)+=r1
>     20:	79 00 00 00 00 00 00 00 	r0=*(u64*)(r0+0)
>     28:	95 00 00 00 00 00 00 00 	exit
>
> Internally:
>
> - When compiling the program A GCC decides to emit an
>   `atomic_fetch_addDI' insn, documented as:
>
>   'atomic_fetch_addMODE', 'atomic_fetch_subMODE'
>   'atomic_fetch_orMODE', 'atomic_fetch_andMODE'
>   'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE'
>
>      These patterns emit code for an atomic operation on memory with
>      memory model semantics, and return the original value.  Operand 0
>      is an output operand which contains the value of the memory
>      location before the operation was performed.  Operand 1 is the
>      memory on which the atomic operation is performed.  Operand 2 is
>      the second operand to the binary operator.  Operand 3 is the memory
>      model to be used by the operation.
>
>   The BPF backend defines atomic_fetch_add for DI modes (long) to expand
>   to this BPF instruction:
>
>       %w0 = atomic_fetch_add((<smop> *)%1, %w0)
>
> - When compiling the program B GCC decides to emit an `atomic_addDI'
>   insn, documented as:
>
>   'atomic_addMODE', 'atomic_subMODE'
>   'atomic_orMODE', 'atomic_andMODE'
>   'atomic_xorMODE', 'atomic_nandMODE'
>
>      These patterns emit code for an atomic operation on memory with
>      memory model semantics.  Operand 0 is the memory on which the
>      atomic operation is performed.  Operand 1 is the second operand to
>      the binary operator.  Operand 2 is the memory model to be used by
>      the operation.
>
>   The BPF backend defines atomic_fetch_add for DI modes (long) to expand
>   to this BPF instruction:
>
>       lock *(<smop> *)%w0 += %w1
>
> This is done for all targets. In x86-64, for example, case A compiles
> to:
>
>   0000000000000000 <func>:
>      0:	b8 01 00 00 00       	mov    $0x1,%eax
>      5:	f0 48 0f c1 05 00 00 	lock xadd %rax,0x0(%rip)        # e <func+0xe>
>      c:	00 00 
>      e:	c3                   	retq   
>
> And case B compiles to:
>
>   0000000000000000 <func>:
>      0:	f0 48 83 05 00 00 00 	lock addq $0x1,0x0(%rip)        # 9 <func+0x9>
>      7:	00 01 
>      9:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 10 <func+0x10>
>     10:	c3                   	retq   
>
> Why wouldn't the compiler be allowed to optimize from atomic_fetch_add
> to atomic_add in this case?

Ok I see.  The generic compiler optimization is ok.  It is the backend
that is buggy because it emits BPF instruction sequences with different
memory ordering semantics for atomic_OP and atomic_fetch_OP.

The only difference between fetching and non-fetching builtins is that
in one case the original value is returned, in the other the new value.
Other than that they should be equivalent.

For ARM64, GCC generates for case A:

  0000000000000000 <func>:
     0:	90000001 	adrp	x1, 0 <func>
     4:	d2800020 	mov	x0, #0x1                   	// #1
     8:	91000021 	add	x1, x1, #0x0
     c:	f8e00020 	ldaddal	x0, x0, [x1]
    10:	d65f03c0 	ret

And this for case B:

  0000000000000000 <func>:
     0:	90000000 	adrp	x0, 0 <func>
     4:	d2800022 	mov	x2, #0x1                   	// #1
     8:	91000001 	add	x1, x0, #0x0
     c:	f8e20021 	ldaddal	x2, x1, [x1]
    10:	f9400000 	ldr	x0, [x0]
    14:	d65f03c0 	ret

i.e. GCC emits LDADDAL for both atomic_add and atomic_fetch_add internal
insns.  Like in x86-64, both sequences have same memory ordering
semantics.

Allright we are changing GCC to always emit fetch versions of sequences
for all the supported atomic operations: add, and, or, xor.  After the
change the `lock' versions of the instructions will not be generated by
the compiler at all out of inline asm.

Will send a headsup when done.
Salud!

>>>
>>>> III. Next steps
>>>> ---------------
>>>>
>>>> Roughly, the scope of this work includes:
>>>>
>>>>    * decide how to extend the BPF ISA (add new instructions and/or extend
>>>>      current ones)
>>> ldx/stx insns support MEM and MEMSX modifiers.
>>> Adding MEM_ACQ_REL feels like a natural fit. Better name?
>>>
>>> For barriers we would need a new insn. Not sure which class would fit the best.
>>> Maybe BPF_LD ?
>>>
>>> Another alternative for barriers is to use nocsr kfuncs.
>>> Then we have the freedom to make mistakes and fix them later.
>>> One kfunc per barrier would do.
>>> JITs would inline them into appropriate insns.
>>> In bpf progs they will be used just like in the kernel code smp_mb(),
>>> smp_rmb(), etc.
>>>
>>> I don't think compilers have to emit barriers from C code, so my
>>> preference is kfuncs atm.
>>>
>>>>    * teach LLVM and GCC to generate the new/extended instructions
>>>>    * teach the BPF verifier to understand them
>>>>    * teach the BPF JIT compiler to compile them
>>>>    * update BPF memory model and tooling
>>>>    * update IETF specification
>>>>
>>>> Additionally, for the issue described in the previous section, we need to:
>>>>
>>>>    * check if GCC has the same behavior
>>>>    * at least clearly document the implied effects on BPF memory ordering of
>>>>      current __sync_*fetch*() built-ins (especially for architectures like ARM),
>>>>      as described
>>>>    * fully support the new __atomic_*fetch*() built-ins for BPF to replace the
>>>>      __sync_*fetch*() ones
>>>>
>>>> Any suggestions or corrections would be most welcome!
>>>>
>>>> Thanks,
>>>> Peilin Ye
>>>>
>>>>
>>>> [1] Instruction-Level BPF Memory Model
>>>> https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing
>>>>
>>>> [2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic
>>>>      instructions").  Search for "LLVM will check the return value" in the
>>>>      commit message.
>>>>
>>>> [3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI
>>>>      0487K.a, ID032224), C6.2.149, page 2006
>>>>
>>>> [4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
>>>>      6.58 Legacy __sync Built-in Functions for Atomic Memory Access
>>>>      "In most cases, these built-in functions are considered a full barrier."
>>>>
>>>> [5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
>>>>      6.59 Built-in Functions for Memory Model Aware Atomic Operations
>>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-08-01 14:20       ` Jose E. Marchesi
@ 2024-08-01 16:44         ` Yonghong Song
  2024-08-05 16:13           ` Jose E. Marchesi
  2024-08-01 22:00         ` Peilin Ye
  1 sibling, 1 reply; 20+ messages in thread
From: Yonghong Song @ 2024-08-01 16:44 UTC (permalink / raw)
  To: Jose E. Marchesi
  Cc: Alexei Starovoitov, Peilin Ye, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky


On 8/1/24 7:20 AM, Jose E. Marchesi wrote:
>>> On 7/29/24 6:28 PM, Alexei Starovoitov wrote:
>>>> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
>>>>> Hi list!
>>>>>
>>>>> As we are looking at running sched_ext-style BPF scheduling on architectures
>>>>> with a more relaxed memory model (i.e. ARM), we would like to:
>>>>>
>>>>>     1. have fine-grained control over memory ordering in BPF (instead of
>>>>>        defaulting to a full barrier), for performance reasons
>>>>>     2. pay closer attention to if memory barriers are being used correctly in
>>>>>        BPF
>>>>>
>>>>> To that end, our main goal here is to support more types of memory barriers in
>>>>> BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
>>>>> model [1], Paul agreed that it makes sense to support some basic types first.
>>>>> Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
>>>>> related to memory ordering, which will be described in details below.
>>>>>
>>>>> I. We need more types of BPF memory barriers
>>>>> --------------------------------------------
>>>>>
>>>>> Currently, when it comes to BPF memory barriers, our choices are effectively
>>>>> limited to:
>>>>>
>>>>>     * compiler barrier: 'asm volatile ("" ::: "memory");'
>>>>>     * full memory barriers implied by compiler built-ins like
>>>>>       __sync_val_compare_and_swap()
>>>>>
>>>>> We need more.  During offline discussion with Paul, we agreed we can start
>>>>> from:
>>>>>
>>>>>     * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
>>>>>     * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
>>>> we would need inline asm equivalent too. Similar to kernel
>>>> smp_load_acquire() macro.
>>>>
>>>>> Theoretically, the BPF JIT compiler could also reorder instructions just like
>>>>> Clang or GCC, though it might not currently do so.  If we ever developed a more
>>>>> optimizing BPF JIT compiler, it would also be nice to have an optimization
>>>>> barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
>>>>> instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
>>>> It can be a standalone insn that is a compiler barrier only but that feels like
>>>> a waste of an instruction. So depending how we end up encoding various
>>>> real barriers
>>>> there may be a bit to spend in such a barrier insn that is only a
>>>> compiler barrier.
>>>> In this case optimizing JIT barrier.
>>>>
>>>>> II. Implicit barriers can get confusing
>>>>> ---------------------------------------
>>>>>
>>>>> We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
>>>>> always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
>>>>> frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
>>>>> or "acquire and release" (full barrier) semantics, depending on if its return
>>>>> value is used:
>>>>>
>>>>> Case (a): return value is used
>>>>>
>>>>>     SEC("...")
>>>>>     int64_t foo;
>>>>>
>>>>>     int64_t func(...) {
>>>>>         return __sync_fetch_and_add(&foo, 1);
>>>>>     }
>>>>>
>>>>> For case (a), Clang gave us:
>>>>>
>>>>>     3:    db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)
>>>>>
>>>>>     opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>>>>     imm (0x00000001): BPF_ADD | BPF_FETCH
>>>>>
>>>>> Case (b): return value is ignored
>>>>>
>>>>>     SEC("...")
>>>>>     int64_t foo;
>>>>>
>>>>>     int64_t func(...) {
>>>>>         __sync_fetch_and_add(&foo, 1);
>>>>>
>>>>>         return foo;
>>>>>     }
>>>>>
>>>>> For case (b), Clang gave us:
>>>>>
>>>>>     3:    db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1
>>>>>
>>>>>     opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>>>>     imm (0x00000000): BPF_ADD
>>>>>
>>>>> LLVM decided to drop BPF_FETCH, since the return value of
>>>>> __sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
>>>>> emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
>>>>> atomic instructions are being used):
>>>>>
>>>>>     case BPF_ADD:
>>>>>             emit(A64_STADD(isdw, reg, src), ctx);
>>>>>             break;
>>>>>     <...>
>>>>>     case BPF_ADD | BPF_FETCH:
>>>>>             emit(A64_LDADDAL(isdw, src, reg, src), ctx);
>>>>>             break;
>>>>>
>>>>> STADD is an alias for LDADD.  According to [3]:
>>>>>
>>>>>     * LDADDAL for case (a) has "acquire" plus "release" semantics
>>>>>     * LDADD for case (b) "has neither acquire nor release semantics"
>>>>>
>>>>> This is pretty non-intuitive; a compiler built-in should not have inconsistent
>>>>> implications on memory ordering, and it is better not to require all BPF
>>>>> programmers to memorize this.
>>>>>
>>>>> GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
>>>>> always imply a full barrier.  GCC considers these __sync_*() built-ins as
>>>>> "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
>>>>> Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
>>>>> are designed to be a lot more explicit on memory ordering, for example:
>>>>>
>>>>>     type __atomic_fetch_add (type *ptr, type val, int memorder)
>>>>>
>>>>> This requires the programmer to specify a memory order type (relaxed, acquire,
>>>>> release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
>>>>> __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
>>>>> counterparts (the "memorder" parameter seems effectively ignored), and are not
>>>>> fully supported.
>>>> This sounds like a compiler bug.
>>>>
>>>> Yonghong, Jose,
>>>> do you know what compilers do for other backends?
>>>> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
>>> This behavior is introduced by the following llvm commit:
>>> https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
>>>
>>> Specifically the following commit message:
>>>
>>> =======
>>> Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
>>> instructions are added for atomic operations which do not
>>> have return values. LLVM will check the return value for
>>> __sync_fetch_and_{add,and,or,xor}.
>>> If the return value is used, instructions atomic_fetch_<op>
>>> will be used. Otherwise, atomic_<op> instructions will be used.
>>> ======
>>>
>>> Basically, if no return value, __sync_fetch_and_add() will use
>>> xadd insn. The decision is made at that time to maintain backward compatibility.
>>> For one example, in bcc
>>>    https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
>>> we have
>>>    #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))
>>>
>>> Should we use atomic_fetch_*() always regardless of whether the return
>>> val is used or not? Probably, it should still work. Not sure what gcc
>>> does for this case.
>> GCC behaves similarly.
>>
>> For program A:
>>
>>    long foo;
>>    
>>    long func () {
>>          return __sync_fetch_and_add(&foo, 1);
>>    }
>>
>> bpf-unknown-none-gcc -O2 compiles to:
>>
>>    0000000000000000 <func>:
>>       0:	18 00 00 00 00 00 00 00 	r0=0 ll
>>       8:	00 00 00 00 00 00 00 00
>>      10:	b7 01 00 00 01 00 00 00 	r1=1
>>      18:	db 10 00 00 01 00 00 00 	r1=atomic_fetch_add((u64*)(r0+0),r1)
>>      20:	bf 10 00 00 00 00 00 00 	r0=r1
>>      28:	95 00 00 00 00 00 00 00 	exit
>>
>> And for program B:
>>
>>    long foo;
>>    
>>    long func () {
>>         __sync_fetch_and_add(&foo, 1);
>>          return foo;
>>    }
>>
>> bpf-unknown-none-gcc -O2 compiles to:
>>
>>    0000000000000000 <func>:
>>       0:	18 00 00 00 00 00 00 00 	r0=0 ll
>>       8:	00 00 00 00 00 00 00 00
>>      10:	b7 01 00 00 01 00 00 00 	r1=1
>>      18:	db 10 00 00 00 00 00 00 	lock *(u64*)(r0+0)+=r1
>>      20:	79 00 00 00 00 00 00 00 	r0=*(u64*)(r0+0)
>>      28:	95 00 00 00 00 00 00 00 	exit
>>
>> Internally:
>>
>> - When compiling the program A GCC decides to emit an
>>    `atomic_fetch_addDI' insn, documented as:
>>
>>    'atomic_fetch_addMODE', 'atomic_fetch_subMODE'
>>    'atomic_fetch_orMODE', 'atomic_fetch_andMODE'
>>    'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE'
>>
>>       These patterns emit code for an atomic operation on memory with
>>       memory model semantics, and return the original value.  Operand 0
>>       is an output operand which contains the value of the memory
>>       location before the operation was performed.  Operand 1 is the
>>       memory on which the atomic operation is performed.  Operand 2 is
>>       the second operand to the binary operator.  Operand 3 is the memory
>>       model to be used by the operation.
>>
>>    The BPF backend defines atomic_fetch_add for DI modes (long) to expand
>>    to this BPF instruction:
>>
>>        %w0 = atomic_fetch_add((<smop> *)%1, %w0)
>>
>> - When compiling the program B GCC decides to emit an `atomic_addDI'
>>    insn, documented as:
>>
>>    'atomic_addMODE', 'atomic_subMODE'
>>    'atomic_orMODE', 'atomic_andMODE'
>>    'atomic_xorMODE', 'atomic_nandMODE'
>>
>>       These patterns emit code for an atomic operation on memory with
>>       memory model semantics.  Operand 0 is the memory on which the
>>       atomic operation is performed.  Operand 1 is the second operand to
>>       the binary operator.  Operand 2 is the memory model to be used by
>>       the operation.
>>
>>    The BPF backend defines atomic_fetch_add for DI modes (long) to expand
>>    to this BPF instruction:
>>
>>        lock *(<smop> *)%w0 += %w1
>>
>> This is done for all targets. In x86-64, for example, case A compiles
>> to:
>>
>>    0000000000000000 <func>:
>>       0:	b8 01 00 00 00       	mov    $0x1,%eax
>>       5:	f0 48 0f c1 05 00 00 	lock xadd %rax,0x0(%rip)        # e <func+0xe>
>>       c:	00 00
>>       e:	c3                   	retq
>>
>> And case B compiles to:
>>
>>    0000000000000000 <func>:
>>       0:	f0 48 83 05 00 00 00 	lock addq $0x1,0x0(%rip)        # 9 <func+0x9>
>>       7:	00 01
>>       9:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 10 <func+0x10>
>>      10:	c3                   	retq
>>
>> Why wouldn't the compiler be allowed to optimize from atomic_fetch_add
>> to atomic_add in this case?
> Ok I see.  The generic compiler optimization is ok.  It is the backend
> that is buggy because it emits BPF instruction sequences with different
> memory ordering semantics for atomic_OP and atomic_fetch_OP.
>
> The only difference between fetching and non-fetching builtins is that
> in one case the original value is returned, in the other the new value.
> Other than that they should be equivalent.
>
> For ARM64, GCC generates for case A:
>
>    0000000000000000 <func>:
>       0:	90000001 	adrp	x1, 0 <func>
>       4:	d2800020 	mov	x0, #0x1                   	// #1
>       8:	91000021 	add	x1, x1, #0x0
>       c:	f8e00020 	ldaddal	x0, x0, [x1]
>      10:	d65f03c0 	ret
>
> And this for case B:
>
>    0000000000000000 <func>:
>       0:	90000000 	adrp	x0, 0 <func>
>       4:	d2800022 	mov	x2, #0x1                   	// #1
>       8:	91000001 	add	x1, x0, #0x0
>       c:	f8e20021 	ldaddal	x2, x1, [x1]
>      10:	f9400000 	ldr	x0, [x0]
>      14:	d65f03c0 	ret
>
> i.e. GCC emits LDADDAL for both atomic_add and atomic_fetch_add internal
> insns.  Like in x86-64, both sequences have same memory ordering
> semantics.
>
> Allright we are changing GCC to always emit fetch versions of sequences
> for all the supported atomic operations: add, and, or, xor.  After the
> change the `lock' versions of the instructions will not be generated by
> the compiler at all out of inline asm.
>
> Will send a headsup when done.

Thanks! https://github.com/llvm/llvm-project/pull/101428
is the change in llvm side.

[...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-08-01 14:20       ` Jose E. Marchesi
  2024-08-01 16:44         ` Yonghong Song
@ 2024-08-01 22:00         ` Peilin Ye
  1 sibling, 0 replies; 20+ messages in thread
From: Peilin Ye @ 2024-08-01 22:00 UTC (permalink / raw)
  To: Jose E. Marchesi
  Cc: Yonghong Song, Alexei Starovoitov, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky

Hi Jose,

On Thu, Aug 01, 2024 at 04:20:30PM +0200, Jose E. Marchesi wrote:
> > GCC behaves similarly.
> >
> > For program A:
> >
> >   long foo;
> >   
> >   long func () {
> >         return __sync_fetch_and_add(&foo, 1);
> >   }
> >
> > bpf-unknown-none-gcc -O2 compiles to:
> >
> >   0000000000000000 <func>:
> >      0:	18 00 00 00 00 00 00 00 	r0=0 ll
> >      8:	00 00 00 00 00 00 00 00 
> >     10:	b7 01 00 00 01 00 00 00 	r1=1
> >     18:	db 10 00 00 01 00 00 00 	r1=atomic_fetch_add((u64*)(r0+0),r1)
> >     20:	bf 10 00 00 00 00 00 00 	r0=r1
> >     28:	95 00 00 00 00 00 00 00 	exit
> >
> > And for program B:
> >
> >   long foo;
> >   
> >   long func () {
> >        __sync_fetch_and_add(&foo, 1);
> >         return foo;
> >   }
> >
> > bpf-unknown-none-gcc -O2 compiles to:
> >
> >   0000000000000000 <func>:
> >      0:	18 00 00 00 00 00 00 00 	r0=0 ll
> >      8:	00 00 00 00 00 00 00 00 
> >     10:	b7 01 00 00 01 00 00 00 	r1=1
> >     18:	db 10 00 00 00 00 00 00 	lock *(u64*)(r0+0)+=r1
> >     20:	79 00 00 00 00 00 00 00 	r0=*(u64*)(r0+0)
> >     28:	95 00 00 00 00 00 00 00 	exit
> >
> > Internally:
> >
> > - When compiling the program A GCC decides to emit an
> >   `atomic_fetch_addDI' insn, documented as:
> >
> >   'atomic_fetch_addMODE', 'atomic_fetch_subMODE'
> >   'atomic_fetch_orMODE', 'atomic_fetch_andMODE'
> >   'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE'
> >
> >      These patterns emit code for an atomic operation on memory with
> >      memory model semantics, and return the original value.  Operand 0
> >      is an output operand which contains the value of the memory
> >      location before the operation was performed.  Operand 1 is the
> >      memory on which the atomic operation is performed.  Operand 2 is
> >      the second operand to the binary operator.  Operand 3 is the memory
> >      model to be used by the operation.
> >
> >   The BPF backend defines atomic_fetch_add for DI modes (long) to expand
> >   to this BPF instruction:
> >
> >       %w0 = atomic_fetch_add((<smop> *)%1, %w0)
> >
> > - When compiling the program B GCC decides to emit an `atomic_addDI'
> >   insn, documented as:
> >
> >   'atomic_addMODE', 'atomic_subMODE'
> >   'atomic_orMODE', 'atomic_andMODE'
> >   'atomic_xorMODE', 'atomic_nandMODE'
> >
> >      These patterns emit code for an atomic operation on memory with
> >      memory model semantics.  Operand 0 is the memory on which the
> >      atomic operation is performed.  Operand 1 is the second operand to
> >      the binary operator.  Operand 2 is the memory model to be used by
> >      the operation.
> >
> >   The BPF backend defines atomic_fetch_add for DI modes (long) to expand
> >   to this BPF instruction:
> >
> >       lock *(<smop> *)%w0 += %w1
> >
> > This is done for all targets. In x86-64, for example, case A compiles
> > to:
> >
> >   0000000000000000 <func>:
> >      0:	b8 01 00 00 00       	mov    $0x1,%eax
> >      5:	f0 48 0f c1 05 00 00 	lock xadd %rax,0x0(%rip)        # e <func+0xe>
> >      c:	00 00 
> >      e:	c3                   	retq   
> >
> > And case B compiles to:
> >
> >   0000000000000000 <func>:
> >      0:	f0 48 83 05 00 00 00 	lock addq $0x1,0x0(%rip)        # 9 <func+0x9>
> >      7:	00 01 
> >      9:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 10 <func+0x10>
> >     10:	c3                   	retq   
> >
> > Why wouldn't the compiler be allowed to optimize from atomic_fetch_add
> > to atomic_add in this case?
> 
> Ok I see.  The generic compiler optimization is ok.  It is the backend
> that is buggy because it emits BPF instruction sequences with different
> memory ordering semantics for atomic_OP and atomic_fetch_OP.
> 
> The only difference between fetching and non-fetching builtins is that
> in one case the original value is returned, in the other the new value.
> Other than that they should be equivalent.
> 
> For ARM64, GCC generates for case A:
> 
>   0000000000000000 <func>:
>      0:	90000001 	adrp	x1, 0 <func>
>      4:	d2800020 	mov	x0, #0x1                   	// #1
>      8:	91000021 	add	x1, x1, #0x0
>      c:	f8e00020 	ldaddal	x0, x0, [x1]
>     10:	d65f03c0 	ret
> 
> And this for case B:
> 
>   0000000000000000 <func>:
>      0:	90000000 	adrp	x0, 0 <func>
>      4:	d2800022 	mov	x2, #0x1                   	// #1
>      8:	91000001 	add	x1, x0, #0x0
>      c:	f8e20021 	ldaddal	x2, x1, [x1]
>     10:	f9400000 	ldr	x0, [x0]
>     14:	d65f03c0 	ret
> 
> i.e. GCC emits LDADDAL for both atomic_add and atomic_fetch_add internal
> insns.  Like in x86-64, both sequences have same memory ordering
> semantics.
> 
> Allright we are changing GCC to always emit fetch versions of sequences
> for all the supported atomic operations: add, and, or, xor.  After the
> change the `lock' versions of the instructions will not be generated by
> the compiler at all out of inline asm.
> 
> Will send a headsup when done.

Thanks for taking care of this!

Peilin Ye


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-08-01 16:44         ` Yonghong Song
@ 2024-08-05 16:13           ` Jose E. Marchesi
  0 siblings, 0 replies; 20+ messages in thread
From: Jose E. Marchesi @ 2024-08-05 16:13 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Jose E. Marchesi, Alexei Starovoitov, Peilin Ye, bpf, Josh Don,
	Barret Rhoden, Neel Natu, Benjamin Segall, Paul E. McKenney,
	Alexei Starovoitov, David Vernet, Dave Marchevsky


> On 8/1/24 7:20 AM, Jose E. Marchesi wrote:
>>>> On 7/29/24 6:28 PM, Alexei Starovoitov wrote:
>>>>> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
>>>>>> Hi list!
>>>>>>
>>>>>> As we are looking at running sched_ext-style BPF scheduling on architectures
>>>>>> with a more relaxed memory model (i.e. ARM), we would like to:
>>>>>>
>>>>>>     1. have fine-grained control over memory ordering in BPF (instead of
>>>>>>        defaulting to a full barrier), for performance reasons
>>>>>>     2. pay closer attention to if memory barriers are being used correctly in
>>>>>>        BPF
>>>>>>
>>>>>> To that end, our main goal here is to support more types of memory barriers in
>>>>>> BPF.  While Paul E. McKenney et al. are working on the formalized BPF memory
>>>>>> model [1], Paul agreed that it makes sense to support some basic types first.
>>>>>> Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
>>>>>> related to memory ordering, which will be described in details below.
>>>>>>
>>>>>> I. We need more types of BPF memory barriers
>>>>>> --------------------------------------------
>>>>>>
>>>>>> Currently, when it comes to BPF memory barriers, our choices are effectively
>>>>>> limited to:
>>>>>>
>>>>>>     * compiler barrier: 'asm volatile ("" ::: "memory");'
>>>>>>     * full memory barriers implied by compiler built-ins like
>>>>>>       __sync_val_compare_and_swap()
>>>>>>
>>>>>> We need more.  During offline discussion with Paul, we agreed we can start
>>>>>> from:
>>>>>>
>>>>>>     * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
>>>>>>     * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
>>>>> we would need inline asm equivalent too. Similar to kernel
>>>>> smp_load_acquire() macro.
>>>>>
>>>>>> Theoretically, the BPF JIT compiler could also reorder instructions just like
>>>>>> Clang or GCC, though it might not currently do so.  If we ever developed a more
>>>>>> optimizing BPF JIT compiler, it would also be nice to have an optimization
>>>>>> barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
>>>>>> instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
>>>>> It can be a standalone insn that is a compiler barrier only but that feels like
>>>>> a waste of an instruction. So depending how we end up encoding various
>>>>> real barriers
>>>>> there may be a bit to spend in such a barrier insn that is only a
>>>>> compiler barrier.
>>>>> In this case optimizing JIT barrier.
>>>>>
>>>>>> II. Implicit barriers can get confusing
>>>>>> ---------------------------------------
>>>>>>
>>>>>> We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
>>>>>> always imply a full barrier for BPF on ARM.  For example, when using LLVM, the
>>>>>> frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
>>>>>> or "acquire and release" (full barrier) semantics, depending on if its return
>>>>>> value is used:
>>>>>>
>>>>>> Case (a): return value is used
>>>>>>
>>>>>>     SEC("...")
>>>>>>     int64_t foo;
>>>>>>
>>>>>>     int64_t func(...) {
>>>>>>         return __sync_fetch_and_add(&foo, 1);
>>>>>>     }
>>>>>>
>>>>>> For case (a), Clang gave us:
>>>>>>
>>>>>>     3:    db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)
>>>>>>
>>>>>>     opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>>>>>     imm (0x00000001): BPF_ADD | BPF_FETCH
>>>>>>
>>>>>> Case (b): return value is ignored
>>>>>>
>>>>>>     SEC("...")
>>>>>>     int64_t foo;
>>>>>>
>>>>>>     int64_t func(...) {
>>>>>>         __sync_fetch_and_add(&foo, 1);
>>>>>>
>>>>>>         return foo;
>>>>>>     }
>>>>>>
>>>>>> For case (b), Clang gave us:
>>>>>>
>>>>>>     3:    db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1
>>>>>>
>>>>>>     opcode    (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>>>>>     imm (0x00000000): BPF_ADD
>>>>>>
>>>>>> LLVM decided to drop BPF_FETCH, since the return value of
>>>>>> __sync_fetch_and_add() is being ignored [2].  Now, if we take a look at
>>>>>> emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
>>>>>> atomic instructions are being used):
>>>>>>
>>>>>>     case BPF_ADD:
>>>>>>             emit(A64_STADD(isdw, reg, src), ctx);
>>>>>>             break;
>>>>>>     <...>
>>>>>>     case BPF_ADD | BPF_FETCH:
>>>>>>             emit(A64_LDADDAL(isdw, src, reg, src), ctx);
>>>>>>             break;
>>>>>>
>>>>>> STADD is an alias for LDADD.  According to [3]:
>>>>>>
>>>>>>     * LDADDAL for case (a) has "acquire" plus "release" semantics
>>>>>>     * LDADD for case (b) "has neither acquire nor release semantics"
>>>>>>
>>>>>> This is pretty non-intuitive; a compiler built-in should not have inconsistent
>>>>>> implications on memory ordering, and it is better not to require all BPF
>>>>>> programmers to memorize this.
>>>>>>
>>>>>> GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
>>>>>> always imply a full barrier.  GCC considers these __sync_*() built-ins as
>>>>>> "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
>>>>>> Aware Atomic Operations") [5] to replace them.  These __atomic_*() built-ins
>>>>>> are designed to be a lot more explicit on memory ordering, for example:
>>>>>>
>>>>>>     type __atomic_fetch_add (type *ptr, type val, int memorder)
>>>>>>
>>>>>> This requires the programmer to specify a memory order type (relaxed, acquire,
>>>>>> release...) via the "memorder" parameter.  Currently in LLVM, for BPF, those
>>>>>> __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
>>>>>> counterparts (the "memorder" parameter seems effectively ignored), and are not
>>>>>> fully supported.
>>>>> This sounds like a compiler bug.
>>>>>
>>>>> Yonghong, Jose,
>>>>> do you know what compilers do for other backends?
>>>>> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
>>>> This behavior is introduced by the following llvm commit:
>>>> https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
>>>>
>>>> Specifically the following commit message:
>>>>
>>>> =======
>>>> Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
>>>> instructions are added for atomic operations which do not
>>>> have return values. LLVM will check the return value for
>>>> __sync_fetch_and_{add,and,or,xor}.
>>>> If the return value is used, instructions atomic_fetch_<op>
>>>> will be used. Otherwise, atomic_<op> instructions will be used.
>>>> ======
>>>>
>>>> Basically, if no return value, __sync_fetch_and_add() will use
>>>> xadd insn. The decision is made at that time to maintain backward compatibility.
>>>> For one example, in bcc
>>>>    https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
>>>> we have
>>>>    #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))
>>>>
>>>> Should we use atomic_fetch_*() always regardless of whether the return
>>>> val is used or not? Probably, it should still work. Not sure what gcc
>>>> does for this case.
>>> GCC behaves similarly.
>>>
>>> For program A:
>>>
>>>    long foo;
>>>       long func () {
>>>          return __sync_fetch_and_add(&foo, 1);
>>>    }
>>>
>>> bpf-unknown-none-gcc -O2 compiles to:
>>>
>>>    0000000000000000 <func>:
>>>       0:	18 00 00 00 00 00 00 00 	r0=0 ll
>>>       8:	00 00 00 00 00 00 00 00
>>>      10:	b7 01 00 00 01 00 00 00 	r1=1
>>>      18:	db 10 00 00 01 00 00 00 	r1=atomic_fetch_add((u64*)(r0+0),r1)
>>>      20:	bf 10 00 00 00 00 00 00 	r0=r1
>>>      28:	95 00 00 00 00 00 00 00 	exit
>>>
>>> And for program B:
>>>
>>>    long foo;
>>>       long func () {
>>>         __sync_fetch_and_add(&foo, 1);
>>>          return foo;
>>>    }
>>>
>>> bpf-unknown-none-gcc -O2 compiles to:
>>>
>>>    0000000000000000 <func>:
>>>       0:	18 00 00 00 00 00 00 00 	r0=0 ll
>>>       8:	00 00 00 00 00 00 00 00
>>>      10:	b7 01 00 00 01 00 00 00 	r1=1
>>>      18:	db 10 00 00 00 00 00 00 	lock *(u64*)(r0+0)+=r1
>>>      20:	79 00 00 00 00 00 00 00 	r0=*(u64*)(r0+0)
>>>      28:	95 00 00 00 00 00 00 00 	exit
>>>
>>> Internally:
>>>
>>> - When compiling the program A GCC decides to emit an
>>>    `atomic_fetch_addDI' insn, documented as:
>>>
>>>    'atomic_fetch_addMODE', 'atomic_fetch_subMODE'
>>>    'atomic_fetch_orMODE', 'atomic_fetch_andMODE'
>>>    'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE'
>>>
>>>       These patterns emit code for an atomic operation on memory with
>>>       memory model semantics, and return the original value.  Operand 0
>>>       is an output operand which contains the value of the memory
>>>       location before the operation was performed.  Operand 1 is the
>>>       memory on which the atomic operation is performed.  Operand 2 is
>>>       the second operand to the binary operator.  Operand 3 is the memory
>>>       model to be used by the operation.
>>>
>>>    The BPF backend defines atomic_fetch_add for DI modes (long) to expand
>>>    to this BPF instruction:
>>>
>>>        %w0 = atomic_fetch_add((<smop> *)%1, %w0)
>>>
>>> - When compiling the program B GCC decides to emit an `atomic_addDI'
>>>    insn, documented as:
>>>
>>>    'atomic_addMODE', 'atomic_subMODE'
>>>    'atomic_orMODE', 'atomic_andMODE'
>>>    'atomic_xorMODE', 'atomic_nandMODE'
>>>
>>>       These patterns emit code for an atomic operation on memory with
>>>       memory model semantics.  Operand 0 is the memory on which the
>>>       atomic operation is performed.  Operand 1 is the second operand to
>>>       the binary operator.  Operand 2 is the memory model to be used by
>>>       the operation.
>>>
>>>    The BPF backend defines atomic_fetch_add for DI modes (long) to expand
>>>    to this BPF instruction:
>>>
>>>        lock *(<smop> *)%w0 += %w1
>>>
>>> This is done for all targets. In x86-64, for example, case A compiles
>>> to:
>>>
>>>    0000000000000000 <func>:
>>>       0:	b8 01 00 00 00       	mov    $0x1,%eax
>>>       5:	f0 48 0f c1 05 00 00 	lock xadd %rax,0x0(%rip)        # e <func+0xe>
>>>       c:	00 00
>>>       e:	c3                   	retq
>>>
>>> And case B compiles to:
>>>
>>>    0000000000000000 <func>:
>>>       0:	f0 48 83 05 00 00 00 	lock addq $0x1,0x0(%rip)        # 9 <func+0x9>
>>>       7:	00 01
>>>       9:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax        # 10 <func+0x10>
>>>      10:	c3                   	retq
>>>
>>> Why wouldn't the compiler be allowed to optimize from atomic_fetch_add
>>> to atomic_add in this case?
>> Ok I see.  The generic compiler optimization is ok.  It is the backend
>> that is buggy because it emits BPF instruction sequences with different
>> memory ordering semantics for atomic_OP and atomic_fetch_OP.
>>
>> The only difference between fetching and non-fetching builtins is that
>> in one case the original value is returned, in the other the new value.
>> Other than that they should be equivalent.
>>
>> For ARM64, GCC generates for case A:
>>
>>    0000000000000000 <func>:
>>       0:	90000001 	adrp	x1, 0 <func>
>>       4:	d2800020 	mov	x0, #0x1                   	// #1
>>       8:	91000021 	add	x1, x1, #0x0
>>       c:	f8e00020 	ldaddal	x0, x0, [x1]
>>      10:	d65f03c0 	ret
>>
>> And this for case B:
>>
>>    0000000000000000 <func>:
>>       0:	90000000 	adrp	x0, 0 <func>
>>       4:	d2800022 	mov	x2, #0x1                   	// #1
>>       8:	91000001 	add	x1, x0, #0x0
>>       c:	f8e20021 	ldaddal	x2, x1, [x1]
>>      10:	f9400000 	ldr	x0, [x0]
>>      14:	d65f03c0 	ret
>>
>> i.e. GCC emits LDADDAL for both atomic_add and atomic_fetch_add internal
>> insns.  Like in x86-64, both sequences have same memory ordering
>> semantics.
>>
>> Allright we are changing GCC to always emit fetch versions of sequences
>> for all the supported atomic operations: add, and, or, xor.  After the
>> change the `lock' versions of the instructions will not be generated by
>> the compiler at all out of inline asm.
>>
>> Will send a headsup when done.
>
> Thanks! https://github.com/llvm/llvm-project/pull/101428
> is the change in llvm side.

This is now fixed in GCC upstream as well.
https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659454.html

Salud!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-07-30  1:28 ` Alexei Starovoitov
  2024-07-30  3:49   ` Paul E. McKenney
  2024-07-30  5:14   ` Yonghong Song
@ 2024-08-06 19:22   ` Peilin Ye
  2024-08-08 16:33     ` Alexei Starovoitov
  2 siblings, 1 reply; 20+ messages in thread
From: Peilin Ye @ 2024-08-06 19:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Jose E. Marchesi, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky

Hi Alexei,

Thanks for all the suggestions!  Some questions:

On Mon, Jul 29, 2024 at 06:28:16PM -0700, Alexei Starovoitov wrote:
> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
> > We need more.  During offline discussion with Paul, we agreed we can start
> > from:
> >
> >   * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
> >   * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
> 
> we would need inline asm equivalent too. Similar to kernel
> smp_load_acquire() macro.

I see, so something like:

    asm volatile("%0 = load_acquire((u64 *)(%1 + 0x0))" :
                 "=r"(ret) : "r"(&foo) : "memory");

and e.g. this in disassembly:

    r0 = load_acquire((u64 *)(r1 + 0x0))

I agree that we'd better not put the entire e.g.
"r0 = __atomic_load_n((u64 *)(r1 + 0x0), __ATOMIC_ACQUIRE)" into
disassembly.

> > Theoretically, the BPF JIT compiler could also reorder instructions just like
> > Clang or GCC, though it might not currently do so.  If we ever developed a more
> > optimizing BPF JIT compiler, it would also be nice to have an optimization
> > barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
> > instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
> 
> It can be a standalone insn that is a compiler barrier only but that feels like
> a waste of an instruction. So depending how we end up encoding various
> real barriers
> there may be a bit to spend in such a barrier insn that is only a
> compiler barrier.
> In this case optimizing JIT barrier.

[...]

> > Roughly, the scope of this work includes:
> >
> >   * decide how to extend the BPF ISA (add new instructions and/or extend
> >     current ones)
> 
> ldx/stx insns support MEM and MEMSX modifiers.
> Adding MEM_ACQ_REL feels like a natural fit. Better name?

Do we allow aliases?  E.g., can we have "MEMACQ" for LDX and "MEMREL"
for STX, but let them share the same numeric value?

Speaking of numeric value, out of curiosity:

    IMM    0
    ABS    1
    IND    2
    MEM    3
    MEMSX  4
    ATOMIC 6

Was there a reason that we skipped 5?  Is 5 reserved?

> For barriers we would need a new insn. Not sure which class would fit the best.
> Maybe BPF_LD ?
> 
> Another alternative for barriers is to use nocsr kfuncs.
> Then we have the freedom to make mistakes and fix them later.
> One kfunc per barrier would do.
> JITs would inline them into appropriate insns.
> In bpf progs they will be used just like in the kernel code smp_mb(),
> smp_rmb(), etc.
> 
> I don't think compilers have to emit barriers from C code, so my
> preference is kfuncs atm.

Ah, I see; we recently supported [1] nocsr BPF helper functions.  The
cover letter says:

  """
  This patch-set seeks to allow using no_caller_saved_registers
  gcc/clang attribute with some BPF helper functions (and kfuncs in the
  future).
  """

It seems that nocsr BPF kfuncs are not supported yet.  Do we have a
schedule for it?

Thanks,
Peilin Ye

[1] [PATCH bpf-next v4 00/10] no_caller_saved_registers attribute for helper calls
    https://lore.kernel.org/bpf/20240722233844.1406874-1-eddyz87@gmail.com/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-08-06 19:22   ` Peilin Ye
@ 2024-08-08 16:33     ` Alexei Starovoitov
  2024-08-08 20:59       ` Peilin Ye
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2024-08-08 16:33 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Yonghong Song, Jose E. Marchesi, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky, Eddy Z

On Tue, Aug 6, 2024 at 12:22 PM Peilin Ye <yepeilin@google.com> wrote:
>
> Hi Alexei,
>
> Thanks for all the suggestions!  Some questions:
>
> On Mon, Jul 29, 2024 at 06:28:16PM -0700, Alexei Starovoitov wrote:
> > On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
> > > We need more.  During offline discussion with Paul, we agreed we can start
> > > from:
> > >
> > >   * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
> > >   * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
> >
> > we would need inline asm equivalent too. Similar to kernel
> > smp_load_acquire() macro.
>
> I see, so something like:
>
>     asm volatile("%0 = load_acquire((u64 *)(%1 + 0x0))" :
>                  "=r"(ret) : "r"(&foo) : "memory");
>
> and e.g. this in disassembly:
>
>     r0 = load_acquire((u64 *)(r1 + 0x0))

yes.

> I agree that we'd better not put the entire e.g.
> "r0 = __atomic_load_n((u64 *)(r1 + 0x0), __ATOMIC_ACQUIRE)" into
> disassembly.

exactly. That's too verbose.

> > > Theoretically, the BPF JIT compiler could also reorder instructions just like
> > > Clang or GCC, though it might not currently do so.  If we ever developed a more
> > > optimizing BPF JIT compiler, it would also be nice to have an optimization
> > > barrier for it.  However, Alexei Starovoitov has expressed that defining a BPF
> > > instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
> >
> > It can be a standalone insn that is a compiler barrier only but that feels like
> > a waste of an instruction. So depending how we end up encoding various
> > real barriers
> > there may be a bit to spend in such a barrier insn that is only a
> > compiler barrier.
> > In this case optimizing JIT barrier.
>
> [...]
>
> > > Roughly, the scope of this work includes:
> > >
> > >   * decide how to extend the BPF ISA (add new instructions and/or extend
> > >     current ones)
> >
> > ldx/stx insns support MEM and MEMSX modifiers.
> > Adding MEM_ACQ_REL feels like a natural fit. Better name?
>
> Do we allow aliases?  E.g., can we have "MEMACQ" for LDX and "MEMREL"
> for STX, but let them share the same numeric value?

yes. See
#define BPF_ATOMIC      0xc0    /* atomic memory ops - op type in immediate */
#define BPF_XADD        0xc0    /* exclusive add - legacy name */

but it has to be backward compatible.

> Speaking of numeric value, out of curiosity:
>
>     IMM    0
>     ABS    1
>     IND    2
>     MEM    3
>     MEMSX  4
>     ATOMIC 6
>
> Was there a reason that we skipped 5?  Is 5 reserved?

See
/* unused opcode to mark special load instruction. Same as BPF_ABS */
#define BPF_PROBE_MEM   0x20

/* unused opcode to mark special ldsx instruction. Same as BPF_IND */
#define BPF_PROBE_MEMSX 0x40

/* unused opcode to mark special load instruction. Same as BPF_MSH */
#define BPF_PROBE_MEM32 0xa0

it's used by the verifier when it remaps opcode to tell JIT.
It can be used, but then the internal opcode needs to change too.

> > For barriers we would need a new insn. Not sure which class would fit the best.
> > Maybe BPF_LD ?
> >
> > Another alternative for barriers is to use nocsr kfuncs.
> > Then we have the freedom to make mistakes and fix them later.
> > One kfunc per barrier would do.
> > JITs would inline them into appropriate insns.
> > In bpf progs they will be used just like in the kernel code smp_mb(),
> > smp_rmb(), etc.
> >
> > I don't think compilers have to emit barriers from C code, so my
> > preference is kfuncs atm.
>
> Ah, I see; we recently supported [1] nocsr BPF helper functions.  The
> cover letter says:
>
>   """
>   This patch-set seeks to allow using no_caller_saved_registers
>   gcc/clang attribute with some BPF helper functions (and kfuncs in the
>   future).
>   """
>
> It seems that nocsr BPF kfuncs are not supported yet.  Do we have a
> schedule for it?

Support for nocsr for kfuncs is being added.
Assume it's already available :)
It's not a blocker to add barrier kfuncs.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-08-08 16:33     ` Alexei Starovoitov
@ 2024-08-08 20:59       ` Peilin Ye
  2024-09-16 21:14         ` Peilin Ye
  0 siblings, 1 reply; 20+ messages in thread
From: Peilin Ye @ 2024-08-08 20:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Jose E. Marchesi, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky, Eddy Z

On Thu, Aug 08, 2024 at 09:33:31AM -0700, Alexei Starovoitov wrote:
> > > ldx/stx insns support MEM and MEMSX modifiers.
> > > Adding MEM_ACQ_REL feels like a natural fit. Better name?
> >
> > Do we allow aliases?  E.g., can we have "MEMACQ" for LDX and "MEMREL"
> > for STX, but let them share the same numeric value?
> 
> yes. See
> #define BPF_ATOMIC      0xc0    /* atomic memory ops - op type in immediate */
> #define BPF_XADD        0xc0    /* exclusive add - legacy name */
> 
> but it has to be backward compatible.
> 
> > Speaking of numeric value, out of curiosity:
> >
> >     IMM    0
> >     ABS    1
> >     IND    2
> >     MEM    3
> >     MEMSX  4
> >     ATOMIC 6
> >
> > Was there a reason that we skipped 5?  Is 5 reserved?
> 
> See
> /* unused opcode to mark special load instruction. Same as BPF_ABS */
> #define BPF_PROBE_MEM   0x20
> 
> /* unused opcode to mark special ldsx instruction. Same as BPF_IND */
> #define BPF_PROBE_MEMSX 0x40
> 
> /* unused opcode to mark special load instruction. Same as BPF_MSH */
> #define BPF_PROBE_MEM32 0xa0
> 
> it's used by the verifier when it remaps opcode to tell JIT.
> It can be used, but then the internal opcode needs to change too.

[...]

> > It seems that nocsr BPF kfuncs are not supported yet.  Do we have a
> > schedule for it?
> 
> Support for nocsr for kfuncs is being added.
> Assume it's already available :)
> It's not a blocker to add barrier kfuncs.

Got it!  I'll start cooking (kernel and LLVM) patches for "MEMACQ" and
"MEMREL" (using 0x7) first.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-08-08 20:59       ` Peilin Ye
@ 2024-09-16 21:14         ` Peilin Ye
  2024-09-17  0:08           ` Peilin Ye
  0 siblings, 1 reply; 20+ messages in thread
From: Peilin Ye @ 2024-09-16 21:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Jose E. Marchesi, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky, Eddy Z, Peilin Ye

Hi all,

LLVM changes for load-acquire and store-release are available for review
at: https://github.com/llvm/llvm-project/pull/108636

I've tentatively put them under -mcpu=v5.  Please take a look when you
have a chance.  Thanks!

I just noticed, however:

> On Thu, Aug 08, 2024 at 09:33:31AM -0700, Alexei Starovoitov wrote:
> > > Speaking of numeric value, out of curiosity:
> > >
> > >     IMM    0
> > >     ABS    1
> > >     IND    2
> > >     MEM    3
> > >     MEMSX  4
> > >     ATOMIC 6
> > >
> > > Was there a reason that we skipped 5?  Is 5 reserved?
> > 
> > See
> > /* unused opcode to mark special load instruction. Same as BPF_ABS */
> > #define BPF_PROBE_MEM   0x20
> > 
> > /* unused opcode to mark special ldsx instruction. Same as BPF_IND */
> > #define BPF_PROBE_MEMSX 0x40
> > 
> > /* unused opcode to mark special load instruction. Same as BPF_MSH */
> > #define BPF_PROBE_MEM32 0xa0
> > 
> > it's used by the verifier when it remaps opcode to tell JIT.
> > It can be used, but then the internal opcode needs to change too.

There's also:

  /* unused opcode to mark special atomic instruction */
  #define BPF_PROBE_ATOMIC 0xe0

0xe0 is (7 << 5), so it seems like we've already run out of bits for
mode modifiers?  Can we delete these internal opcodes and re-implement
them in other ways, to make room for MEMACQ (MEMREL)?

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Supporting New Memory Barrier Types in BPF
  2024-09-16 21:14         ` Peilin Ye
@ 2024-09-17  0:08           ` Peilin Ye
  0 siblings, 0 replies; 20+ messages in thread
From: Peilin Ye @ 2024-09-17  0:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Jose E. Marchesi, bpf, Josh Don, Barret Rhoden,
	Neel Natu, Benjamin Segall, Paul E. McKenney, Alexei Starovoitov,
	David Vernet, Dave Marchevsky, Eddy Z

On Mon, Sep 16, 2024 at 09:14:02PM +0000, Peilin Ye wrote:
> I just noticed, however:
> 
> > On Thu, Aug 08, 2024 at 09:33:31AM -0700, Alexei Starovoitov wrote:
> > > > Speaking of numeric value, out of curiosity:
> > > >
> > > >     IMM    0
> > > >     ABS    1
> > > >     IND    2
> > > >     MEM    3
> > > >     MEMSX  4
> > > >     ATOMIC 6
> > > >
> > > > Was there a reason that we skipped 5?  Is 5 reserved?
> > > 
> > > See
> > > /* unused opcode to mark special load instruction. Same as BPF_ABS */
> > > #define BPF_PROBE_MEM   0x20
> > > 
> > > /* unused opcode to mark special ldsx instruction. Same as BPF_IND */
> > > #define BPF_PROBE_MEMSX 0x40
> > > 
> > > /* unused opcode to mark special load instruction. Same as BPF_MSH */
> > > #define BPF_PROBE_MEM32 0xa0
> > > 
> > > it's used by the verifier when it remaps opcode to tell JIT.
> > > It can be used, but then the internal opcode needs to change too.
> 
> There's also:
> 
>   /* unused opcode to mark special atomic instruction */
>   #define BPF_PROBE_ATOMIC 0xe0
> 
> 0xe0 is (7 << 5), so it seems like we've already run out of bits for
> mode modifiers?  Can we delete these internal opcodes and re-implement
> them in other ways, to make room for MEMACQ (MEMREL)?

Can we instead use the unused higher bits of bpf_insn::imm as a scratch
pad?                                                   ^^^

For example, instead of letting the verifier change bpf_insn::code from
BPF_ATOMIC to BPF_PROBE_ATOMIC, keep it as-is, but set the highest bit
in bpf_insn::imm to let the JIT know it is a "_PROBE_" instruction, and
there's a corresponding entry in the exception table.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-09-17  0:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-29 18:32 Supporting New Memory Barrier Types in BPF Peilin Ye
2024-07-30  1:28 ` Alexei Starovoitov
2024-07-30  3:49   ` Paul E. McKenney
2024-07-30  4:03     ` Alexei Starovoitov
2024-07-30  5:14   ` Yonghong Song
2024-07-31  1:19     ` Alexei Starovoitov
2024-07-31  3:51       ` Yonghong Song
2024-07-31 20:44         ` Peilin Ye
2024-07-31 23:17           ` Yonghong Song
2024-08-01  0:11             ` Peilin Ye
2024-08-01 12:47     ` Jose E. Marchesi
2024-08-01 14:20       ` Jose E. Marchesi
2024-08-01 16:44         ` Yonghong Song
2024-08-05 16:13           ` Jose E. Marchesi
2024-08-01 22:00         ` Peilin Ye
2024-08-06 19:22   ` Peilin Ye
2024-08-08 16:33     ` Alexei Starovoitov
2024-08-08 20:59       ` Peilin Ye
2024-09-16 21:14         ` Peilin Ye
2024-09-17  0:08           ` Peilin Ye

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox