From: "Jose E. Marchesi" <jemarch@gnu.org>
To: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Peilin Ye <yepeilin@google.com>, bpf <bpf@vger.kernel.org>,
Josh Don <joshdon@google.com>, Barret Rhoden <brho@google.com>,
Neel Natu <neelnatu@google.com>,
Benjamin Segall <bsegall@google.com>,
"Paul E. McKenney" <paulmck@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
David Vernet <dvernet@meta.com>,
Dave Marchevsky <davemarchevsky@meta.com>
Subject: Re: Supporting New Memory Barrier Types in BPF
Date: Thu, 01 Aug 2024 14:47:49 +0200 [thread overview]
Message-ID: <87h6c4h0ju.fsf@gnu.org> (raw)
In-Reply-To: <24b57380-c829-4033-a7b1-06a4ed413a49@linux.dev> (Yonghong Song's message of "Mon, 29 Jul 2024 22:14:26 -0700")
> On 7/29/24 6:28 PM, Alexei Starovoitov wrote:
>> On Mon, Jul 29, 2024 at 11:33 AM Peilin Ye <yepeilin@google.com> wrote:
>>> Hi list!
>>>
>>> As we are looking at running sched_ext-style BPF scheduling on architectures
>>> with a more relaxed memory model (i.e. ARM), we would like to:
>>>
>>> 1. have fine-grained control over memory ordering in BPF (instead of
>>> defaulting to a full barrier), for performance reasons
>>> 2. pay closer attention to if memory barriers are being used correctly in
>>> BPF
>>>
>>> To that end, our main goal here is to support more types of memory barriers in
>>> BPF. While Paul E. McKenney et al. are working on the formalized BPF memory
>>> model [1], Paul agreed that it makes sense to support some basic types first.
>>> Additionally, we noticed an issue with the __sync_*fetch*() compiler built-ins
>>> related to memory ordering, which will be described in details below.
>>>
>>> I. We need more types of BPF memory barriers
>>> --------------------------------------------
>>>
>>> Currently, when it comes to BPF memory barriers, our choices are effectively
>>> limited to:
>>>
>>> * compiler barrier: 'asm volatile ("" ::: "memory");'
>>> * full memory barriers implied by compiler built-ins like
>>> __sync_val_compare_and_swap()
>>>
>>> We need more. During offline discussion with Paul, we agreed we can start
>>> from:
>>>
>>> * load-acquire: __atomic_load_n(... memorder=__ATOMIC_ACQUIRE);
>>> * store-release: __atomic_store_n(... memorder=__ATOMIC_RELEASE);
>> we would need inline asm equivalent too. Similar to kernel
>> smp_load_acquire() macro.
>>
>>> Theoretically, the BPF JIT compiler could also reorder instructions just like
>>> Clang or GCC, though it might not currently do so. If we ever developed a more
>>> optimizing BPF JIT compiler, it would also be nice to have an optimization
>>> barrier for it. However, Alexei Starovoitov has expressed that defining a BPF
>>> instruction with 'asm volatile ("" ::: "memory");' semantics might be tricky.
>> It can be a standalone insn that is a compiler barrier only but that feels like
>> a waste of an instruction. So depending how we end up encoding various
>> real barriers
>> there may be a bit to spend in such a barrier insn that is only a
>> compiler barrier.
>> In this case optimizing JIT barrier.
>>
>>> II. Implicit barriers can get confusing
>>> ---------------------------------------
>>>
>>> We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins do not
>>> always imply a full barrier for BPF on ARM. For example, when using LLVM, the
>>> frequently-used __sync_fetch_and_add() can either imply "relaxed" (no barrier),
>>> or "acquire and release" (full barrier) semantics, depending on if its return
>>> value is used:
>>>
>>> Case (a): return value is used
>>>
>>> SEC("...")
>>> int64_t foo;
>>>
>>> int64_t func(...) {
>>> return __sync_fetch_and_add(&foo, 1);
>>> }
>>>
>>> For case (a), Clang gave us:
>>>
>>> 3: db 01 00 00 01 00 00 00 r0 = atomic_fetch_add((u64 *)(r1 + 0x0), r0)
>>>
>>> opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>> imm (0x00000001): BPF_ADD | BPF_FETCH
>>>
>>> Case (b): return value is ignored
>>>
>>> SEC("...")
>>> int64_t foo;
>>>
>>> int64_t func(...) {
>>> __sync_fetch_and_add(&foo, 1);
>>>
>>> return foo;
>>> }
>>>
>>> For case (b), Clang gave us:
>>>
>>> 3: db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) += r1
>>>
>>> opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW
>>> imm (0x00000000): BPF_ADD
>>>
>>> LLVM decided to drop BPF_FETCH, since the return value of
>>> __sync_fetch_and_add() is being ignored [2]. Now, if we take a look at
>>> emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that LSE
>>> atomic instructions are being used):
>>>
>>> case BPF_ADD:
>>> emit(A64_STADD(isdw, reg, src), ctx);
>>> break;
>>> <...>
>>> case BPF_ADD | BPF_FETCH:
>>> emit(A64_LDADDAL(isdw, src, reg, src), ctx);
>>> break;
>>>
>>> STADD is an alias for LDADD. According to [3]:
>>>
>>> * LDADDAL for case (a) has "acquire" plus "release" semantics
>>> * LDADD for case (b) "has neither acquire nor release semantics"
>>>
>>> This is pretty non-intuitive; a compiler built-in should not have inconsistent
>>> implications on memory ordering, and it is better not to require all BPF
>>> programmers to memorize this.
>>>
>>> GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins should
>>> always imply a full barrier. GCC considers these __sync_*() built-ins as
>>> "legacy", and introduced a new set of __atomic_*() built-ins ("Memory Model
>>> Aware Atomic Operations") [5] to replace them. These __atomic_*() built-ins
>>> are designed to be a lot more explicit on memory ordering, for example:
>>>
>>> type __atomic_fetch_add (type *ptr, type val, int memorder)
>>>
>>> This requires the programmer to specify a memory order type (relaxed, acquire,
>>> release...) via the "memorder" parameter. Currently in LLVM, for BPF, those
>>> __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*()
>>> counterparts (the "memorder" parameter seems effectively ignored), and are not
>>> fully supported.
>> This sounds like a compiler bug.
>>
>> Yonghong, Jose,
>> do you know what compilers do for other backends?
>> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is unused?
>
> This behavior is introduced by the following llvm commit:
> https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
>
> Specifically the following commit message:
>
> =======
> Similar to xadd, atomic xadd, xor and xxor (atomic_<op>)
> instructions are added for atomic operations which do not
> have return values. LLVM will check the return value for
> __sync_fetch_and_{add,and,or,xor}.
> If the return value is used, instructions atomic_fetch_<op>
> will be used. Otherwise, atomic_<op> instructions will be used.
> ======
>
> Basically, if no return value, __sync_fetch_and_add() will use
> xadd insn. The decision is made at that time to maintain backward compatibility.
> For one example, in bcc
> https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444
> we have
> #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val))
>
> Should we use atomic_fetch_*() always regardless of whether the return
> val is used or not? Probably, it should still work. Not sure what gcc
> does for this case.
GCC behaves similarly.
For program A:
long foo;
long func () {
return __sync_fetch_and_add(&foo, 1);
}
bpf-unknown-none-gcc -O2 compiles to:
0000000000000000 <func>:
0: 18 00 00 00 00 00 00 00 r0=0 ll
8: 00 00 00 00 00 00 00 00
10: b7 01 00 00 01 00 00 00 r1=1
18: db 10 00 00 01 00 00 00 r1=atomic_fetch_add((u64*)(r0+0),r1)
20: bf 10 00 00 00 00 00 00 r0=r1
28: 95 00 00 00 00 00 00 00 exit
And for program B:
long foo;
long func () {
__sync_fetch_and_add(&foo, 1);
return foo;
}
bpf-unknown-none-gcc -O2 compiles to:
0000000000000000 <func>:
0: 18 00 00 00 00 00 00 00 r0=0 ll
8: 00 00 00 00 00 00 00 00
10: b7 01 00 00 01 00 00 00 r1=1
18: db 10 00 00 00 00 00 00 lock *(u64*)(r0+0)+=r1
20: 79 00 00 00 00 00 00 00 r0=*(u64*)(r0+0)
28: 95 00 00 00 00 00 00 00 exit
Internally:
- When compiling the program A GCC decides to emit an
`atomic_fetch_addDI' insn, documented as:
'atomic_fetch_addMODE', 'atomic_fetch_subMODE'
'atomic_fetch_orMODE', 'atomic_fetch_andMODE'
'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE'
These patterns emit code for an atomic operation on memory with
memory model semantics, and return the original value. Operand 0
is an output operand which contains the value of the memory
location before the operation was performed. Operand 1 is the
memory on which the atomic operation is performed. Operand 2 is
the second operand to the binary operator. Operand 3 is the memory
model to be used by the operation.
The BPF backend defines atomic_fetch_add for DI modes (long) to expand
to this BPF instruction:
%w0 = atomic_fetch_add((<smop> *)%1, %w0)
- When compiling the program B GCC decides to emit an `atomic_addDI'
insn, documented as:
'atomic_addMODE', 'atomic_subMODE'
'atomic_orMODE', 'atomic_andMODE'
'atomic_xorMODE', 'atomic_nandMODE'
These patterns emit code for an atomic operation on memory with
memory model semantics. Operand 0 is the memory on which the
atomic operation is performed. Operand 1 is the second operand to
the binary operator. Operand 2 is the memory model to be used by
the operation.
The BPF backend defines atomic_fetch_add for DI modes (long) to expand
to this BPF instruction:
lock *(<smop> *)%w0 += %w1
This is done for all targets. In x86-64, for example, case A compiles
to:
0000000000000000 <func>:
0: b8 01 00 00 00 mov $0x1,%eax
5: f0 48 0f c1 05 00 00 lock xadd %rax,0x0(%rip) # e <func+0xe>
c: 00 00
e: c3 retq
And case B compiles to:
0000000000000000 <func>:
0: f0 48 83 05 00 00 00 lock addq $0x1,0x0(%rip) # 9 <func+0x9>
7: 00 01
9: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 10 <func+0x10>
10: c3 retq
Why wouldn't the compiler be allowed to optimize from atomic_fetch_add
to atomic_add in this case?
>
>>
>>> III. Next steps
>>> ---------------
>>>
>>> Roughly, the scope of this work includes:
>>>
>>> * decide how to extend the BPF ISA (add new instructions and/or extend
>>> current ones)
>> ldx/stx insns support MEM and MEMSX modifiers.
>> Adding MEM_ACQ_REL feels like a natural fit. Better name?
>>
>> For barriers we would need a new insn. Not sure which class would fit the best.
>> Maybe BPF_LD ?
>>
>> Another alternative for barriers is to use nocsr kfuncs.
>> Then we have the freedom to make mistakes and fix them later.
>> One kfunc per barrier would do.
>> JITs would inline them into appropriate insns.
>> In bpf progs they will be used just like in the kernel code smp_mb(),
>> smp_rmb(), etc.
>>
>> I don't think compilers have to emit barriers from C code, so my
>> preference is kfuncs atm.
>>
>>> * teach LLVM and GCC to generate the new/extended instructions
>>> * teach the BPF verifier to understand them
>>> * teach the BPF JIT compiler to compile them
>>> * update BPF memory model and tooling
>>> * update IETF specification
>>>
>>> Additionally, for the issue described in the previous section, we need to:
>>>
>>> * check if GCC has the same behavior
>>> * at least clearly document the implied effects on BPF memory ordering of
>>> current __sync_*fetch*() built-ins (especially for architectures like ARM),
>>> as described
>>> * fully support the new __atomic_*fetch*() built-ins for BPF to replace the
>>> __sync_*fetch*() ones
>>>
>>> Any suggestions or corrections would be most welcome!
>>>
>>> Thanks,
>>> Peilin Ye
>>>
>>>
>>> [1] Instruction-Level BPF Memory Model
>>> https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15qcbxIsFb0/edit?usp=sharing
>>>
>>> [2] For more information, see LLVM commit 286daafd6512 ("[BPF] support atomic
>>> instructions"). Search for "LLVM will check the return value" in the
>>> commit message.
>>>
>>> [3] Arm Architecture Reference Manual for A-profile architecture (ARM DDI
>>> 0487K.a, ID032224), C6.2.149, page 2006
>>>
>>> [4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
>>> 6.58 Legacy __sync Built-in Functions for Atomic Memory Access
>>> "In most cases, these built-in functions are considered a full barrier."
>>>
>>> [5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
>>> 6.59 Built-in Functions for Memory Model Aware Atomic Operations
>>>
next prev parent reply other threads:[~2024-08-01 12:48 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-29 18:32 Supporting New Memory Barrier Types in BPF Peilin Ye
2024-07-30 1:28 ` Alexei Starovoitov
2024-07-30 3:49 ` Paul E. McKenney
2024-07-30 4:03 ` Alexei Starovoitov
2024-07-30 5:14 ` Yonghong Song
2024-07-31 1:19 ` Alexei Starovoitov
2024-07-31 3:51 ` Yonghong Song
2024-07-31 20:44 ` Peilin Ye
2024-07-31 23:17 ` Yonghong Song
2024-08-01 0:11 ` Peilin Ye
2024-08-01 12:47 ` Jose E. Marchesi [this message]
2024-08-01 14:20 ` Jose E. Marchesi
2024-08-01 16:44 ` Yonghong Song
2024-08-05 16:13 ` Jose E. Marchesi
2024-08-01 22:00 ` Peilin Ye
2024-08-06 19:22 ` Peilin Ye
2024-08-08 16:33 ` Alexei Starovoitov
2024-08-08 20:59 ` Peilin Ye
2024-09-16 21:14 ` Peilin Ye
2024-09-17 0:08 ` Peilin Ye
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h6c4h0ju.fsf@gnu.org \
--to=jemarch@gnu.org \
--cc=alexei.starovoitov@gmail.com \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=brho@google.com \
--cc=bsegall@google.com \
--cc=davemarchevsky@meta.com \
--cc=dvernet@meta.com \
--cc=joshdon@google.com \
--cc=neelnatu@google.com \
--cc=paulmck@kernel.org \
--cc=yepeilin@google.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox