From: "Jose E. Marchesi" <jose.marchesi@oracle.com>
To: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <kernel-team@fb.com>,
Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf-next v2] selftests/bpf: Fix arena_atomics selftest failure due to llvm change
Date: Tue, 06 Aug 2024 06:48:49 +0200 [thread overview]
Message-ID: <87cymmqmry.fsf@oracle.com> (raw)
In-Reply-To: <e42a26b6-1520-40b9-850a-28d660bd9149@linux.dev> (Yonghong Song's message of "Mon, 5 Aug 2024 15:36:17 -0700")
> On 8/5/24 10:53 AM, Alexei Starovoitov wrote:
>> On Fri, Aug 2, 2024 at 7:59 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> Peilen Ye reported an issue ([1]) where for __sync_fetch_and_add(...) without
>>> return value like
>>> __sync_fetch_and_add(&foo, 1);
>>> llvm BPF backend generates locked insn e.g.
>>> lock *(u32 *)(r1 + 0) += r2
>>>
>>> If __sync_fetch_and_add(...) returns a value like
>>> res = __sync_fetch_and_add(&foo, 1);
>>> llvm BPF backend generates like
>>> r2 = atomic_fetch_add((u32 *)(r1 + 0), r2)
>>>
>>> But 'lock *(u32 *)(r1 + 0) += r2' caused a problem in jit
>>> since proper barrier is not inserted based on __sync_fetch_and_add() semantics.
>>>
>>> The above discrepancy is due to commit [2] where it tries to maintain backward
>>> compatability since before commit [2], __sync_fetch_and_add(...) generates
>>> lock insn in BPF backend.
>>>
>>> Based on discussion in [1], now it is time to fix the above discrepancy so we can
>>> have proper barrier support in jit. llvm patch [3] made sure that __sync_fetch_and_add(...)
>>> always generates atomic_fetch_add(...) insns. Now 'lock *(u32 *)(r1 + 0) += r2' can only
>>> be generated by inline asm. The same for __sync_fetch_and_and(), __sync_fetch_and_or()
>>> and __sync_fetch_and_xor().
>>>
>>> But the change in [3] caused arena_atomics selftest failure.
>>>
>>> test_arena_atomics:PASS:arena atomics skeleton open 0 nsec
>>> libbpf: prog 'and': BPF program load failed: Permission denied
>>> libbpf: prog 'and': -- BEGIN PROG LOAD LOG --
>>> arg#0 reference type('UNKNOWN ') size cannot be determined: -22
>>> 0: R1=ctx() R10=fp0
>>> ; if (pid != (bpf_get_current_pid_tgid() >> 32)) @ arena_atomics.c:87
>>> 0: (18) r1 = 0xffffc90000064000 ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4)
>>> 2: (61) r6 = *(u32 *)(r1 +0) ; R1_w=map_value(map=arena_at.bss,ks=4,vs=4) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
>>> 3: (85) call bpf_get_current_pid_tgid#14 ; R0_w=scalar()
>>> 4: (77) r0 >>= 32 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
>>> 5: (5d) if r0 != r6 goto pc+11 ; R0_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0x)
>>> ; __sync_fetch_and_and(&and64_value, 0x011ull << 32); @ arena_atomics.c:91
>>> 6: (18) r1 = 0x100000000060 ; R1_w=scalar()
>>> 8: (bf) r1 = addr_space_cast(r1, 0, 1) ; R1_w=arena
>>> 9: (18) r2 = 0x1100000000 ; R2_w=0x1100000000
>>> 11: (db) r2 = atomic64_fetch_and((u64 *)(r1 +0), r2)
>>> BPF_ATOMIC stores into R1 arena is not allowed
>>> processed 9 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
>>> -- END PROG LOAD LOG --
>>> libbpf: prog 'and': failed to load: -13
>>> libbpf: failed to load object 'arena_atomics'
>>> libbpf: failed to load BPF skeleton 'arena_atomics': -13
>>> test_arena_atomics:FAIL:arena atomics skeleton load unexpected error: -13 (errno 13)
>>> #3 arena_atomics:FAIL
>>>
>>> The reason of the failure is due to [4] where atomic{64,}_fetch_{and,or,xor}() are not
>>> allowed by arena addresses. Without llvm patch [3], the compiler will generate 'lock ...'
>>> insn and everything will work fine.
>>>
>>> This patch fixed the problem by using inline asms. Instead of __sync_fetch_and_{and,or,xor}() functions,
>>> the inline asm with 'lock' insn is used and it will work with or without [3].
>>> Note that three bpf programs ('and', 'or' and 'xor') are guarded with __BPF_FEATURE_ADDR_SPACE_CAST
>>> as well to ensure compilation failure for llvm <= 18 version. Note that for llvm <= 18 where
>>> addr_space_cast is not supported, all arena_atomics subtests are skipped with below message:
>>> test_arena_atomics:SKIP:no ENABLE_ATOMICS_TESTS or no addr_space_cast support in clang
>>> #3 arena_atomics:SKIP
>>>
>>> [1] https://lore.kernel.org/bpf/ZqqiQQWRnz7H93Hc@google.com/T/#mb68d67bc8f39e35a0c3db52468b9de59b79f021f
>>> [2] https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca74488615744
>>> [3] https://github.com/llvm/llvm-project/pull/101428
>>> [4] d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT")
>>>
>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>> ---
>>> .../selftests/bpf/progs/arena_atomics.c | 63 ++++++++++++++++---
>>> 1 file changed, 54 insertions(+), 9 deletions(-)
>>>
>>> Changelog:
>>> v1 -> v2:
>>> - Add __BPF_FEATURE_ADDR_SPACE_CAST to guard newly added asm codes for llvm >= 19
>>>
>>> diff --git a/tools/testing/selftests/bpf/progs/arena_atomics.c b/tools/testing/selftests/bpf/progs/arena_atomics.c
>>> index bb0acd79d28a..dea54557fc00 100644
>>> --- a/tools/testing/selftests/bpf/progs/arena_atomics.c
>>> +++ b/tools/testing/selftests/bpf/progs/arena_atomics.c
>>> @@ -5,6 +5,7 @@
>>> #include <bpf/bpf_tracing.h>
>>> #include <stdbool.h>
>>> #include "bpf_arena_common.h"
>>> +#include "bpf_misc.h"
>>>
>>> struct {
>>> __uint(type, BPF_MAP_TYPE_ARENA);
>>> @@ -85,10 +86,24 @@ int and(const void *ctx)
>>> {
>>> if (pid != (bpf_get_current_pid_tgid() >> 32))
>>> return 0;
>>> -#ifdef ENABLE_ATOMICS_TESTS
>>> +#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
>>>
>>> - __sync_fetch_and_and(&and64_value, 0x011ull << 32);
>>> - __sync_fetch_and_and(&and32_value, 0x011);
>>> + asm volatile(
>>> + "r1 = addr_space_cast(%[and64_value], 0, 1);"
>>> + "lock *(u64 *)(r1 + 0) &= %[val]"
>>> + :
>>> + : __imm_ptr(and64_value),
>>> + [val]"r"(0x011ull << 32)
>>> + : "r1"
>>> + );
>>> + asm volatile(
>>> + "r1 = addr_space_cast(%[and32_value], 0, 1);"
>>> + "lock *(u32 *)(r1 + 0) &= %[val]"
>>> + :
>>> + : __imm_ptr(and32_value),
>>> + [val]"w"(0x011)
>>> + : "r1"
>>> + );
>> Instead of inline asm there is a better way to do the same in C.
>> https://godbolt.org/z/71PYx1oqE
>>
>> void foo(int a, _Atomic int *b)
>> {
>> *b += a;
>> }
>>
>> generates:
>> lock *(u32 *)(r2 + 0) += r1
>
> If you use latest llvm-project with
> https://github.com/llvm/llvm-project/pull/101428
> included, the above code will generate like
>
> $ clang --target=bpf -O2 -c t.c && llvm-objdump -d t.o
> t.o: file format elf64-bpf
> Disassembly of section .text:
> 0000000000000000 <foo>:
> 0: c3 12 00 00 01 00 00 00 r1 = atomic_fetch_add((u32 *)(r2 + 0x0), r1)
> 1: 95 00 00 00 00 00 00 00 exit
>
>
> With -mcpu=v3 the same code can be generated.
Same for current GCC.
>>
>> but
>> *b &= a;
>>
>> crashes llvm :( with
>>
>> <source>:3:5: error: unsupported atomic operation, please use 64 bit version
>> 3 | *b &= a;
>
> It failed with the following llvm error message:
> t.c:1:6: error: unsupported atomic operation, please use 64 bit version
> 1 | void foo(int a, _Atomic int *b)
> | ^
> fatal error: error in backend: Cannot select: t8: i64,ch = AtomicLoadAnd<(load store seq_cst (s32) on %ir.b)> t0, t4, t2
> t4: i64,ch = CopyFromReg t0, Register:i64 %1
> t3: i64 = Register %1
> t2: i64,ch = CopyFromReg t0, Register:i64 %0
> t1: i64 = Register %0
> In function: foo
>
>>
>> but works with -mcpu=v3
>
> Yes. it does work with -mcpu=v3:
In GCC if you specify -mcpu=v2 and use atomic built-ins you get
workaround code in the form of libcalls (like calls to
__atomic_fetch_and_4) which are of course of no use in BPF atm.
> $ clang --target=bpf -O2 -mcpu=v3 -c t.c && llvm-objdump -d --mcpu=v3 t.o
>
> t.o: file format elf64-bpf
> Disassembly of section .text:
> 0000000000000000 <foo>:
> 0: c3 12 00 00 51 00 00 00 w1 = atomic_fetch_and((u32 *)(r2 + 0x0), w1)
> 1: 95 00 00 00 00 00 00 00 exit
>
> NOTE: I need -mcpu=v3 for llvm-objdump to print asm code 'atomic_fetch_and' properly.
> Will double check this.
The GNU binutils objdump will try to recognize instructions in the
latest supported cpu version, unless an explicit option is passed to
specify a particular ISA version:
$ bpf-unknown-none-objdump -M pseudoc -d foo.o
foo.o: file format elf64-bpfle
Disassembly of section .text:
0000000000000000 <foo>:
0: bf 11 20 00 00 00 00 00 r1 = (s32) r1
8: c3 12 00 00 51 00 00 00 w1=atomic_fetch_and((u32*)(r2+0),w1)
10: 95 00 00 00 00 00 00 00 exit
$ bpf-unknown-none-objdump -M v2,pseudoc -d foo.o
foo.o: file format elf64-bpfle
Disassembly of section .text:
0000000000000000 <foo>:
0: bf 11 20 00 00 00 00 00 r1=r1
8: c3 12 00 00 51 00 00 00 <unknown>
10: 95 00 00 00 00 00 00 00 exit
> For code:
> void foo(int a, _Atomic int *b)
> {
> *b &= a;
> }
>
> The initial IR generated by clang frontend is:
>
> define dso_local void @foo(i32 noundef %a, ptr noundef %b) #0 {
> entry:
> %a.addr = alloca i32, align 4
> %b.addr = alloca ptr, align 8
> store i32 %a, ptr %a.addr, align 4, !tbaa !3
> store ptr %b, ptr %b.addr, align 8, !tbaa !7
> %0 = load i32, ptr %a.addr, align 4, !tbaa !3
> %1 = load ptr, ptr %b.addr, align 8, !tbaa !7
> %2 = atomicrmw and ptr %1, i32 %0 seq_cst, align 4
> %3 = and i32 %2, %0
> ret void
> }
>
> Note that atomicrmw in the above. Eventually it optimized to
>
> define dso_local void @foo(i32 noundef %a, ptr noundef %b) #0 {
> entry:
> %0 = atomicrmw and ptr %b, i32 %a seq_cst, align 4
> ret void
> }
>
> The 'atomicrmw' is the same IR as generated by
> __sync_fetch_and_*() and eventually will generate atomic_fetch_*() bpf
> insn.
> Discussed with Andrii, and
> another option is to specify relaxed consistency, so llvm
> internal could translate it into locked insn. For example,
>
> $ cat t1.c
> #include <stdatomic.h>
>
> void f(_Atomic int *i) {
> __c11_atomic_fetch_and(i, 1, memory_order_relaxed);
> }
I think this makes sense. Currently we removed the GCC insns
atomic_{add,or,xor,and} all together so the compiler is forced to
implement them in terms of atomic_fetch_and_{add,or,xor,and} regardless
of the memory ordering policy specified to the __sync_fetch_and_OP. So
even if you specify relaxed memory ordered, the resulting ordering is
whatever implied by the fetching operation.
> # to have gnu/stubs-32.h in the current directory to make it compile
> [yhs@devvm1513.prn0 ~/tmp6]$ ls gnu
> stubs-32.h
> [yhs@devvm1513.prn0 ~/tmp6]$ clang --target=bpf -O2 -I. -c -mcpu=v3 t1.c
>
> The initial IR:
> define dso_local void @f(ptr noundef %i) #0 {
> entry:
> %i.addr = alloca ptr, align 8
> %.atomictmp = alloca i32, align 4
> %atomic-temp = alloca i32, align 4
> store ptr %i, ptr %i.addr, align 8, !tbaa !3
> %0 = load ptr, ptr %i.addr, align 8, !tbaa !3
> store i32 1, ptr %.atomictmp, align 4, !tbaa !7
> %1 = load i32, ptr %.atomictmp, align 4
> %2 = atomicrmw and ptr %0, i32 %1 monotonic, align 4
> store i32 %2, ptr %atomic-temp, align 4
> %3 = load i32, ptr %atomic-temp, align 4, !tbaa !7
> ret void
> }
>
> The IR right before machine code generation:
>
> define dso_local void @f(ptr nocapture noundef %i) local_unnamed_addr #0 {
> entry:
> %0 = atomicrmw and ptr %i, i32 1 monotonic, align 4
> ret void
> }
>
> Maybe we could special process the above to generate
> a locked insn if
> - atomicrmw operator
> - monotonic (related) consistency
> - return value is not used
>
> So this will not violate original program semantics.
> Does this sound a reasonable apporach?
Whether monotonic consistency is desired (ordered writes) can be
probably deduced from the memory_order_* flag of the built-ins, but I
don't know what atomiccrmw is... what is it in non-llvm terms?
>>
>> So let's make this test mcpu=v3 only and use normal C ?
>>
>> pw-bot: cr
next prev parent reply other threads:[~2024-08-06 4:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-03 2:59 [PATCH bpf-next v2] selftests/bpf: Fix arena_atomics selftest failure due to llvm change Yonghong Song
2024-08-05 17:53 ` Alexei Starovoitov
2024-08-05 22:36 ` Yonghong Song
2024-08-06 4:48 ` Jose E. Marchesi [this message]
2024-08-06 5:26 ` Yonghong Song
2024-08-08 16:26 ` Alexei Starovoitov
2024-08-12 20:42 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87cymmqmry.fsf@oracle.com \
--to=jose.marchesi@oracle.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox