From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org (eggs.gnu.org [209.51.188.92]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5660114D443 for ; Thu, 1 Aug 2024 12:48:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.51.188.92 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722516484; cv=none; b=eJBLeMVw3ErDK/BvNxwo9BsDYqdoRxVwRjAcF89EN8PEvHwbUuv+t6MiBeGkFXZP1hrNtZNzBcMVqSXunTXERKWHpFYwZIFMwLsmI2AXq+H+6XVHTMC4HOdViPavwDXrX+6vKuaCQICqSRw4FC7zSw+oZuCd/780LMRZXjgwxT0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722516484; c=relaxed/simple; bh=Z77NADmlUIvyX7sMFWC1uQQH9ZuuNAQwuvGbKhrXejM=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=dDk9/Hi3dgLfVDQUTHkJkVr1inFTR13iKoBsQ2Jmb2WXRr6ihw6seUWifu1+uXekO2qkRw2h5Wbl5gXfd4enoxSYLrqJsKqMJQHXKzQsHxo6tVjHAcwPCejQY2cqHbIiKrAoZdj3A0Sx5F8xmWDS3AO2kDlSHXsWqk1+0U5iIY0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gnu.org; spf=pass smtp.mailfrom=gnu.org; dkim=pass (2048-bit key) header.d=gnu.org header.i=@gnu.org header.b=NjSl2lNH; arc=none smtp.client-ip=209.51.188.92 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gnu.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gnu.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gnu.org header.i=@gnu.org header.b="NjSl2lNH" Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sZVDm-0005x0-VY; Thu, 01 Aug 2024 08:47:54 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=cLtTI5aDpVkqi3h3wegi8eiUwEKQQHh2T2w0eEXQR+Y=; b=NjSl2lNHAdJFD1iounFr Sqr6TgFefzWUkxW7ihLNmd1N+tJOEJkMG1LX1QgsomiI4Wfi2SGicVQa+Tk+sL9XP6MXFed66bvV2 quSS60GurxGq+y7AMS+toBO3sv8AiXgudlqKeGXZ3TjfK4siC1PZyRXPmRJ/jf5LGQ8m4MpaKK5at jnJ4rHwW6b3fYIGt71UC6SdfnhVG27nwnhElFRFa9JeNG4hdMZ+xDlHBdGr0o8FfebscwihtUEu92 7+fZtnFdtUsVCmiQICQQ8VMH+8zOLLnFRnZAh/L+i6OlqwSf4xNX9CHkwCjWMzW+SxDH6rLnoPMOw LpgpdkdYDBdjSA==; From: "Jose E. Marchesi" To: Yonghong Song Cc: Alexei Starovoitov , Peilin Ye , bpf , Josh Don , Barret Rhoden , Neel Natu , Benjamin Segall , "Paul E. McKenney" , Alexei Starovoitov , David Vernet , Dave Marchevsky Subject: Re: Supporting New Memory Barrier Types in BPF In-Reply-To: <24b57380-c829-4033-a7b1-06a4ed413a49@linux.dev> (Yonghong Song's message of "Mon, 29 Jul 2024 22:14:26 -0700") References: <20240729183246.4110549-1-yepeilin@google.com> <24b57380-c829-4033-a7b1-06a4ed413a49@linux.dev> Date: Thu, 01 Aug 2024 14:47:49 +0200 Message-ID: <87h6c4h0ju.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > On 7/29/24 6:28 PM, Alexei Starovoitov wrote: >> On Mon, Jul 29, 2024 at 11:33=E2=80=AFAM Peilin Ye = wrote: >>> Hi list! >>> >>> As we are looking at running sched_ext-style BPF scheduling on architec= tures >>> with a more relaxed memory model (i.e. ARM), we would like to: >>> >>> 1. have fine-grained control over memory ordering in BPF (instead of >>> defaulting to a full barrier), for performance reasons >>> 2. pay closer attention to if memory barriers are being used correct= ly in >>> BPF >>> >>> To that end, our main goal here is to support more types of memory barr= iers in >>> BPF. While Paul E. McKenney et al. are working on the formalized BPF m= emory >>> model [1], Paul agreed that it makes sense to support some basic types = first. >>> Additionally, we noticed an issue with the __sync_*fetch*() compiler bu= ilt-ins >>> related to memory ordering, which will be described in details below. >>> >>> I. We need more types of BPF memory barriers >>> -------------------------------------------- >>> >>> Currently, when it comes to BPF memory barriers, our choices are effect= ively >>> limited to: >>> >>> * compiler barrier: 'asm volatile ("" ::: "memory");' >>> * full memory barriers implied by compiler built-ins like >>> __sync_val_compare_and_swap() >>> >>> We need more. During offline discussion with Paul, we agreed we can st= art >>> from: >>> >>> * load-acquire: __atomic_load_n(... memorder=3D__ATOMIC_ACQUIRE); >>> * store-release: __atomic_store_n(... memorder=3D__ATOMIC_RELEASE); >> we would need inline asm equivalent too. Similar to kernel >> smp_load_acquire() macro. >> >>> Theoretically, the BPF JIT compiler could also reorder instructions jus= t like >>> Clang or GCC, though it might not currently do so. If we ever develope= d a more >>> optimizing BPF JIT compiler, it would also be nice to have an optimizat= ion >>> barrier for it. However, Alexei Starovoitov has expressed that definin= g a BPF >>> instruction with 'asm volatile ("" ::: "memory");' semantics might be t= ricky. >> It can be a standalone insn that is a compiler barrier only but that fee= ls like >> a waste of an instruction. So depending how we end up encoding various >> real barriers >> there may be a bit to spend in such a barrier insn that is only a >> compiler barrier. >> In this case optimizing JIT barrier. >> >>> II. Implicit barriers can get confusing >>> --------------------------------------- >>> >>> We noticed that, as a bit of a surprise, the __sync_*fetch*() built-ins= do not >>> always imply a full barrier for BPF on ARM. For example, when using LL= VM, the >>> frequently-used __sync_fetch_and_add() can either imply "relaxed" (no b= arrier), >>> or "acquire and release" (full barrier) semantics, depending on if its = return >>> value is used: >>> >>> Case (a): return value is used >>> >>> SEC("...") >>> int64_t foo; >>> >>> int64_t func(...) { >>> return __sync_fetch_and_add(&foo, 1); >>> } >>> >>> For case (a), Clang gave us: >>> >>> 3: db 01 00 00 01 00 00 00 r0 =3D atomic_fetch_add((u64 *)(r1 + 0= x0), r0) >>> >>> opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW >>> imm (0x00000001): BPF_ADD | BPF_FETCH >>> >>> Case (b): return value is ignored >>> >>> SEC("...") >>> int64_t foo; >>> >>> int64_t func(...) { >>> __sync_fetch_and_add(&foo, 1); >>> >>> return foo; >>> } >>> >>> For case (b), Clang gave us: >>> >>> 3: db 12 00 00 00 00 00 00 lock *(u64 *)(r2 + 0x0) +=3D r1 >>> >>> opcode (0xdb): BPF_STX | BPF_ATOMIC | BPF_DW >>> imm (0x00000000): BPF_ADD >>> >>> LLVM decided to drop BPF_FETCH, since the return value of >>> __sync_fetch_and_add() is being ignored [2]. Now, if we take a look at >>> emit_lse_atomic() in the BPF JIT compiler code for ARM64 (suppose that = LSE >>> atomic instructions are being used): >>> >>> case BPF_ADD: >>> emit(A64_STADD(isdw, reg, src), ctx); >>> break; >>> <...> >>> case BPF_ADD | BPF_FETCH: >>> emit(A64_LDADDAL(isdw, src, reg, src), ctx); >>> break; >>> >>> STADD is an alias for LDADD. According to [3]: >>> >>> * LDADDAL for case (a) has "acquire" plus "release" semantics >>> * LDADD for case (b) "has neither acquire nor release semantics" >>> >>> This is pretty non-intuitive; a compiler built-in should not have incon= sistent >>> implications on memory ordering, and it is better not to require all BPF >>> programmers to memorize this. >>> >>> GCC seems a bit ambiguous [4] on whether __sync_*fetch*() built-ins sho= uld >>> always imply a full barrier. GCC considers these __sync_*() built-ins = as >>> "legacy", and introduced a new set of __atomic_*() built-ins ("Memory M= odel >>> Aware Atomic Operations") [5] to replace them. These __atomic_*() buil= t-ins >>> are designed to be a lot more explicit on memory ordering, for example: >>> >>> type __atomic_fetch_add (type *ptr, type val, int memorder) >>> >>> This requires the programmer to specify a memory order type (relaxed, a= cquire, >>> release...) via the "memorder" parameter. Currently in LLVM, for BPF, = those >>> __atomic_*fetch*() built-ins seem to be aliases to their __sync_*fetch*= () >>> counterparts (the "memorder" parameter seems effectively ignored), and = are not >>> fully supported. >> This sounds like a compiler bug. >> >> Yonghong, Jose, >> do you know what compilers do for other backends? >> Is it allowed to convert sycn_fetch_add into sync_add when fetch part is= unused? > > This behavior is introduced by the following llvm commit: > https://github.com/llvm/llvm-project/commit/286daafd65129228e08a1d07aa4ca= 74488615744 > > Specifically the following commit message: > > =3D=3D=3D=3D=3D=3D=3D > Similar to xadd, atomic xadd, xor and xxor (atomic_) > instructions are added for atomic operations which do not > have return values. LLVM will check the return value for > __sync_fetch_and_{add,and,or,xor}. > If the return value is used, instructions atomic_fetch_ > will be used. Otherwise, atomic_ instructions will be used. > =3D=3D=3D=3D=3D=3D > > Basically, if no return value, __sync_fetch_and_add() will use > xadd insn. The decision is made at that time to maintain backward compati= bility. > For one example, in bcc > https://github.com/iovisor/bcc/blob/master/src/cc/export/helpers.h#L1444 > we have > #define lock_xadd(ptr, val) ((void)__sync_fetch_and_add(ptr, val)) > > Should we use atomic_fetch_*() always regardless of whether the return > val is used or not? Probably, it should still work. Not sure what gcc > does for this case. GCC behaves similarly. For program A: long foo; =20=20 long func () { return __sync_fetch_and_add(&foo, 1); } bpf-unknown-none-gcc -O2 compiles to: 0000000000000000 : 0: 18 00 00 00 00 00 00 00 r0=3D0 ll 8: 00 00 00 00 00 00 00 00=20 10: b7 01 00 00 01 00 00 00 r1=3D1 18: db 10 00 00 01 00 00 00 r1=3Datomic_fetch_add((u64*)(r0+0),r1) 20: bf 10 00 00 00 00 00 00 r0=3Dr1 28: 95 00 00 00 00 00 00 00 exit And for program B: long foo; =20=20 long func () { __sync_fetch_and_add(&foo, 1); return foo; } bpf-unknown-none-gcc -O2 compiles to: 0000000000000000 : 0: 18 00 00 00 00 00 00 00 r0=3D0 ll 8: 00 00 00 00 00 00 00 00=20 10: b7 01 00 00 01 00 00 00 r1=3D1 18: db 10 00 00 00 00 00 00 lock *(u64*)(r0+0)+=3Dr1 20: 79 00 00 00 00 00 00 00 r0=3D*(u64*)(r0+0) 28: 95 00 00 00 00 00 00 00 exit Internally: - When compiling the program A GCC decides to emit an `atomic_fetch_addDI' insn, documented as: 'atomic_fetch_addMODE', 'atomic_fetch_subMODE' 'atomic_fetch_orMODE', 'atomic_fetch_andMODE' 'atomic_fetch_xorMODE', 'atomic_fetch_nandMODE' These patterns emit code for an atomic operation on memory with memory model semantics, and return the original value. Operand 0 is an output operand which contains the value of the memory location before the operation was performed. Operand 1 is the memory on which the atomic operation is performed. Operand 2 is the second operand to the binary operator. Operand 3 is the memory model to be used by the operation. The BPF backend defines atomic_fetch_add for DI modes (long) to expand to this BPF instruction: %w0 =3D atomic_fetch_add(( *)%1, %w0) - When compiling the program B GCC decides to emit an `atomic_addDI' insn, documented as: 'atomic_addMODE', 'atomic_subMODE' 'atomic_orMODE', 'atomic_andMODE' 'atomic_xorMODE', 'atomic_nandMODE' These patterns emit code for an atomic operation on memory with memory model semantics. Operand 0 is the memory on which the atomic operation is performed. Operand 1 is the second operand to the binary operator. Operand 2 is the memory model to be used by the operation. The BPF backend defines atomic_fetch_add for DI modes (long) to expand to this BPF instruction: lock *( *)%w0 +=3D %w1 This is done for all targets. In x86-64, for example, case A compiles to: 0000000000000000 : 0: b8 01 00 00 00 mov $0x1,%eax 5: f0 48 0f c1 05 00 00 lock xadd %rax,0x0(%rip) # e c: 00 00=20 e: c3 retq=20=20=20 And case B compiles to: 0000000000000000 : 0: f0 48 83 05 00 00 00 lock addq $0x1,0x0(%rip) # 9 7: 00 01=20 9: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 10 10: c3 retq=20=20=20 Why wouldn't the compiler be allowed to optimize from atomic_fetch_add to atomic_add in this case? > >> >>> III. Next steps >>> --------------- >>> >>> Roughly, the scope of this work includes: >>> >>> * decide how to extend the BPF ISA (add new instructions and/or exte= nd >>> current ones) >> ldx/stx insns support MEM and MEMSX modifiers. >> Adding MEM_ACQ_REL feels like a natural fit. Better name? >> >> For barriers we would need a new insn. Not sure which class would fit th= e best. >> Maybe BPF_LD ? >> >> Another alternative for barriers is to use nocsr kfuncs. >> Then we have the freedom to make mistakes and fix them later. >> One kfunc per barrier would do. >> JITs would inline them into appropriate insns. >> In bpf progs they will be used just like in the kernel code smp_mb(), >> smp_rmb(), etc. >> >> I don't think compilers have to emit barriers from C code, so my >> preference is kfuncs atm. >> >>> * teach LLVM and GCC to generate the new/extended instructions >>> * teach the BPF verifier to understand them >>> * teach the BPF JIT compiler to compile them >>> * update BPF memory model and tooling >>> * update IETF specification >>> >>> Additionally, for the issue described in the previous section, we need = to: >>> >>> * check if GCC has the same behavior >>> * at least clearly document the implied effects on BPF memory orderi= ng of >>> current __sync_*fetch*() built-ins (especially for architectures l= ike ARM), >>> as described >>> * fully support the new __atomic_*fetch*() built-ins for BPF to repl= ace the >>> __sync_*fetch*() ones >>> >>> Any suggestions or corrections would be most welcome! >>> >>> Thanks, >>> Peilin Ye >>> >>> >>> [1] Instruction-Level BPF Memory Model >>> https://docs.google.com/document/d/1TaSEfWfLnRUi5KqkavUQyL2tThJXYWHS15q= cbxIsFb0/edit?usp=3Dsharing >>> >>> [2] For more information, see LLVM commit 286daafd6512 ("[BPF] support = atomic >>> instructions"). Search for "LLVM will check the return value" in = the >>> commit message. >>> >>> [3] Arm Architecture Reference Manual for A-profile architecture (ARM D= DI >>> 0487K.a, ID032224), C6.2.149, page 2006 >>> >>> [4] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html >>> 6.58 Legacy __sync Built-in Functions for Atomic Memory Access >>> "In most cases, these built-in functions are considered a full bar= rier." >>> >>> [5] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html >>> 6.59 Built-in Functions for Memory Model Aware Atomic Operations >>>