From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-175.mta0.migadu.com (out-175.mta0.migadu.com [91.218.175.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 79A295811C for ; Mon, 8 Jan 2024 23:23:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="whMtJQvz" Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1704756186; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JdvANYXLmn7twfMElhrn4m2Tj038cyAbQY2iU5ArmuA=; b=whMtJQvzrCecII8Tq+lsO1sUQyhvGys3ZJUkNygz+A32893LJdaiskYSdYLhVth3ISSQax gVTU+MAVQ7+XgtamOouWWX4H4RiG/8AjafFUK8FQ4VLDSU1mE/E/lRTWSj9TePkUIFA2y4 aXvGInqyiDG1C15fq0XBUXAif56sgqY= Date: Mon, 8 Jan 2024 15:22:58 -0800 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: asm register constraint. Was: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro Content-Language: en-GB To: Alexei Starovoitov , Eduard Zingerman , "Jose E. Marchesi" , "Jose E. Marchesi" Cc: Kumar Kartikeya Dwivedi , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Daniel Xu , John Fastabend , bpf , Kernel Team References: <20231221033854.38397-1-alexei.starovoitov@gmail.com> <20231221033854.38397-3-alexei.starovoitov@gmail.com> <44a9223b6638673487850eb9d70cc01ef58e9d93.camel@gmail.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 1/8/24 1:33 PM, Alexei Starovoitov wrote: > On Fri, Jan 5, 2024 at 1:47 PM Eduard Zingerman wrote: >> On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote: >> [...] >>> It turned out there are indeed a bunch of redundant shifts >>> when u32 or s32 is passed into "r" asm constraint. >>> >>> Strangely the shifts are there when compiled with -mcpu=v3 or v4 >>> and no shifts with -mcpu=v1 and v2. >>> >>> Also weird that u8 and u16 are passed into "r" without redundant shifts. >>> Hence I found a "workaround": cast u32 into u16 while passing. >>> The truncation of u32 doesn't happen and shifts to zero upper 32-bit >>> are gone as well. >>> >>> https://godbolt.org/z/Kqszr6q3v >> Regarding unnecessary shifts. >> Sorry, a long email about minor feature/defect. >> >> So, currently the following C program >> (and it's variations with implicit casts): >> >> extern unsigned long bar(void); >> void foo(void) { >> asm volatile ("%[reg] += 1"::[reg]"r"((unsigned)bar())); >> } >> >> Is translated to the following BPF: >> >> $ clang -mcpu=v3 -O2 --target=bpf -mcpu=v3 -c -o - t.c | llvm-objdump --no-show-raw-insn -d - >> >> : file format elf64-bpf >> >> Disassembly of section .text: >> >> 0000000000000000 : >> 0: call -0x1 >> 1: r0 <<= 0x20 >> 2: r0 >>= 0x20 >> 3: r0 += 0x1 >> 4: exit >> >> Note: no additional shifts are generated when "w" (32-bit register) >> constraint is used instead of "r". >> >> First, is this right or wrong? >> ------------------------------ >> >> C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says >> the following: >> >> If the value of the expression is represented with greater range or >> precision than required by the type named by the cast (6.3.1.8), >> then the cast specifies a conversion even if the type of the >> expression is the same as the named type and removes any extra range >> and precision. ^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> ^^^^^^^^^^^^^ >> >> What other LLVM backends do in such situations? >> Consider the following program translated to amd64 [2] and aarch64 [3]: >> >> void foo(void) { >> asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long) bar())); // 1 >> asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int) bar())); // 2 >> asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); // 3 >> } >> >> - for amd64 register of proper size is selected for `reg`; >> - for aarch64 warnings about wrong operand size are emitted at (2) and (3) >> and 64-bit register is used w/o generating any additional instructions. >> >> (Note, however, that 'arm' silently ignores the issue and uses 32-bit >> registers for all three points). >> >> So, it looks like that something of this sort should be done: >> - either extra precision should be removed via additional instructions; >> - or 32-bit register should be picked for `reg`; >> - or warning should be emitted as in aarch64 case. >> >> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf >> [2] https://godbolt.org/z/9nKxaMc5j >> [3] https://godbolt.org/z/1zxEr5b3f >> >> >> Second, what to do? >> ------------------- >> >> I think that the following steps are needed: >> - Investigation described in the next section shows that currently two >> shifts are generated accidentally w/o real intent to shed precision. >> I have a patch [6] that removes shifts generation, it should be applied. >> - When 32-bit value is passed to "r" constraint: >> - for cpu v3/v4 a 32-bit register should be selected; >> - for cpu v1/v2 a warning should be reported. > Thank you for the detailed analysis. > > Agree that llvm fix [6] is a necessary step, then [6] probably not needed for normal C code. Eduard may help to confirm. But yes it is needed to simplify asm with 'r' registers. > using 'w' in v3/v4 and warn on v1/v2 makes sense too, > but we have this macro: > #define barrier_var(var) asm volatile("" : "+r"(var)) > that is used in various bpf production programs. > tetragon is also a heavy user of inline asm. asm volatile("" : "+r"(var)) is expected to be used only in compiler and the code will be generated in final binary. This is true for all the instances I see. If in unusual cases, we have code generated like (r1 = r1, w1.= w1), we can remove that inline asm in bpf backend at the very late peephole optimization. > > Right now a full 64-bit register is allocated, > so switching to 'w' might cause unexpected behavior > including rejection by the verifier. > I think it makes sense to align the bpf backend with arm64 and x86, > but we need to broadcast this change widely. Indeed, just with [6] might cause wrong results in certain cases. Or if [6] landed, some code might need to change from 'r' to 'w' in their source code. Agree we should do similar thing to arm64/x86. For cpuv3/cpuv4 4-byte value, changing from 'r' to 'w'. For other 4-byte value, or 1/2 byte value, issue warnings. > > Also need to align with GCC. (Jose cc-ed) > > And, the most importantly, we need a way to go back to old behavior, > since u32 var; asm("...":: "r"(var)); will now > allocate "w" register or warn. > > What should be the workaround? > > I've tried: > u32 var; asm("...":: "r"((u64)var)); > > https://godbolt.org/z/n4ejvWj7v > > and x86/arm64 generate 32-bit truction. > Sounds like the bpf backend has to do it as well. > We should be doing 'wX=wX' in such case (just like x86) > instead of <<=32 >>=32. Indeed, this is preferred. > > I think this should be done as a separate diff. > Our current pattern of using shifts is inefficient and guaranteed > to screw up verifier range analysis while wX=wX is faster > and more verifier friendly. > Yes it's still not going to be 1-1 to old (our current) behavior. With [6] and replacing 'r' to 'w' in inline asm, it will be a new behavior. > > We probably need some macro (like we did for __BPF_CPU_VERSION__) > to identify such fixed llvm, so existing users with '(short)' > workaround and other tricks can detect new vs old compiler. We can add a feature macro like https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/Targets/BPF.cpp#L55 > > Looks like we opened a can of worms. > Aligning with x86/arm64 makes sense, but the cost of doing > the right thing is hard to estimate. > >> Third, why two shifts are generated? >> ------------------------------------ >> >> (Details here might be interesting to Yonghong, regular reader could >> skip this section). >> >> The two shifts are result of interaction between two IR constructs >> `trunc` and `asm`. The C program above looks as follows in LLVM IR >> before machine code generation: >> >> declare dso_local i64 @bar() >> define dso_local void @foo(i32 %p) { >> entry: >> %call = call i64 @bar() >> %v32 = trunc i64 %call to i32 >> tail call void asm sideeffect "$0 += 1", "r"(i32 %v32) >> ret void >> } >> >> Initial selection DAG: >> >> $ llc -debug-only=isel -march=bpf -mcpu=v3 --filetype=asm -o - t.ll >> SelectionDAG has 21 nodes: >> ... >> t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1 >> ! t11: i32 = truncate t10 >> ! t15: i64 = zero_extend t11 >> t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t15 >> t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch, >> TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1 >> ... >> >> Note values t11 and t15 marked with (!). >> >> Optimized lowered selection DAG for this fragment: >> >> t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1 >> ! t22: i64 = and t10, Constant:i64<4294967295> >> t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22 >> t19: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch, >> TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1 >> >> Note (zext (truncate ...)) converted to (and ... 0xffff_ffff). >> >> DAG after instruction selection: >> >> t10: i64,ch,glue = CopyFromReg t8:1, Register:i64 $r0, t8:2 >> ! t25: i64 = SLL_ri t10, TargetConstant:i64<32> >> ! t22: i64 = SRL_ri t25, TargetConstant:i64<32> >> t17: ch,glue = CopyToReg t10:1, Register:i64 %1, t22 >> t23: ch,glue = inlineasm t17, TargetExternalSymbol:i64'$0 += 1', MDNode:ch, >> TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t17:1 >> >> Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)). >> This happens because of the following pattern from BPFInstrInfo.td: >> >> // 0xffffFFFF doesn't fit into simm32, optimize common case >> def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)), >> (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>; >> >> So, the two shift instructions are result of translation of (zext (trunc ...)). >> However, closer examination shows that zext DAG node was generated >> almost by accident. Here is the backtrace for when this node was created: >> >> Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=202) ;; 202 is opcode for ZERO_EXTEND >> at .../SelectionDAG.cpp:5605 >> (gdb) bt >> #0 llvm::SelectionDAG::getNode (...) >> at ...SelectionDAG.cpp:5605 >> #1 0x... in getCopyToParts (..., ExtendKind=llvm::ISD::ZERO_EXTEND) >> at .../SelectionDAGBuilder.cpp:537 >> #2 0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType=llvm::ISD::ANY_EXTEND) >> at .../SelectionDAGBuilder.cpp:958 >> #3 0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...) >> at .../SelectionDAGBuilder.cpp:9640 >> ... >> >> The stack frame #2 is interesting, here is the code for it [4]: >> >> void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG, >> const SDLoc &dl, SDValue &Chain, SDValue *Glue, >> const Value *V, >> ISD::NodeType PreferredExtendType) const { >> ^ >> '-- this is ANY_EXTEND >> ... >> for (unsigned Value = 0, Part = 0, e = ValueVTs.size(); Value != e; ++Value) { >> ... >> .-- this returns true >> v >> if (ExtendKind == ISD::ANY_EXTEND && TLI.isZExtFree(Val, RegisterVT)) >> ExtendKind = ISD::ZERO_EXTEND; >> >> .-- this is ZERO_EXTEND >> v >> getCopyToParts(..., ExtendKind); >> Part += NumParts; >> } >> ... >> } >> >> The getCopyToRegs() function was called with ANY_EXTEND preference, >> but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns >> true for any 32 to 64-bit conversion [5]. >> However, in this case this is clearly a mistake, as zero extension of >> (zext i64 (truncate i32 ...)) costs two instructions. >> >> The isZExtFree() behavior could be changed to report false for such >> situations, as in my patch [6]. This in turn removes zext => >> removes two shifts from final asm. >> Here is how DAG/asm look after patch [6]: >> >> Initial selection DAG: >> ... >> t10: i64,ch,glue = CopyFromReg t8, Register:i64 $r0, t8:1 >> ! t11: i32 = truncate t10 >> t16: ch,glue = CopyToReg t10:1, Register:i64 %1, t10 >> t18: ch,glue = inlineasm t16, TargetExternalSymbol:i64'$0 += 1', MDNode:ch, >> TargetConstant:i64<1>, TargetConstant:i32<131081>, Register:i64 %1, t16:1 >> ... >> >> Final asm: >> >> ... >> # %bb.0: >> call bar >> #APP >> r0 += 1 >> #NO_APP >> exit >> ... >> >> Note that [6] is a very minor change, it does not affect code >> generation for selftests at all and I was unable to conjure examples >> where it has effect aside from inline asm parameters. >> >> [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L937C10 >> [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895db4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21 >> [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40fef022d08b0ef >>