From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A678936093 for ; Fri, 5 Jan 2024 21:47:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="hed1qo6Y" Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-40e3a67ca9fso266595e9.0 for ; Fri, 05 Jan 2024 13:47:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1704491270; x=1705096070; darn=vger.kernel.org; h=mime-version:user-agent:content-transfer-encoding:autocrypt :references:in-reply-to:date:cc:to:from:subject:message-id:from:to :cc:subject:date:message-id:reply-to; bh=2TwumnDWyEHWwqsTw4aAlQjOE5GhST1uznuHWGlCq74=; b=hed1qo6YLOvMgqDPezc6ffYFCFcUnFaoj+qF55VcXjXMb/hbwcLGAFdtfqfOSfMqLg lF/2YKEJr3jWO/Kma7PBLPTwjw8i6RdO90RQqh/yd50wY6CuaJ5hzjXcfQmfTz1s6dbE LahlvkRHbVBcS0juCM6sdZrDMOjCCvrQq8OoCQhVBgM8azUfSbvKipT2DxFt2/gIT5it Uo+8tA6N2qzy+P9bZfZLUteOpkoH7kAb9+4TSld8OVSyI2VDxBbhomya9g7nIg3CC2KO Ovtfl+xoDwU9BeB3EVyFbt/gsOVciropg8EEfkhJ/Jk8ZF8aI4KQG5qfFdvGfqZAoPxN r0+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704491270; x=1705096070; h=mime-version:user-agent:content-transfer-encoding:autocrypt :references:in-reply-to:date:cc:to:from:subject:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=2TwumnDWyEHWwqsTw4aAlQjOE5GhST1uznuHWGlCq74=; b=URjqmtN9Hd8zirzyzBgnJrf5TpGShZAUrwEw8FoIejnc0AvWGA8cu5B3S5jzBINT4z txyFZusavvnSCl1XONqX7dz2d4Y/fxw7NT6quBjk5RuFBwlPRh3sL2KV0mYCVhY7D9ah rOd9zpbm8Flt04sco9okcsPWZzoU9G7qTgUQT2ISvO8cTQGsBVIzwJA9Tj6vP/m94eC8 AmKHbOTPP1xa098MM8znwNLTuYuwJ45x997xYsF7cs+qGTSNkufXOQpw6TslW7KCdsCw ZPwXn+3fmkMHXqHzjXfSKzOgfXNUudF0oUwxh37yHb4T5Za3SgdvTDG8CCkS0wwQLXQ8 LzOQ== X-Gm-Message-State: AOJu0YwL7zXQYeIw8fNYOkXcoOMsrcrgKrBDRS1XAyUYr81p9oWNBgiL C66FxxWqzIQkuPKDDtJ3+yk= X-Google-Smtp-Source: AGHT+IGD9Q/bEgBcp109+48zXxh1/B1TNKLQ97MYHTxbrJH92WJOOJ+i1rnujcVBXoADekOe/tvP0g== X-Received: by 2002:a1c:7c1a:0:b0:40d:6e05:fcc0 with SMTP id x26-20020a1c7c1a000000b0040d6e05fcc0mr77936wmc.41.1704491269467; Fri, 05 Jan 2024 13:47:49 -0800 (PST) Received: from [192.168.1.95] (host-176-36-0-241.b024.la.net.ua. [176.36.0.241]) by smtp.gmail.com with ESMTPSA id l8-20020a05600c1d0800b0040d6e07a147sm2732307wms.23.2024.01.05.13.47.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 05 Jan 2024 13:47:48 -0800 (PST) Message-ID: <44a9223b6638673487850eb9d70cc01ef58e9d93.camel@gmail.com> Subject: Re: [PATCH v2 bpf-next 2/5] bpf: Introduce "volatile compare" macro From: Eduard Zingerman To: Alexei Starovoitov , Kumar Kartikeya Dwivedi , Yonghong Song Cc: Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Daniel Xu , John Fastabend , bpf , Kernel Team Date: Fri, 05 Jan 2024 23:47:47 +0200 In-Reply-To: References: <20231221033854.38397-1-alexei.starovoitov@gmail.com> <20231221033854.38397-3-alexei.starovoitov@gmail.com> Autocrypt: addr=eddyz87@gmail.com; prefer-encrypt=mutual; keydata=mQGNBGKNNQEBDACwcUNXZOGTzn4rr7Sd18SA5Wv0Wna/ONE0ZwZEx+sIjyGrPOIhR14/DsOr3ZJer9UJ/WAJwbxOBj6E5Y2iF7grehljNbLr/jMjzPJ+hJpfOEAb5xjCB8xIqDoric1WRcCaRB+tDSk7jcsIIiMish0diTK3qTdu4MB6i/sh4aeFs2nifkNi3LdBuk8Xnk+RJHRoKFJ+C+EoSmQPuDQIRaF9N2m4yO0eG36N8jLwvUXnZzGvHkphoQ9ztbRJp58oh6xT7uH62m98OHbsVgzYKvHyBu/IU2ku5kVG9pLrFp25xfD4YdlMMkJH6l+jk+cpY0cvMTS1b6/g+1fyPM+uzD8Wy+9LtZ4PHwLZX+t4ONb/48i5AKq/jSsb5HWdciLuKEwlMyFAihZamZpEj+9n91NLPX4n7XeThXHaEvaeVVl4hfW/1Qsao7l1YjU/NCHuLaDeH4U1P59bagjwo9d1n5/PESeuD4QJFNqW+zkmE4tmyTZ6bPV6T5xdDRHeiITGc00AEQEAAbQkRWR1YXJkIFppbmdlcm1hbiA8ZWRkeXo4N0BnbWFpbC5jb20+iQHUBBMBCgA+FiEEx+6LrjApQyqnXCYELgxleklgRAkFAmKNNQECGwMFCQPCZwAFCwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQLgxleklgRAlWZAv/cJ5v3zlEyP0/jMKQBqbVCCHTirPEw+nqxbkeSO6r2FUds0NnGA9a6NPOpBH+qW7a6+n6q3sIbvH7jlss4pzLI7LYlDC6z+egTv7KR5X1xFrY1uR5UGs1beAjnzYeV2hK4yqRUfygsT0Wk5e4FiNBv4+DUZ8r0cNDkO6swJxU55DO21mcteC147+4aDoHZ40R0tsAu+brDGSSoOPpb0RWVsEf9XOBJqWWA+T7mluw nYzhLWGcczc6J71q1Dje0l5vIPaSFOgwmWD4DA+WvuxM/shH4rtWeodbv iCTce6yYIygHgUAtJcHozAlgRrL0jz44cggBTcoeXp/atckXK546OugZPnl00J3qmm5uWAznU6T5YDv2vCvAMEbz69ib+kHtnOSBvR0Jb86UZZqSb4ATfwMOWe9htGTjKMb0QQOLK0mTcrk/TtymaG+T4Fsos0kgrxqjgfrxxEhYcVNW8v8HISmFGFbqsJmFbVtgk68BcU0wgF8oFxo7u+XYQDdKbI1uQGNBGKNNQEBDADbQIdo8L3sdSWGQtu+LnFqCZoAbYurZCmUjLV3df1b+sg+GJZvVTmMZnzDP/ADufcbjopBBjGTRAY4L76T2niu2EpjclMMM3mtrOc738Kr3+RvPjUupdkZ1ZEZaWpf4cZm+4wH5GUfyu5pmD5WXX2i1r9XaUjeVtebvbuXWmWI1ZDTfOkiz/6Z0GDSeQeEqx2PXYBcepU7S9UNWttDtiZ0+IH4DZcvyKPUcK3tOj4u8GvO3RnOrglERzNCM/WhVdG1+vgU9fXO83TB/PcfAsvxYSie7u792s/I+yA4XKKh82PSTvTzg2/4vEDGpI9yubkfXRkQN28w+HKF5qoRB8/L1ZW/brlXkNzA6SveJhCnH7aOF0Yezl6TfX27w1CW5Xmvfi7X33V/SPvo0tY1THrO1c+bOjt5F+2/K3tvejmXMS/I6URwa8n1e767y5ErFKyXAYRweE9zarEgpNZTuSIGNNAqK+SiLLXt51G7P30TVavIeB6s2lCt1QKt62ccLqUAEQEAAYkBvAQYAQoAJhYhBMfui64wKUMqp1wmBC4MZXpJYEQJBQJijTUBAhsMBQkDwmcAAAoJEC4MZXpJYEQJkRAMAKNvWVwtXm/WxWoiLnXyF2WGXKoDe5+itTLvBmKcV/b1OKZF1s90V7WfSBz712eFAynEzyeezPbwU8QBiTpZcHXwQni3IYKvsh7s t1iq+gsfnXbPz5AnS598ScZI1oP7OrPSFJkt/z4acEbOQDQs8aUqrd46PV jsdqGvKnXZxzylux29UTNby4jTlz9pNJM+wPrDRmGfchLDUmf6CffaUYCbu4FiId+9+dcTCDvxbABRy1C3OJ8QY7cxfJ+pEZW18fRJ0XCl/fiV/ecAOfB3HsqgTzAn555h0rkFgay0hAvMU/mAW/CFNSIxV397zm749ZNLA0L2dMy1AKuOqH+/B+/ImBfJMDjmdyJQ8WU/OFRuGLdqOd2oZrA1iuPIa+yUYyZkaZfz/emQwpIL1+Q4p1R/OplA4yc301AqruXXUcVDbEB+joHW3hy5FwK5t5OwTKatrSJBkydSF9zdXy98fYzGniRyRA65P0Ix/8J3BYB4edY2/w0Ip/mdYsYQljBY0A== Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.50.2 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Mon, 2023-12-25 at 12:33 -0800, Alexei Starovoitov wrote: [...] > It turned out there are indeed a bunch of redundant shifts > when u32 or s32 is passed into "r" asm constraint. >=20 > Strangely the shifts are there when compiled with -mcpu=3Dv3 or v4 > and no shifts with -mcpu=3Dv1 and v2. >=20 > Also weird that u8 and u16 are passed into "r" without redundant shifts. > Hence I found a "workaround": cast u32 into u16 while passing. > The truncation of u32 doesn't happen and shifts to zero upper 32-bit > are gone as well. >=20 > https://godbolt.org/z/Kqszr6q3v Regarding unnecessary shifts. Sorry, a long email about minor feature/defect. So, currently the following C program (and it's variations with implicit casts): extern unsigned long bar(void); void foo(void) { asm volatile ("%[reg] +=3D 1"::[reg]"r"((unsigned)bar())); } Is translated to the following BPF: $ clang -mcpu=3Dv3 -O2 --target=3Dbpf -mcpu=3Dv3 -c -o - t.c | llvm-obj= dump --no-show-raw-insn -d - =20 : file format elf64-bpf =20 Disassembly of section .text: =20 0000000000000000 : 0: call -0x1 1: r0 <<=3D 0x20 2: r0 >>=3D 0x20 3: r0 +=3D 0x1 4: exit =20 Note: no additional shifts are generated when "w" (32-bit register) constraint is used instead of "r". First, is this right or wrong? ------------------------------ C language spec [1] paragraph 6.5.4.6 (Cast operators -> Semantics) says the following: If the value of the expression is represented with greater range or precision than required by the type named by the cast (6.3.1.8), then the cast specifies a conversion even if the type of the expression is the same as the named type and removes any extra range and precision. ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^ =20 What other LLVM backends do in such situations? Consider the following program translated to amd64 [2] and aarch64 [3]: void foo(void) { asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned long) bar())); = // 1 asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned int) bar())); = // 2 asm volatile("mov %[reg],%[reg]"::[reg]"r"((unsigned short) bar())); = // 3 } - for amd64 register of proper size is selected for `reg`; - for aarch64 warnings about wrong operand size are emitted at (2) and (3) and 64-bit register is used w/o generating any additional instructions. (Note, however, that 'arm' silently ignores the issue and uses 32-bit registers for all three points). So, it looks like that something of this sort should be done: - either extra precision should be removed via additional instructions; - or 32-bit register should be picked for `reg`; - or warning should be emitted as in aarch64 case. [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf [2] https://godbolt.org/z/9nKxaMc5j [3] https://godbolt.org/z/1zxEr5b3f Second, what to do? ------------------- I think that the following steps are needed: - Investigation described in the next section shows that currently two shifts are generated accidentally w/o real intent to shed precision. I have a patch [6] that removes shifts generation, it should be applied. - When 32-bit value is passed to "r" constraint: - for cpu v3/v4 a 32-bit register should be selected; - for cpu v1/v2 a warning should be reported. Third, why two shifts are generated? ------------------------------------ (Details here might be interesting to Yonghong, regular reader could skip this section). The two shifts are result of interaction between two IR constructs `trunc` and `asm`. The C program above looks as follows in LLVM IR before machine code generation: declare dso_local i64 @bar() define dso_local void @foo(i32 %p) { entry: %call =3D call i64 @bar() %v32 =3D trunc i64 %call to i32 tail call void asm sideeffect "$0 +=3D 1", "r"(i32 %v32) ret void } Initial selection DAG: $ llc -debug-only=3Disel -march=3Dbpf -mcpu=3Dv3 --filetype=3Dasm -o - = t.ll SelectionDAG has 21 nodes: ... t10: i64,ch,glue =3D CopyFromReg t8, Register:i64 $r0, t8:1 ! t11: i32 =3D truncate t10 ! t15: i64 =3D zero_extend t11 t17: ch,glue =3D CopyToReg t10:1, Register:i64 %1, t15 t19: ch,glue =3D inlineasm t17, TargetExternalSymbol:i64'$0 +=3D 1'= , MDNode:ch, TargetConstant:i64<1>, TargetConstant:i32<131081>,= Register:i64 %1, t17:1 ... Note values t11 and t15 marked with (!). Optimized lowered selection DAG for this fragment: t10: i64,ch,glue =3D CopyFromReg t8, Register:i64 $r0, t8:1 ! t22: i64 =3D and t10, Constant:i64<4294967295> t17: ch,glue =3D CopyToReg t10:1, Register:i64 %1, t22 t19: ch,glue =3D inlineasm t17, TargetExternalSymbol:i64'$0 +=3D 1', = MDNode:ch, TargetConstant:i64<1>, TargetConstant:i32<131081>, R= egister:i64 %1, t17:1 Note (zext (truncate ...)) converted to (and ... 0xffff_ffff). DAG after instruction selection: t10: i64,ch,glue =3D CopyFromReg t8:1, Register:i64 $r0, t8:2 ! t25: i64 =3D SLL_ri t10, TargetConstant:i64<32> ! t22: i64 =3D SRL_ri t25, TargetConstant:i64<32> t17: ch,glue =3D CopyToReg t10:1, Register:i64 %1, t22 t23: ch,glue =3D inlineasm t17, TargetExternalSymbol:i64'$0 +=3D 1', = MDNode:ch, TargetConstant:i64<1>, TargetConstant:i32<131081>, R= egister:i64 %1, t17:1 Note (and ... 0xffff_ffff) converted to (SRL_ri (SLL_ri ...)). This happens because of the following pattern from BPFInstrInfo.td: // 0xffffFFFF doesn't fit into simm32, optimize common case def : Pat<(i64 (and (i64 GPR:$src), 0xffffFFFF)), (SRL_ri (SLL_ri (i64 GPR:$src), 32), 32)>; So, the two shift instructions are result of translation of (zext (trunc ..= .)). However, closer examination shows that zext DAG node was generated almost by accident. Here is the backtrace for when this node was created: Breakpoint 1, llvm::SelectionDAG::getNode (... Opcode=3D202) ;; 202 is = opcode for ZERO_EXTEND at .../SelectionDAG.cpp:5605 (gdb) bt #0 llvm::SelectionDAG::getNode (...) at ...SelectionDAG.cpp:5605 #1 0x... in getCopyToParts (..., ExtendKind=3Dllvm::ISD::ZERO_EXTEND) at .../SelectionDAGBuilder.cpp:537 #2 0x... in llvm::RegsForValue::getCopyToRegs (... PreferredExtendType= =3Dllvm::ISD::ANY_EXTEND) at .../SelectionDAGBuilder.cpp:958 #3 0x... in llvm::SelectionDAGBuilder::visitInlineAsm(...) at .../SelectionDAGBuilder.cpp:9640 ... The stack frame #2 is interesting, here is the code for it [4]: void RegsForValue::getCopyToRegs(SDValue Val, SelectionDAG &DAG, const SDLoc &dl, SDValue &Chain, SDVal= ue *Glue, const Value *V, ISD::NodeType PreferredExtendType) con= st { ^ '-- this is ANY_EXTEND ... for (unsigned Value =3D 0, Part =3D 0, e =3D ValueVTs.size(); Value != =3D e; ++Value) { ... .-- this returns true v if (ExtendKind =3D=3D ISD::ANY_EXTEND && TLI.isZExtFree(Val, Regist= erVT)) ExtendKind =3D ISD::ZERO_EXTEND; =20 .-- this is ZERO_EXTEND v getCopyToParts(..., ExtendKind); Part +=3D NumParts; } ... } The getCopyToRegs() function was called with ANY_EXTEND preference, but switched to ZERO_EXTEND because TLI.isZExtFree() currently returns true for any 32 to 64-bit conversion [5]. However, in this case this is clearly a mistake, as zero extension of=20 (zext i64 (truncate i32 ...)) costs two instructions. The isZExtFree() behavior could be changed to report false for such situations, as in my patch [6]. This in turn removes zext =3D> removes two shifts from final asm. Here is how DAG/asm look after patch [6]: Initial selection DAG: ... t10: i64,ch,glue =3D CopyFromReg t8, Register:i64 $r0, t8:1 ! t11: i32 =3D truncate t10 t16: ch,glue =3D CopyToReg t10:1, Register:i64 %1, t10 t18: ch,glue =3D inlineasm t16, TargetExternalSymbol:i64'$0 +=3D 1'= , MDNode:ch, TargetConstant:i64<1>, TargetConstant:i32<131081>,= Register:i64 %1, t16:1 ... =20 Final asm: ... # %bb.0: call bar #APP r0 +=3D 1 #NO_APP exit ... =20 Note that [6] is a very minor change, it does not affect code generation for selftests at all and I was unable to conjure examples where it has effect aside from inline asm parameters. [4] https://github.com/llvm/llvm-project/blob/365fbbfbcfefb8766f7716109b9c3= 767b58e6058/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L937C10-L= 937C10 [5] https://github.com/llvm/llvm-project/blob/6f4cc1310b12bc59210e4596a895d= b4cb9ad6075/llvm/lib/Target/BPF/BPFISelLowering.cpp#L213C21-L213C21 [6] https://github.com/llvm/llvm-project/commit/cf8e142e5eac089cc786c671a40= fef022d08b0ef