From: Jie Meng <jmeng@fb.com>
To: KP Singh <kpsingh@kernel.org>
Cc: <bpf@vger.kernel.org>, <ast@kernel.org>, <andrii@kernel.org>,
<daniel@iogearbox.net>
Subject: Re: [PATCH bpf-next v4 2/3] bpf,x64: use shrx/sarx/shlx when available
Date: Thu, 6 Oct 2022 20:14:35 -0700 [thread overview]
Message-ID: <Yz+Zm6qar+nWrLZs@fb.com> (raw)
In-Reply-To: <CACYkzJ4RvEZVp5-sybdn2tOuV-h6KyGJRjvEMZWBoqTBVrK1aQ@mail.gmail.com>
On Wed, Oct 05, 2022 at 09:11:01PM -0700, KP Singh wrote:
> On Sat, Oct 1, 2022 at 10:12 PM Jie Meng <jmeng@fb.com> wrote:
> >
> > Instead of shr/sar/shl that implicitly use %cl, emit their more flexible
> > alternatives provided in BMI2 when advantageous; keep using the non BMI2
> > instructions when shift count is already in BPF_REG_4/rcx as non BMI2
> > instructions are shorter.
>
> This is confusing, you mention %CL in the first sentence and then RCX in the
> second sentence. Can you clarify this more here?
%cl is the lowest 8 bit of %rcx. In assembly, non BMI2 shifts with shift
count in register is written as
SHR eax, cl
Although the use of CL is mandatory and if the shift count is in another
register it has to be moved into RCX first, unless of course when the
shift count is already in BPF_REG_4, which is mapped to RCX in x86-64.
It is indeed awkward but exactly what one would see in assembly: a MOV
to RCX and a shift that uses CL as the source register.
>
> Also, It would be good to have some explanations about the
> performance benefits here as well.
>
> i.e. a load + store + non vector instruction v/s a single vector instruction
> omitting the load. How many cycles do we expect in each case, I do expect the
> latter to be lesser, but mentioning it in the commit removes any ambiguity.
Although it uses similar encoding as AVX instructions BMI2 actually
operates on general purpose registers and no vector register is ever
involved [1]. Inside a CPU all shifts instructions (both baseline and BMI2
flavors) are almost always handled by the same units and have the same
latency and throughput [2].
[1] https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set
[2] https://www.agner.org/optimize/instruction_tables.pdf
>
> >
> > To summarize, when BMI2 is available:
> > -------------------------------------------------
> > | arbitrary dst
> > =================================================
> > src == ecx | shl dst, cl
> > -------------------------------------------------
> > src != ecx | shlx dst, dst, src
> > -------------------------------------------------
> >
> > A concrete example between non BMI2 and BMI2 codegen. To shift %rsi by
> > %rdi:
> >
> > Without BMI2:
> >
> > ef3: push %rcx
> > 51
> > ef4: mov %rdi,%rcx
> > 48 89 f9
> > ef7: shl %cl,%rsi
> > 48 d3 e6
> > efa: pop %rcx
> > 59
> >
> > With BMI2:
> >
> > f0b: shlx %rdi,%rsi,%rsi
> > c4 e2 c1 f7 f6
> >
> > Signed-off-by: Jie Meng <jmeng@fb.com>
> > ---
> > arch/x86/net/bpf_jit_comp.c | 64 +++++++++++++++++++++++++++++++++++++
> > 1 file changed, 64 insertions(+)
> >
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index d9ba997c5891..d09c54f3d2e0 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -889,6 +889,48 @@ static void emit_nops(u8 **pprog, int len)
> > *pprog = prog;
> > }
> >
> > +/* emit the 3-byte VEX prefix */
> > +static void emit_3vex(u8 **pprog, bool r, bool x, bool b, u8 m,
> > + bool w, u8 src_reg2, bool l, u8 p)
>
> Can you please use somewhat more descriptive variable names here?
>
> or add more information about what x, b, m, w, l and p mean?
Apart from src_reg2, the rest is the same as what Intel has chosen to
name the various fields in the VEX prefix. Would rather keep them
consistent so that people won't get confused when comparing with other
documents across the Internet.
>
> > +{
> > + u8 *prog = *pprog;
> > + u8 b0 = 0xc4, b1, b2;
> > + u8 src2 = reg2hex[src_reg2];
> > +
> > + if (is_ereg(src_reg2))
> > + src2 |= 1 << 3;
> > +
> > + /*
> > + * 7 0
> > + * +---+---+---+---+---+---+---+---+
> > + * |~R |~X |~B | m |
> > + * +---+---+---+---+---+---+---+---+
> > + */
> > + b1 = (!r << 7) | (!x << 6) | (!b << 5) | (m & 0x1f);
>
> Some explanation here would help, not everyone is aware of x86 vex encoding :)
The comment is the exact rule how different pieces of information is
encoded into the 3-byte VEX prefix i.e. their position and length, and
whether a field needs to be bit inverted. Combined with code the comment
should give one clear idea what the intent is here.
>
> > + /*
> > + * 7 0
> > + * +---+---+---+---+---+---+---+---+
> > + * | W | ~vvvv | L | pp |
> > + * +---+---+---+---+---+---+---+---+
> > + */
> > + b2 = (w << 7) | ((~src2 & 0xf) << 3) | (l << 2) | (p & 3);
> > +
> > + EMIT3(b0, b1, b2);
> > + *pprog = prog;
> > +}
> > +
> > +/* emit BMI2 shift instruction */
> > +static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
> > +{
> > + u8 *prog = *pprog;
> > + bool r = is_ereg(dst_reg);
> > + u8 m = 2; /* escape code 0f38 */
> > +
> > + emit_3vex(&prog, r, false, r, m, is64, src_reg, false, op);
> > + EMIT2(0xf7, add_2reg(0xC0, dst_reg, dst_reg));
> > + *pprog = prog;
> > +}
> > +
> > #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
> >
> > static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
> > @@ -1135,6 +1177,28 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
> > case BPF_ALU64 | BPF_LSH | BPF_X:
> > case BPF_ALU64 | BPF_RSH | BPF_X:
> > case BPF_ALU64 | BPF_ARSH | BPF_X:
> > + /* BMI2 shifts aren't better when shift count is already in rcx */
> > + if (boot_cpu_has(X86_FEATURE_BMI2) && src_reg != BPF_REG_4) {
> > + /* shrx/sarx/shlx dst_reg, dst_reg, src_reg */
> > + bool w = (BPF_CLASS(insn->code) == BPF_ALU64);
> > + u8 op;
> > +
> > + switch (BPF_OP(insn->code)) {
> > + case BPF_LSH:
> > + op = 1; /* prefix 0x66 */
> > + break;
> > + case BPF_RSH:
> > + op = 3; /* prefix 0xf2 */
> > + break;
> > + case BPF_ARSH:
> > + op = 2; /* prefix 0xf3 */
> > + break;
> > + }
> > +
> > + emit_shiftx(&prog, dst_reg, src_reg, w, op);
> > +
> > + break;
> > + }
> >
> > if (src_reg != BPF_REG_4) { /* common case */
> > /* Check for bad case when dst_reg == rcx */
> > --
> > 2.30.2
> >
next prev parent reply other threads:[~2022-10-07 3:14 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-21 2:21 [PATCH bpf-next] bpf,x64: use shrx/sarx/shlx when available Jie Meng
2022-09-22 15:07 ` Daniel Borkmann
2022-09-24 0:32 ` [PATCH bpf-next v2 0/2] bpf,x64: Use BMI2 for shifts Jie Meng
2022-09-24 0:32 ` [PATCH bpf-next v2 1/2] bpf,x64: use shrx/sarx/shlx when available Jie Meng
2022-09-26 19:16 ` Daniel Borkmann
2022-09-27 0:38 ` Jie Meng
2022-09-27 9:45 ` Daniel Borkmann
2022-09-27 18:57 ` [PATCH bpf-next v3 0/3] bpf,x64: Use BMI2 for shifts Jie Meng
2022-09-27 18:57 ` [PATCH bpf-next v3 1/3] bpf,x64: avoid unnecessary instructions when shift dest is ecx Jie Meng
2022-09-27 18:58 ` [PATCH bpf-next v3 2/3] bpf,x64: use shrx/sarx/shlx when available Jie Meng
2022-09-27 18:58 ` [PATCH bpf-next v3 3/3] bpf: add selftests for lsh, rsh, arsh with reg operand Jie Meng
2022-10-02 5:11 ` [PATCH bpf-next v4 0/3] bpf,x64: Use BMI2 for shifts Jie Meng
2022-10-02 5:11 ` [PATCH bpf-next v4 1/3] bpf,x64: avoid unnecessary instructions when shift dest is ecx Jie Meng
2022-10-02 5:11 ` [PATCH bpf-next v4 2/3] bpf,x64: use shrx/sarx/shlx when available Jie Meng
2022-10-06 4:11 ` KP Singh
2022-10-07 3:14 ` Jie Meng [this message]
2022-10-07 18:11 ` KP Singh
2022-10-07 20:23 ` [PATCH bpf-next v5 0/3] bpf,x64: Use BMI2 for shifts Jie Meng
2022-10-20 0:00 ` patchwork-bot+netdevbpf
2022-10-07 20:23 ` [PATCH bpf-next v5 1/3] bpf,x64: avoid unnecessary instructions when shift dest is ecx Jie Meng
2022-10-07 20:23 ` [PATCH bpf-next v5 2/3] bpf,x64: use shrx/sarx/shlx when available Jie Meng
2022-10-07 20:23 ` [PATCH bpf-next v5 3/3] bpf: add selftests for lsh, rsh, arsh with reg operand Jie Meng
2022-10-02 5:11 ` [PATCH bpf-next v4 " Jie Meng
2022-09-24 0:32 ` [PATCH bpf-next v2 2/2] bpf: Add " Jie Meng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yz+Zm6qar+nWrLZs@fb.com \
--to=jmeng@fb.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kpsingh@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox