From: "Harris, James R" <james.r.harris@intel.com>
To: Christoph Hellwig <hch@lst.de>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>, bpf <bpf@vger.kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Dave Thaler <dthaler@microsoft.com>,
KP Singh <kpsingh@kernel.org>,
Brendan Gregg <brendan.d.gregg@gmail.com>,
Joe Stringer <joe@cilium.io>
Subject: Re: LSF/MM session: eBPF standardization
Date: Tue, 17 May 2022 22:29:37 +0000 [thread overview]
Message-ID: <8DA9E260-AE56-4B21-90BF-CF0049CFD04D@intel.com> (raw)
In-Reply-To: <20220517091011.GA18723@lst.de>
> On May 17, 2022, at 2:10 AM, Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, May 11, 2022 at 07:39:34PM -0700, Alexei Starovoitov wrote:
>> Turns out that clang -mcpu=v1,v2,v3 are not exactly correct.
>> We've extended ISA more than three times.
>> For example when we added more atomics insns in
>> https://lore.kernel.org/bpf/20210114181751.768687-1-jackmanb@google.com/
>>
>> The corresponding llvm diff didn't add a new -mcpu flavor.
>> There was no need to do it.
>
> .. because all uses of these new instructions are through builtins
> that wouldn't other be available, yes.
>
>> Also llvm flags can turn a subset of insns on and off.
>> Like llvm can turn on alu32, but without <,<= insns.
>> -mcpu=v3 includes both. -mcpu=v2 are only <,<=.
>>
>> So we need a plan B.
>
> Yes.
>
>> How about using the commit sha where support was added to the verifier
>> as a 'version' of the ISA ?
>>
>> We can try to use a kernel version, but backports
>> will ruin that picture.
>> Looks like upstream 'commit sha' is the only stable number.
>
> Using kernel release hashes is a pretty horrible interface, especially
> for non-kernel users. I also think the compilers and other tools
> would really like some vaguely meaninfuly identifiers.
>
>> Another approach would be to declare the current ISA as
>> 1.0 (or bpf-isa-may-2022) and
>> in the future bump it with every new insn.
>
> I think that is a much more reasonable starting position. However, are
> we sure the ISA actually evolves linearly? As far as I can tell support
> for full atomics only exists for a few JITs so far.
>
> So maybe starting with a basedline, and then just have name for
> each meaningful extension (e.g. the full atomics as a start) might be
> even better. For the Linux kernel case we can then also have a user
> interface where userspace programs can query which extensions are
> supported before loading eBPF programs that rely on them instead of
> doing a roundtrip through the verifier.
>
>>> - we need to decide to do about the legacy BPF packet access
>>> instrutions. Alexei mentioned that the modern JIT doesn't
>>> even use those internally any more.
>>
>> I think we need to document them as supported in the linux kernel,
>> but deprecated in general.
>> The standard might say "implementation defined" meaning that
>> different run-times don't have to support them.
>
> Yeah. If we do the extensions proposal above we could make these
> a specific extension as well.
>
>> [...]
>> I don't think it's worth documenting all that.
>> I would group all undefined/underflow/overflow as implementation
>> defined and document only things that matter.
>
> Makese sense.
>
>>> Discussion on where to host a definitive version of the document:
>>>
>>> - I think the rough consensus is to just host regular (hopefully
>>> low cadence) documents and maybe the latest gratest at a eBPF
>>> foundation website. Whom do we need to work with at the fundation
>>> to make this happen?
>>
>> foundation folks cc-ed.
>
> I'd be rally glad if we could kick off this ASAP. Feel free to contact
> me privately if we want to keep it off the list.
>
>>> - as idea it was brought up to write a doument with the minimal
>>> verification requirements required for any eBPF implementation
>>> independent of the program type. Again I can volunteer to
>>> draft a documentation, but I need input on what such a consensus
>>> would be. In this case input from the non-Linux verifier
>>> implementors (I only know the Microsoft research one) would
>>> be very helpful as well.
>>
>> The verifier is a moving target.
>
> Absolutely.
>
>> I'd say minimal verification is the one that checks that:
>> - instructions are formed correctly
>> - opcode is valid
>> - no reserved bits are used
>> - registers are within range (r11+ are not used)
>> - combination of opcode+regs+off+imm is valid
>> - simple things like that
>
> Sounds good. One useful thing for this would be an opcode table
> with all the optional field usage in machine readable format.
>
> Jim who is on CC has already built a nice table off all opcodes based
> on existing material that might be a good starting point.
Table is inline below. I used the tables in the iovisor project
documentation as a starting point for this table.
https://github.com/iovisor/bpf-docs/blob/master/eBPF.md.
Feedback welcome. The atomic sections at the bottom could especially use some
careful review for correctness.
BPF_ALU64 (opc & 0x07 == 0x07)
==============================
0x07 BPF_ALU64 | BPF_K | BPF_ADD bpf_add dst, imm dst += imm
0x0f BPF_ALU64 | BPF_X | BPF_ADD bpf_add dst, src dst += src
0x17 BPF_ALU64 | BPF_K | BPF_SUB bpf_sub dst, imm dst -= imm
0x1f BPF_ALU64 | BPF_X | BPF_SUB bpf_sub dst, src dst -= src
0x27 BPF_ALU64 | BPF_K | BPF_MUL bpf_mul dst, imm dst *= imm
0x2f BPF_ALU64 | BPF_X | BPF_MUL bpf_mul dst, src dst *= src
0x37 BPF_ALU64 | BPF_K | BPF_DIV bpf_div dst, imm dst /= imm
0x3f BPF_ALU64 | BPF_X | BPF_DIV bpf_div dst, src dst /= src
0x47 BPF_ALU64 | BPF_K | BPF_OR bpf_or dst, imm dst |= imm
0x4f BPF_ALU64 | BPF_X | BPF_OR bpf_or dst, src dst |= src
0x57 BPF_ALU64 | BPF_K | BPF_AND bpf_and dst, imm dst &= imm
0x5f BPF_ALU64 | BPF_X | BPF_AND bpf_and dst, src dst &= src
0x67 BPF_ALU64 | BPF_K | BPF_LSH bpf_lsh dst, imm dst <<= imm
0x6f BPF_ALU64 | BPF_X | BPF_LSH bpf_lsh dst, src dst <<= src
0x77 BPF_ALU64 | BPF_K | BPF_RSH bpf_lsh dst, imm dst >>= imm (logical)
0x7f BPF_ALU64 | BPF_X | BPF_RSH bpf_lsh dst, src dst >>= src (logical)
0x87 BPF_ALU64 | BPF_K | BPF_NEG bpf_neg dst dst = ~dst
0x97 BPF_ALU64 | BPF_K | BPF_MOD bpf_mod dst, imm dst %= imm
0x9f BPF_ALU64 | BPF_X | BPF_MOD bpf_mod dst, src dst %= src
0xa7 BPF_ALU64 | BPF_K | BPF_XOR bpf_xor dst, imm dst ^= imm
0xaf BPF_ALU64 | BPF_X | BPF_XOR bpf_xor dst, src dst ^= src
0xb7 BPF_ALU64 | BPF_K | BPF_MOV bpf_mov dst, imm dst = imm
0xbf BPF_ALU64 | BPF_X | BPF_MOV bpf_mov dst, src dst = src
0xc7 BPF_ALU64 | BPF_K | BPF_ARSH bpf_arsh dst, imm dst >>= imm (arithmetic)
0xcf BPF_ALU64 | BPF_X | BPF_ARSH bpf_arsh dst, src dst >>= src (arithmetic)
BPF_ALU (opc & 0x07 == 0x04)
============================
0x04 BPF_ALU | BPF_K | BPF_ADD bpf_add32 dst, imm dst32 += imm
0x0c BPF_ALU | BPF_X | BPF_ADD bpf_add32 dst, src dst32 += src32
0x14 BPF_ALU | BPF_K | BPF_SUB bpf_sub32 dst, imm dst32 -= imm
0x1c BPF_ALU | BPF_X | BPF_SUB bpf_sub32 dst, src dst32 -= src32
0x24 BPF_ALU | BPF_K | BPF_MUL bpf_mul32 dst, imm dst32 *= imm
0x2c BPF_ALU | BPF_X | BPF_MUL bpf_mul32 dst, src dst32 *= src32
0x34 BPF_ALU | BPF_K | BPF_DIV bpf_div32 dst, imm dst32 /= imm
0x3c BPF_ALU | BPF_X | BPF_DIV bpf_div32 dst, src dst32 /= src32
0x44 BPF_ALU | BPF_K | BPF_OR bpf_or32 dst, imm dst32 |= imm
0x4c BPF_ALU | BPF_X | BPF_OR bpf_or32 dst, src dst32 |= src32
0x54 BPF_ALU | BPF_K | BPF_AND bpf_and32 dst, imm dst32 &= imm
0x5c BPF_ALU | BPF_X | BPF_AND bpf_and32 dst, src dst32 &= src32
0x64 BPF_ALU | BPF_K | BPF_LSH bpf_lsh32 dst, imm dst32 <<= imm
0x6c BPF_ALU | BPF_X | BPF_LSH bpf_lsh32 dst, src dst32 <<= src32
0x74 BPF_ALU | BPF_K | BPF_RSH bpf_lsh32 dst, imm dst32 >>= imm (logical)
0x7c BPF_ALU | BPF_X | BPF_RSH bpf_lsh32 dst, src dst32 >>= src32 (logical)
0x84 BPF_ALU | BPF_K | BPF_NEG bpf_neg32 dst dst32 = ~dst32
0x94 BPF_ALU | BPF_K | BPF_MOD bpf_mod32 dst, imm dst32 %= imm
0x9c BPF_ALU | BPF_X | BPF_MOD bpf_mod32 dst, src dst32 %= src32
0xa4 BPF_ALU | BPF_K | BPF_XOR bpf_xor32 dst, imm dst32 ^= imm
0xac BPF_ALU | BPF_X | BPF_XOR bpf_xor32 dst, src dst32 ^= src32
0xb4 BPF_ALU | BPF_K | BPF_MOV bpf_mov32 dst, imm dst32 = imm
0xbc BPF_ALU | BPF_X | BPF_MOV bpf_mov32 dst, src dst32 = src32
0xc4 BPF_ALU | BPF_K | BPF_ARSH bpf_arsh32 dst, imm dst32 >>= imm (arithmetic)
0xcc BPF_ALU | BPF_X | BPF_ARSH bpf_arsh32 dst, src dst32 >>= src32 (arithmetic)
For BPF_END instructions, BPF_K == htole conversion, BPF_X == htobe conversion
Operation size (16, 32, 64 bit) determined by 'imm' value of instruction (upper 32 bits)
0xd4 (.imm = 16) BPF_ALU | BPF_K | BPF_END bpf_le16 dst dst16 = htole16(dst16)
0xd4 (.imm = 32) BPF_ALU | BPF_K | BPF_END bpf_le32 dst dst32 = htole32(dst32)
0xd4 (.imm = 64) BPF_ALU | BPF_K | BPF_END bpf_le64 dst dst = htole64(dst)
0xdc (.imm = 16) BPF_ALU | BPF_X | BPF_END bpf_be16 dst dst16 = htobe16(dst16)
0xdc (.imm = 32) BPF_ALU | BPF_X | BPF_END bpf_be32 dst dst32 = htobe32(dst32)
0xdc (.imm = 64) BPF_ALU | BPF_X | BPF_END bpf_be64 dst dst = htobe64(dst)
BPF_JMP (opc & 0x07 == 0x05)
============================
0x05 BPF_JMP | BPF_K | BPF_JA bpf_ja +off PC += off
0x15 BPF_JMP | BPF_K | BPF_JEQ bpf_jeq dst, imm, +off PC += off if dst == imm
0x1d BPF_JMP | BPF_X | BPF_JEQ bpf_jeq dst, src, +off PC += off if dst == src
0x25 BPF_JMP | BPF_K | BPF_JGT bpf_jgt dst, imm, +off PC += off if dst > imm
0x2d BPF_JMP | BPF_X | BPF_JGT bpf_jgt dst, src, +off PC += off if dst > src
0x35 BPF_JMP | BPF_K | BPF_JGE bpf_jge dst, imm, +off PC += off if dst >= imm
0x3d BPF_JMP | BPF_X | BPF_JGE bpf_jge dst, src, +off PC += off if dst >= src
0x45 BPF_JMP | BPF_K | BPF_JSET bpf_jset dst, imm, +off PC += off if dst & imm
0x4d BPF_JMP | BPF_X | BPF_JSET bpf_jset dst, src, +off PC += off if dst & src
0x55 BPF_JMP | BPF_K | BPF_JNE bpf_jne dst, imm, +off PC += off if dst != imm
0x5d BPF_JMP | BPF_X | BPF_JNE bpf_jne dst, src, +off PC += off if dst != src
0x65 BPF_JMP | BPF_K | BPF_JSGT bpf_jsgt dst, imm, +off PC += off if dst > imm (signed)
0x6d BPF_JMP | BPF_X | BPF_JSGT bpf_jsgt dst, src, +off PC += off if dst > src (signed)
0x75 BPF_JMP | BPF_K | BPF_JSGE bpf_jsge dst, imm, +off PC += off if dst >= imm (signed)
0x7d BPF_JMP | BPF_X | BPF_JSGE bpf_jsge dst, src, +off PC += off if dst >= src (signed)
0x85 BPF_JMP | BPF_K | BPF_CALL bpf_call imm Function call
0x95 BPF_JMP | BPF_K | BPF_EXIT bpf_exit return r0
0xa5 BPF_JMP | BPF_K | BPF_JLT bpf_jlt dst, imm, +off PC += off if dst < imm
0xad BPF_JMP | BPF_X | BPF_JLT bpf_jlt dst, src, +off PC += off if dst < src
0xb5 BPF_JMP | BPF_K | BPF_JLE bpf_jle dst, imm, +off PC += off if dst <= imm
0xbd BPF_JMP | BPF_X | BPF_JLE bpf_jle dst, src, +off PC += off if dst <= src
0xc5 BPF_JMP | BPF_K | BPF_JSLT bpf_jslt dst, imm, +off PC += off if dst < imm (signed)
0xcd BPF_JMP | BPF_X | BPF_JSLT bpf_jslt dst, src, +off PC += off if dst < src (signed)
0xd5 BPF_JMP | BPF_K | BPF_JSLE bpf_jsle dst, imm, +off PC += off if dst <= imm (signed)
0xdd BPF_JMP | BPF_X | BPF_JSLE bpf_jsle dst, src, +off PC += off if dst <= src (signed)
BPF_JMP32 (opc & 0x07 == 0x06)
==============================
0x16 BPF_JMP32 | BPF_K | BPF_JEQ bpf_jeq32 dst, imm, +off PC += off if dst32 == imm
0x1e BPF_JMP32 | BPF_X | BPF_JEQ bpf_jeq32 dst, src, +off PC += off if dst32 == src32
0x26 BPF_JMP32 | BPF_K | BPF_JGT bpf_jgt32 dst, imm, +off PC += off if dst32 > imm
0x2e BPF_JMP32 | BPF_X | BPF_JGT bpf_jgt32 dst, src, +off PC += off if dst32 > src32
0x36 BPF_JMP32 | BPF_K | BPF_JGE bpf_jge32 dst, imm, +off PC += off if dst32 >= imm
0x3e BPF_JMP32 | BPF_X | BPF_JGE bpf_jge32 dst, src, +off PC += off if dst32 >= src32
0x46 BPF_JMP32 | BPF_K | BPF_JSET bpf_jset32 dst, imm, +off PC += off if dst32 & imm
0x4e BPF_JMP32 | BPF_X | BPF_JSET bpf_jset32 dst, src, +off PC += off if dst32 & src32
0x56 BPF_JMP32 | BPF_K | BPF_JNE bpf_jne32 dst, imm, +off PC += off if dst32 != imm
0x5e BPF_JMP32 | BPF_X | BPF_JNE bpf_jne32 dst, src, +off PC += off if dst32 != src32
0x66 BPF_JMP32 | BPF_K | BPF_JSGT bpf_jsgt32 dst, imm, +off PC += off if dst32 > imm (signed)
0x6e BPF_JMP32 | BPF_X | BPF_JSGT bpf_jsgt32 dst, src, +off PC += off if dst32 > src32 (signed)
0x76 BPF_JMP32 | BPF_K | BPF_JSGE bpf_jsge32 dst, imm, +off PC += off if dst32 >= imm (signed)
0x7e BPF_JMP32 | BPF_X | BPF_JSGE bpf_jsge32 dst, src, +off PC += off if dst32 >= src32 (signed)
0xa6 BPF_JMP32 | BPF_K | BPF_JLT bpf_jlt32 dst, imm, +off PC += off if dst32 < imm
0xae BPF_JMP32 | BPF_X | BPF_JLT bpf_jlt32 dst, src, +off PC += off if dst32 < src32
0xb6 BPF_JMP32 | BPF_K | BPF_JLE bpf_jle32 dst, imm, +off PC += off if dst32 <= imm
0xbe BPF_JMP32 | BPF_X | BPF_JLE bpf_jle32 dst, src, +off PC += off if dst32 <= src32
0xc6 BPF_JMP32 | BPF_K | BPF_JSLT bpf_jslt32 dst, imm, +off PC += off if dst32 < imm (signed)
0xce BPF_JMP32 | BPF_X | BPF_JSLT bpf_jslt32 dst, src, +off PC += off if dst32 < src32 (signed)
0xd6 BPF_JMP32 | BPF_K | BPF_JSLE bpf_jsle32 dst, imm, +off PC += off if dst32 <= imm (signed)
0xde BPF_JMP32 | BPF_X | BPF_JSLE bpf_jsle32 dst, src, +off PC += off if dst32 <= src32 (signed)
BPF_LD (opc & 0x07 == 0x00)
===========================
0x18 BPF_LD | BPF_DW | BPF_IMM bpf_lddw dst, imm64 dst = imm64
Note: imm64 expressed in 8 bytes following instruction
BPF_LDX (opc & 0x07 == 0x01)
============================
0x61 BPF_LDX | BPF_W | BPF_MEM bpf_ldxw dst, [src + off] dst32 = *(u32 *)(src + off)
0x69 BPF_LDX | BPF_H | BPF_MEM bpf_ldxh dst, [src + off] dst16 = *(u16 *)(src + off)
0x71 BPF_LDX | BPF_B | BPF_MEM bpf_ldxb dst, [src + off] dst8 = *(u8 *)(src + off)
0x79 BPF_LDX | BPF_DW | BPF_MEM bpf_ldxdw dst, [src + off] dst = *(u64 *)(src + off)
BPF_ST (opc & 0x07 == 0x02)
===========================
0x62 BPF_ST | BPF_W | BPF_IMM bpf_stw [dst + off], imm *(u32 *)(dst + off) = imm
0x6a BPF_ST | BPF_H | BPF_IMM bpf_sth [dst + off], imm *(u16 *)(dst + off) = imm
0x72 BPF_ST | BPF_B | BPF_IMM bpf_stb [dst + off], imm *(u8 *)(dst + off) = imm
0x7a BPF_ST | BPF_DW | BPF_IMM bpf_stdw [dst + off], imm *(u64 *)(dst + off) = imm
BPF_STX (opc & 0x07 == 0x03)
============================
0x63 BPF_STX | BPF_W | BPF_MEM bpf_stxw [dst + off], src *(u32 *)(dst + off) = src32
0x6b BPF_STX | BPF_H | BPF_MEM bpf_stxh [dst + off], src *(u16 *)(dst + off) = src16
0x73 BPF_STX | BPF_B | BPF_MEM bpf_stxb [dst + off], src *(u8 *)(dst + off) = src8
0x7b BPF_STX | BPF_DW | BPF_MEM bpf_stxdw [dst + off], src *(u64 *)(dst + off) = src
0xdb BPF_STX | BPF_DW | BPF_ATOMIC 64-bit atomic instructions (see below)
0xc3 BPF_STX | BPF_W | BPF_ATOMIC 32-bit atomic instructions (see below)
Note: mnemonic for atomic instructions? for example, eBPF originally had only XADD atomic instruction which could be
represented as bpf_xadd. But with addition of atomic operations for AND, OR, XOR - that same pattern no longer works.
So for now, just show pseudocode for each atomic operation.
64-bit atomic instructions (opc == 0xdb)
========================================
The following table applies to opc 0xdb (BPF_STX | BPF_DW | BPF_ATOMIC) which are 64-bit atomic operations.
The imm (immediate) value in column 1 specifies the type of atomic operation.
0x00 BPF_ADD lock *(u64 *)(dst + off) += src
0x01 BPF_ADD | BPF_FETCH src = atomic_fetch_add((u64 *)(dst + off), src)
0x40 BPF_OR lock *(u64 *)(dst + off) |= src
0x41 BPF_OR | BPF_FETCH src = atomic_fetch_or((u64 *)(dst + off), src)
0x50 BPF_AND lock *(u64 *)(dst + off) &= src
0x51 BPF_AND | BPF_FETCH src = atomic_fetch_and((u64 *)(dst + off), src)
0xa0 BPF_XOR lock *(u64 *)(dst + off) ^= src
0xa1 BPF_XOR | BPF_FETCH src = atomic_fetch_xor((u64 *)(dst + off), src)
0xe1 BPF_XCHG | BPF_FETCH src = xchg((u64 *)(dst + off), src)
0xf1 BPF_CMPXCHG | BPF_FETCH r0 = cmpxchg((u64 *)(dst + off), r0, src)
32-bit atomic instructions (opc == 0xc3)
========================================
The following table applies to opc 0xc3 (BPF_STX | BPF_W | BPF_ATOMIC) which are 32-bit atomic operations.
The imm (immediate) value in column 1 specifies the type of atomic operation.
0x00 BPF_ADD lock *(u32 *)(dst + off) += src32
0x01 BPF_ADD | BPF_FETCH src32 = atomic_fetch_add((u32 *)(dst + off), src32)
0x40 BPF_OR lock *(u32 *)(dst + off) |= src32
0x41 BPF_OR | BPF_FETCH src32 = atomic_fetch_or((u32 *)(dst + off), src32)
0x50 BPF_AND lock *(u32 *)(dst + off) &= src32
0x51 BPF_AND | BPF_FETCH src32 = atomic_fetch_and((u32 *)(dst + off), src32)
0xa0 BPF_XOR lock *(u32 *)(dst + off) ^= src32
0xa1 BPF_XOR | BPF_FETCH src32 = atomic_fetch_xor((u32 *)(dst + off), src32)
0xe1 BPF_XCHG | BPF_FETCH src32 = xchg((u32 *)(dst + off), src32)
0xf1 BPF_CMPXCHG | BPF_FETCH r0 = cmpxchg((u32 *)(dst + off), r0, src32)
prev parent reply other threads:[~2022-05-17 22:29 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-03 14:04 LSF/MM session: eBPF standardization Christoph Hellwig
2022-05-10 8:16 ` Christoph Hellwig
2022-05-12 2:39 ` Alexei Starovoitov
2022-05-17 9:10 ` Christoph Hellwig
2022-05-17 22:29 ` Harris, James R [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8DA9E260-AE56-4B21-90BF-CF0049CFD04D@intel.com \
--to=james.r.harris@intel.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=brendan.d.gregg@gmail.com \
--cc=daniel@iogearbox.net \
--cc=dthaler@microsoft.com \
--cc=hch@lst.de \
--cc=joe@cilium.io \
--cc=kpsingh@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox