[PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction

BPF List
 help / color / mirror / Atom feed

* [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
@ 2024-04-02  2:13 Andrii Nakryiko
  2024-04-02  2:13 ` [PATCH v2 bpf-next 1/4] bpf: add special internal-only MOV instruction to resolve per-CPU addrs Andrii Nakryiko
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-02  2:13 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Add a new BPF instruction for resolving per-CPU memory addresses.

New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
to an absolute address where per-CPU data resides for "this" CPU.

This patch set implements support for it in x86-64 BPF JIT only.

Using the new instruction, we also implement inlining for three cases:
  - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
    function call, saving a bit of performance and also not polluting LBR
    records with unnecessary function call/return records;
  - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
    performance to implementing per-CPU data structures using global variables
    in BPF (which is an awesome improvement, see benchmarks below);
  - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
    same for non-PERCPU HASH map; this still saves a bit of overhead.

To validate performance benefits, I hacked together a tiny benchmark doing
only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
(arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
on global variable array using bpf_get_smp_processor_id() to index array for
current CPU (glob-arr-inc benchmark below).

BEFORE
======
glob-arr-inc   :  163.685 ± 0.092M/s
arr-inc        :  138.096 ± 0.160M/s
hash-inc       :   66.855 ± 0.123M/s

AFTER
=====
glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
arr-inc        :  170.729 ± 0.210M/s (+23.7%)
hash-inc       :   68.673 ± 0.070M/s (+2.7%)

As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().

But what's really important is that arr-inc benchmark basically catches up
with glob-arr-inc, resulting in +23.7% improvement. This means that in
practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
critical (e.g., high-frequent stats collection, which is often a practical use
for PERCPU_ARRAY today).

v1->v2:
  - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
  - dropped the direct per-CPU memory read instruction, it can always be added
    back, if necessary;
  - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
  - switched all per-cpu addr casts to (unsigned long) to avoid sparse
    warnings.

Andrii Nakryiko (4):
  bpf: add special internal-only MOV instruction to resolve per-CPU
    addrs
  bpf: inline bpf_get_smp_processor_id() helper
  bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
  bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map

 arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
 include/linux/filter.h      | 20 ++++++++++++++++++++
 kernel/bpf/arraymap.c       | 33 +++++++++++++++++++++++++++++++++
 kernel/bpf/core.c           |  5 +++++
 kernel/bpf/disasm.c         | 14 ++++++++++++++
 kernel/bpf/hashtab.c        | 21 +++++++++++++++++++++
 kernel/bpf/verifier.c       | 24 ++++++++++++++++++++++++
 7 files changed, 133 insertions(+)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 bpf-next 1/4] bpf: add special internal-only MOV instruction to resolve per-CPU addrs
  2024-04-02  2:13 [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction Andrii Nakryiko
@ 2024-04-02  2:13 ` Andrii Nakryiko
  2024-04-02  4:35   ` John Fastabend
  2024-04-02  2:13 ` [PATCH v2 bpf-next 2/4] bpf: inline bpf_get_smp_processor_id() helper Andrii Nakryiko
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-02  2:13 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Add a new BPF instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF JITs.

We use a special BPF_MOV | BPF_ALU64 | BPF_X form with insn->off field
set to BPF_ADDR_PERCPU = -1. I used negative offset value to distinguish
them from positive ones used by user-exposed instructions.

Such instruction performs a resolution of a per-CPU offset stored in
a register to a valid kernel address which can be dereferenced. It is
useful in any use case where absolute address of a per-CPU data has to
be resolved (e.g., in inlining bpf_map_lookup_elem()).

BPF disassembler is also taught to recognize them to support dumping
final BPF assembly code (non-JIT'ed version).

Add arch-specific way for BPF JITs to mark support for this instructions.

This patch also adds support for these instructions in x86-64 BPF JIT.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
 include/linux/filter.h      | 20 ++++++++++++++++++++
 kernel/bpf/core.c           |  5 +++++
 kernel/bpf/disasm.c         | 14 ++++++++++++++
 4 files changed, 55 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 3b639d6f2f54..af89dd117dce 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1382,6 +1382,17 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 				maybe_emit_mod(&prog, AUX_REG, dst_reg, true);
 				EMIT3(0x0F, 0x44, add_2reg(0xC0, AUX_REG, dst_reg));
 				break;
+			} else if (insn_is_mov_percpu_addr(insn)) {
+				u32 off = (u32)(unsigned long)&this_cpu_off;
+
+				/* mov <dst>, <src> (if necessary) */
+				EMIT_mov(dst_reg, src_reg);
+
+				/* add <dst>, gs:[<off>] */
+				EMIT2(0x65, add_1mod(0x48, dst_reg));
+				EMIT3(0x03, add_1reg(0x04, dst_reg), 0x25);
+				EMIT(off, 4);
+				break;
 			}
 			fallthrough;
 		case BPF_ALU | BPF_MOV | BPF_X:
@@ -3365,6 +3376,11 @@ bool bpf_jit_supports_subprog_tailcalls(void)
 	return true;
 }
 
+bool bpf_jit_supports_percpu_insn(void)
+{
+	return true;
+}
+
 void bpf_jit_free(struct bpf_prog *prog)
 {
 	if (prog->jited) {
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 44934b968b57..8bcfe6800400 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -178,6 +178,25 @@ struct ctl_table_header;
 		.off   = 0,					\
 		.imm   = 0 })
 
+/* Special (internal-only) form of mov, used to resolve per-CPU addrs:
+ * dst_reg = src_reg + <percpu_base_off>
+ * BPF_ADDR_PERCPU is used as a special insn->off value.
+ */
+#define BPF_ADDR_PERCPU	(-1)
+
+#define BPF_MOV64_PERCPU_REG(DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = BPF_ADDR_PERCPU,			\
+		.imm   = 0 })
+
+static inline bool insn_is_mov_percpu_addr(const struct bpf_insn *insn)
+{
+	return insn->code == (BPF_ALU64 | BPF_MOV | BPF_X) && insn->off == BPF_ADDR_PERCPU;
+}
+
 /* Short form of mov, dst_reg = imm32 */
 
 #define BPF_MOV64_IMM(DST, IMM)					\
@@ -970,6 +989,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
 bool bpf_jit_needs_zext(void);
 bool bpf_jit_supports_subprog_tailcalls(void);
+bool bpf_jit_supports_percpu_insn(void);
 bool bpf_jit_supports_kfunc_call(void);
 bool bpf_jit_supports_far_kfunc_call(void);
 bool bpf_jit_supports_exceptions(void);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ab400cdd7d7a..a996fd8dd303 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2945,6 +2945,11 @@ bool __weak bpf_jit_supports_subprog_tailcalls(void)
 	return false;
 }
 
+bool __weak bpf_jit_supports_percpu_insn(void)
+{
+	return false;
+}
+
 bool __weak bpf_jit_supports_kfunc_call(void)
 {
 	return false;
diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c
index bd2e2dd04740..309c4aa1b026 100644
--- a/kernel/bpf/disasm.c
+++ b/kernel/bpf/disasm.c
@@ -172,6 +172,17 @@ static bool is_addr_space_cast(const struct bpf_insn *insn)
 		insn->off == BPF_ADDR_SPACE_CAST;
 }
 
+/* Special (internal-only) form of mov, used to resolve per-CPU addrs:
+ * dst_reg = src_reg + <percpu_base_off>
+ * BPF_ADDR_PERCPU is used as a special insn->off value.
+ */
+#define BPF_ADDR_PERCPU	(-1)
+
+static inline bool is_mov_percpu_addr(const struct bpf_insn *insn)
+{
+	return insn->code == (BPF_ALU64 | BPF_MOV | BPF_X) && insn->off == BPF_ADDR_PERCPU;
+}
+
 void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 		    const struct bpf_insn *insn,
 		    bool allow_ptr_leaks)
@@ -194,6 +205,9 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs,
 			verbose(cbs->private_data, "(%02x) r%d = addr_space_cast(r%d, %d, %d)\n",
 				insn->code, insn->dst_reg,
 				insn->src_reg, ((u32)insn->imm) >> 16, (u16)insn->imm);
+		} else if (is_mov_percpu_addr(insn)) {
+			verbose(cbs->private_data, "(%02x) r%d = &(void __percpu *)(r%d)\n",
+				insn->code, insn->dst_reg, insn->src_reg);
 		} else if (BPF_SRC(insn->code) == BPF_X) {
 			verbose(cbs->private_data, "(%02x) %c%d %s %s%c%d\n",
 				insn->code, class == BPF_ALU ? 'w' : 'r',
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 bpf-next 2/4] bpf: inline bpf_get_smp_processor_id() helper
  2024-04-02  2:13 [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction Andrii Nakryiko
  2024-04-02  2:13 ` [PATCH v2 bpf-next 1/4] bpf: add special internal-only MOV instruction to resolve per-CPU addrs Andrii Nakryiko
@ 2024-04-02  2:13 ` Andrii Nakryiko
  2024-04-02  4:41   ` John Fastabend
  2024-04-02  2:13 ` [PATCH v2 bpf-next 3/4] bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps Andrii Nakryiko
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-02  2:13 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

If BPF JIT supports per-CPU MOV instruction, inline bpf_get_smp_processor_id()
to eliminate unnecessary function calls.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/verifier.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index edb650667f44..af0274b090bb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20072,6 +20072,30 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
+#ifdef CONFIG_X86_64
+		/* Implement bpf_get_smp_processor_id() inline. */
+		if (insn->imm == BPF_FUNC_get_smp_processor_id &&
+		    prog->jit_requested && bpf_jit_supports_percpu_insn()) {
+			/* BPF_FUNC_get_smp_processor_id inlining is an
+			 * optimization, so if pcpu_hot.cpu_number is ever
+			 * changed in some incompatible and hard to support
+			 * way, it's fine to back out this inlining logic
+			 */
+			insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
+			insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
+			insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
+			cnt = 3;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			goto next_insn;
+		}
+#endif
 		/* Implement bpf_get_func_arg inline. */
 		if (prog_type == BPF_PROG_TYPE_TRACING &&
 		    insn->imm == BPF_FUNC_get_func_arg) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 bpf-next 3/4] bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
  2024-04-02  2:13 [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction Andrii Nakryiko
  2024-04-02  2:13 ` [PATCH v2 bpf-next 1/4] bpf: add special internal-only MOV instruction to resolve per-CPU addrs Andrii Nakryiko
  2024-04-02  2:13 ` [PATCH v2 bpf-next 2/4] bpf: inline bpf_get_smp_processor_id() helper Andrii Nakryiko
@ 2024-04-02  2:13 ` Andrii Nakryiko
  2024-04-02  5:02   ` John Fastabend
  2024-04-02  2:13 ` [PATCH v2 bpf-next 4/4] bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map Andrii Nakryiko
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-02  2:13 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Using new per-CPU BPF instruction implement inlining for per-CPU ARRAY
map lookup helper, if BPF JIT support is present.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/arraymap.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 13358675ff2e..8c1e6d7654bb 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -246,6 +246,38 @@ static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
 	return this_cpu_ptr(array->pptrs[index & array->index_mask]);
 }
 
+/* emit BPF instructions equivalent to C code of percpu_array_map_lookup_elem() */
+static int percpu_array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	struct bpf_insn *insn = insn_buf;
+
+	if (!bpf_jit_supports_percpu_insn())
+		return -EOPNOTSUPP;
+
+	if (map->map_flags & BPF_F_INNER_MAP)
+		return -EOPNOTSUPP;
+
+	BUILD_BUG_ON(offsetof(struct bpf_array, map) != 0);
+	*insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, offsetof(struct bpf_array, pptrs));
+
+	*insn++ = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_2, 0);
+	if (!map->bypass_spec_v1) {
+		*insn++ = BPF_JMP_IMM(BPF_JGE, BPF_REG_0, map->max_entries, 6);
+		*insn++ = BPF_ALU32_IMM(BPF_AND, BPF_REG_0, array->index_mask);
+	} else {
+		*insn++ = BPF_JMP_IMM(BPF_JGE, BPF_REG_0, map->max_entries, 5);
+	}
+
+	*insn++ = BPF_ALU64_IMM(BPF_LSH, BPF_REG_0, 3);
+	*insn++ = BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1);
+	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0);
+	*insn++ = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
+	*insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1);
+	*insn++ = BPF_MOV64_IMM(BPF_REG_0, 0);
+	return insn - insn_buf;
+}
+
 static void *percpu_array_map_lookup_percpu_elem(struct bpf_map *map, void *key, u32 cpu)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
@@ -776,6 +808,7 @@ const struct bpf_map_ops percpu_array_map_ops = {
 	.map_free = array_map_free,
 	.map_get_next_key = array_map_get_next_key,
 	.map_lookup_elem = percpu_array_map_lookup_elem,
+	.map_gen_lookup = percpu_array_map_gen_lookup,
 	.map_update_elem = array_map_update_elem,
 	.map_delete_elem = array_map_delete_elem,
 	.map_lookup_percpu_elem = percpu_array_map_lookup_percpu_elem,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 bpf-next 4/4] bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
  2024-04-02  2:13 [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction Andrii Nakryiko
                   ` (2 preceding siblings ...)
  2024-04-02  2:13 ` [PATCH v2 bpf-next 3/4] bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps Andrii Nakryiko
@ 2024-04-02  2:13 ` Andrii Nakryiko
  2024-04-02  5:04   ` John Fastabend
  2024-04-02  4:57 ` [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction John Fastabend
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-02  2:13 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Using new per-CPU BPF instruction, partially inline
bpf_map_lookup_elem() helper for per-CPU hashmap BPF map. Just like for
normal HASH map, we still generate a call into __htab_map_lookup_elem(),
but after that we resolve per-CPU element address using a new
instruction, saving on extra functions calls.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/hashtab.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index e81059faae63..83a9a74260e9 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -2308,6 +2308,26 @@ static void *htab_percpu_map_lookup_elem(struct bpf_map *map, void *key)
 		return NULL;
 }
 
+/* inline bpf_map_lookup_elem() call for per-CPU hashmap */
+static int htab_percpu_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	if (!bpf_jit_supports_percpu_insn())
+		return -EOPNOTSUPP;
+
+	BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
+		     (void *(*)(struct bpf_map *map, void *key))NULL));
+	*insn++ = BPF_EMIT_CALL(__htab_map_lookup_elem);
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3);
+	*insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_0,
+				offsetof(struct htab_elem, key) + map->key_size);
+	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0);
+	*insn++ = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
+
+	return insn - insn_buf;
+}
+
 static void *htab_percpu_map_lookup_percpu_elem(struct bpf_map *map, void *key, u32 cpu)
 {
 	struct htab_elem *l;
@@ -2436,6 +2456,7 @@ const struct bpf_map_ops htab_percpu_map_ops = {
 	.map_free = htab_map_free,
 	.map_get_next_key = htab_map_get_next_key,
 	.map_lookup_elem = htab_percpu_map_lookup_elem,
+	.map_gen_lookup = htab_percpu_map_gen_lookup,
 	.map_lookup_and_delete_elem = htab_percpu_map_lookup_and_delete_elem,
 	.map_update_elem = htab_percpu_map_update_elem,
 	.map_delete_elem = htab_map_delete_elem,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 bpf-next 1/4] bpf: add special internal-only MOV instruction to resolve per-CPU addrs
  2024-04-02  2:13 ` [PATCH v2 bpf-next 1/4] bpf: add special internal-only MOV instruction to resolve per-CPU addrs Andrii Nakryiko
@ 2024-04-02  4:35   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2024-04-02  4:35 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Andrii Nakryiko wrote:
> Add a new BPF instruction for resolving absolute addresses of per-CPU
> data from their per-CPU offsets. This instruction is internal-only and
> users are not allowed to use them directly. They will only be used for
> internal inlining optimizations for now between BPF verifier and BPF JITs.
> 
> We use a special BPF_MOV | BPF_ALU64 | BPF_X form with insn->off field
> set to BPF_ADDR_PERCPU = -1. I used negative offset value to distinguish
> them from positive ones used by user-exposed instructions.
> 
> Such instruction performs a resolution of a per-CPU offset stored in
> a register to a valid kernel address which can be dereferenced. It is
> useful in any use case where absolute address of a per-CPU data has to
> be resolved (e.g., in inlining bpf_map_lookup_elem()).
> 
> BPF disassembler is also taught to recognize them to support dumping
> final BPF assembly code (non-JIT'ed version).
> 
> Add arch-specific way for BPF JITs to mark support for this instructions.
> 
> This patch also adds support for these instructions in x86-64 BPF JIT.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
>  include/linux/filter.h      | 20 ++++++++++++++++++++
>  kernel/bpf/core.c           |  5 +++++
>  kernel/bpf/disasm.c         | 14 ++++++++++++++
>  4 files changed, 55 insertions(+)

LGTM

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 bpf-next 2/4] bpf: inline bpf_get_smp_processor_id() helper
  2024-04-02  2:13 ` [PATCH v2 bpf-next 2/4] bpf: inline bpf_get_smp_processor_id() helper Andrii Nakryiko
@ 2024-04-02  4:41   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2024-04-02  4:41 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Andrii Nakryiko wrote:
> If BPF JIT supports per-CPU MOV instruction, inline bpf_get_smp_processor_id()
> to eliminate unnecessary function calls.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/bpf/verifier.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index edb650667f44..af0274b090bb 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -20072,6 +20072,30 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>  			goto next_insn;
>  		}
>  
> +#ifdef CONFIG_X86_64
> +		/* Implement bpf_get_smp_processor_id() inline. */
> +		if (insn->imm == BPF_FUNC_get_smp_processor_id &&
> +		    prog->jit_requested && bpf_jit_supports_percpu_insn()) {
> +			/* BPF_FUNC_get_smp_processor_id inlining is an
> +			 * optimization, so if pcpu_hot.cpu_number is ever
> +			 * changed in some incompatible and hard to support
> +			 * way, it's fine to back out this inlining logic
> +			 */
> +			insn_buf[0] = BPF_MOV32_IMM(BPF_REG_0, (u32)(unsigned long)&pcpu_hot.cpu_number);
> +			insn_buf[1] = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
> +			insn_buf[2] = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0);
> +			cnt = 3;
> +
> +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +			if (!new_prog)
> +				return -ENOMEM;
> +
> +			delta    += cnt - 1;
> +			env->prog = prog = new_prog;
> +			insn      = new_prog->insnsi + i + delta;
> +			goto next_insn;
> +		}
> +#endif
>  		/* Implement bpf_get_func_arg inline. */
>  		if (prog_type == BPF_PROG_TYPE_TRACING &&
>  		    insn->imm == BPF_FUNC_get_func_arg) {
> -- 
> 2.43.0
> 
> 

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-02  2:13 [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction Andrii Nakryiko
                   ` (3 preceding siblings ...)
  2024-04-02  2:13 ` [PATCH v2 bpf-next 4/4] bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map Andrii Nakryiko
@ 2024-04-02  4:57 ` John Fastabend
  2024-04-02 16:20   ` Andrii Nakryiko
  2024-04-03 17:40 ` patchwork-bot+netdevbpf
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: John Fastabend @ 2024-04-02  4:57 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Andrii Nakryiko wrote:
> Add a new BPF instruction for resolving per-CPU memory addresses.
> 
> New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> to an absolute address where per-CPU data resides for "this" CPU.
> 
> This patch set implements support for it in x86-64 BPF JIT only.
> 
> Using the new instruction, we also implement inlining for three cases:
>   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
>     function call, saving a bit of performance and also not polluting LBR
>     records with unnecessary function call/return records;
>   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
>     performance to implementing per-CPU data structures using global variables
>     in BPF (which is an awesome improvement, see benchmarks below);
>   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
>     same for non-PERCPU HASH map; this still saves a bit of overhead.
> 
> To validate performance benefits, I hacked together a tiny benchmark doing
> only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> on global variable array using bpf_get_smp_processor_id() to index array for
> current CPU (glob-arr-inc benchmark below).
> 
> BEFORE
> ======
> glob-arr-inc   :  163.685 ± 0.092M/s
> arr-inc        :  138.096 ± 0.160M/s
> hash-inc       :   66.855 ± 0.123M/s
> 
> AFTER
> =====
> glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
> arr-inc        :  170.729 ± 0.210M/s (+23.7%)
> hash-inc       :   68.673 ± 0.070M/s (+2.7%)
> 
> As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
> array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().
> 
> But what's really important is that arr-inc benchmark basically catches up
> with glob-arr-inc, resulting in +23.7% improvement. This means that in
> practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
> critical (e.g., high-frequent stats collection, which is often a practical use
> for PERCPU_ARRAY today).

Out of curiousity did we consider exposing this instruction outside internal
inlining? It seems it would help compiler some to not believe its doing a
function call.

We could do some runtime rewrites to find the address for global vars for
example.

FWIW I don't think one should block this necessarily perhaps as follow up?
Or at least worth considering if I didn't miss some reason its not
plausible.

> 
> v1->v2:
>   - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
>   - dropped the direct per-CPU memory read instruction, it can always be added
>     back, if necessary;
>   - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
>   - switched all per-cpu addr casts to (unsigned long) to avoid sparse
>     warnings.
> 
> Andrii Nakryiko (4):
>   bpf: add special internal-only MOV instruction to resolve per-CPU
>     addrs
>   bpf: inline bpf_get_smp_processor_id() helper
>   bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
>   bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
> 
>  arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
>  include/linux/filter.h      | 20 ++++++++++++++++++++
>  kernel/bpf/arraymap.c       | 33 +++++++++++++++++++++++++++++++++
>  kernel/bpf/core.c           |  5 +++++
>  kernel/bpf/disasm.c         | 14 ++++++++++++++
>  kernel/bpf/hashtab.c        | 21 +++++++++++++++++++++
>  kernel/bpf/verifier.c       | 24 ++++++++++++++++++++++++
>  7 files changed, 133 insertions(+)
> 
> -- 
> 2.43.0
> 
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 bpf-next 3/4] bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
  2024-04-02  2:13 ` [PATCH v2 bpf-next 3/4] bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps Andrii Nakryiko
@ 2024-04-02  5:02   ` John Fastabend
  2024-04-02 16:20     ` Andrii Nakryiko
  0 siblings, 1 reply; 24+ messages in thread
From: John Fastabend @ 2024-04-02  5:02 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Andrii Nakryiko wrote:
> Using new per-CPU BPF instruction implement inlining for per-CPU ARRAY
> map lookup helper, if BPF JIT support is present.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/bpf/arraymap.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 13358675ff2e..8c1e6d7654bb 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -246,6 +246,38 @@ static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
>  	return this_cpu_ptr(array->pptrs[index & array->index_mask]);
>  }
>  
> +/* emit BPF instructions equivalent to C code of percpu_array_map_lookup_elem() */
> +static int percpu_array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
> +{
> +	struct bpf_array *array = container_of(map, struct bpf_array, map);
> +	struct bpf_insn *insn = insn_buf;

Nit, If you wanted to be consistent with array_*_map_gen_lookup,

	const int ret = BPF_REG_0;
	const int map_ptr = BPF_REG_1;
	const int index = BPF_REG_2;

But, I think its easier to read as is.

> +
> +	if (!bpf_jit_supports_percpu_insn())
> +		return -EOPNOTSUPP;
> +
> +	if (map->map_flags & BPF_F_INNER_MAP)
> +		return -EOPNOTSUPP;
> +
> +	BUILD_BUG_ON(offsetof(struct bpf_array, map) != 0);
> +	*insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, offsetof(struct bpf_array, pptrs));
> +
> +	*insn++ = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_2, 0);
> +	if (!map->bypass_spec_v1) {
> +		*insn++ = BPF_JMP_IMM(BPF_JGE, BPF_REG_0, map->max_entries, 6);
> +		*insn++ = BPF_ALU32_IMM(BPF_AND, BPF_REG_0, array->index_mask);
> +	} else {
> +		*insn++ = BPF_JMP_IMM(BPF_JGE, BPF_REG_0, map->max_entries, 5);
> +	}
> +
> +	*insn++ = BPF_ALU64_IMM(BPF_LSH, BPF_REG_0, 3);
> +	*insn++ = BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1);
> +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0);
> +	*insn++ = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
> +	*insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1);
> +	*insn++ = BPF_MOV64_IMM(BPF_REG_0, 0);
> +	return insn - insn_buf;
> +}
> +
>  static void *percpu_array_map_lookup_percpu_elem(struct bpf_map *map, void *key, u32 cpu)
>  {
>  	struct bpf_array *array = container_of(map, struct bpf_array, map);
> @@ -776,6 +808,7 @@ const struct bpf_map_ops percpu_array_map_ops = {
>  	.map_free = array_map_free,
>  	.map_get_next_key = array_map_get_next_key,
>  	.map_lookup_elem = percpu_array_map_lookup_elem,
> +	.map_gen_lookup = percpu_array_map_gen_lookup,
>  	.map_update_elem = array_map_update_elem,
>  	.map_delete_elem = array_map_delete_elem,
>  	.map_lookup_percpu_elem = percpu_array_map_lookup_percpu_elem,

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v2 bpf-next 4/4] bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
  2024-04-02  2:13 ` [PATCH v2 bpf-next 4/4] bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map Andrii Nakryiko
@ 2024-04-02  5:04   ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2024-04-02  5:04 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Andrii Nakryiko wrote:
> Using new per-CPU BPF instruction, partially inline
> bpf_map_lookup_elem() helper for per-CPU hashmap BPF map. Just like for
> normal HASH map, we still generate a call into __htab_map_lookup_elem(),
> but after that we resolve per-CPU element address using a new
> instruction, saving on extra functions calls.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/bpf/hashtab.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index e81059faae63..83a9a74260e9 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -2308,6 +2308,26 @@ static void *htab_percpu_map_lookup_elem(struct bpf_map *map, void *key)
>  		return NULL;
>  }
>  
> +/* inline bpf_map_lookup_elem() call for per-CPU hashmap */
> +static int htab_percpu_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +
> +	if (!bpf_jit_supports_percpu_insn())
> +		return -EOPNOTSUPP;
> +
> +	BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
> +		     (void *(*)(struct bpf_map *map, void *key))NULL));
> +	*insn++ = BPF_EMIT_CALL(__htab_map_lookup_elem);
> +	*insn++ = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3);
> +	*insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_0,
> +				offsetof(struct htab_elem, key) + map->key_size);
> +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0);
> +	*insn++ = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
> +
> +	return insn - insn_buf;
> +}
> +
>  static void *htab_percpu_map_lookup_percpu_elem(struct bpf_map *map, void *key, u32 cpu)
>  {
>  	struct htab_elem *l;
> @@ -2436,6 +2456,7 @@ const struct bpf_map_ops htab_percpu_map_ops = {
>  	.map_free = htab_map_free,
>  	.map_get_next_key = htab_map_get_next_key,
>  	.map_lookup_elem = htab_percpu_map_lookup_elem,
> +	.map_gen_lookup = htab_percpu_map_gen_lookup,
>  	.map_lookup_and_delete_elem = htab_percpu_map_lookup_and_delete_elem,
>  	.map_update_elem = htab_percpu_map_update_elem,
>  	.map_delete_elem = htab_map_delete_elem,


Thanks I'll test on Tetragon as well to see if we can see some perf improvement we have
a few per cpu maps int here as well.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-02  4:57 ` [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction John Fastabend
@ 2024-04-02 16:20   ` Andrii Nakryiko
  2024-04-03 22:01     ` John Fastabend
  0 siblings, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-02 16:20 UTC (permalink / raw)
  To: John Fastabend; +Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, kernel-team

On Mon, Apr 1, 2024 at 9:57 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > Add a new BPF instruction for resolving per-CPU memory addresses.
> >
> > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > to an absolute address where per-CPU data resides for "this" CPU.
> >
> > This patch set implements support for it in x86-64 BPF JIT only.
> >
> > Using the new instruction, we also implement inlining for three cases:
> >   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
> >     function call, saving a bit of performance and also not polluting LBR
> >     records with unnecessary function call/return records;
> >   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
> >     performance to implementing per-CPU data structures using global variables
> >     in BPF (which is an awesome improvement, see benchmarks below);
> >   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
> >     same for non-PERCPU HASH map; this still saves a bit of overhead.
> >
> > To validate performance benefits, I hacked together a tiny benchmark doing
> > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> > on global variable array using bpf_get_smp_processor_id() to index array for
> > current CPU (glob-arr-inc benchmark below).
> >
> > BEFORE
> > ======
> > glob-arr-inc   :  163.685 ± 0.092M/s
> > arr-inc        :  138.096 ± 0.160M/s
> > hash-inc       :   66.855 ± 0.123M/s
> >
> > AFTER
> > =====
> > glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
> > arr-inc        :  170.729 ± 0.210M/s (+23.7%)
> > hash-inc       :   68.673 ± 0.070M/s (+2.7%)
> >
> > As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
> > array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().
> >
> > But what's really important is that arr-inc benchmark basically catches up
> > with glob-arr-inc, resulting in +23.7% improvement. This means that in
> > practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
> > critical (e.g., high-frequent stats collection, which is often a practical use
> > for PERCPU_ARRAY today).
>
> Out of curiousity did we consider exposing this instruction outside internal
> inlining? It seems it would help compiler some to not believe its doing a
> function call.

We decided to start as internal-only to try it out and get inlining
benefits, without being stuck in a longer discussion to define the
exact user-visible semantics, guarantees, etc. Given this instruction
calculates memory addresses that are meant to be dereferenced
directly, we'd need to be careful to make sure all the safety aspects
are carefully considered.

Though you reminded me that we should probably also implement inlining
of bpf_this_cpu_ptr() and maybe even bpf_per_cpu_ptr() (I'll need to
double-check how arbitrary CPU address is calculated). Maybe as a
follow up.

>
> We could do some runtime rewrites to find the address for global vars for
> example.
>
> FWIW I don't think one should block this necessarily perhaps as follow up?
> Or at least worth considering if I didn't miss some reason its not
> plausible.

No blocker in principle, but certainly we'd need to be more careful if
we expose the instruction to users.

>
> >
> > v1->v2:
> >   - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
> >   - dropped the direct per-CPU memory read instruction, it can always be added
> >     back, if necessary;
> >   - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
> >   - switched all per-cpu addr casts to (unsigned long) to avoid sparse
> >     warnings.
> >
> > Andrii Nakryiko (4):
> >   bpf: add special internal-only MOV instruction to resolve per-CPU
> >     addrs
> >   bpf: inline bpf_get_smp_processor_id() helper
> >   bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
> >   bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
> >
> >  arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
> >  include/linux/filter.h      | 20 ++++++++++++++++++++
> >  kernel/bpf/arraymap.c       | 33 +++++++++++++++++++++++++++++++++
> >  kernel/bpf/core.c           |  5 +++++
> >  kernel/bpf/disasm.c         | 14 ++++++++++++++
> >  kernel/bpf/hashtab.c        | 21 +++++++++++++++++++++
> >  kernel/bpf/verifier.c       | 24 ++++++++++++++++++++++++
> >  7 files changed, 133 insertions(+)
> >
> > --
> > 2.43.0
> >
> >
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 3/4] bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
  2024-04-02  5:02   ` John Fastabend
@ 2024-04-02 16:20     ` Andrii Nakryiko
  0 siblings, 0 replies; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-02 16:20 UTC (permalink / raw)
  To: John Fastabend; +Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, kernel-team

On Mon, Apr 1, 2024 at 10:02 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > Using new per-CPU BPF instruction implement inlining for per-CPU ARRAY
> > map lookup helper, if BPF JIT support is present.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  kernel/bpf/arraymap.c | 33 +++++++++++++++++++++++++++++++++
> >  1 file changed, 33 insertions(+)
> >
> > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> > index 13358675ff2e..8c1e6d7654bb 100644
> > --- a/kernel/bpf/arraymap.c
> > +++ b/kernel/bpf/arraymap.c
> > @@ -246,6 +246,38 @@ static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
> >       return this_cpu_ptr(array->pptrs[index & array->index_mask]);
> >  }
> >
> > +/* emit BPF instructions equivalent to C code of percpu_array_map_lookup_elem() */
> > +static int percpu_array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
> > +{
> > +     struct bpf_array *array = container_of(map, struct bpf_array, map);
> > +     struct bpf_insn *insn = insn_buf;
>
> Nit, If you wanted to be consistent with array_*_map_gen_lookup,
>

I didn't in this case, I found these "aliases" more confusing than helpful.

>         const int ret = BPF_REG_0;
>         const int map_ptr = BPF_REG_1;
>         const int index = BPF_REG_2;
>
> But, I think its easier to read as is.
>

Yep, that's what I thought as well.


> > +
> > +     if (!bpf_jit_supports_percpu_insn())
> > +             return -EOPNOTSUPP;
> > +
> > +     if (map->map_flags & BPF_F_INNER_MAP)
> > +             return -EOPNOTSUPP;
> > +
> > +     BUILD_BUG_ON(offsetof(struct bpf_array, map) != 0);
> > +     *insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, offsetof(struct bpf_array, pptrs));
> > +
> > +     *insn++ = BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_2, 0);
> > +     if (!map->bypass_spec_v1) {
> > +             *insn++ = BPF_JMP_IMM(BPF_JGE, BPF_REG_0, map->max_entries, 6);
> > +             *insn++ = BPF_ALU32_IMM(BPF_AND, BPF_REG_0, array->index_mask);
> > +     } else {
> > +             *insn++ = BPF_JMP_IMM(BPF_JGE, BPF_REG_0, map->max_entries, 5);
> > +     }
> > +
> > +     *insn++ = BPF_ALU64_IMM(BPF_LSH, BPF_REG_0, 3);
> > +     *insn++ = BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1);
> > +     *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0);
> > +     *insn++ = BPF_MOV64_PERCPU_REG(BPF_REG_0, BPF_REG_0);
> > +     *insn++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1);
> > +     *insn++ = BPF_MOV64_IMM(BPF_REG_0, 0);
> > +     return insn - insn_buf;
> > +}
> > +
> >  static void *percpu_array_map_lookup_percpu_elem(struct bpf_map *map, void *key, u32 cpu)
> >  {
> >       struct bpf_array *array = container_of(map, struct bpf_array, map);
> > @@ -776,6 +808,7 @@ const struct bpf_map_ops percpu_array_map_ops = {
> >       .map_free = array_map_free,
> >       .map_get_next_key = array_map_get_next_key,
> >       .map_lookup_elem = percpu_array_map_lookup_elem,
> > +     .map_gen_lookup = percpu_array_map_gen_lookup,
> >       .map_update_elem = array_map_update_elem,
> >       .map_delete_elem = array_map_delete_elem,
> >       .map_lookup_percpu_elem = percpu_array_map_lookup_percpu_elem,
>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-02  2:13 [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction Andrii Nakryiko
                   ` (4 preceding siblings ...)
  2024-04-02  4:57 ` [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction John Fastabend
@ 2024-04-03 17:40 ` patchwork-bot+netdevbpf
  2024-04-03 17:41 ` Alexei Starovoitov
  2024-04-04 16:12 ` Andrii Nakryiko
  7 siblings, 0 replies; 24+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-04-03 17:40 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, kernel-team

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Mon,  1 Apr 2024 19:13:01 -0700 you wrote:
> Add a new BPF instruction for resolving per-CPU memory addresses.
> 
> New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> to an absolute address where per-CPU data resides for "this" CPU.
> 
> This patch set implements support for it in x86-64 BPF JIT only.
> 
> [...]

Here is the summary with links:
  - [v2,bpf-next,1/4] bpf: add special internal-only MOV instruction to resolve per-CPU addrs
    https://git.kernel.org/bpf/bpf-next/c/7bdbf7446305
  - [v2,bpf-next,2/4] bpf: inline bpf_get_smp_processor_id() helper
    https://git.kernel.org/bpf/bpf-next/c/1ae6921009e5
  - [v2,bpf-next,3/4] bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
    https://git.kernel.org/bpf/bpf-next/c/db69718b8efa
  - [v2,bpf-next,4/4] bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
    https://git.kernel.org/bpf/bpf-next/c/0b56e637f705

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-02  2:13 [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction Andrii Nakryiko
                   ` (5 preceding siblings ...)
  2024-04-03 17:40 ` patchwork-bot+netdevbpf
@ 2024-04-03 17:41 ` Alexei Starovoitov
  2024-04-03 18:03   ` Andrii Nakryiko
  2024-04-04 16:12 ` Andrii Nakryiko
  7 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2024-04-03 17:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Kernel Team

On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Add a new BPF instruction for resolving per-CPU memory addresses.
>
> New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> to an absolute address where per-CPU data resides for "this" CPU.

Fixed the typo s/BPF_DW/BPF_X/ while applying.

I'm still a bit worried that there is no BUILD_BUG_ON
for 32-bit archs while both array and htab perpu gen_lookup
assume sizeof(void*) == 8 in few places:
htab:
        *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0);

array:
        *insn++ = BPF_ALU64_IMM(BPF_LSH, BPF_REG_0, 3);
        *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0);
Worth following up or I'm overthinking this?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-03 17:41 ` Alexei Starovoitov
@ 2024-04-03 18:03   ` Andrii Nakryiko
  2024-04-03 18:14     ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-03 18:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team

On Wed, Apr 3, 2024 at 10:41 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Add a new BPF instruction for resolving per-CPU memory addresses.
> >
> > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > to an absolute address where per-CPU data resides for "this" CPU.
>
> Fixed the typo s/BPF_DW/BPF_X/ while applying.

Oops, thanks!

>
> I'm still a bit worried that there is no BUILD_BUG_ON

BUILD_BUG_ON is problematic, because those gen_lookup callbacks are
compiled on 32-bit arches just fine (just not used). I'm totally fine
adding extra checks as a follow up. Let's just agree on some
consistent approach her?

See [0] for the verifier check I mentioned. I can

a) add #ifdef guards around all current gen_lookup callbacks to only
build them on 64-bit arches

or

b) add runtime checks inside get_lookup to return -EOPNOTSUPP if
architecture isn't 64-bit.

I don't particularly care which one, I care that it's done for all
callbacks if we are doing this. Please let me know your preference.

  [0] https://github.com/torvalds/linux/blob/master/kernel/bpf/verifier.c#L19942-L19946

> for 32-bit archs while both array and htab perpu gen_lookup
> assume sizeof(void*) == 8 in few places:
> htab:
>         *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0);
>
> array:
>         *insn++ = BPF_ALU64_IMM(BPF_LSH, BPF_REG_0, 3);
>         *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0);
> Worth following up or I'm overthinking this?

I don't know, maybe? :) See above.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-03 18:03   ` Andrii Nakryiko
@ 2024-04-03 18:14     ` Alexei Starovoitov
  0 siblings, 0 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2024-04-03 18:14 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team

On Wed, Apr 3, 2024 at 11:04 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Apr 3, 2024 at 10:41 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Add a new BPF instruction for resolving per-CPU memory addresses.
> > >
> > > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > > to an absolute address where per-CPU data resides for "this" CPU.
> >
> > Fixed the typo s/BPF_DW/BPF_X/ while applying.
>
> Oops, thanks!
>
> >
> > I'm still a bit worried that there is no BUILD_BUG_ON
>
> BUILD_BUG_ON is problematic, because those gen_lookup callbacks are
> compiled on 32-bit arches just fine (just not used). I'm totally fine
> adding extra checks as a follow up. Let's just agree on some
> consistent approach her?
>
> See [0] for the verifier check I mentioned. I can
>
> a) add #ifdef guards around all current gen_lookup callbacks to only
> build them on 64-bit arches
>
> or
>
> b) add runtime checks inside get_lookup to return -EOPNOTSUPP if
> architecture isn't 64-bit.
>
> I don't particularly care which one, I care that it's done for all
> callbacks if we are doing this. Please let me know your preference.
>
>   [0] https://github.com/torvalds/linux/blob/master/kernel/bpf/verifier.c#L19942-L19946

Ahh. BITS_PER_LONG == 64 check is for a different reason,
but that works. It won't be lifted any time soon,
so not worried any more about that assumption inside gen_lookup.
Let's keep things as-is.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-02 16:20   ` Andrii Nakryiko
@ 2024-04-03 22:01     ` John Fastabend
  0 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2024-04-03 22:01 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend
  Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, kernel-team

Andrii Nakryiko wrote:
> On Mon, Apr 1, 2024 at 9:57 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > Add a new BPF instruction for resolving per-CPU memory addresses.
> > >
> > > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > > to an absolute address where per-CPU data resides for "this" CPU.
> > >
> > > This patch set implements support for it in x86-64 BPF JIT only.
> > >
> > > Using the new instruction, we also implement inlining for three cases:
> > >   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
> > >     function call, saving a bit of performance and also not polluting LBR
> > >     records with unnecessary function call/return records;
> > >   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
> > >     performance to implementing per-CPU data structures using global variables
> > >     in BPF (which is an awesome improvement, see benchmarks below);
> > >   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
> > >     same for non-PERCPU HASH map; this still saves a bit of overhead.
> > >
> > > To validate performance benefits, I hacked together a tiny benchmark doing
> > > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> > > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> > > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> > > on global variable array using bpf_get_smp_processor_id() to index array for
> > > current CPU (glob-arr-inc benchmark below).
> > >
> > > BEFORE
> > > ======
> > > glob-arr-inc   :  163.685 ± 0.092M/s
> > > arr-inc        :  138.096 ± 0.160M/s
> > > hash-inc       :   66.855 ± 0.123M/s
> > >
> > > AFTER
> > > =====
> > > glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
> > > arr-inc        :  170.729 ± 0.210M/s (+23.7%)
> > > hash-inc       :   68.673 ± 0.070M/s (+2.7%)
> > >
> > > As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
> > > array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().
> > >
> > > But what's really important is that arr-inc benchmark basically catches up
> > > with glob-arr-inc, resulting in +23.7% improvement. This means that in
> > > practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
> > > critical (e.g., high-frequent stats collection, which is often a practical use
> > > for PERCPU_ARRAY today).
> >
> > Out of curiousity did we consider exposing this instruction outside internal
> > inlining? It seems it would help compiler some to not believe its doing a
> > function call.
> 
> We decided to start as internal-only to try it out and get inlining
> benefits, without being stuck in a longer discussion to define the
> exact user-visible semantics, guarantees, etc. Given this instruction
> calculates memory addresses that are meant to be dereferenced
> directly, we'd need to be careful to make sure all the safety aspects
> are carefully considered.
> 
> Though you reminded me that we should probably also implement inlining
> of bpf_this_cpu_ptr() and maybe even bpf_per_cpu_ptr() (I'll need to
> double-check how arbitrary CPU address is calculated). Maybe as a
> follow up.
> 
> >
> > We could do some runtime rewrites to find the address for global vars for
> > example.
> >
> > FWIW I don't think one should block this necessarily perhaps as follow up?
> > Or at least worth considering if I didn't miss some reason its not
> > plausible.
> 
> No blocker in principle, but certainly we'd need to be more careful if
> we expose the instruction to users.

All sounds good that was my understanding as well.

I see it got merged but for what its worth I went through and ran some
of this and rest of patches lgtm,

Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: John Fastabend <john.fastabend@gmail.com>

> 
> >
> > >
> > > v1->v2:
> > >   - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
> > >   - dropped the direct per-CPU memory read instruction, it can always be added
> > >     back, if necessary;
> > >   - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
> > >   - switched all per-cpu addr casts to (unsigned long) to avoid sparse
> > >     warnings.
> > >
> > > Andrii Nakryiko (4):
> > >   bpf: add special internal-only MOV instruction to resolve per-CPU
> > >     addrs
> > >   bpf: inline bpf_get_smp_processor_id() helper
> > >   bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
> > >   bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
> > >
> > >  arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
> > >  include/linux/filter.h      | 20 ++++++++++++++++++++
> > >  kernel/bpf/arraymap.c       | 33 +++++++++++++++++++++++++++++++++
> > >  kernel/bpf/core.c           |  5 +++++
> > >  kernel/bpf/disasm.c         | 14 ++++++++++++++
> > >  kernel/bpf/hashtab.c        | 21 +++++++++++++++++++++
> > >  kernel/bpf/verifier.c       | 24 ++++++++++++++++++++++++
> > >  7 files changed, 133 insertions(+)
> > >
> > > --
> > > 2.43.0
> > >
> > >
> >
> >



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-02  2:13 [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction Andrii Nakryiko
                   ` (6 preceding siblings ...)
  2024-04-03 17:41 ` Alexei Starovoitov
@ 2024-04-04 16:12 ` Andrii Nakryiko
  2024-04-04 16:16   ` Puranjay Mohan
  2024-04-04 21:02   ` Puranjay Mohan
  7 siblings, 2 replies; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-04 16:12 UTC (permalink / raw)
  To: Andrii Nakryiko, Pu Lehui, Puranjay Mohan
  Cc: bpf, ast, daniel, martin.lau, kernel-team

On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Add a new BPF instruction for resolving per-CPU memory addresses.
>
> New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> to an absolute address where per-CPU data resides for "this" CPU.
>
> This patch set implements support for it in x86-64 BPF JIT only.
>
> Using the new instruction, we also implement inlining for three cases:
>   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
>     function call, saving a bit of performance and also not polluting LBR
>     records with unnecessary function call/return records;
>   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
>     performance to implementing per-CPU data structures using global variables
>     in BPF (which is an awesome improvement, see benchmarks below);
>   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
>     same for non-PERCPU HASH map; this still saves a bit of overhead.
>
> To validate performance benefits, I hacked together a tiny benchmark doing
> only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> on global variable array using bpf_get_smp_processor_id() to index array for
> current CPU (glob-arr-inc benchmark below).
>
> BEFORE
> ======
> glob-arr-inc   :  163.685 ± 0.092M/s
> arr-inc        :  138.096 ± 0.160M/s
> hash-inc       :   66.855 ± 0.123M/s
>
> AFTER
> =====
> glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
> arr-inc        :  170.729 ± 0.210M/s (+23.7%)
> hash-inc       :   68.673 ± 0.070M/s (+2.7%)
>
> As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
> array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().
>
> But what's really important is that arr-inc benchmark basically catches up
> with glob-arr-inc, resulting in +23.7% improvement. This means that in
> practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
> critical (e.g., high-frequent stats collection, which is often a practical use
> for PERCPU_ARRAY today).
>
> v1->v2:
>   - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
>   - dropped the direct per-CPU memory read instruction, it can always be added
>     back, if necessary;
>   - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
>   - switched all per-cpu addr casts to (unsigned long) to avoid sparse
>     warnings.
>
> Andrii Nakryiko (4):
>   bpf: add special internal-only MOV instruction to resolve per-CPU
>     addrs
>   bpf: inline bpf_get_smp_processor_id() helper
>   bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
>   bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
>
>  arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
>  include/linux/filter.h      | 20 ++++++++++++++++++++
>  kernel/bpf/arraymap.c       | 33 +++++++++++++++++++++++++++++++++
>  kernel/bpf/core.c           |  5 +++++
>  kernel/bpf/disasm.c         | 14 ++++++++++++++
>  kernel/bpf/hashtab.c        | 21 +++++++++++++++++++++
>  kernel/bpf/verifier.c       | 24 ++++++++++++++++++++++++
>  7 files changed, 133 insertions(+)
>
> --
> 2.43.0
>

Puranjay, Pu,

Is this something you guys can help implement for ARM64 and RISC-V
architectures as well? I'm lacking the knowledge of the specifics of
this architecture to be able to do this by myself, but it is a nice
improvement for BPF applications, so it would be nice to get wide
architecture support.

Thank you!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-04 16:12 ` Andrii Nakryiko
@ 2024-04-04 16:16   ` Puranjay Mohan
  2024-04-04 16:18     ` Andrii Nakryiko
                       ` (2 more replies)
  2024-04-04 21:02   ` Puranjay Mohan
  1 sibling, 3 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-04-04 16:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, Pu Lehui, bpf, ast, daniel, martin.lau,
	kernel-team

On Thu, Apr 4, 2024 at 6:12 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Add a new BPF instruction for resolving per-CPU memory addresses.
> >
> > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > to an absolute address where per-CPU data resides for "this" CPU.
> >
> > This patch set implements support for it in x86-64 BPF JIT only.
> >
> > Using the new instruction, we also implement inlining for three cases:
> >   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
> >     function call, saving a bit of performance and also not polluting LBR
> >     records with unnecessary function call/return records;
> >   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
> >     performance to implementing per-CPU data structures using global variables
> >     in BPF (which is an awesome improvement, see benchmarks below);
> >   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
> >     same for non-PERCPU HASH map; this still saves a bit of overhead.
> >
> > To validate performance benefits, I hacked together a tiny benchmark doing
> > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> > on global variable array using bpf_get_smp_processor_id() to index array for
> > current CPU (glob-arr-inc benchmark below).
> >
> > BEFORE
> > ======
> > glob-arr-inc   :  163.685 ± 0.092M/s
> > arr-inc        :  138.096 ± 0.160M/s
> > hash-inc       :   66.855 ± 0.123M/s
> >
> > AFTER
> > =====
> > glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
> > arr-inc        :  170.729 ± 0.210M/s (+23.7%)
> > hash-inc       :   68.673 ± 0.070M/s (+2.7%)
> >
> > As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
> > array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().
> >
> > But what's really important is that arr-inc benchmark basically catches up
> > with glob-arr-inc, resulting in +23.7% improvement. This means that in
> > practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
> > critical (e.g., high-frequent stats collection, which is often a practical use
> > for PERCPU_ARRAY today).
> >
> > v1->v2:
> >   - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
> >   - dropped the direct per-CPU memory read instruction, it can always be added
> >     back, if necessary;
> >   - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
> >   - switched all per-cpu addr casts to (unsigned long) to avoid sparse
> >     warnings.
> >
> > Andrii Nakryiko (4):
> >   bpf: add special internal-only MOV instruction to resolve per-CPU
> >     addrs
> >   bpf: inline bpf_get_smp_processor_id() helper
> >   bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
> >   bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
> >
> >  arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
> >  include/linux/filter.h      | 20 ++++++++++++++++++++
> >  kernel/bpf/arraymap.c       | 33 +++++++++++++++++++++++++++++++++
> >  kernel/bpf/core.c           |  5 +++++
> >  kernel/bpf/disasm.c         | 14 ++++++++++++++
> >  kernel/bpf/hashtab.c        | 21 +++++++++++++++++++++
> >  kernel/bpf/verifier.c       | 24 ++++++++++++++++++++++++
> >  7 files changed, 133 insertions(+)
> >
> > --
> > 2.43.0
> >
>
> Puranjay, Pu,
>
> Is this something you guys can help implement for ARM64 and RISC-V
> architectures as well? I'm lacking the knowledge of the specifics of
> this architecture to be able to do this by myself, but it is a nice
> improvement for BPF applications, so it would be nice to get wide
> architecture support.
>
> Thank you!

I have already started implementing this on ARM64,

Pu, can you do it for RISC-V? If you don't have cycles then I could do it after
finishing ARM64.

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-04 16:16   ` Puranjay Mohan
@ 2024-04-04 16:18     ` Andrii Nakryiko
  2024-04-05 12:48     ` Puranjay Mohan
  2024-04-05 13:19     ` Pu Lehui
  2 siblings, 0 replies; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-04 16:18 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Andrii Nakryiko, Pu Lehui, bpf, ast, daniel, martin.lau,
	kernel-team

On Thu, Apr 4, 2024 at 9:16 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
>
> On Thu, Apr 4, 2024 at 6:12 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Add a new BPF instruction for resolving per-CPU memory addresses.
> > >
> > > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > > to an absolute address where per-CPU data resides for "this" CPU.
> > >
> > > This patch set implements support for it in x86-64 BPF JIT only.
> > >
> > > Using the new instruction, we also implement inlining for three cases:
> > >   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
> > >     function call, saving a bit of performance and also not polluting LBR
> > >     records with unnecessary function call/return records;
> > >   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
> > >     performance to implementing per-CPU data structures using global variables
> > >     in BPF (which is an awesome improvement, see benchmarks below);
> > >   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
> > >     same for non-PERCPU HASH map; this still saves a bit of overhead.
> > >
> > > To validate performance benefits, I hacked together a tiny benchmark doing
> > > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> > > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> > > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> > > on global variable array using bpf_get_smp_processor_id() to index array for
> > > current CPU (glob-arr-inc benchmark below).
> > >
> > > BEFORE
> > > ======
> > > glob-arr-inc   :  163.685 ± 0.092M/s
> > > arr-inc        :  138.096 ± 0.160M/s
> > > hash-inc       :   66.855 ± 0.123M/s
> > >
> > > AFTER
> > > =====
> > > glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
> > > arr-inc        :  170.729 ± 0.210M/s (+23.7%)
> > > hash-inc       :   68.673 ± 0.070M/s (+2.7%)
> > >
> > > As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
> > > array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().
> > >
> > > But what's really important is that arr-inc benchmark basically catches up
> > > with glob-arr-inc, resulting in +23.7% improvement. This means that in
> > > practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
> > > critical (e.g., high-frequent stats collection, which is often a practical use
> > > for PERCPU_ARRAY today).
> > >
> > > v1->v2:
> > >   - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
> > >   - dropped the direct per-CPU memory read instruction, it can always be added
> > >     back, if necessary;
> > >   - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
> > >   - switched all per-cpu addr casts to (unsigned long) to avoid sparse
> > >     warnings.
> > >
> > > Andrii Nakryiko (4):
> > >   bpf: add special internal-only MOV instruction to resolve per-CPU
> > >     addrs
> > >   bpf: inline bpf_get_smp_processor_id() helper
> > >   bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps
> > >   bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map
> > >
> > >  arch/x86/net/bpf_jit_comp.c | 16 ++++++++++++++++
> > >  include/linux/filter.h      | 20 ++++++++++++++++++++
> > >  kernel/bpf/arraymap.c       | 33 +++++++++++++++++++++++++++++++++
> > >  kernel/bpf/core.c           |  5 +++++
> > >  kernel/bpf/disasm.c         | 14 ++++++++++++++
> > >  kernel/bpf/hashtab.c        | 21 +++++++++++++++++++++
> > >  kernel/bpf/verifier.c       | 24 ++++++++++++++++++++++++
> > >  7 files changed, 133 insertions(+)
> > >
> > > --
> > > 2.43.0
> > >
> >
> > Puranjay, Pu,
> >
> > Is this something you guys can help implement for ARM64 and RISC-V
> > architectures as well? I'm lacking the knowledge of the specifics of
> > this architecture to be able to do this by myself, but it is a nice
> > improvement for BPF applications, so it would be nice to get wide
> > architecture support.
> >
> > Thank you!
>
> I have already started implementing this on ARM64,

Awesome, thanks!

>
> Pu, can you do it for RISC-V? If you don't have cycles then I could do it after
> finishing ARM64.
>
> Thanks,
> Puranjay

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-04 16:12 ` Andrii Nakryiko
  2024-04-04 16:16   ` Puranjay Mohan
@ 2024-04-04 21:02   ` Puranjay Mohan
  2024-04-04 21:21     ` Andrii Nakryiko
  1 sibling, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-04-04 21:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, Pu Lehui, bpf, ast, daniel, martin.lau,
	kernel-team

Hi Andrii,

On Thu, Apr 4, 2024 at 6:12 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Add a new BPF instruction for resolving per-CPU memory addresses.
> >
> > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > to an absolute address where per-CPU data resides for "this" CPU.
> >
> > This patch set implements support for it in x86-64 BPF JIT only.
> >
> > Using the new instruction, we also implement inlining for three cases:
> >   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
> >     function call, saving a bit of performance and also not polluting LBR
> >     records with unnecessary function call/return records;
> >   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
> >     performance to implementing per-CPU data structures using global variables
> >     in BPF (which is an awesome improvement, see benchmarks below);
> >   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
> >     same for non-PERCPU HASH map; this still saves a bit of overhead.
> >
> > To validate performance benefits, I hacked together a tiny benchmark doing
> > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> > on global variable array using bpf_get_smp_processor_id() to index array for
> > current CPU (glob-arr-inc benchmark below).

Can you share the code for these benchmarks? I want to use the same to
compare the performance
on ARM64.

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-04 21:02   ` Puranjay Mohan
@ 2024-04-04 21:21     ` Andrii Nakryiko
  0 siblings, 0 replies; 24+ messages in thread
From: Andrii Nakryiko @ 2024-04-04 21:21 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Andrii Nakryiko, Pu Lehui, bpf, ast, daniel, martin.lau,
	kernel-team

On Thu, Apr 4, 2024 at 2:03 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
>
> Hi Andrii,
>
> On Thu, Apr 4, 2024 at 6:12 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Apr 1, 2024 at 7:13 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Add a new BPF instruction for resolving per-CPU memory addresses.
> > >
> > > New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_DW, with
> > > insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
> > > to an absolute address where per-CPU data resides for "this" CPU.
> > >
> > > This patch set implements support for it in x86-64 BPF JIT only.
> > >
> > > Using the new instruction, we also implement inlining for three cases:
> > >   - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
> > >     function call, saving a bit of performance and also not polluting LBR
> > >     records with unnecessary function call/return records;
> > >   - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
> > >     performance to implementing per-CPU data structures using global variables
> > >     in BPF (which is an awesome improvement, see benchmarks below);
> > >   - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
> > >     same for non-PERCPU HASH map; this still saves a bit of overhead.
> > >
> > > To validate performance benefits, I hacked together a tiny benchmark doing
> > > only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
> > > (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
> > > establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
> > > on global variable array using bpf_get_smp_processor_id() to index array for
> > > current CPU (glob-arr-inc benchmark below).
>
> Can you share the code for these benchmarks? I want to use the same to
> compare the performance
> on ARM64.
>

Sure, see [0]. You can run:

$ ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

from tools/testing/selftest/bpf directory to get something like

$ ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
glob-arr-inc   :  243.196 ± 7.879M/s
arr-inc        :  218.139 ± 4.407M/s
hash-inc       :   97.727 ± 3.643M/s

  [0] https://github.com/anakryiko/linux/commit/8dec900975ef1a0308a7862154735549d6b66f64

> Thanks,
> Puranjay

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-04 16:16   ` Puranjay Mohan
  2024-04-04 16:18     ` Andrii Nakryiko
@ 2024-04-05 12:48     ` Puranjay Mohan
  2024-04-05 13:19     ` Pu Lehui
  2 siblings, 0 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-04-05 12:48 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, Pu Lehui, bpf, ast, daniel, martin.lau,
	kernel-team

>
> I have already started implementing this on ARM64,
>
> Pu, can you do it for RISC-V? If you don't have cycles then I could do it after
> finishing ARM64.

I ended up doing it for both ARM64 and RISC-V. But I don't see as high
performance improvement as we see in x86.
Maybe, I missed something. Please see:

ARM64: https://lore.kernel.org/bpf/20240405091707.66675-1-puranjay12@gmail.com/
RISC-V: https://lore.kernel.org/bpf/20240405124348.27644-1-puranjay12@gmail.com/

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction
  2024-04-04 16:16   ` Puranjay Mohan
  2024-04-04 16:18     ` Andrii Nakryiko
  2024-04-05 12:48     ` Puranjay Mohan
@ 2024-04-05 13:19     ` Pu Lehui
  2 siblings, 0 replies; 24+ messages in thread
From: Pu Lehui @ 2024-04-05 13:19 UTC (permalink / raw)
  To: puranjay12
  Cc: andrii.nakryiko, andrii, ast, bpf, daniel, kernel-team,
	martin.lau, pulehui

Nice work! Puranjay. I am on festival holiday.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-04-05 13:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-02  2:13 [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction Andrii Nakryiko
2024-04-02  2:13 ` [PATCH v2 bpf-next 1/4] bpf: add special internal-only MOV instruction to resolve per-CPU addrs Andrii Nakryiko
2024-04-02  4:35   ` John Fastabend
2024-04-02  2:13 ` [PATCH v2 bpf-next 2/4] bpf: inline bpf_get_smp_processor_id() helper Andrii Nakryiko
2024-04-02  4:41   ` John Fastabend
2024-04-02  2:13 ` [PATCH v2 bpf-next 3/4] bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps Andrii Nakryiko
2024-04-02  5:02   ` John Fastabend
2024-04-02 16:20     ` Andrii Nakryiko
2024-04-02  2:13 ` [PATCH v2 bpf-next 4/4] bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map Andrii Nakryiko
2024-04-02  5:04   ` John Fastabend
2024-04-02  4:57 ` [PATCH v2 bpf-next 0/4] Add internal-only BPF per-CPU instruction John Fastabend
2024-04-02 16:20   ` Andrii Nakryiko
2024-04-03 22:01     ` John Fastabend
2024-04-03 17:40 ` patchwork-bot+netdevbpf
2024-04-03 17:41 ` Alexei Starovoitov
2024-04-03 18:03   ` Andrii Nakryiko
2024-04-03 18:14     ` Alexei Starovoitov
2024-04-04 16:12 ` Andrii Nakryiko
2024-04-04 16:16   ` Puranjay Mohan
2024-04-04 16:18     ` Andrii Nakryiko
2024-04-05 12:48     ` Puranjay Mohan
2024-04-05 13:19     ` Pu Lehui
2024-04-04 21:02   ` Puranjay Mohan
2024-04-04 21:21     ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox