[PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime
@ 2026-01-02 15:00 Leon Hwang
  2026-01-02 15:00 ` [PATCH bpf-next 1/4] bpf: tailcall: Introduce bpf_arch_tail_call_prologue_offset Leon Hwang
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Leon Hwang @ 2026-01-02 15:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Andrew Morton,
	linux-arm-kernel, linux-kernel, netdev, kernel-patches-bot,
	Leon Hwang

This patch series optimizes BPF tail calls on x86_64 and arm64 by
eliminating runtime memory accesses for max_entries and 'prog->bpf_func'
when the prog array map is known at verification time.

Currently, every tail call requires:
  1. Loading max_entries from the prog array map
  2. Dereferencing 'prog->bpf_func' to get the target address

This series introduces a mechanism to precompute and cache the tail call
target addresses (bpf_func + prologue_offset) in the prog array itself:
  array->ptrs[max_entries + index] = prog->bpf_func + prologue_offset

When a program is added to or removed from the prog array, the cached
target is atomically updated via xchg().

The verifier now encodes additional information in the tail call
instruction's imm field:
  - bits 0-7:   map index in used_maps[]
  - bits 8-15:  dynamic array flag (1 if map pointer is poisoned)
  - bits 16-31: poke table index + 1 for direct tail calls

For static tail calls (map known at verification time):
  - max_entries is embedded as an immediate in the comparison instruction
  - The cached target from array->ptrs[max_entries + index] is used
    directly, avoiding the 'prog->bpf_func' dereference

For dynamic tail calls (map pointer poisoned):
  - Fall back to runtime lookup of max_entries and prog->bpf_func

This reduces cache misses and improves tail call performance for the
common case where the prog array is statically known.

Leon Hwang (4):
  bpf: tailcall: Introduce bpf_arch_tail_call_prologue_offset
  bpf, x64: tailcall: Eliminate max_entries and bpf_func access at
    runtime
  bpf, arm64: tailcall: Eliminate max_entries and bpf_func access at
    runtime
  bpf, lib/test_bpf: Fix broken tailcall tests

 arch/arm64/net/bpf_jit_comp.c | 71 +++++++++++++++++++++++++----------
 arch/x86/net/bpf_jit_comp.c   | 51 ++++++++++++++++++-------
 include/linux/bpf.h           |  1 +
 kernel/bpf/arraymap.c         | 27 ++++++++++++-
 kernel/bpf/verifier.c         | 30 ++++++++++++++-
 lib/test_bpf.c                | 39 ++++++++++++++++---
 6 files changed, 178 insertions(+), 41 deletions(-)

--
2.52.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 1/4] bpf: tailcall: Introduce bpf_arch_tail_call_prologue_offset
  2026-01-02 15:00 [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Leon Hwang
@ 2026-01-02 15:00 ` Leon Hwang
  2026-01-02 15:21   ` bot+bpf-ci
  2026-01-02 15:00 ` [PATCH bpf-next 2/4] bpf, x64: tailcall: Eliminate max_entries and bpf_func access at runtime Leon Hwang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Leon Hwang @ 2026-01-02 15:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Andrew Morton,
	linux-arm-kernel, linux-kernel, netdev, kernel-patches-bot,
	Leon Hwang

Introduce bpf_arch_tail_call_prologue_offset() to allow architectures
to specify the offset from bpf_func to the actual program entry point
for tail calls. This offset accounts for prologue instructions that
should be skipped (e.g., fentry NOPs, TCC initialization).

When an architecture provides a non-zero prologue offset, prog arrays
allocate additional space to cache precomputed tail call targets:
  array->ptrs[max_entries + index] = prog->bpf_func + prologue_offset

This cached target is updated atomically via xchg() when programs are
added or removed from the prog array, eliminating the need to compute
the target address at runtime during tail calls.

The function is exported for use by the test_bpf module.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/arraymap.c | 27 ++++++++++++++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4e7d72dfbcd4..acd85c239af9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3792,6 +3792,7 @@ int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type old_t,
 
 void bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
 			       struct bpf_prog *new, struct bpf_prog *old);
+int bpf_arch_tail_call_prologue_offset(void);
 
 void *bpf_arch_text_copy(void *dst, void *src, size_t len);
 int bpf_arch_text_invalidate(void *dst, size_t len);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1eeb31c5b317..beedd1281c22 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -127,6 +127,9 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 			array_size += (u64) max_entries * elem_size;
 		}
 	}
+	if (attr->map_type == BPF_MAP_TYPE_PROG_ARRAY && bpf_arch_tail_call_prologue_offset())
+		/* Store tailcall targets */
+		array_size += (u64) max_entries * sizeof(void *);
 
 	/* allocate all map elements and zero-initialize them */
 	if (attr->map_flags & BPF_F_MMAPABLE) {
@@ -1087,16 +1090,38 @@ void __weak bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
 	WARN_ON_ONCE(1);
 }
 
+int __weak bpf_arch_tail_call_prologue_offset(void)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(bpf_arch_tail_call_prologue_offset);
+
+static void bpf_tail_call_target_update(struct bpf_array *array, u32 key, struct bpf_prog *new)
+{
+	int offset = bpf_arch_tail_call_prologue_offset();
+	void *target;
+
+	if (!offset)
+		return;
+
+	target = new ? (void *) new->bpf_func + offset : 0;
+	xchg(array->ptrs + array->map.max_entries + key, target);
+}
+
 static void prog_array_map_poke_run(struct bpf_map *map, u32 key,
 				    struct bpf_prog *old,
 				    struct bpf_prog *new)
 {
 	struct prog_poke_elem *elem;
 	struct bpf_array_aux *aux;
+	struct bpf_array *array;
 
-	aux = container_of(map, struct bpf_array, map)->aux;
+	array = container_of(map, struct bpf_array, map);
+	aux = array->aux;
 	WARN_ON_ONCE(!mutex_is_locked(&aux->poke_mutex));
 
+	bpf_tail_call_target_update(array, key, new);
+
 	list_for_each_entry(elem, &aux->poke_progs, list) {
 		struct bpf_jit_poke_descriptor *poke;
 		int i;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: tailcall: Introduce bpf_arch_tail_call_prologue_offset
  2026-01-02 15:00 ` [PATCH bpf-next 1/4] bpf: tailcall: Introduce bpf_arch_tail_call_prologue_offset Leon Hwang
@ 2026-01-02 15:21   ` bot+bpf-ci
  2026-01-02 15:38     ` Leon Hwang
  0 siblings, 1 reply; 13+ messages in thread
From: bot+bpf-ci @ 2026-01-02 15:21 UTC (permalink / raw)
  To: leon.hwang, bpf
  Cc: ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, puranjay, xukuohai,
	catalin.marinas, will, davem, dsahern, tglx, mingo, bp,
	dave.hansen, x86, hpa, akpm, linux-arm-kernel, linux-kernel,
	netdev, kernel-patches-bot, leon.hwang, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1205 bytes --]

> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 1eeb31c5b317..beedd1281c22 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -127,6 +127,9 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
>  			array_size += (u64) max_entries * elem_size;
>  		}
>  	}
> +	if (attr->map_type == BPF_MAP_TYPE_PROG_ARRAY && bpf_arch_tail_call_prologue_offset())
> +		/* Store tailcall targets */
> +		array_size += (u64) max_entries * sizeof(void *);

Should array_map_mem_usage() also account for this additional space?
Currently array_map_mem_usage() handles special cases for percpu arrays
and mmapable arrays, but has no corresponding check for PROG_ARRAY maps
when bpf_arch_tail_call_prologue_offset() is non-zero. This would cause
the reported memory usage to be lower than the actual allocation when
an architecture implements this function (e.g., x86_64 and arm64 in the
subsequent patches of this series).

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20660615603

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: tailcall: Introduce bpf_arch_tail_call_prologue_offset
  2026-01-02 15:21   ` bot+bpf-ci
@ 2026-01-02 15:38     ` Leon Hwang
  0 siblings, 0 replies; 13+ messages in thread
From: Leon Hwang @ 2026-01-02 15:38 UTC (permalink / raw)
  To: bot+bpf-ci, bpf
  Cc: ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, puranjay, xukuohai,
	catalin.marinas, will, davem, dsahern, tglx, mingo, bp,
	dave.hansen, x86, hpa, akpm, linux-arm-kernel, linux-kernel,
	netdev, kernel-patches-bot, martin.lau, clm, ihor.solodrai



On 2026/1/2 23:21, bot+bpf-ci@kernel.org wrote:
>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
>> index 1eeb31c5b317..beedd1281c22 100644
>> --- a/kernel/bpf/arraymap.c
>> +++ b/kernel/bpf/arraymap.c
>> @@ -127,6 +127,9 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
>>  			array_size += (u64) max_entries * elem_size;
>>  		}
>>  	}
>> +	if (attr->map_type == BPF_MAP_TYPE_PROG_ARRAY && bpf_arch_tail_call_prologue_offset())
>> +		/* Store tailcall targets */
>> +		array_size += (u64) max_entries * sizeof(void *);
> 
> Should array_map_mem_usage() also account for this additional space?
> Currently array_map_mem_usage() handles special cases for percpu arrays
> and mmapable arrays, but has no corresponding check for PROG_ARRAY maps
> when bpf_arch_tail_call_prologue_offset() is non-zero. This would cause
> the reported memory usage to be lower than the actual allocation when
> an architecture implements this function (e.g., x86_64 and arm64 in the
> subsequent patches of this series).
> 
> [ ... ]
> 

You are right, array_map_mem_usage() needs to stay in sync with the
allocation logic to avoid under-reporting memory.

I will include this fix in the next revision after I’ve collected more
feedback.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 2/4] bpf, x64: tailcall: Eliminate max_entries and bpf_func access at runtime
  2026-01-02 15:00 [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Leon Hwang
  2026-01-02 15:00 ` [PATCH bpf-next 1/4] bpf: tailcall: Introduce bpf_arch_tail_call_prologue_offset Leon Hwang
@ 2026-01-02 15:00 ` Leon Hwang
  2026-01-02 15:00 ` [PATCH bpf-next 3/4] bpf, arm64: " Leon Hwang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Leon Hwang @ 2026-01-02 15:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Andrew Morton,
	linux-arm-kernel, linux-kernel, netdev, kernel-patches-bot,
	Leon Hwang

Optimize BPF tail calls on x86_64 by eliminating runtime memory accesses
for max_entries and prog->bpf_func when the prog array map is known at
verification time.

The verifier now encodes three fields in the tail call instruction's imm:
  - bits 0-7:   map index in used_maps[] (max 63)
  - bits 8-15:  dynamic array flag (1 if map pointer is poisoned)
  - bits 16-31: poke table index + 1 for direct tail calls (max 1023)

For static tail calls (map known at verification time):
  - max_entries is embedded as an immediate in the comparison instruction
  - The cached target from array->ptrs[max_entries + index] is used
    directly, avoiding the prog->bpf_func dereference

For dynamic tail calls (map pointer poisoned):
  - Fall back to runtime lookup of max_entries and prog->bpf_func

This reduces cache misses and improves tail call performance for the
common case where the prog array is statically known.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 51 +++++++++++++++++++++++++++----------
 kernel/bpf/verifier.c       | 30 ++++++++++++++++++++--
 2 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e3b1c4b1d550..9fd707612da5 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -733,11 +733,13 @@ static void emit_return(u8 **pprog, u8 *ip)
  * out:
  */
 static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
+					u32 map_index, bool dyn_array,
 					u8 **pprog, bool *callee_regs_used,
 					u32 stack_depth, u8 *ip,
 					struct jit_context *ctx)
 {
 	int tcc_ptr_off = BPF_TAIL_CALL_CNT_PTR_STACK_OFF(stack_depth);
+	struct bpf_map *map = bpf_prog->aux->used_maps[map_index];
 	u8 *prog = *pprog, *start = *pprog;
 	int offset;
 
@@ -752,11 +754,14 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 	 *	goto out;
 	 */
 	EMIT2(0x89, 0xD2);                        /* mov edx, edx */
-	EMIT3(0x39, 0x56,                         /* cmp dword ptr [rsi + 16], edx */
-	      offsetof(struct bpf_array, map.max_entries));
+	if (dyn_array)
+		EMIT3(0x3B, 0x56,                 /* cmp edx, dword ptr [rsi + 16] */
+		      offsetof(struct bpf_array, map.max_entries));
+	else
+		EMIT2_off32(0x81, 0xFA, map->max_entries); /* cmp edx, imm32 (map->max_entries) */
 
 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
-	EMIT2(X86_JBE, offset);                   /* jbe out */
+	EMIT2(X86_JAE, offset);                   /* jae out */
 
 	/*
 	 * if ((*tcc_ptr)++ >= MAX_TAIL_CALL_CNT)
@@ -768,9 +773,15 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 	offset = ctx->tail_call_indirect_label - (prog + 2 - start);
 	EMIT2(X86_JAE, offset);                   /* jae out */
 
-	/* prog = array->ptrs[index]; */
-	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6,       /* mov rcx, [rsi + rdx * 8 + offsetof(...)] */
-		    offsetof(struct bpf_array, ptrs));
+	/*
+	 * if (dyn_array)
+	 *	prog = array->ptrs[index];
+	 * else
+	 *	tgt = array->ptrs[max_entries + index];
+	 */
+	offset = offsetof(struct bpf_array, ptrs);
+	offset += dyn_array ? 0 : map->max_entries * sizeof(void *);
+	EMIT4_off32(0x48, 0x8B, 0x8C, 0xD6, offset); /* mov rcx, [rsi + rdx * 8 + offset] */
 
 	/*
 	 * if (prog == NULL)
@@ -803,11 +814,14 @@ static void emit_bpf_tail_call_indirect(struct bpf_prog *bpf_prog,
 		EMIT3_off32(0x48, 0x81, 0xC4,     /* add rsp, sd */
 			    round_up(stack_depth, 8));
 
-	/* goto *(prog->bpf_func + X86_TAIL_CALL_OFFSET); */
-	EMIT4(0x48, 0x8B, 0x49,                   /* mov rcx, qword ptr [rcx + 32] */
-	      offsetof(struct bpf_prog, bpf_func));
-	EMIT4(0x48, 0x83, 0xC1,                   /* add rcx, X86_TAIL_CALL_OFFSET */
-	      X86_TAIL_CALL_OFFSET);
+	if (dyn_array) {
+		/* goto *(prog->bpf_func + X86_TAIL_CALL_OFFSET); */
+		EMIT4(0x48, 0x8B, 0x49,           /* mov rcx, qword ptr [rcx + 32] */
+		      offsetof(struct bpf_prog, bpf_func));
+		EMIT4(0x48, 0x83, 0xC1,           /* add rcx, X86_TAIL_CALL_OFFSET */
+		      X86_TAIL_CALL_OFFSET);
+	}
+
 	/*
 	 * Now we're ready to jump into next BPF program
 	 * rdi == ctx (1st arg)
@@ -2461,15 +2475,21 @@ st:			if (is_imm8(insn->off))
 		}
 
 		case BPF_JMP | BPF_TAIL_CALL:
-			if (imm32)
+			bool dynamic_array = (imm32 >> 8) & 0xFF;
+			u32 map_index = imm32 & 0xFF;
+			s32 imm16 = imm32 >> 16;
+
+			if (imm16)
 				emit_bpf_tail_call_direct(bpf_prog,
-							  &bpf_prog->aux->poke_tab[imm32 - 1],
+							  &bpf_prog->aux->poke_tab[imm16 - 1],
 							  &prog, image + addrs[i - 1],
 							  callee_regs_used,
 							  stack_depth,
 							  ctx);
 			else
 				emit_bpf_tail_call_indirect(bpf_prog,
+							    map_index,
+							    dynamic_array,
 							    &prog,
 							    callee_regs_used,
 							    stack_depth,
@@ -4047,6 +4067,11 @@ void bpf_arch_poke_desc_update(struct bpf_jit_poke_descriptor *poke,
 	}
 }
 
+int bpf_arch_tail_call_prologue_offset(void)
+{
+	return X86_TAIL_CALL_OFFSET;
+}
+
 bool bpf_jit_supports_arena(void)
 {
 	return true;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3d44c5d06623..ab9c84e76a62 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -22602,6 +22602,18 @@ static int add_hidden_subprog(struct bpf_verifier_env *env, struct bpf_insn *pat
 	return 0;
 }
 
+static int tail_call_find_map_index(struct bpf_verifier_env *env, struct bpf_map *map)
+{
+	int i;
+
+	for (i = 0; i < env->used_map_cnt; i++) {
+		if (env->used_maps[i] == map)
+			return i;
+	}
+
+	return -ENOENT;
+}
+
 /* Do various post-verification rewrites in a single program pass.
  * These rewrites simplify JIT and interpreter implementations.
  */
@@ -22993,10 +23005,24 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			 * call and to prevent accidental JITing by JIT compiler
 			 * that doesn't support bpf_tail_call yet
 			 */
-			insn->imm = 0;
 			insn->code = BPF_JMP | BPF_TAIL_CALL;
 
+			/*
+			 * insn->imm contains 3 fields:
+			 *   map index(8 bits):   6 bits are enough, 63 max
+			 *   poisoned(8 bits):    1 bit is enough
+			 *   poke index(16 bits): 1023 max
+			 */
+
 			aux = &env->insn_aux_data[i + delta];
+			insn->imm = tail_call_find_map_index(env, aux->map_ptr_state.map_ptr);
+			if (insn->imm < 0) {
+				verifier_bug(env, "index not found for prog array map\n");
+				return -EINVAL;
+			}
+
+			insn->imm |= bpf_map_ptr_poisoned(aux) << 8;
+
 			if (env->bpf_capable && !prog->blinding_requested &&
 			    prog->jit_requested &&
 			    !bpf_map_key_poisoned(aux) &&
@@ -23015,7 +23041,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 					return ret;
 				}
 
-				insn->imm = ret + 1;
+				insn->imm |= (ret + 1) << 16;
 				goto next_insn;
 			}
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 3/4] bpf, arm64: tailcall: Eliminate max_entries and bpf_func access at runtime
  2026-01-02 15:00 [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Leon Hwang
  2026-01-02 15:00 ` [PATCH bpf-next 1/4] bpf: tailcall: Introduce bpf_arch_tail_call_prologue_offset Leon Hwang
  2026-01-02 15:00 ` [PATCH bpf-next 2/4] bpf, x64: tailcall: Eliminate max_entries and bpf_func access at runtime Leon Hwang
@ 2026-01-02 15:00 ` Leon Hwang
  2026-01-02 15:00 ` [PATCH bpf-next 4/4] bpf, lib/test_bpf: Fix broken tailcall tests Leon Hwang
  2026-01-03  0:10 ` [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Alexei Starovoitov
  4 siblings, 0 replies; 13+ messages in thread
From: Leon Hwang @ 2026-01-02 15:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Andrew Morton,
	linux-arm-kernel, linux-kernel, netdev, kernel-patches-bot,
	Leon Hwang

Apply the same tail call optimization to arm64 as done for x86_64.

When the prog array map is known at verification time (dyn_array=false):
  - Embed max_entries as an immediate value instead of loading from memory
  - Use the precomputed target from array->ptrs[max_entries + index]
  - Jump directly to the cached target without dereferencing prog->bpf_func

When the map is dynamically determined (dyn_array=true):
  - Load max_entries from the array at runtime
  - Look up prog from array->ptrs[index] and compute the target address

Implement bpf_arch_tail_call_prologue_offset() returning
"PROLOGUE_OFFSET * 4" to convert the instruction count to bytes.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 arch/arm64/net/bpf_jit_comp.c | 71 +++++++++++++++++++++++++----------
 1 file changed, 51 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 0c4d44bcfbf4..bcd890bff36a 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -620,8 +620,10 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 	return 0;
 }
 
-static int emit_bpf_tail_call(struct jit_ctx *ctx)
+static int emit_bpf_tail_call(struct jit_ctx *ctx, u32 map_index, bool dyn_array)
 {
+	struct bpf_map *map = ctx->prog->aux->used_maps[map_index];
+
 	/* bpf_tail_call(void *prog_ctx, struct bpf_array *array, u64 index) */
 	const u8 r2 = bpf2a64[BPF_REG_2];
 	const u8 r3 = bpf2a64[BPF_REG_3];
@@ -638,9 +640,13 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 	/* if (index >= array->map.max_entries)
 	 *     goto out;
 	 */
-	off = offsetof(struct bpf_array, map.max_entries);
-	emit_a64_mov_i64(tmp, off, ctx);
-	emit(A64_LDR32(tmp, r2, tmp), ctx);
+	if (dyn_array) {
+		off = offsetof(struct bpf_array, map.max_entries);
+		emit_a64_mov_i64(tmp, off, ctx);
+		emit(A64_LDR32(tmp, r2, tmp), ctx);
+	} else {
+		emit_a64_mov_i64(tmp, map->max_entries, ctx);
+	}
 	emit(A64_MOV(0, r3, r3), ctx);
 	emit(A64_CMP(0, r3, tmp), ctx);
 	branch1 = ctx->image + ctx->idx;
@@ -659,15 +665,26 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 	/* (*tail_call_cnt_ptr)++; */
 	emit(A64_ADD_I(1, tcc, tcc, 1), ctx);
 
-	/* prog = array->ptrs[index];
-	 * if (prog == NULL)
-	 *     goto out;
-	 */
-	off = offsetof(struct bpf_array, ptrs);
-	emit_a64_mov_i64(tmp, off, ctx);
-	emit(A64_ADD(1, tmp, r2, tmp), ctx);
-	emit(A64_LSL(1, prg, r3, 3), ctx);
-	emit(A64_LDR64(prg, tmp, prg), ctx);
+	if (dyn_array) {
+		/* prog = array->ptrs[index];
+		 * if (prog == NULL)
+		 *     goto out;
+		 */
+		off = offsetof(struct bpf_array, ptrs);
+		emit_a64_mov_i64(tmp, off, ctx);
+		emit(A64_ADD(1, tmp, r2, tmp), ctx);
+		emit(A64_LSL(1, prg, r3, 3), ctx);
+		emit(A64_LDR64(prg, tmp, prg), ctx);
+	} else {
+		/* tgt = array->ptrs[max_entries + index];
+		 * if (tgt == 0)
+		 *     goto out;
+		 */
+		emit(A64_LSL(1, prg, r3, 3), ctx);
+		off = offsetof(struct bpf_array, ptrs) + map->max_entries * sizeof(void *);
+		emit_a64_add_i(1, prg, prg, tmp, off, ctx);
+		emit(A64_LDR64(prg, r2, prg), ctx);
+	}
 	branch3 = ctx->image + ctx->idx;
 	emit(A64_NOP, ctx);
 
@@ -680,12 +697,17 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 
 	pop_callee_regs(ctx);
 
-	/* goto *(prog->bpf_func + prologue_offset); */
-	off = offsetof(struct bpf_prog, bpf_func);
-	emit_a64_mov_i64(tmp, off, ctx);
-	emit(A64_LDR64(tmp, prg, tmp), ctx);
-	emit(A64_ADD_I(1, tmp, tmp, sizeof(u32) * PROLOGUE_OFFSET), ctx);
-	emit(A64_BR(tmp), ctx);
+	if (dyn_array) {
+		/* goto *(prog->bpf_func + prologue_offset); */
+		off = offsetof(struct bpf_prog, bpf_func);
+		emit_a64_mov_i64(tmp, off, ctx);
+		emit(A64_LDR64(tmp, prg, tmp), ctx);
+		emit(A64_ADD_I(1, tmp, tmp, sizeof(u32) * PROLOGUE_OFFSET), ctx);
+		emit(A64_BR(tmp), ctx);
+	} else {
+		/* goto *tgt; */
+		emit(A64_BR(prg), ctx);
+	}
 
 	if (ctx->image) {
 		off = &ctx->image[ctx->idx] - branch1;
@@ -701,6 +723,12 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 	return 0;
 }
 
+int bpf_arch_tail_call_prologue_offset(void)
+{
+	/* offset is in instructions, convert to bytes */
+	return PROLOGUE_OFFSET * 4;
+}
+
 static int emit_atomic_ld_st(const struct bpf_insn *insn, struct jit_ctx *ctx)
 {
 	const s32 imm = insn->imm;
@@ -1617,7 +1645,10 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
 	}
 	/* tail call */
 	case BPF_JMP | BPF_TAIL_CALL:
-		if (emit_bpf_tail_call(ctx))
+		bool dynamic_array = (insn->imm >> 8) & 0xFF;
+		u32 map_index = insn->imm & 0xFF;
+
+		if (emit_bpf_tail_call(ctx, map_index, dynamic_array))
 			return -EFAULT;
 		break;
 	/* function return */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 4/4] bpf, lib/test_bpf: Fix broken tailcall tests
  2026-01-02 15:00 [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Leon Hwang
                   ` (2 preceding siblings ...)
  2026-01-02 15:00 ` [PATCH bpf-next 3/4] bpf, arm64: " Leon Hwang
@ 2026-01-02 15:00 ` Leon Hwang
  2026-01-03  0:10 ` [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Alexei Starovoitov
  4 siblings, 0 replies; 13+ messages in thread
From: Leon Hwang @ 2026-01-02 15:00 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Andrew Morton,
	linux-arm-kernel, linux-kernel, netdev, kernel-patches-bot,
	Leon Hwang

Update the tail call tests in test_bpf to work with the new tail call
optimization that requires:
  1. A valid used_maps array pointing to the prog array
  2. Precomputed tail call targets in array->ptrs[max_entries + index]

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 lib/test_bpf.c | 39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index af0041df2b72..680d34d46f19 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -15448,26 +15448,45 @@ static void __init destroy_tail_call_tests(struct bpf_array *progs)
 {
 	int i;
 
-	for (i = 0; i < ARRAY_SIZE(tail_call_tests); i++)
-		if (progs->ptrs[i])
-			bpf_prog_free(progs->ptrs[i]);
+	for (i = 0; i < ARRAY_SIZE(tail_call_tests); i++) {
+		struct bpf_prog *fp = progs->ptrs[i];
+
+		if (!fp)
+			continue;
+
+		/*
+		 * The used_maps points to fake maps that don't have
+		 * proper ops, so clear it before bpf_prog_free to avoid
+		 * bpf_free_used_maps trying to process it.
+		 */
+		kfree(fp->aux->used_maps);
+		fp->aux->used_maps = NULL;
+		fp->aux->used_map_cnt = 0;
+		bpf_prog_free(fp);
+	}
 	kfree(progs);
 }
 
 static __init int prepare_tail_call_tests(struct bpf_array **pprogs)
 {
+	int prologue_offset = bpf_arch_tail_call_prologue_offset();
 	int ntests = ARRAY_SIZE(tail_call_tests);
+	u32 max_entries = ntests + 1;
 	struct bpf_array *progs;
 	int which, err;
 
 	/* Allocate the table of programs to be used for tail calls */
-	progs = kzalloc(struct_size(progs, ptrs, ntests + 1), GFP_KERNEL);
+	progs = kzalloc(struct_size(progs, ptrs, max_entries * 2), GFP_KERNEL);
 	if (!progs)
 		goto out_nomem;
 
+	/* Set max_entries before JIT, as it's used in JIT */
+	progs->map.max_entries = max_entries;
+
 	/* Create all eBPF programs and populate the table */
 	for (which = 0; which < ntests; which++) {
 		struct tail_call_test *test = &tail_call_tests[which];
+		struct bpf_map *map = &progs->map;
 		struct bpf_prog *fp;
 		int len, i;
 
@@ -15487,10 +15506,16 @@ static __init int prepare_tail_call_tests(struct bpf_array **pprogs)
 		if (!fp)
 			goto out_nomem;
 
+		fp->aux->used_maps = kmalloc_array(1, sizeof(map), GFP_KERNEL);
+		if (!fp->aux->used_maps)
+			goto out_nomem;
+
 		fp->len = len;
 		fp->type = BPF_PROG_TYPE_SOCKET_FILTER;
 		fp->aux->stack_depth = test->stack_depth;
 		fp->aux->tail_call_reachable = test->has_tail_call;
+		fp->aux->used_maps[0] = map;
+		fp->aux->used_map_cnt = 1;
 		memcpy(fp->insnsi, test->insns, len * sizeof(struct bpf_insn));
 
 		/* Relocate runtime tail call offsets and addresses */
@@ -15548,6 +15573,10 @@ static __init int prepare_tail_call_tests(struct bpf_array **pprogs)
 				if ((long)__bpf_call_base + insn->imm != addr)
 					*insn = BPF_JMP_A(0); /* Skip: NOP */
 				break;
+
+			case BPF_JMP | BPF_TAIL_CALL:
+				insn->imm = 0;
+				break;
 			}
 		}
 
@@ -15555,11 +15584,11 @@ static __init int prepare_tail_call_tests(struct bpf_array **pprogs)
 		if (err)
 			goto out_err;
 
+		progs->ptrs[max_entries + which] = (void *) fp->bpf_func + prologue_offset;
 		progs->ptrs[which] = fp;
 	}
 
 	/* The last entry contains a NULL program pointer */
-	progs->map.max_entries = ntests + 1;
 	*pprogs = progs;
 	return 0;
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime
  2026-01-02 15:00 [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Leon Hwang
                   ` (3 preceding siblings ...)
  2026-01-02 15:00 ` [PATCH bpf-next 4/4] bpf, lib/test_bpf: Fix broken tailcall tests Leon Hwang
@ 2026-01-03  0:10 ` Alexei Starovoitov
  2026-01-14 11:28   ` Jiri Olsa
  4 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2026-01-03  0:10 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H . Peter Anvin,
	Andrew Morton, linux-arm-kernel, LKML, Network Development,
	kernel-patches-bot

On Fri, Jan 2, 2026 at 7:01 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> This patch series optimizes BPF tail calls on x86_64 and arm64 by
> eliminating runtime memory accesses for max_entries and 'prog->bpf_func'
> when the prog array map is known at verification time.
>
> Currently, every tail call requires:
>   1. Loading max_entries from the prog array map
>   2. Dereferencing 'prog->bpf_func' to get the target address
>
> This series introduces a mechanism to precompute and cache the tail call
> target addresses (bpf_func + prologue_offset) in the prog array itself:
>   array->ptrs[max_entries + index] = prog->bpf_func + prologue_offset
>
> When a program is added to or removed from the prog array, the cached
> target is atomically updated via xchg().
>
> The verifier now encodes additional information in the tail call
> instruction's imm field:
>   - bits 0-7:   map index in used_maps[]
>   - bits 8-15:  dynamic array flag (1 if map pointer is poisoned)
>   - bits 16-31: poke table index + 1 for direct tail calls
>
> For static tail calls (map known at verification time):
>   - max_entries is embedded as an immediate in the comparison instruction
>   - The cached target from array->ptrs[max_entries + index] is used
>     directly, avoiding the 'prog->bpf_func' dereference
>
> For dynamic tail calls (map pointer poisoned):
>   - Fall back to runtime lookup of max_entries and prog->bpf_func
>
> This reduces cache misses and improves tail call performance for the
> common case where the prog array is statically known.

Sorry, I don't like this. tail_calls are complex enough and
I'd rather let them be as-is and deprecate their usage altogether
instead of trying to optimize them in certain conditions.
We have indirect jumps now. The next step is indirect calls.
When it lands there will be no need to use tail_calls.
Consider tail_calls to be legacy. No reason to improve them.

pw-bot: cr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime
  2026-01-03  0:10 ` [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Alexei Starovoitov
@ 2026-01-14 11:28   ` Jiri Olsa
  2026-01-14 16:04     ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Jiri Olsa @ 2026-01-14 11:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H . Peter Anvin,
	Andrew Morton, linux-arm-kernel, LKML, Network Development,
	kernel-patches-bot

On Fri, Jan 02, 2026 at 04:10:01PM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 2, 2026 at 7:01 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >
> > This patch series optimizes BPF tail calls on x86_64 and arm64 by
> > eliminating runtime memory accesses for max_entries and 'prog->bpf_func'
> > when the prog array map is known at verification time.
> >
> > Currently, every tail call requires:
> >   1. Loading max_entries from the prog array map
> >   2. Dereferencing 'prog->bpf_func' to get the target address
> >
> > This series introduces a mechanism to precompute and cache the tail call
> > target addresses (bpf_func + prologue_offset) in the prog array itself:
> >   array->ptrs[max_entries + index] = prog->bpf_func + prologue_offset
> >
> > When a program is added to or removed from the prog array, the cached
> > target is atomically updated via xchg().
> >
> > The verifier now encodes additional information in the tail call
> > instruction's imm field:
> >   - bits 0-7:   map index in used_maps[]
> >   - bits 8-15:  dynamic array flag (1 if map pointer is poisoned)
> >   - bits 16-31: poke table index + 1 for direct tail calls
> >
> > For static tail calls (map known at verification time):
> >   - max_entries is embedded as an immediate in the comparison instruction
> >   - The cached target from array->ptrs[max_entries + index] is used
> >     directly, avoiding the 'prog->bpf_func' dereference
> >
> > For dynamic tail calls (map pointer poisoned):
> >   - Fall back to runtime lookup of max_entries and prog->bpf_func
> >
> > This reduces cache misses and improves tail call performance for the
> > common case where the prog array is statically known.
> 
> Sorry, I don't like this. tail_calls are complex enough and
> I'd rather let them be as-is and deprecate their usage altogether
> instead of trying to optimize them in certain conditions.
> We have indirect jumps now. The next step is indirect calls.
> When it lands there will be no need to use tail_calls.
> Consider tail_calls to be legacy. No reason to improve them.

hi,
I'd like to make tail calls available in sleepable programs. I still
need to check if there's technical reason we don't have that, but seeing
this answer I wonder you'd be against that anyway ?

fyi I briefly discussed that with Andrii indicating that it might not
be worth the effort at this stage.

thanks,
jirka

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime
  2026-01-14 11:28   ` Jiri Olsa
@ 2026-01-14 16:04     ` Alexei Starovoitov
  2026-01-14 21:00       ` Jiri Olsa
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2026-01-14 16:04 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H . Peter Anvin,
	Andrew Morton, linux-arm-kernel, LKML, Network Development,
	kernel-patches-bot

On Wed, Jan 14, 2026 at 3:28 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Jan 02, 2026 at 04:10:01PM -0800, Alexei Starovoitov wrote:
> > On Fri, Jan 2, 2026 at 7:01 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> > >
> > > This patch series optimizes BPF tail calls on x86_64 and arm64 by
> > > eliminating runtime memory accesses for max_entries and 'prog->bpf_func'
> > > when the prog array map is known at verification time.
> > >
> > > Currently, every tail call requires:
> > >   1. Loading max_entries from the prog array map
> > >   2. Dereferencing 'prog->bpf_func' to get the target address
> > >
> > > This series introduces a mechanism to precompute and cache the tail call
> > > target addresses (bpf_func + prologue_offset) in the prog array itself:
> > >   array->ptrs[max_entries + index] = prog->bpf_func + prologue_offset
> > >
> > > When a program is added to or removed from the prog array, the cached
> > > target is atomically updated via xchg().
> > >
> > > The verifier now encodes additional information in the tail call
> > > instruction's imm field:
> > >   - bits 0-7:   map index in used_maps[]
> > >   - bits 8-15:  dynamic array flag (1 if map pointer is poisoned)
> > >   - bits 16-31: poke table index + 1 for direct tail calls
> > >
> > > For static tail calls (map known at verification time):
> > >   - max_entries is embedded as an immediate in the comparison instruction
> > >   - The cached target from array->ptrs[max_entries + index] is used
> > >     directly, avoiding the 'prog->bpf_func' dereference
> > >
> > > For dynamic tail calls (map pointer poisoned):
> > >   - Fall back to runtime lookup of max_entries and prog->bpf_func
> > >
> > > This reduces cache misses and improves tail call performance for the
> > > common case where the prog array is statically known.
> >
> > Sorry, I don't like this. tail_calls are complex enough and
> > I'd rather let them be as-is and deprecate their usage altogether
> > instead of trying to optimize them in certain conditions.
> > We have indirect jumps now. The next step is indirect calls.
> > When it lands there will be no need to use tail_calls.
> > Consider tail_calls to be legacy. No reason to improve them.
>
> hi,
> I'd like to make tail calls available in sleepable programs. I still
> need to check if there's technical reason we don't have that, but seeing
> this answer I wonder you'd be against that anyway ?

tail_calls are not allowed in sleepable progs?
I don't remember such a limitation.
What prevents it?
prog_type needs to match, so all sleepable progs should be fine.
The mix and match is problematic due to rcu vs srcu life times.

> fyi I briefly discussed that with Andrii indicating that it might not
> be worth the effort at this stage.

depending on complexity of course.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime
  2026-01-14 16:04     ` Alexei Starovoitov
@ 2026-01-14 21:00       ` Jiri Olsa
  2026-01-14 21:56         ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Jiri Olsa @ 2026-01-14 21:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H . Peter Anvin,
	Andrew Morton, linux-arm-kernel, LKML, Network Development,
	kernel-patches-bot

On Wed, Jan 14, 2026 at 08:04:38AM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 14, 2026 at 3:28 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, Jan 02, 2026 at 04:10:01PM -0800, Alexei Starovoitov wrote:
> > > On Fri, Jan 2, 2026 at 7:01 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> > > >
> > > > This patch series optimizes BPF tail calls on x86_64 and arm64 by
> > > > eliminating runtime memory accesses for max_entries and 'prog->bpf_func'
> > > > when the prog array map is known at verification time.
> > > >
> > > > Currently, every tail call requires:
> > > >   1. Loading max_entries from the prog array map
> > > >   2. Dereferencing 'prog->bpf_func' to get the target address
> > > >
> > > > This series introduces a mechanism to precompute and cache the tail call
> > > > target addresses (bpf_func + prologue_offset) in the prog array itself:
> > > >   array->ptrs[max_entries + index] = prog->bpf_func + prologue_offset
> > > >
> > > > When a program is added to or removed from the prog array, the cached
> > > > target is atomically updated via xchg().
> > > >
> > > > The verifier now encodes additional information in the tail call
> > > > instruction's imm field:
> > > >   - bits 0-7:   map index in used_maps[]
> > > >   - bits 8-15:  dynamic array flag (1 if map pointer is poisoned)
> > > >   - bits 16-31: poke table index + 1 for direct tail calls
> > > >
> > > > For static tail calls (map known at verification time):
> > > >   - max_entries is embedded as an immediate in the comparison instruction
> > > >   - The cached target from array->ptrs[max_entries + index] is used
> > > >     directly, avoiding the 'prog->bpf_func' dereference
> > > >
> > > > For dynamic tail calls (map pointer poisoned):
> > > >   - Fall back to runtime lookup of max_entries and prog->bpf_func
> > > >
> > > > This reduces cache misses and improves tail call performance for the
> > > > common case where the prog array is statically known.
> > >
> > > Sorry, I don't like this. tail_calls are complex enough and
> > > I'd rather let them be as-is and deprecate their usage altogether
> > > instead of trying to optimize them in certain conditions.
> > > We have indirect jumps now. The next step is indirect calls.
> > > When it lands there will be no need to use tail_calls.
> > > Consider tail_calls to be legacy. No reason to improve them.
> >
> > hi,
> > I'd like to make tail calls available in sleepable programs. I still
> > need to check if there's technical reason we don't have that, but seeing
> > this answer I wonder you'd be against that anyway ?
> 
> tail_calls are not allowed in sleepable progs?
> I don't remember such a limitation.
> What prevents it?
> prog_type needs to match, so all sleepable progs should be fine.

right, that's what we have, tail-called uprobe programs that we
need to become sleepable

> The mix and match is problematic due to rcu vs srcu life times.
> 
> > fyi I briefly discussed that with Andrii indicating that it might not
> > be worth the effort at this stage.
> 
> depending on complexity of course.

for my tests I just had to allow BPF_MAP_TYPE_PROG_ARRAY map
for sleepable programs

jirka


---
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index faa1ecc1fe9d..1f6fc74c7ea1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20969,6 +20969,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		case BPF_MAP_TYPE_STACK:
 		case BPF_MAP_TYPE_ARENA:
 		case BPF_MAP_TYPE_INSN_ARRAY:
+		case BPF_MAP_TYPE_PROG_ARRAY:
 			break;
 		default:
 			verbose(env,

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime
  2026-01-14 21:00       ` Jiri Olsa
@ 2026-01-14 21:56         ` Alexei Starovoitov
  2026-01-15 18:00           ` Jiri Olsa
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2026-01-14 21:56 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H . Peter Anvin,
	Andrew Morton, linux-arm-kernel, LKML, Network Development,
	kernel-patches-bot

On Wed, Jan 14, 2026 at 1:00 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> >
> > > fyi I briefly discussed that with Andrii indicating that it might not
> > > be worth the effort at this stage.
> >
> > depending on complexity of course.
>
> for my tests I just had to allow BPF_MAP_TYPE_PROG_ARRAY map
> for sleepable programs
>
> jirka
>
>
> ---
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index faa1ecc1fe9d..1f6fc74c7ea1 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -20969,6 +20969,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>                 case BPF_MAP_TYPE_STACK:
>                 case BPF_MAP_TYPE_ARENA:
>                 case BPF_MAP_TYPE_INSN_ARRAY:
> +               case BPF_MAP_TYPE_PROG_ARRAY:
>                         break;
>                 default:
>                         verbose(env,

Think it through, add selftests, ship it.
On the surface the easy part is to make
__bpf_prog_map_compatible() reject sleepable/non-sleepable combo.
Maybe there are other things.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime
  2026-01-14 21:56         ` Alexei Starovoitov
@ 2026-01-15 18:00           ` Jiri Olsa
  0 siblings, 0 replies; 13+ messages in thread
From: Jiri Olsa @ 2026-01-15 18:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Leon Hwang, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Puranjay Mohan, Xu Kuohai, Catalin Marinas, Will Deacon,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H . Peter Anvin,
	Andrew Morton, linux-arm-kernel, LKML, Network Development,
	kernel-patches-bot

On Wed, Jan 14, 2026 at 01:56:11PM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 14, 2026 at 1:00 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > >
> > > > fyi I briefly discussed that with Andrii indicating that it might not
> > > > be worth the effort at this stage.
> > >
> > > depending on complexity of course.
> >
> > for my tests I just had to allow BPF_MAP_TYPE_PROG_ARRAY map
> > for sleepable programs
> >
> > jirka
> >
> >
> > ---
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index faa1ecc1fe9d..1f6fc74c7ea1 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -20969,6 +20969,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
> >                 case BPF_MAP_TYPE_STACK:
> >                 case BPF_MAP_TYPE_ARENA:
> >                 case BPF_MAP_TYPE_INSN_ARRAY:
> > +               case BPF_MAP_TYPE_PROG_ARRAY:
> >                         break;
> >                 default:
> >                         verbose(env,
> 
> Think it through, add selftests, ship it.
> On the surface the easy part is to make
> __bpf_prog_map_compatible() reject sleepable/non-sleepable combo.
> Maybe there are other things.

ok, thanks

jirka

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-01-15 18:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-02 15:00 [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Leon Hwang
2026-01-02 15:00 ` [PATCH bpf-next 1/4] bpf: tailcall: Introduce bpf_arch_tail_call_prologue_offset Leon Hwang
2026-01-02 15:21   ` bot+bpf-ci
2026-01-02 15:38     ` Leon Hwang
2026-01-02 15:00 ` [PATCH bpf-next 2/4] bpf, x64: tailcall: Eliminate max_entries and bpf_func access at runtime Leon Hwang
2026-01-02 15:00 ` [PATCH bpf-next 3/4] bpf, arm64: " Leon Hwang
2026-01-02 15:00 ` [PATCH bpf-next 4/4] bpf, lib/test_bpf: Fix broken tailcall tests Leon Hwang
2026-01-03  0:10 ` [PATCH bpf-next 0/4] bpf: tailcall: Eliminate max_entries and bpf_func access at runtime Alexei Starovoitov
2026-01-14 11:28   ` Jiri Olsa
2026-01-14 16:04     ` Alexei Starovoitov
2026-01-14 21:00       ` Jiri Olsa
2026-01-14 21:56         ` Alexei Starovoitov
2026-01-15 18:00           ` Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox