[RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
@ 2025-04-20 10:55 Raj Sahu
  2025-04-20 10:55 ` [RFC bpf-next 1/4] bpf: Introduce new structs and struct fields Raj Sahu
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Raj Sahu @ 2025-04-20 10:55 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, djwillia, miloc,
	ericts, rahult, doniaghazy, quanzhif, jinghao7, sidchintamaneni,
	memxor, Raj

From: Raj <rjsu26@gmail.com>

Motivation:
We propose an enhancement to the BPF infrastructure to address the
critical issue of long-running BPF programs [1,2,3,4]. While the verifier
ensures termination by restricting instruction count and backward edges, the
total execution time of BPF programs is not bounded. Certain helper functions
and iterators can result in extended runtimes, affecting system performance.

The existing BPF infrastructure verifies that programs will not indefinitely
loop or leak resources. However, helper functions such as `bpf_loop`,
`bpf_for_each_map_elem`, and other iterative or expensive kernel interactions
can cause runtimes that significantly degrade system performance [6]. Current
detaching mechanisms do not immediately terminate running instances,
monopolizing CPU. Therefore, a termination solution is necessary to swiftly
terminate execution while safely releasing resources.

Existing termination approach like the BPF Exception or Runtime hooks [5] have
the issue of either lack of dynamism or having runtime overheads: BPF
Exceptions: Introduces bpf_throw() and exception callbacks, requiring stack
unwinding and exception state management. Cleanup can only be done for
pre-defined cancellation points.  Runtime Hooks: Proposes watchdog timers, which
requires resource tracking, though minimal but non-zero runtime overheads.

Design:
We introduce the Fast-Path termination mechanism, leveraging the
verifier's guarantees regarding control flow and resource management. The
approach dynamically patches running BPF programs with a stripped-down version
that accelerates termination. This can be used to terminate any given instance
of a BPF execution. Key elements include:

- Implicit Lifetime Management: Utilizing the verifier’s inherent control flow
  and resource cleanup paths encoded within the BPF program structure,
  eliminating the need for explicit garbage collection or unwinding tables.

- Dynamic Program Patching: At runtime, BPF programs are atomically patched,
  replacing expensive helper calls with stubbed versions (fast fall-through
  implementations). This ensures swift termination without compromising safety
  or leaking resources.

- Helper Function Adjustments: Certain helper functions (e.g., `bpf_loop`,
  `bpf_for_each_map_elem`) include  mechanisms to facilitate early exits through
  modified return values.

TODOs:
- Termination support for nested BPF programs.
- Policy enforcements to control runtime of BPF programs in a system:
- Timer based termination (watchdog)
	- Userspace management to detect low-performing BPF program and
	  terminated them

We haven’t added any selftests in the POC as this mail is mainly to get
feedback on the design. Attaching link to sample BPF programs to
validate the POC [7].  Styling will be taken care in next iteration. 

References:
1. https://lpc.events/event/17/contributions/1610/attachments/1229/2505/LPC_BPF_termination_Raj_Sahu.pdf
2. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/f0749daa-4560-41c9-9f36-6aa618161665/content
3. https://lore.kernel.org/bpf/AM6PR03MB508011599420DB53480E8BF799F72@AM6PR03MB5080.eurprd03.prod.outlook.com/T/
4. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/7fb70c04-0736-4e2d-b48b-2d8d012bacfc/content
5. https://lwn.net/ml/all/AM6PR03MB5080513BFAEB54A93CC70D4399FE2@AM6PR03MB5080.eurprd03.prod.outlook.com/#t
6. https://people.cs.vt.edu/djwillia/papers/ebpf23-runtime.pdf
7. https://github.com/sidchintamaneni/os-dev-env/tree/main/bpf-programs-catalog/research/termination/patch_gen_testing

 arch/x86/kernel/smp.c          |   4 +-
 include/linux/bpf.h            |  18 ++
 include/linux/filter.h         |  16 ++
 include/linux/smp.h            |   2 +-
 include/uapi/linux/bpf.h       |  13 ++
 kernel/bpf/bpf_iter.c          |  65 ++++++
 kernel/bpf/core.c              |  45 ++++
 kernel/bpf/helpers.c           |   8 +
 kernel/bpf/syscall.c           | 375 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |  16 +-
 kernel/smp.c                   |  22 +-
 tools/bpf/bpftool/prog.c       |  40 ++++
 tools/include/uapi/linux/bpf.h |   5 +
 tools/lib/bpf/bpf.c            |  15 ++
 tools/lib/bpf/bpf.h            |  10 +
 tools/lib/bpf/libbpf.map       |   1 +
 16 files changed, 643 insertions(+), 12 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC bpf-next 1/4] bpf: Introduce new structs and struct fields
  2025-04-20 10:55 [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Raj Sahu
@ 2025-04-20 10:55 ` Raj Sahu
  2025-05-13  0:04   ` Eduard Zingerman
  2025-04-20 10:55 ` [RFC bpf-next 2/4] bpftool/libbpf : Introduce bpf_prog_termination to trigger termination signal Raj Sahu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Raj Sahu @ 2025-04-20 10:55 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, djwillia, miloc,
	ericts, rahult, doniaghazy, quanzhif, jinghao7, sidchintamaneni,
	memxor, sidchintamaneni, Raj

From: sidchintamaneni <sidchintamaneni@vt.edu>

Introduces the definition of struct termination_aux_states
required to support fast-path termination of BPF programs.
Adds the memory allocation and free logic for newly added
termination_states feild in struct bpf_prog.

Signed-off-by: Raj <rjsu26@gmail.com>
Signed-off-by: Siddharth <sidchintamaneni@gmail.com>
---
 include/linux/bpf.h | 14 ++++++++++++++
 kernel/bpf/core.c   | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3f0cc89c0622..5141f189b79b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -57,6 +57,7 @@ struct cgroup;
 struct bpf_token;
 struct user_namespace;
 struct super_block;
+struct termination_aux_states;
 struct inode;
 
 extern struct idr btf_idr;
@@ -1518,6 +1519,18 @@ struct btf_mod_pair {
 
 struct bpf_kfunc_desc_tab;
 
+struct cpu_aux {
+	u8 cpu_flag;
+	spinlock_t lock;
+};
+
+struct termination_aux_states {
+	struct cpu_aux *per_cpu_state;
+	struct pt_regs *pre_execution_state;
+	struct bpf_prog *patch_prog;
+	bool is_termination_prog;
+};
+
 struct bpf_prog_aux {
 	atomic64_t refcnt;
 	u32 used_map_cnt;
@@ -1656,6 +1669,7 @@ struct bpf_prog {
 					    const struct bpf_insn *insn);
 	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
 	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
+	struct termination_aux_states *termination_states;
 	/* Instructions for interpreter */
 	union {
 		DECLARE_FLEX_ARRAY(struct sock_filter, insns);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba6b6118cf50..27dcf59f4445 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -99,6 +99,7 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 	gfp_t gfp_flags = bpf_memcg_flags(GFP_KERNEL | __GFP_ZERO | gfp_extra_flags);
 	struct bpf_prog_aux *aux;
 	struct bpf_prog *fp;
+	struct termination_aux_states *termination_states = NULL;
 
 	size = round_up(size, __PAGE_SIZE);
 	fp = __vmalloc(size, gfp_flags);
@@ -117,11 +118,35 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 		return NULL;
 	}
 
+	termination_states = kzalloc(sizeof(*termination_states),
+					bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
+	if (!termination_states)
+		goto free_bpf_struct_ptr_alloc;
+
+	termination_states->per_cpu_state = kzalloc(sizeof(struct cpu_aux) * NR_CPUS,
+					bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags));
+	if (!termination_states->per_cpu_state)
+		goto free_bpf_termination_states;
+
+	for (int i = 0; i < NR_CPUS; i++) {
+		termination_states->per_cpu_state[i].cpu_flag = 0;
+		spin_lock_init(&termination_states->per_cpu_state[i].lock);
+	}
+
+	termination_states->pre_execution_state = kzalloc(
+					sizeof(struct pt_regs) * NR_CPUS,
+					bpf_memcg_flags(GFP_KERNEL | gfp_extra_flags)
+					);
+	if (!termination_states->pre_execution_state)
+		goto free_per_cpu_state;
+
+	termination_states->is_termination_prog = false;
 	fp->pages = size / PAGE_SIZE;
 	fp->aux = aux;
 	fp->aux->prog = fp;
 	fp->jit_requested = ebpf_jit_enabled();
 	fp->blinding_requested = bpf_jit_blinding_enabled(fp);
+	fp->termination_states = termination_states;
 #ifdef CONFIG_CGROUP_BPF
 	aux->cgroup_atype = CGROUP_BPF_ATTACH_TYPE_INVALID;
 #endif
@@ -135,6 +160,16 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 	mutex_init(&fp->aux->dst_mutex);
 
 	return fp;
+
+free_per_cpu_state:
+	kfree(termination_states->per_cpu_state);
+free_bpf_termination_states:
+	kfree(termination_states);
+free_bpf_struct_ptr_alloc:
+	free_percpu(fp->active);
+	vfree(fp);
+	kfree(aux);
+	return NULL;
 }
 
 struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
@@ -282,6 +317,13 @@ void __bpf_prog_free(struct bpf_prog *fp)
 		kfree(fp->aux->poke_tab);
 		kfree(fp->aux);
 	}
+
+	if (fp->termination_states) {
+		kfree(fp->termination_states->pre_execution_state);
+		kfree(fp->termination_states->per_cpu_state);
+		kfree(fp->termination_states);
+	}
+
 	free_percpu(fp->stats);
 	free_percpu(fp->active);
 	vfree(fp);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC bpf-next 2/4] bpftool/libbpf : Introduce bpf_prog_termination to trigger termination signal
  2025-04-20 10:55 [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Raj Sahu
  2025-04-20 10:55 ` [RFC bpf-next 1/4] bpf: Introduce new structs and struct fields Raj Sahu
@ 2025-04-20 10:55 ` Raj Sahu
  2025-04-20 10:55 ` [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination Raj Sahu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Raj Sahu @ 2025-04-20 10:55 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, djwillia, miloc,
	ericts, rahult, doniaghazy, quanzhif, jinghao7, sidchintamaneni,
	memxor, sidchintamaneni, Raj

From: sidchintamaneni <sidchintamaneni@vt.edu>

Introduces bpf_prog_termination API in libbpf to trigger termination
signal from userspace. Adds do_terminate functionality to bpftool to
use cmd line interface.

cmd - bpftool prog terminate id [] cpu []

Will split this commit to two (bpftool/ libbpf) while sending the
patches

Signed-off-by: Raj <rjsu26@gmail.com>
Signed-off-by: Siddharth <sidchintamaneni@gmail.com>
---
 include/uapi/linux/bpf.h       |  5 +++++
 tools/bpf/bpftool/prog.c       | 40 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  5 +++++
 tools/lib/bpf/bpf.c            | 15 +++++++++++++
 tools/lib/bpf/bpf.h            | 10 +++++++++
 tools/lib/bpf/libbpf.map       |  1 +
 6 files changed, 76 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 71d5ac83cf5d..9b9061b9b8e1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -961,6 +961,7 @@ enum bpf_cmd {
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
 	BPF_TOKEN_CREATE,
+	BPF_PROG_TERMINATE,
 	__MAX_BPF_CMD,
 };
 
@@ -1842,6 +1843,10 @@ union bpf_attr {
 		__u32		bpffs_fd;
 	} token_create;
 
+	struct { /* struct used by BPF_PROG_TERMINATE command */
+		__u32		prog_id;
+		__u32		term_cpu_id;
+	} prog_terminate;
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index f010295350be..77bf3fa10d46 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -1968,6 +1968,44 @@ static int do_loadall(int argc, char **argv)
 	return load_with_options(argc, argv, false);
 }
 
+static int do_terminate(int argc, char **argv)
+{
+	int prog_id, cpu_id;
+
+	if (!REQ_ARGS(4))
+		return BAD_ARG();
+
+	if (!is_prefix(*argv, "id")) {
+		p_err("expected 'id', got: %s", *argv);
+		return -1;
+	}
+	NEXT_ARG();
+
+	prog_id = atoi(argv[0]);
+	if (prog_id <= 0) {
+		p_err("Invalid prog_id: %d\n", prog_id);
+		return -1;
+	}
+	NEXT_ARG();
+
+	if (!is_prefix(*argv, "cpu")) {
+		p_err("expected 'cpu', got: %s", *argv);
+		return -1;
+	}
+	NEXT_ARG();
+
+	cpu_id = atoi(argv[0]);
+	if (cpu_id < 0) {
+		p_err("Invalid cpu_id: %d\n", cpu_id);
+		return -1;
+	}
+
+	bpf_prog_terminate(prog_id, cpu_id);
+
+	return 0;
+
+}
+
 #ifdef BPFTOOL_WITHOUT_SKELETONS
 
 static int do_profile(int argc, char **argv)
@@ -2466,6 +2504,7 @@ static int do_help(int argc, char **argv)
 
 	fprintf(stderr,
 		"Usage: %1$s %2$s { show | list } [PROG]\n"
+		"	%1$s %2$s terminate PROG CPU\n"
 		"       %1$s %2$s dump xlated PROG [{ file FILE | [opcodes] [linum] [visual] }]\n"
 		"       %1$s %2$s dump jited  PROG [{ file FILE | [opcodes] [linum] }]\n"
 		"       %1$s %2$s pin   PROG FILE\n"
@@ -2525,6 +2564,7 @@ static const struct cmd cmds[] = {
 	{ "tracelog",	do_tracelog },
 	{ "run",	do_run },
 	{ "profile",	do_profile },
+	{ "terminate",	do_terminate },
 	{ 0 }
 };
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 71d5ac83cf5d..9b9061b9b8e1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -961,6 +961,7 @@ enum bpf_cmd {
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
 	BPF_TOKEN_CREATE,
+	BPF_PROG_TERMINATE,
 	__MAX_BPF_CMD,
 };
 
@@ -1842,6 +1843,10 @@ union bpf_attr {
 		__u32		bpffs_fd;
 	} token_create;
 
+	struct { /* struct used by BPF_PROG_TERMINATE command */
+		__u32		prog_id;
+		__u32		term_cpu_id;
+	} prog_terminate;
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index a9c3e33d0f8a..0b9dc9b16e02 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -1331,3 +1331,18 @@ int bpf_token_create(int bpffs_fd, struct bpf_token_create_opts *opts)
 	fd = sys_bpf_fd(BPF_TOKEN_CREATE, &attr, attr_sz);
 	return libbpf_err_errno(fd);
 }
+
+int bpf_prog_terminate(int prog_id, int cpu_id)
+{
+	int fd;
+	union bpf_attr attr;
+	const size_t attr_sz = offsetofend(union bpf_attr, prog_terminate);
+
+	memset(&attr, 0, sizeof(attr));
+	attr.prog_terminate.prog_id = prog_id;
+	attr.prog_terminate.term_cpu_id = cpu_id;
+
+	fd = sys_bpf(BPF_PROG_TERMINATE, &attr, attr_sz);
+
+	return libbpf_err_errno(fd);
+}
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 777627d33d25..6d09d17467b7 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -704,6 +704,16 @@ struct bpf_token_create_opts {
 LIBBPF_API int bpf_token_create(int bpffs_fd,
 				struct bpf_token_create_opts *opts);
 
+
+/**
+ * @brief **bpf_prog_terminate()** when provided with prog id and cpu id
+ * of the running prog, it terminated the running BPF program.
+ *
+ * @param BPF program file descriptor
+ * @cpu_id cpu id of the running program
+ */
+LIBBPF_API int bpf_prog_terminate(int prog_id, int cpu_id);
+
 #ifdef __cplusplus
 } /* extern "C" */
 #endif
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 1205f9a4fe04..80793f215464 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -443,4 +443,5 @@ LIBBPF_1.6.0 {
 		bpf_program__line_info_cnt;
 		btf__add_decl_attr;
 		btf__add_type_attr;
+		bpf_prog_terminate;
 } LIBBPF_1.5.0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-04-20 10:55 [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Raj Sahu
  2025-04-20 10:55 ` [RFC bpf-next 1/4] bpf: Introduce new structs and struct fields Raj Sahu
  2025-04-20 10:55 ` [RFC bpf-next 2/4] bpftool/libbpf : Introduce bpf_prog_termination to trigger termination signal Raj Sahu
@ 2025-04-20 10:55 ` Raj Sahu
  2025-05-13  0:07   ` Eduard Zingerman
  2025-04-20 10:55 ` [RFC bpf-next 4/4] bpf: Runtime part of fast-path termination approach Raj Sahu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Raj Sahu @ 2025-04-20 10:55 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, djwillia, miloc,
	ericts, rahult, doniaghazy, quanzhif, jinghao7, sidchintamaneni,
	memxor, Raj

From: Raj <rjsu26@gmail.com>

Introduces patch_generator which generates patch for each BPF
program, iterate through all helper calls and stub them with
a lightweight dummy function.

Also introduces a new patched version of bpf_loop namely
bpf_loop_termination for early exit from the loop.
For bpf_loop inlining case, we modify the return value of
the callback function to terminate early.

Signed-off-by: Raj <rjsu26@gmail.com>
Signed-off-by: Siddharth <sidchintamaneni@gmail.com>
---
 include/linux/bpf.h      |   4 +
 include/uapi/linux/bpf.h |   8 ++
 kernel/bpf/bpf_iter.c    |  65 ++++++++++++
 kernel/bpf/core.c        |   3 +
 kernel/bpf/helpers.c     |   8 ++
 kernel/bpf/syscall.c     | 216 +++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c    |  16 ++-
 7 files changed, 318 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5141f189b79b..5a174b536190 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3400,12 +3400,16 @@ extern const struct bpf_func_proto bpf_unlocked_sk_setsockopt_proto;
 extern const struct bpf_func_proto bpf_unlocked_sk_getsockopt_proto;
 extern const struct bpf_func_proto bpf_find_vma_proto;
 extern const struct bpf_func_proto bpf_loop_proto;
+extern const struct bpf_func_proto bpf_loop_termination_proto;
 extern const struct bpf_func_proto bpf_copy_from_user_task_proto;
 extern const struct bpf_func_proto bpf_set_retval_proto;
 extern const struct bpf_func_proto bpf_get_retval_proto;
 extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
 extern const struct bpf_func_proto bpf_cgrp_storage_get_proto;
 extern const struct bpf_func_proto bpf_cgrp_storage_delete_proto;
+extern const struct bpf_func_proto bpf_dummy_void_proto;
+extern const struct bpf_func_proto bpf_dummy_int_proto;
+extern const struct bpf_func_proto bpf_dummy_ptr_to_map_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9b9061b9b8e1..d1d1795eefd4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6033,6 +6033,14 @@ union bpf_attr {
 	FN(user_ringbuf_drain, 209, ##ctx)		\
 	FN(cgrp_storage_get, 210, ##ctx)		\
 	FN(cgrp_storage_delete, 211, ##ctx)		\
+	FN(dummy_void, 212, ##ctx)			\
+	FN(dummy_int, 213, ##ctx)			\
+	FN(dummy_ptr_to_map, 214, ##ctx)		\
+	FN(loop_termination, 215, ##ctx)		\
+	/*
+	 * TODO: Remove these dummy helper interface because we
+	 * are not exposing them to userspace
+	 */
 	/* This helper list is effectively frozen. If you are trying to	\
 	 * add a new helper, you should add a kfunc instead which has	\
 	 * less stability guarantees. See Documentation/bpf/kfuncs.rst	\
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 380e9a7cac75..e5dceebfb302 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -6,6 +6,7 @@
 #include <linux/filter.h>
 #include <linux/bpf.h>
 #include <linux/rcupdate_trace.h>
+#include <asm/unwind.h>
 
 struct bpf_iter_target_info {
 	struct list_head list;
@@ -775,6 +776,70 @@ const struct bpf_func_proto bpf_loop_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_loop_termination, u32, nr_loops, void *, callback_fn,
+		void *, callback_ctx, u64, flags)
+{
+	/* Since a patched BPF program for termination will want
+	 * to finish as fast as possible,
+	 * we simply don't run any loop in here.
+	 */
+
+	/* Restoring the callee-saved registers and exit.
+	 * Hacky/ err prone way of restoring the registers.
+	 * We are figuring out a way to have arch independent
+	 * way to do this.
+	 */
+
+	asm volatile(
+	"pop %rbx\n\t"
+	"pop %rbp\n\t"
+	"pop %r12\n\t"
+	"pop %r13\n\t"
+	);
+
+	return 0;
+}
+
+const struct bpf_func_proto bpf_loop_termination_proto = {
+       .func           = bpf_loop_termination,
+       .gpl_only       = false,
+       .ret_type       = RET_INTEGER,
+       .arg1_type      = ARG_ANYTHING,
+       .arg2_type      = ARG_PTR_TO_FUNC,
+       .arg3_type      = ARG_PTR_TO_STACK_OR_NULL,
+       .arg4_type      = ARG_ANYTHING,
+};
+
+BPF_CALL_0(bpf_dummy_void) {
+	return 0;
+}
+
+const struct bpf_func_proto bpf_dummy_void_proto = {
+	.func           = bpf_dummy_void,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+};
+
+BPF_CALL_0(bpf_dummy_int) {
+	return -1;
+}
+
+const struct bpf_func_proto bpf_dummy_int_proto = {
+	.func		= bpf_dummy_int,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+};
+
+BPF_CALL_0(bpf_dummy_ptr_to_map) {
+	return 0;
+}
+
+const struct bpf_func_proto bpf_dummy_ptr_to_map_proto = {
+	.func		= bpf_dummy_ptr_to_map,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_MAP_VALUE_OR_NULL,
+};
+
 struct bpf_iter_num_kern {
 	int cur; /* current value, inclusive */
 	int end; /* final value, exclusive */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 27dcf59f4445..acc490155b87 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -3003,6 +3003,9 @@ const struct bpf_func_proto bpf_snprintf_btf_proto __weak;
 const struct bpf_func_proto bpf_seq_printf_btf_proto __weak;
 const struct bpf_func_proto bpf_set_retval_proto __weak;
 const struct bpf_func_proto bpf_get_retval_proto __weak;
+const struct bpf_func_proto bpf_dummy_void __weak;
+const struct bpf_func_proto bpf_dummy_int __weak;
+const struct bpf_func_proto bpf_dummy_ptr_to_map __weak;
 
 const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index e3a2662f4e33..9106c63b9851 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1965,6 +1965,12 @@ bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_current_pid_tgid_proto;
 	case BPF_FUNC_get_ns_current_pid_tgid:
 		return &bpf_get_ns_current_pid_tgid_proto;
+	case BPF_FUNC_dummy_void:
+		return &bpf_dummy_void_proto;
+	case BPF_FUNC_dummy_int:
+		return &bpf_dummy_int_proto;
+	case BPF_FUNC_dummy_ptr_to_map:
+		return &bpf_dummy_ptr_to_map_proto;
 	default:
 		break;
 	}
@@ -1997,6 +2003,8 @@ bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_for_each_map_elem_proto;
 	case BPF_FUNC_loop:
 		return &bpf_loop_proto;
+	case BPF_FUNC_loop_termination:
+		return &bpf_loop_termination_proto;
 	case BPF_FUNC_user_ringbuf_drain:
 		return &bpf_user_ringbuf_drain_proto;
 	case BPF_FUNC_ringbuf_reserve_dynptr:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9794446bc8c6..fb54c5e948ff 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -36,6 +36,7 @@
 #include <linux/memcontrol.h>
 #include <linux/trace_events.h>
 #include <linux/tracepoint.h>
+#include <asm/unwind.h>
 
 #include <net/netfilter/nf_bpf_link.h>
 #include <net/netkit.h>
@@ -51,6 +52,17 @@
 
 #define BPF_OBJ_FLAG_MASK   (BPF_F_RDONLY | BPF_F_WRONLY)
 
+static const struct bpf_verifier_ops * const bpf_verifier_ops[] = {
+#define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \
+	[_id] = & _name ## _verifier_ops,
+#define BPF_MAP_TYPE(_id, _ops)
+#define BPF_LINK_TYPE(_id, _name)
+#include <linux/bpf_types.h>
+#undef BPF_PROG_TYPE
+#undef BPF_MAP_TYPE
+#undef BPF_LINK_TYPE
+};
+
 DEFINE_PER_CPU(int, bpf_prog_active);
 static DEFINE_IDR(prog_idr);
 static DEFINE_SPINLOCK(prog_idr_lock);
@@ -2756,6 +2768,207 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_PROG_LOAD_LAST_FIELD fd_array_cnt
 
+static int clone_bpf_prog(struct bpf_prog *patch_prog, struct bpf_prog *prog)
+{
+	int err = 0;
+	patch_prog->expected_attach_type = prog->expected_attach_type;
+	patch_prog->len = prog->len;
+	patch_prog->gpl_compatible = prog->gpl_compatible;
+
+	memcpy(patch_prog->insnsi, prog->insnsi,  bpf_prog_insn_size(prog));
+
+	patch_prog->orig_prog = NULL;
+	patch_prog->jited = 0;
+	patch_prog->type = prog->type;
+
+	char *patch_prefix = "patch_";
+	strncpy(patch_prog->aux->name, patch_prefix, strlen(patch_prefix));
+	strncat(patch_prog->aux->name, prog->aux->name, sizeof(prog->aux->name));
+
+	return err;
+}
+
+static bool is_verifier_inlined_function(int func_id) {
+	switch (func_id) {
+		case BPF_FUNC_get_smp_processor_id:
+		case BPF_FUNC_jiffies64:
+		case BPF_FUNC_get_func_arg:
+		case BPF_FUNC_get_func_ret:
+		case BPF_FUNC_get_func_arg_cnt:
+		case BPF_FUNC_get_func_ip:
+		case BPF_FUNC_get_branch_snapshot:
+		case BPF_FUNC_kptr_xchg:
+		case BPF_FUNC_map_lookup_elem:
+			return true;
+		default:
+			return false;
+	}
+}
+
+static bool is_debug_function(int func_id) {
+	switch (func_id) {
+		case BPF_FUNC_trace_printk:
+			return true;
+		default:
+			return false;
+	}
+}
+
+static bool is_resource_release_function(int func_id) {
+	switch (func_id) {
+		case BPF_FUNC_spin_unlock:
+		case BPF_FUNC_ringbuf_submit:
+		case BPF_FUNC_ringbuf_discard:
+			return true;
+		default:
+			return false;
+	}
+}
+
+static bool find_in_skiplist(int func_id) {
+	return is_verifier_inlined_function(func_id) ||
+	       is_debug_function(func_id) ||
+	       is_resource_release_function(func_id);
+}
+
+static int get_replacement_helper(int func_id, enum bpf_return_type ret_type) {
+
+	switch (func_id) {
+		case BPF_FUNC_loop:
+			return BPF_FUNC_loop_termination;
+		case BPF_FUNC_for_each_map_elem:
+		case BPF_FUNC_user_ringbuf_drain:
+			return -ENOTSUPP;
+	}
+
+	switch (ret_type) {
+		case RET_VOID:
+			return BPF_FUNC_dummy_void;
+		case RET_INTEGER:
+			return BPF_FUNC_dummy_int;
+		case RET_PTR_TO_MAP_VALUE_OR_NULL:
+			return BPF_FUNC_dummy_ptr_to_map;
+		case RET_PTR_TO_SOCKET_OR_NULL:
+		case RET_PTR_TO_TCP_SOCK_OR_NULL:
+		case RET_PTR_TO_SOCK_COMMON_OR_NULL:
+		case RET_PTR_TO_RINGBUF_MEM_OR_NULL:
+		case RET_PTR_TO_DYNPTR_MEM_OR_NULL:
+		case RET_PTR_TO_BTF_ID_OR_NULL:
+		case RET_PTR_TO_BTF_ID_TRUSTED:
+		case RET_PTR_TO_MAP_VALUE:
+		case RET_PTR_TO_SOCKET:
+		case RET_PTR_TO_TCP_SOCK:
+		case RET_PTR_TO_SOCK_COMMON:
+		case RET_PTR_TO_MEM:
+		case RET_PTR_TO_MEM_OR_BTF_ID:
+		case RET_PTR_TO_BTF_ID:
+		default:
+			return -ENOTSUPP;
+	}
+}
+
+static void patch_generator(struct bpf_prog *prog)
+{
+	struct call_insn_aux{
+		int insn_idx;
+		int replacement_helper;
+	};
+
+	struct call_insn_aux *call_indices;
+	int num_calls=0;
+	call_indices = vmalloc(sizeof(call_indices) * prog->len);
+
+	/* Find all call insns */
+	for(int insn_idx =0 ;insn_idx < prog->len; insn_idx++)
+	{
+		struct bpf_insn *insn = &prog->insnsi[insn_idx] ;
+		u8 class = BPF_CLASS(insn->code);
+		if (class == BPF_JMP || class == BPF_JMP32) {
+			if (BPF_OP(insn->code) == BPF_CALL){
+				if (insn->src_reg == BPF_PSEUDO_CALL) {
+					continue;
+				}
+				if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL){ /*kfunc */
+					// TODO Need to use btf for getting proto
+					// If release function --> skip
+					// If acquire function --> find return type and add to list
+				}
+				else {
+					int func_id = insn->imm;
+					const struct bpf_func_proto *fn = NULL;
+					int new_helper_id = -1;
+
+					if (find_in_skiplist(func_id)) {
+						continue;
+					}
+
+					fn = bpf_verifier_ops[prog->type]->get_func_proto(func_id, prog);
+					if (!fn && !fn->func) {
+						continue;
+					}
+
+					new_helper_id = get_replacement_helper(func_id, fn->ret_type);
+					if (new_helper_id < 0) {
+						continue;
+					}
+
+					call_indices[num_calls].insn_idx = insn_idx;
+					call_indices[num_calls].replacement_helper= new_helper_id;
+					num_calls++;
+				}
+			}
+		}
+	}
+
+	/* Patch all call insns */
+	for(int k =0; k < num_calls; k++){
+		prog->insnsi[call_indices[k].insn_idx].imm = call_indices[k].replacement_helper;
+	}
+}
+
+static bool create_termination_prog(struct bpf_prog *prog,
+					union bpf_attr *attr,
+					bpfptr_t uattr,
+					u32 uattr_size)
+{
+	if (prog->len < 10)
+		return false;
+
+	int err;
+	struct bpf_prog *patch_prog;
+	patch_prog = bpf_prog_alloc_no_stats(bpf_prog_size(prog->len), 0);
+	if (!patch_prog) {
+		return false;
+	}
+
+	patch_prog->termination_states->is_termination_prog = true;
+
+	err = clone_bpf_prog(patch_prog, prog);
+	if (err)
+			goto free_termination_prog;
+
+	patch_generator(patch_prog);
+
+	err = bpf_check(&patch_prog, attr, uattr, uattr_size);
+	if (err) {
+		goto free_termination_prog;
+	}
+
+	patch_prog = bpf_prog_select_runtime(patch_prog, &err);
+	if (err) {
+		goto free_termination_prog;
+	}
+
+	prog->termination_states->patch_prog = patch_prog;
+	return true;
+
+free_termination_prog:
+	free_percpu(patch_prog->stats);
+	free_percpu(patch_prog->active);
+	kfree(patch_prog->aux);
+	return false;
+}
+
 static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 {
 	enum bpf_prog_type type = attr->prog_type;
@@ -2765,6 +2978,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	bool bpf_cap;
 	int err;
 	char license[128];
+	bool have_termination_prog = false;
 
 	if (CHECK_ATTR(BPF_PROG_LOAD))
 		return -EINVAL;
@@ -2967,6 +3181,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	if (err)
 		goto free_prog_sec;
 
+	have_termination_prog = create_termination_prog( prog, attr, uattr, uattr_size);
+
 	/* run eBPF verifier */
 	err = bpf_check(&prog, attr, uattr, uattr_size);
 	if (err < 0)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 54c6953a8b84..57b4fd1f6a72 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -513,6 +513,7 @@ static bool is_sync_callback_calling_function(enum bpf_func_id func_id)
 	return func_id == BPF_FUNC_for_each_map_elem ||
 	       func_id == BPF_FUNC_find_vma ||
 	       func_id == BPF_FUNC_loop ||
+	       func_id == BPF_FUNC_loop_termination ||
 	       func_id == BPF_FUNC_user_ringbuf_drain;
 }
 
@@ -11424,6 +11425,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		err = check_bpf_snprintf_call(env, regs);
 		break;
 	case BPF_FUNC_loop:
+	case BPF_FUNC_loop_termination:
 		update_loop_inline_state(env, meta.subprogno);
 		/* Verifier relies on R1 value to determine if bpf_loop() iteration
 		 * is finished, thus mark it precise.
@@ -22470,10 +22472,12 @@ static struct bpf_prog *inline_bpf_loop(struct bpf_verifier_env *env,
 
 	struct bpf_insn *insn_buf = env->insn_buf;
 	struct bpf_prog *new_prog;
+	struct termination_aux_states *termination_states;
 	u32 callback_start;
 	u32 call_insn_offset;
 	s32 callback_offset;
 	u32 cnt = 0;
+	termination_states = env->prog->termination_states;
 
 	/* This represents an inlined version of bpf_iter.c:bpf_loop,
 	 * be careful to modify this code in sync.
@@ -22502,7 +22506,14 @@ static struct bpf_prog *inline_bpf_loop(struct bpf_verifier_env *env,
 	 */
 	insn_buf[cnt++] = BPF_MOV64_REG(BPF_REG_1, reg_loop_cnt);
 	insn_buf[cnt++] = BPF_MOV64_REG(BPF_REG_2, reg_loop_ctx);
-	insn_buf[cnt++] = BPF_CALL_REL(0);
+
+	if (termination_states && termination_states->is_termination_prog) {
+		/* In a termination BPF prog, we want to exit - set R0 = 1 */
+		insn_buf[cnt++] = BPF_MOV64_IMM(BPF_REG_0, 1);
+	} else {
+		insn_buf[cnt++] = BPF_CALL_REL(0);
+	}
+
 	/* increment loop counter */
 	insn_buf[cnt++] = BPF_ALU64_IMM(BPF_ADD, reg_loop_cnt, 1);
 	/* jump to loop header if callback returned 0 */
@@ -22535,7 +22546,8 @@ static bool is_bpf_loop_call(struct bpf_insn *insn)
 {
 	return insn->code == (BPF_JMP | BPF_CALL) &&
 		insn->src_reg == 0 &&
-		insn->imm == BPF_FUNC_loop;
+		(insn->imm == BPF_FUNC_loop
+		 || insn->imm == BPF_FUNC_loop_termination);
 }
 
 /* For all sub-programs in the program (including main) check
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC bpf-next 4/4] bpf: Runtime part of fast-path termination approach
  2025-04-20 10:55 [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Raj Sahu
                   ` (2 preceding siblings ...)
  2025-04-20 10:55 ` [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination Raj Sahu
@ 2025-04-20 10:55 ` Raj Sahu
  2025-05-05 20:13 ` [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Kumar Kartikeya Dwivedi
  2025-05-07 18:15 ` Zvi Effron
  5 siblings, 0 replies; 30+ messages in thread
From: Raj Sahu @ 2025-04-20 10:55 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, djwillia, miloc,
	ericts, rahult, doniaghazy, quanzhif, jinghao7, sidchintamaneni,
	memxor, sidchintamaneni, Raj

From: sidchintamaneni <sidchintamaneni@vt.edu>

Introduces IPI based runtime mechanism to terminate
a BPF program. When a BPF program is interrupted by
an IPI, its registers are are passed onto the bpf_die.

Inside bpf_die we perform the RIP switch and stack walk
to replace the return addresses of BPF progs/subprogs to
the patched program. In bpf_die, we are supporting non-inlined
bpf_die scenario as well, later could be extended to other
unrestricted iterators.

Signed-off-by: Raj <rjsu26@gmail.com>
Signed-off-by: Siddharth <sidchintamaneni@gmail.com>
---
 arch/x86/kernel/smp.c  |   4 +-
 include/linux/filter.h |  16 +++++
 include/linux/smp.h    |   2 +-
 kernel/bpf/syscall.c   | 159 +++++++++++++++++++++++++++++++++++++++++
 kernel/smp.c           |  22 ++++--
 5 files changed, 193 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 18266cc3d98c..aca5a97be19f 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -259,7 +259,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_call_function)
 	apic_eoi();
 	trace_call_function_entry(CALL_FUNCTION_VECTOR);
 	inc_irq_stat(irq_call_count);
-	generic_smp_call_function_interrupt();
+	generic_smp_call_function_interrupt(regs);
 	trace_call_function_exit(CALL_FUNCTION_VECTOR);
 }
 
@@ -268,7 +268,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_call_function_single)
 	apic_eoi();
 	trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR);
 	inc_irq_stat(irq_call_count);
-	generic_smp_call_function_single_interrupt();
+	generic_smp_call_function_single_interrupt(regs);
 	trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR);
 }
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f5cf4d35d83e..cb75f62a1357 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -689,10 +689,21 @@ extern int (*nfct_btf_struct_access)(struct bpf_verifier_log *log,
 				     const struct bpf_reg_state *reg,
 				     int off, int size);
 
+void bpf_die(void *data);
 typedef unsigned int (*bpf_dispatcher_fn)(const void *ctx,
 					  const struct bpf_insn *insnsi,
 					  unsigned int (*bpf_func)(const void *,
 								   const struct bpf_insn *));
+static void update_term_per_cpu_flag(const struct bpf_prog *prog, u8 cpu_flag)
+{
+	unsigned long flags;
+	u32 cpu_id = raw_smp_processor_id();
+	spin_lock_irqsave(&prog->termination_states->per_cpu_state[cpu_id].lock, 
+				flags);
+	prog->termination_states->per_cpu_state[cpu_id].cpu_flag = cpu_flag;
+	spin_unlock_irqrestore(&prog->termination_states->per_cpu_state[cpu_id].lock,
+				flags);
+}
 
 static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
 					  const void *ctx,
@@ -701,12 +712,15 @@ static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
 	u32 ret;
 
 	cant_migrate();
+
 	if (static_branch_unlikely(&bpf_stats_enabled_key)) {
 		struct bpf_prog_stats *stats;
 		u64 duration, start = sched_clock();
 		unsigned long flags;
 
+		update_term_per_cpu_flag(prog, 1);
 		ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
+		update_term_per_cpu_flag(prog, 0);
 
 		duration = sched_clock() - start;
 		stats = this_cpu_ptr(prog->stats);
@@ -715,7 +729,9 @@ static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
 		u64_stats_add(&stats->nsecs, duration);
 		u64_stats_update_end_irqrestore(&stats->syncp, flags);
 	} else {
+		update_term_per_cpu_flag(prog, 1);
 		ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
+		update_term_per_cpu_flag(prog, 0);
 	}
 	return ret;
 }
diff --git a/include/linux/smp.h b/include/linux/smp.h
index f1aa0952e8c3..a0d8b3263a15 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -173,7 +173,7 @@ void wake_up_all_idle_cpus(void);
  * Generic and arch helpers
  */
 void __init call_function_init(void);
-void generic_smp_call_function_single_interrupt(void);
+void generic_smp_call_function_single_interrupt(struct pt_regs *regs);
 #define generic_smp_call_function_interrupt \
 	generic_smp_call_function_single_interrupt
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index fb54c5e948ff..c5911b67eb15 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -6008,6 +6008,162 @@ static int token_create(union bpf_attr *attr)
 	return bpf_token_create(attr);
 }
 
+static bool per_cpu_flag_is_true(struct termination_aux_states *term_states, int cpu_id)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&term_states->per_cpu_state[cpu_id].lock, 
+				flags);
+	if (term_states->per_cpu_state[cpu_id].cpu_flag == 1) {
+		spin_unlock_irqrestore(&term_states->per_cpu_state[cpu_id].lock,
+					flags);
+		return true;
+	}
+	spin_unlock_irqrestore(&term_states->per_cpu_state[cpu_id].lock,
+				flags);
+	return false;
+}
+
+static int is_bpf_address(struct bpf_prog *prog, unsigned long addr) {
+
+        unsigned long bpf_func_addr = (unsigned long)prog->bpf_func;
+        if ((addr > bpf_func_addr) &&
+                        (addr < bpf_func_addr + prog->jited_len)){
+                return 1;
+        }
+
+        for (int subprog = 1; subprog < prog->aux->func_cnt; subprog++) {
+                struct bpf_prog *bpf_subprog = prog->aux->func[subprog];
+                unsigned long bpf_subprog_func_addr =
+                                        (unsigned long)bpf_subprog->bpf_func;
+                if ((addr > bpf_subprog_func_addr) && (addr < bpf_subprog_func_addr +
+                                                        bpf_subprog->jited_len)) {
+                        return 1;
+                }
+        }
+
+        return 0;
+}
+
+static unsigned long find_offset_in_patch_prog(struct bpf_prog *patch_prog,
+                struct bpf_prog *prog, unsigned long addr)
+{
+
+        unsigned long bpf_func_addr = (unsigned long)prog->bpf_func;
+        if ((addr > bpf_func_addr) &&
+                        (addr < bpf_func_addr + prog->jited_len)){
+                unsigned long offset = addr - (unsigned long)prog->bpf_func;
+                return (unsigned long)patch_prog->bpf_func + offset;
+        }
+
+        for (int subprog = 1; subprog < prog->aux->func_cnt; subprog++) {
+                struct bpf_prog *bpf_subprog = prog->aux->func[subprog];
+                unsigned long bpf_subprog_func_addr =
+                                        (unsigned long)bpf_subprog->bpf_func;
+                if ((addr > bpf_subprog_func_addr) && (addr < bpf_subprog_func_addr +
+                                                        bpf_subprog->jited_len)) {
+                        unsigned long offset = addr - (unsigned
+                                        long)prog->aux->func[subprog]->bpf_func;
+                        return (unsigned long)patch_prog->aux->func[subprog]->bpf_func + offset;
+                }
+        }
+
+	return -EINVAL;
+}
+
+
+void bpf_die(void *data)
+{
+	struct unwind_state state;
+	struct bpf_prog *prog, *patch_prog;
+	struct pt_regs *regs;
+	char str[KSYM_SYMBOL_LEN];
+	unsigned long addr, new_addr, bpf_loop_addr, bpf_loop_term_addr;
+	int cpu_id = raw_smp_processor_id();
+
+	prog = (struct bpf_prog *)data;
+	patch_prog = prog->termination_states->patch_prog;
+
+	if(!per_cpu_flag_is_true(prog->termination_states, cpu_id))
+		return;
+
+	regs = &prog->termination_states->pre_execution_state[cpu_id];
+	bpf_loop_addr = (unsigned long)bpf_loop_proto.func;
+	bpf_loop_term_addr = (unsigned long)bpf_loop_termination_proto.func;
+
+	unwind_start(&state, current, regs, NULL);
+	addr = unwind_get_return_address(&state);
+
+	/* BPF programs RIP is in bpf program context when termination
+	 * signal raises an IPI
+	 */
+	if (is_bpf_address(prog, addr)) {
+		new_addr = find_offset_in_patch_prog(patch_prog, prog, addr);
+		if (new_addr < 0)
+			return;
+		regs->ip = new_addr;
+	}
+
+	unsigned long stack_addr = regs->sp;
+	while (addr) {
+		if (is_bpf_address(prog, addr)) {
+			while (*(unsigned long *)stack_addr != addr) {
+				stack_addr += 1;
+			}
+			new_addr = find_offset_in_patch_prog(patch_prog, prog, addr);
+			if (new_addr < 0)
+				return;
+			*(unsigned long *)stack_addr = new_addr;
+		} else {
+			/* Handles termination non-inline bpf_loop scenario.
+			 * Could be modular and later extended to other iterators.
+			 */
+			const char *name = kallsyms_lookup(addr, NULL, NULL, NULL, str);
+			if (name) {
+				unsigned long lookup_addr = kallsyms_lookup_name(name);
+				if (lookup_addr && lookup_addr == bpf_loop_addr) {
+					while (*(unsigned long *)stack_addr != addr) {
+						stack_addr += 1;
+					}
+					*(unsigned long *)stack_addr = bpf_loop_term_addr;
+				}
+			}
+		}
+		unwind_next_frame(&state);
+		addr = unwind_get_return_address(&state);
+	}
+
+	atomic64_dec(&prog->aux->refcnt);
+
+	return;
+}
+
+static int bpf_prog_terminate(union bpf_attr *attr)
+{
+	struct bpf_prog *prog;
+	struct termination_aux_states *term_states;
+	int cpu_id;
+
+	prog = bpf_prog_by_id(attr->prog_id);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	term_states = prog->termination_states;
+	if (!term_states)
+		return -ENOTSUPP;
+
+	cpu_id = attr->prog_terminate.term_cpu_id;
+	if (cpu_id < 0 && cpu_id >= NR_CPUS)
+		return -EINVAL;
+
+	if (!per_cpu_flag_is_true(term_states, cpu_id))
+		return -EFAULT;
+
+	smp_call_function_single(cpu_id, bpf_die, (void *)prog, 1);
+
+	return 0;
+}
+
 static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size)
 {
 	union bpf_attr attr;
@@ -6144,6 +6300,9 @@ static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size)
 	case BPF_TOKEN_CREATE:
 		err = token_create(&attr);
 		break;
+	case BPF_PROG_TERMINATE:
+		err = bpf_prog_terminate(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/smp.c b/kernel/smp.c
index 974f3a3962e8..f4dcc493b63f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -26,6 +26,7 @@
 #include <linux/sched/debug.h>
 #include <linux/jump_label.h>
 #include <linux/string_choices.h>
+#include <linux/filter.h>
 
 #include <trace/events/ipi.h>
 #define CREATE_TRACE_POINTS
@@ -49,7 +50,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue);
 
 static DEFINE_PER_CPU(atomic_t, trigger_backtrace) = ATOMIC_INIT(1);
 
-static void __flush_smp_call_function_queue(bool warn_cpu_offline);
+static void __flush_smp_call_function_queue(struct pt_regs *regs, bool warn_cpu_offline);
 
 int smpcfd_prepare_cpu(unsigned int cpu)
 {
@@ -94,7 +95,7 @@ int smpcfd_dying_cpu(unsigned int cpu)
 	 * ensure that the outgoing CPU doesn't go offline with work
 	 * still pending.
 	 */
-	__flush_smp_call_function_queue(false);
+	__flush_smp_call_function_queue(NULL, false);
 	irq_work_run();
 	return 0;
 }
@@ -452,14 +453,15 @@ static int generic_exec_single(int cpu, call_single_data_t *csd)
  * Invoked by arch to handle an IPI for call function single.
  * Must be called with interrupts disabled.
  */
-void generic_smp_call_function_single_interrupt(void)
+void generic_smp_call_function_single_interrupt(struct pt_regs *regs)
 {
-	__flush_smp_call_function_queue(true);
+	__flush_smp_call_function_queue(regs, true);
 }
 
 /**
  * __flush_smp_call_function_queue - Flush pending smp-call-function callbacks
  *
+ * @regs : register state when the interrupted the CPU
  * @warn_cpu_offline: If set to 'true', warn if callbacks were queued on an
  *		      offline CPU. Skip this check if set to 'false'.
  *
@@ -471,7 +473,7 @@ void generic_smp_call_function_single_interrupt(void)
  * Loop through the call_single_queue and run all the queued callbacks.
  * Must be called with interrupts disabled.
  */
-static void __flush_smp_call_function_queue(bool warn_cpu_offline)
+static void __flush_smp_call_function_queue(struct pt_regs *regs, bool warn_cpu_offline)
 {
 	call_single_data_t *csd, *csd_next;
 	struct llist_node *entry, *prev;
@@ -536,6 +538,12 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
 				entry = &csd_next->node.llist;
 			}
 
+			if (func == bpf_die) {
+				int cpu_id = raw_smp_processor_id();
+				struct bpf_prog *prog = (struct bpf_prog *)info;
+				prog->termination_states->
+					pre_execution_state[cpu_id] = *regs;
+			}
 			csd_lock_record(csd);
 			csd_do_func(func, info, csd);
 			csd_unlock(csd);
@@ -567,8 +575,8 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
 				void *info = csd->info;
 
 				csd_lock_record(csd);
-				csd_unlock(csd);
 				csd_do_func(func, info, csd);
+				csd_unlock(csd);
 				csd_lock_record(NULL);
 			} else if (type == CSD_TYPE_IRQ_WORK) {
 				irq_work_single(csd);
@@ -612,7 +620,7 @@ void flush_smp_call_function_queue(void)
 	local_irq_save(flags);
 	/* Get the already pending soft interrupts for RT enabled kernels */
 	was_pending = local_softirq_pending();
-	__flush_smp_call_function_queue(true);
+	__flush_smp_call_function_queue(NULL, true);
 	if (local_softirq_pending())
 		do_softirq_post_smp_call_flush(was_pending);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-04-20 10:55 [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Raj Sahu
                   ` (3 preceding siblings ...)
  2025-04-20 10:55 ` [RFC bpf-next 4/4] bpf: Runtime part of fast-path termination approach Raj Sahu
@ 2025-05-05 20:13 ` Kumar Kartikeya Dwivedi
  2025-05-06  5:55   ` Raj Sahu
  2025-05-07 18:15 ` Zvi Effron
  5 siblings, 1 reply; 30+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-05-05 20:13 UTC (permalink / raw)
  To: Raj Sahu
  Cc: bpf, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	djwillia, miloc, ericts, rahult, doniaghazy, quanzhif, jinghao7,
	sidchintamaneni

On Sun, 20 Apr 2025 at 12:56, Raj Sahu <rjsu26@gmail.com> wrote:
>
> From: Raj <rjsu26@gmail.com>
>
> Motivation:
> We propose an enhancement to the BPF infrastructure to address the
> critical issue of long-running BPF programs [1,2,3,4]. While the verifier
> ensures termination by restricting instruction count and backward edges, the
> total execution time of BPF programs is not bounded. Certain helper functions
> and iterators can result in extended runtimes, affecting system performance.
>
> The existing BPF infrastructure verifies that programs will not indefinitely
> loop or leak resources. However, helper functions such as `bpf_loop`,
> `bpf_for_each_map_elem`, and other iterative or expensive kernel interactions
> can cause runtimes that significantly degrade system performance [6]. Current
> detaching mechanisms do not immediately terminate running instances,
> monopolizing CPU. Therefore, a termination solution is necessary to swiftly
> terminate execution while safely releasing resources.

Thanks for sending out the code and cover letter, much appreciated!

I will keep aside opinions on which of 'fast execute' vs
'panic/unwind' feel semantically cleaner, since it's hard to have an
objective discussion on that. One can argue about abstract concepts
like complexity vs clean design either way.

Instead just some questions/comments on the design.

>
> Existing termination approach like the BPF Exception or Runtime hooks [5] have
> the issue of either lack of dynamism or having runtime overheads: BPF
> Exceptions: Introduces bpf_throw() and exception callbacks, requiring stack
> unwinding and exception state management.

I feel there is an equal amount of state management here, it's just
that it's not apparent directly, and not reified into tables etc.
One of the (valid) concerns with unwinding is that preparing tables of
objects that need to be released requires the verifier/runtime to be
aware of each resource acquiring functions.
Every time you add a new kfunc that acquires some resource, you'd have
to update some place in the kernel to make sure it gets tracked for
clean up too.

But that would be equally true for this case:
- The verifier must know which functions acquire resources, so that it
can replace them with stubs in the cloned text.
- Thus, every time a new kfunc is added, one must introduce its noop
stub and add it to _some_ mapping to acquire kfuncs with their stubs.

> Cleanup can only be done for pre-defined cancellation points.

But can you terminate the program at any arbitrary point with this? It
doesn't seem likely to me. You still have designated points where you
stub empty calls instead of ones which return resources. You will jump
to the return address into the cloned stub on return from an interrupt
that gives you control of the CPU where the program is running. But
apart from the stub functions, everything else would be kept the same.

I think you are conflating the mechanism to clean up resources
(unwinding, this (speed-run-to-exit/fast-execute), runtime log of
resources), with the mechanism to enforce termination.

Both are mutually exclusive, and any strategy (probing a memory
location from the program with strategically placed instrumentation,
watchdog timers, probing rdtsc to do more granular and precise
accounting of time spent, etc.) can be combined with any mechanism to
perform cleanup. There is no necessity to bind one with the other.
Depending on different program types, we may need multiple strategies
to terminate them with the right amount of precision.

We may do something coarse for now (like watchdogs), but separation of
concerns keeps doors open.

> Design:
> We introduce the Fast-Path termination mechanism, leveraging the
> verifier's guarantees regarding control flow and resource management. The
> approach dynamically patches running BPF programs with a stripped-down version
> that accelerates termination. This can be used to terminate any given instance
> of a BPF execution. Key elements include:
>
> - Implicit Lifetime Management: Utilizing the verifier’s inherent control flow
>   and resource cleanup paths encoded within the BPF program structure,
>   eliminating the need for explicit garbage collection or unwinding tables.
>
> - Dynamic Program Patching: At runtime, BPF programs are atomically patched,
>   replacing expensive helper calls with stubbed versions (fast fall-through
>   implementations). This ensures swift termination without compromising safety
>   or leaking resources.
>
> - Helper Function Adjustments: Certain helper functions (e.g., `bpf_loop`,
>   `bpf_for_each_map_elem`) include  mechanisms to facilitate early exits through
>   modified return values.
>
> TODOs:
> - Termination support for nested BPF programs.

What's the plan for this?
What do you do if e.g. I attach to some kfunc that you don't stub out?
E.g. you may stub out bpf_sk_lookup, but I can attach something to
bpf_get_prandom_u32 called after it in the stub which is not stubbed
out and stall.

Will you stub out every kfunc with a noop? If so, how do we reason
about correctness when the kfunc introduces side effects on program
state?

> - Policy enforcements to control runtime of BPF programs in a system:
> - Timer based termination (watchdog)
>         - Userspace management to detect low-performing BPF program and
>           terminated them
>

I think one of the things I didn't see reasoned about so far is how
would you handle tail calls or extension programs?
Or in general, control flow being extended dynamically by program attachments?

Since you execute until the end of the program, someone could
construct a chain of 32 chained programs that individually expire the
watchdog timer, breaking your limit and inflating it by limit x 32
etc.

Perhaps you just fail direct/indirect calls? They're already something
that can be skipped because of the recursion counter, so it probably
won't break things.

Extension programs are different, most likely they don't appear as
attached when the program is loaded, so it's an empty global function
call in the original program and the stub. So I presume you don't
attach them to the stub and it's the original empty function that
executes in the stub?

It will be a more difficult question to answer once we have indirect
function calls, and possibly allow calling from the BPF program to
another as long as signatures match correctly.

Say a hierarchical BPF scheduler, where indirect function calls are
used to dispatch to leaves. Each indirect call target may be a
separate program attached into the original one (say
application-specific schedulers). By making the program continue
executing, the second program invoked from the first one could begin
to stall, and this could happen recursively, again breaching your
limit on the parent that called into them.

It doesn't have to be indirect calls, it may be a kernel function that
does this propagation down the tree (like sched-ext plans to do now).
Possibly will have to stub out these kfuncs as well. But then we have
to be mindful if the program depends on side effects vs if they are
pure.

So I think the conclusion is that we need to reason about and stub all
functions (apart from those that acquire resources) that can in turn
invoke more BPF programs, so that the parent calling into them
transitively isn't stalled while it's fast executing, which doesn't
seem easy.

It's the same problem as with nested program execution. On the path
where we are terminating, we allow yet another program to come in and
stall the kernel.

I think it's just a consequence of "keep executing until exit" vs
"stop executing and return" that such a problem would come up. It's
much easier to reason about and notrace the few bits needed to unwind
and return control to the kernel vs controlling it for every possible
suffix of the program where the stub is invoked.

But perhaps you have given thought to these question, and may have
solutions in mind.
Will it be some kind of bpf_prog_active() check that avoids invoking
more programs on the 'fast-execute' path?
It may interfere with other programs that interrupt the fast-execute
termination and try to run (e.g. XDP in an interrupt where a
fast-execute in task context was in progress) and lead to surprises.

> We haven’t added any selftests in the POC as this mail is mainly to get
> feedback on the design. Attaching link to sample BPF programs to
> validate the POC [7].  Styling will be taken care in next iteration.
>
> References:
> 1. https://lpc.events/event/17/contributions/1610/attachments/1229/2505/LPC_BPF_termination_Raj_Sahu.pdf
> 2. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/f0749daa-4560-41c9-9f36-6aa618161665/content
> 3. https://lore.kernel.org/bpf/AM6PR03MB508011599420DB53480E8BF799F72@AM6PR03MB5080.eurprd03.prod.outlook.com/T/
> 4. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/7fb70c04-0736-4e2d-b48b-2d8d012bacfc/content
> 5. https://lwn.net/ml/all/AM6PR03MB5080513BFAEB54A93CC70D4399FE2@AM6PR03MB5080.eurprd03.prod.outlook.com/#t
> 6. https://people.cs.vt.edu/djwillia/papers/ebpf23-runtime.pdf
> 7. https://github.com/sidchintamaneni/os-dev-env/tree/main/bpf-programs-catalog/research/termination/patch_gen_testing
>
>  arch/x86/kernel/smp.c          |   4 +-
>  include/linux/bpf.h            |  18 ++
>  include/linux/filter.h         |  16 ++
>  include/linux/smp.h            |   2 +-
>  include/uapi/linux/bpf.h       |  13 ++
>  kernel/bpf/bpf_iter.c          |  65 ++++++
>  kernel/bpf/core.c              |  45 ++++
>  kernel/bpf/helpers.c           |   8 +
>  kernel/bpf/syscall.c           | 375 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          |  16 +-
>  kernel/smp.c                   |  22 +-
>  tools/bpf/bpftool/prog.c       |  40 ++++
>  tools/include/uapi/linux/bpf.h |   5 +
>  tools/lib/bpf/bpf.c            |  15 ++
>  tools/lib/bpf/bpf.h            |  10 +
>  tools/lib/bpf/libbpf.map       |   1 +
>  16 files changed, 643 insertions(+), 12 deletions(-)
>

All of this said, I think the patches need more work. The arch
specific bits can be moved into arch/*/net/bpf_* files and you can
dispatch to them using __weak functions in kernel/bpf/core.c. A
complete version of stubbing that handles both kfuncs and helpers
would be better.

I don't think bpftool support for termination is necessary, it should
be kicked in by the kernel automatically once a stall is detected.
So you can drop the bpf(2) syscall command being added.

For now, we can also keep the enforcement of termination bits out and
just focus on termination bits, both can land separately (the
enforcement will require more discussion, so it'd be better to keep
focus on and not mix both IMO).

> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-05 20:13 ` [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Kumar Kartikeya Dwivedi
@ 2025-05-06  5:55   ` Raj Sahu
  2025-05-06 22:45     ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Raj Sahu @ 2025-05-06  5:55 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	djwillia, miloc, ericts, rahult, doniaghazy, quanzhif, jinghao7,
	sidchintamaneni

<SNIP>
> Thanks for sending out the code and cover letter, much appreciated!
Thanks for taking out time and providing your feedback as well!

> I will keep aside opinions on which of 'fast execute' vs
> 'panic/unwind' feel semantically cleaner, since it's hard to have an
> objective discussion on that. One can argue about abstract concepts
> like complexity vs clean design either way.

> Instead just some questions/comments on the design.
>
> >
> > Existing termination approach like the BPF Exception or Runtime hooks [5] have
> > the issue of either lack of dynamism or having runtime overheads: BPF
> > Exceptions: Introduces bpf_throw() and exception callbacks, requiring stack
> > unwinding and exception state management.
>
> I feel there is an equal amount of state management here, it's just
> that it's not apparent directly, and not reified into tables etc.
> One of the (valid) concerns with unwinding is that preparing tables of
> objects that need to be released requires the verifier/runtime to be
> aware of each resource acquiring functions.
> Every time you add a new kfunc that acquires some resource, you'd have
> to update some place in the kernel to make sure it gets tracked for
> clean up too.
>
> But that would be equally true for this case:
> - The verifier must know which functions acquire resources, so that it
> can replace them with stubs in the cloned text.
> - Thus, every time a new kfunc is added, one must introduce its noop
> stub and add it to _some_ mapping to acquire kfuncs with their stubs.
>
> > Cleanup can only be done for pre-defined cancellation points.
>
> But can you terminate the program at any arbitrary point with this? It
> doesn't seem likely to me. You still have designated points where you
> stub empty calls instead of ones which return resources. You will jump
> to the return address into the cloned stub on return from an interrupt
> that gives you control of the CPU where the program is running. But
> apart from the stub functions, everything else would be kept the same.
>
> I think you are conflating the mechanism to clean up resources
> (unwinding, this (speed-run-to-exit/fast-execute), runtime log of
> resources), with the mechanism to enforce termination.
>
> Both are mutually exclusive, and any strategy (probing a memory
> location from the program with strategically placed instrumentation,
> watchdog timers, probing rdtsc to do more granular and precise
> accounting of time spent, etc.) can be combined with any mechanism to
> perform cleanup. There is no necessity to bind one with the other.
> Depending on different program types, we may need multiple strategies
> to terminate them with the right amount of precision.

We are trying to identify if the fast-path solution can serve certain use-cases
even slightly better or atleast complement the existing unwind approach.

> We may do something coarse for now (like watchdogs), but separation of
> concerns keeps doors open.

Right. That's something we have in mind too.

<SNIP>
> > TODOs:
> > - Termination support for nested BPF programs.
>
> What's the plan for this?

We explain what we currently know about nesting.
Currently, two scenarios are possible:
      1. A BPF prog got preempted by another program.
              - This shouldn't be a problem as the preempted BPF program
                 is not in running state.
      2. A BPF prog is attached to a function called by another BPF program.
             - This is the interesting case.

> What do you do if e.g. I attach to some kfunc that you don't stub out?
> E.g. you may stub out bpf_sk_lookup, but I can attach something to
> bpf_get_prandom_u32 called after it in the stub which is not stubbed
> out and stall.

We have thought of 2 components to deal with unintended nesting:
1. Have a per-CPU variable to indicate a termination is in-progress.
    If this is set, any new nesting won't occur.
2. Stub the entire nesting chain:
    For example,
    prog1
         -> prog2
                -> prog3

   Say prog3 is long-running and violates the runtime policy of prog2.
   The watchdog will be triggered for prog2, in that case we walk
through the stack
   and stub all BPF programs leading up to prog2 (In this case prog3
and prog2 will
   get stubbed).

> Will you stub out every kfunc with a noop? If so, how do we reason
> about correctness when the kfunc introduces side effects on program
> state?

Are you indicating about the sched_ext intended cgroup nesting case?

> > - Policy enforcements to control runtime of BPF programs in a system:
> > - Timer based termination (watchdog)
> >         - Userspace management to detect low-performing BPF program and
> >           terminated them
> >
>
> I think one of the things I didn't see reasoned about so far is how
> would you handle tail calls or extension programs?
> Or in general, control flow being extended dynamically by program attachments?
>
> Since you execute until the end of the program, someone could
> construct a chain of 32 chained programs that individually expire the
> watchdog timer, breaking your limit and inflating it by limit x 32
> etc.
>
> Perhaps you just fail direct/indirect calls? They're already something
> that can be skipped because of the recursion counter, so it probably
> won't break things.
>
> Extension programs are different, most likely they don't appear as
> attached when the program is loaded, so it's an empty global function
> call in the original program and the stub. So I presume you don't
> attach them to the stub and it's the original empty function that
> executes in the stub?

We did think about the tailcalls and freplace scenarios and will
address these in upcoming patches.

> It will be a more difficult question to answer once we have indirect
> function calls, and possibly allow calling from the BPF program to
> another as long as signatures match correctly.
> Say a hierarchical BPF scheduler, where indirect function calls are
> used to dispatch to leaves. Each indirect call target may be a
> separate program attached into the original one (say
> application-specific schedulers). By making the program continue
> executing, the second program invoked from the first one could begin
> to stall, and this could happen recursively, again breaching your
> limit on the parent that called into them.
>
> It doesn't have to be indirect calls, it may be a kernel function that
> does this propagation down the tree (like sched-ext plans to do now).
> Possibly will have to stub out these kfuncs as well. But then we have
> to be mindful if the program depends on side effects vs if they are
> pure.

Indirect calls are something which we didn't consider until now.
Thanks for pointing this out. We'll play around with the cgroup-based
hierarchical BPF scheduler.

> So I think the conclusion is that we need to reason about and stub all
> functions (apart from those that acquire resources) that can in turn
> invoke more BPF programs, so that the parent calling into them
> transitively isn't stalled while it's fast executing, which doesn't
> seem easy.
> It's the same problem as with nested program execution. On the path
> where we are terminating, we allow yet another program to come in and
> stall the kernel.
>
> I think it's just a consequence of "keep executing until exit" vs
> "stop executing and return" that such a problem would come up. It's
> much easier to reason about and notrace the few bits needed to unwind
> and return control to the kernel vs controlling it for every possible
> suffix of the program where the stub is invoked.
>
> But perhaps you have given thought to these question, and may have
> solutions in mind.
> Will it be some kind of bpf_prog_active() check that avoids invoking
> more programs on the 'fast-execute' path?
> It may interfere with other programs that interrupt the fast-execute
> termination and try to run (e.g. XDP in an interrupt where a
> fast-execute in task context was in progress) and lead to surprises.

True. This is infact a concern we have from the solution described above.
Will try to look around and see if something smarter and cleaner is possible.

<SNIP>
> All of this said, I think the patches need more work. The arch
> specific bits can be moved into arch/*/net/bpf_* files and you can
> dispatch to them using __weak functions in kernel/bpf/core.c. A
> complete version of stubbing that handles both kfuncs and helpers
> would be better.
Sure.

> I don't think bpftool support for termination is necessary, it should
> be kicked in by the kernel automatically once a stall is detected.
> So you can drop the bpf(2) syscall command being added.
Agreed. Since we were using it for testing our termination, we thought
of exposing it to userspace as well.

> For now, we can also keep the enforcement of termination bits out and
> just focus on termination bits, both can land separately (the
> enforcement will require more discussion, so it'd be better to keep
> focus on and not mix both IMO).
Makes sense. Will remove the bpftool part from patches for now.

Since some implementations need experimenting, it might take some time
before sending it out here.
Again, thanks for taking time to review and give feedback. We will push
new set of patches with the suggestions and future work.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-06  5:55   ` Raj Sahu
@ 2025-05-06 22:45     ` Alexei Starovoitov
  2025-05-06 22:59       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-05-06 22:45 UTC (permalink / raw)
  To: Raj Sahu
  Cc: Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Siddharth Chintamaneni

On Mon, May 5, 2025 at 10:55 PM Raj Sahu <rjsu26@gmail.com> wrote:
>
>       2. A BPF prog is attached to a function called by another BPF program.
>              - This is the interesting case.
>
> > What do you do if e.g. I attach to some kfunc that you don't stub out?
> > E.g. you may stub out bpf_sk_lookup, but I can attach something to
> > bpf_get_prandom_u32 called after it in the stub which is not stubbed
> > out and stall.
>
> We have thought of 2 components to deal with unintended nesting:
> 1. Have a per-CPU variable to indicate a termination is in-progress.
>     If this is set, any new nesting won't occur.
> 2. Stub the entire nesting chain:
>     For example,
>     prog1
>          -> prog2
>                 -> prog3
>
>    Say prog3 is long-running and violates the runtime policy of prog2.
>    The watchdog will be triggered for prog2, in that case we walk
> through the stack
>    and stub all BPF programs leading up to prog2 (In this case prog3
> and prog2 will
>    get stubbed).

I feel that concerns about fentry/freplace consuming too much
while parent prog is fast-exiting are overblown.
If child prog is slow it will get flagged by the watchdog sooner or later.
But fentry and tailcall cases are good examples that highlight
that fast-execute is a better path forward.
Manual stack unwind through bpf trampoline and tail call logic
is going to be quite complex, error prone, architecture specific
and hard to keep consistent with changes.
We have complicated lifetime rules for bpf trampoline.
See comment in bpf_tramp_image_put().
Doing that manually in stack unwinder is not practical.
iirc bpf_throw() stops at the first non-bpf_prog frame including
bpf trampoline.
But if we want to, the fast execute approach can unwind through fentry.
Say hw watchdog tells us that CPU is stuck in:
bpf_prog_A
   bpf_subprog_1
     kfunc
       fentry
          bpf_prog_B

since every bpf prog in the system will be cloned and prepared
for fast execute we can replace return addresses everywhere
in the above and fast execute will take us all the way to kernel proper.

Re: prog cloning.
I think the mechanism in this patch is incorrect.
It clones the prog before bpf_check(). That will break when
the verifier does different logic for stubbed helpers/kfuncs
vs original. The bpf progs will not be identical and JITed code
will be different.
Instead the verifier should collect all code points where
kfunc/helper should be replaced, then clone and patch
after the verifier is done, and then compare JIT artifacts for both.
JITed sizes must be the same, and prog->aux->jit_data should
be equivalent. We will need an arch specific function to
compare jit_data.

imo that is the next step in terms of code.
watchdog/trigger_of_abort details are secondary.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-06 22:45     ` Alexei Starovoitov
@ 2025-05-06 22:59       ` Kumar Kartikeya Dwivedi
  2025-05-07  0:32         ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-05-06 22:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Siddharth Chintamaneni

On Wed, 7 May 2025 at 00:45, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, May 5, 2025 at 10:55 PM Raj Sahu <rjsu26@gmail.com> wrote:
> >
> >       2. A BPF prog is attached to a function called by another BPF program.
> >              - This is the interesting case.
> >
> > > What do you do if e.g. I attach to some kfunc that you don't stub out?
> > > E.g. you may stub out bpf_sk_lookup, but I can attach something to
> > > bpf_get_prandom_u32 called after it in the stub which is not stubbed
> > > out and stall.
> >
> > We have thought of 2 components to deal with unintended nesting:
> > 1. Have a per-CPU variable to indicate a termination is in-progress.
> >     If this is set, any new nesting won't occur.
> > 2. Stub the entire nesting chain:
> >     For example,
> >     prog1
> >          -> prog2
> >                 -> prog3
> >
> >    Say prog3 is long-running and violates the runtime policy of prog2.
> >    The watchdog will be triggered for prog2, in that case we walk
> > through the stack
> >    and stub all BPF programs leading up to prog2 (In this case prog3
> > and prog2 will
> >    get stubbed).
>
> I feel that concerns about fentry/freplace consuming too much
> while parent prog is fast-exiting are overblown.
> If child prog is slow it will get flagged by the watchdog sooner or later.

No, there's some difference:
It becomes more common because we're continuing to execute the suffix
of the program in the stub.
Compared to stopping executing and returning control immediately.

> But fentry and tailcall cases are good examples that highlight
> that fast-execute is a better path forward.
> Manual stack unwind through bpf trampoline and tail call logic
> is going to be quite complex, error prone, architecture specific
> and hard to keep consistent with changes.
> We have complicated lifetime rules for bpf trampoline.
> See comment in bpf_tramp_image_put().
> Doing that manually in stack unwinder is not practical.
> iirc bpf_throw() stops at the first non-bpf_prog frame including
> bpf trampoline.
> But if we want to, the fast execute approach can unwind through fentry.
> Say hw watchdog tells us that CPU is stuck in:
> bpf_prog_A
>    bpf_subprog_1
>      kfunc
>        fentry
>           bpf_prog_B
>
> since every bpf prog in the system will be cloned and prepared
> for fast execute we can replace return addresses everywhere
> in the above and fast execute will take us all the way to kernel proper.

The same will be true for unwinding, we can just unwind all the way to
the top of the stack trace in case of cancellation-triggered
unwinding.
If trampoline calls some program, it will see it as a return from the
called BPF program just like stubs would return.

You're essentially going to replace return addresses to jump control
to stubs, we can do that same for jumping into some unwinding code
that can continue the process.
Be it unwinding or stubs, once control goes back to kernel and clean
up must continue, we will have to pass control back to code for both.

Conceptually, both approaches are going to do something to clean up resources,
that can be executing stub code or calling release handlers for
objects on stack.

In the end, it's invoking some kernel code which will act upon the
program stack, then return control back to the kernel.
All lifetime related concerns for objects on stack and trampolines
will be the same for both approaches.

The stub lifetime will be the same as whatever table is built for
releasing resources.

The boundary is checked for bpf_throw(), ofcourse, because there, the
program is asserting some condition and requesting to be cleaned up.
There, we really shouldn't keep unwinding across program boundaries.

There is nothing architecture specific about the rest, it is an if
(cond) check in the arch_bpf_stack_walk callback.

>
> [...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-06 22:59       ` Kumar Kartikeya Dwivedi
@ 2025-05-07  0:32         ` Alexei Starovoitov
  2025-05-07  0:38           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-05-07  0:32 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Siddharth Chintamaneni

On Tue, May 6, 2025 at 4:00 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Wed, 7 May 2025 at 00:45, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, May 5, 2025 at 10:55 PM Raj Sahu <rjsu26@gmail.com> wrote:
> > >
> > >       2. A BPF prog is attached to a function called by another BPF program.
> > >              - This is the interesting case.
> > >
> > > > What do you do if e.g. I attach to some kfunc that you don't stub out?
> > > > E.g. you may stub out bpf_sk_lookup, but I can attach something to
> > > > bpf_get_prandom_u32 called after it in the stub which is not stubbed
> > > > out and stall.
> > >
> > > We have thought of 2 components to deal with unintended nesting:
> > > 1. Have a per-CPU variable to indicate a termination is in-progress.
> > >     If this is set, any new nesting won't occur.
> > > 2. Stub the entire nesting chain:
> > >     For example,
> > >     prog1
> > >          -> prog2
> > >                 -> prog3
> > >
> > >    Say prog3 is long-running and violates the runtime policy of prog2.
> > >    The watchdog will be triggered for prog2, in that case we walk
> > > through the stack
> > >    and stub all BPF programs leading up to prog2 (In this case prog3
> > > and prog2 will
> > >    get stubbed).
> >
> > I feel that concerns about fentry/freplace consuming too much
> > while parent prog is fast-exiting are overblown.
> > If child prog is slow it will get flagged by the watchdog sooner or later.
>
> No, there's some difference:
> It becomes more common because we're continuing to execute the suffix
> of the program in the stub.
> Compared to stopping executing and returning control immediately.
>
> > But fentry and tailcall cases are good examples that highlight
> > that fast-execute is a better path forward.
> > Manual stack unwind through bpf trampoline and tail call logic
> > is going to be quite complex, error prone, architecture specific
> > and hard to keep consistent with changes.
> > We have complicated lifetime rules for bpf trampoline.
> > See comment in bpf_tramp_image_put().
> > Doing that manually in stack unwinder is not practical.
> > iirc bpf_throw() stops at the first non-bpf_prog frame including
> > bpf trampoline.
> > But if we want to, the fast execute approach can unwind through fentry.
> > Say hw watchdog tells us that CPU is stuck in:
> > bpf_prog_A
> >    bpf_subprog_1
> >      kfunc
> >        fentry
> >           bpf_prog_B
> >
> > since every bpf prog in the system will be cloned and prepared
> > for fast execute we can replace return addresses everywhere
> > in the above and fast execute will take us all the way to kernel proper.
>
> The same will be true for unwinding, we can just unwind all the way to
> the top of the stack trace in case of cancellation-triggered
> unwinding.
> If trampoline calls some program, it will see it as a return from the
> called BPF program just like stubs would return.
>
> You're essentially going to replace return addresses to jump control
> to stubs, we can do that same for jumping into some unwinding code
> that can continue the process.
> Be it unwinding or stubs, once control goes back to kernel and clean
> up must continue, we will have to pass control back to code for both.
>
> Conceptually, both approaches are going to do something to clean up resources,
> that can be executing stub code or calling release handlers for
> objects on stack.

I feel you're missing my point.
For unwinding to work through bpf trampoline the unwinder needs
to execute very specific trampoline _exit_ procedure that is arch
specific and hard coded during the generation of trampoline.
That's a ton of extra complexity in unwinder.
It's not just calling destructors for objects.
Depending on trampoline and the progs in there it's one or more
__bpf_prog_exit*, percpu_ref_put(),
and extra headache due to bpf_stats_enabled_key(),
and who knows what else that I'm forgetting.

So, no, unwind-it-all is not comparable to fast-execute
in terms of complexity. bpf_throw approach is a dead end.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-07  0:32         ` Alexei Starovoitov
@ 2025-05-07  0:38           ` Kumar Kartikeya Dwivedi
  2025-05-07  1:15             ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-05-07  0:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Siddharth Chintamaneni

On Wed, 7 May 2025 at 02:33, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, May 6, 2025 at 4:00 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > On Wed, 7 May 2025 at 00:45, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, May 5, 2025 at 10:55 PM Raj Sahu <rjsu26@gmail.com> wrote:
> > > >
> > > >       2. A BPF prog is attached to a function called by another BPF program.
> > > >              - This is the interesting case.
> > > >
> > > > > What do you do if e.g. I attach to some kfunc that you don't stub out?
> > > > > E.g. you may stub out bpf_sk_lookup, but I can attach something to
> > > > > bpf_get_prandom_u32 called after it in the stub which is not stubbed
> > > > > out and stall.
> > > >
> > > > We have thought of 2 components to deal with unintended nesting:
> > > > 1. Have a per-CPU variable to indicate a termination is in-progress.
> > > >     If this is set, any new nesting won't occur.
> > > > 2. Stub the entire nesting chain:
> > > >     For example,
> > > >     prog1
> > > >          -> prog2
> > > >                 -> prog3
> > > >
> > > >    Say prog3 is long-running and violates the runtime policy of prog2.
> > > >    The watchdog will be triggered for prog2, in that case we walk
> > > > through the stack
> > > >    and stub all BPF programs leading up to prog2 (In this case prog3
> > > > and prog2 will
> > > >    get stubbed).
> > >
> > > I feel that concerns about fentry/freplace consuming too much
> > > while parent prog is fast-exiting are overblown.
> > > If child prog is slow it will get flagged by the watchdog sooner or later.
> >
> > No, there's some difference:
> > It becomes more common because we're continuing to execute the suffix
> > of the program in the stub.
> > Compared to stopping executing and returning control immediately.
> >
> > > But fentry and tailcall cases are good examples that highlight
> > > that fast-execute is a better path forward.
> > > Manual stack unwind through bpf trampoline and tail call logic
> > > is going to be quite complex, error prone, architecture specific
> > > and hard to keep consistent with changes.
> > > We have complicated lifetime rules for bpf trampoline.
> > > See comment in bpf_tramp_image_put().
> > > Doing that manually in stack unwinder is not practical.
> > > iirc bpf_throw() stops at the first non-bpf_prog frame including
> > > bpf trampoline.
> > > But if we want to, the fast execute approach can unwind through fentry.
> > > Say hw watchdog tells us that CPU is stuck in:
> > > bpf_prog_A
> > >    bpf_subprog_1
> > >      kfunc
> > >        fentry
> > >           bpf_prog_B
> > >
> > > since every bpf prog in the system will be cloned and prepared
> > > for fast execute we can replace return addresses everywhere
> > > in the above and fast execute will take us all the way to kernel proper.
> >
> > The same will be true for unwinding, we can just unwind all the way to
> > the top of the stack trace in case of cancellation-triggered
> > unwinding.
> > If trampoline calls some program, it will see it as a return from the
> > called BPF program just like stubs would return.
> >
> > You're essentially going to replace return addresses to jump control
> > to stubs, we can do that same for jumping into some unwinding code
> > that can continue the process.
> > Be it unwinding or stubs, once control goes back to kernel and clean
> > up must continue, we will have to pass control back to code for both.
> >
> > Conceptually, both approaches are going to do something to clean up resources,
> > that can be executing stub code or calling release handlers for
> > objects on stack.
>
> I feel you're missing my point.
> For unwinding to work through bpf trampoline the unwinder needs
> to execute very specific trampoline _exit_ procedure that is arch
> specific and hard coded during the generation of trampoline.
> That's a ton of extra complexity in unwinder.
> It's not just calling destructors for objects.
> Depending on trampoline and the progs in there it's one or more
> __bpf_prog_exit*, percpu_ref_put(),
> and extra headache due to bpf_stats_enabled_key(),
> and who knows what else that I'm forgetting.

What I'm saying is that we don't have to do all that.
It's just overcomplication for the sake of it.

The trampoline works by invoking a BPF program.
From it's perspective, it will just get back a return from the call
and continue doing its stuff.
The unwinding happens on BPF frames and then returns control to the trampoline.
No unwinding needs to happen on it.
It will be no different from setting return address to stub and
letting the program execute rest of the logic.

trampoline -> call bpf_prog -> hit cancellation -> (either unwind all
BPF frames, or execute stub program that does the same by following
execution) -> return to trampoline.

If we want to continue it for programs calling into trampoline, we'll
do something similar for both cases, modify return addresses that
either continue executing stub or unwinding BPF frames, then return to
the kernel context calling them.

Hence failing to see what is difference.
Only difference to me is executing BPF code cloned from the program vs
some kernel function that walks the program stack and calls kfuncs.

>
> [...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-07  0:38           ` Kumar Kartikeya Dwivedi
@ 2025-05-07  1:15             ` Alexei Starovoitov
  2025-05-07  2:10               ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-05-07  1:15 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Siddharth Chintamaneni

On Tue, May 6, 2025 at 5:39 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Wed, 7 May 2025 at 02:33, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, May 6, 2025 at 4:00 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > On Wed, 7 May 2025 at 00:45, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Mon, May 5, 2025 at 10:55 PM Raj Sahu <rjsu26@gmail.com> wrote:
> > > > >
> > > > >       2. A BPF prog is attached to a function called by another BPF program.
> > > > >              - This is the interesting case.
> > > > >
> > > > > > What do you do if e.g. I attach to some kfunc that you don't stub out?
> > > > > > E.g. you may stub out bpf_sk_lookup, but I can attach something to
> > > > > > bpf_get_prandom_u32 called after it in the stub which is not stubbed
> > > > > > out and stall.
> > > > >
> > > > > We have thought of 2 components to deal with unintended nesting:
> > > > > 1. Have a per-CPU variable to indicate a termination is in-progress.
> > > > >     If this is set, any new nesting won't occur.
> > > > > 2. Stub the entire nesting chain:
> > > > >     For example,
> > > > >     prog1
> > > > >          -> prog2
> > > > >                 -> prog3
> > > > >
> > > > >    Say prog3 is long-running and violates the runtime policy of prog2.
> > > > >    The watchdog will be triggered for prog2, in that case we walk
> > > > > through the stack
> > > > >    and stub all BPF programs leading up to prog2 (In this case prog3
> > > > > and prog2 will
> > > > >    get stubbed).
> > > >
> > > > I feel that concerns about fentry/freplace consuming too much
> > > > while parent prog is fast-exiting are overblown.
> > > > If child prog is slow it will get flagged by the watchdog sooner or later.
> > >
> > > No, there's some difference:
> > > It becomes more common because we're continuing to execute the suffix
> > > of the program in the stub.
> > > Compared to stopping executing and returning control immediately.
> > >
> > > > But fentry and tailcall cases are good examples that highlight
> > > > that fast-execute is a better path forward.
> > > > Manual stack unwind through bpf trampoline and tail call logic
> > > > is going to be quite complex, error prone, architecture specific
> > > > and hard to keep consistent with changes.
> > > > We have complicated lifetime rules for bpf trampoline.
> > > > See comment in bpf_tramp_image_put().
> > > > Doing that manually in stack unwinder is not practical.
> > > > iirc bpf_throw() stops at the first non-bpf_prog frame including
> > > > bpf trampoline.
> > > > But if we want to, the fast execute approach can unwind through fentry.
> > > > Say hw watchdog tells us that CPU is stuck in:
> > > > bpf_prog_A
> > > >    bpf_subprog_1
> > > >      kfunc
> > > >        fentry
> > > >           bpf_prog_B
> > > >
> > > > since every bpf prog in the system will be cloned and prepared
> > > > for fast execute we can replace return addresses everywhere
> > > > in the above and fast execute will take us all the way to kernel proper.
> > >
> > > The same will be true for unwinding, we can just unwind all the way to
> > > the top of the stack trace in case of cancellation-triggered
> > > unwinding.
> > > If trampoline calls some program, it will see it as a return from the
> > > called BPF program just like stubs would return.
> > >
> > > You're essentially going to replace return addresses to jump control
> > > to stubs, we can do that same for jumping into some unwinding code
> > > that can continue the process.
> > > Be it unwinding or stubs, once control goes back to kernel and clean
> > > up must continue, we will have to pass control back to code for both.
> > >
> > > Conceptually, both approaches are going to do something to clean up resources,
> > > that can be executing stub code or calling release handlers for
> > > objects on stack.
> >
> > I feel you're missing my point.
> > For unwinding to work through bpf trampoline the unwinder needs
> > to execute very specific trampoline _exit_ procedure that is arch
> > specific and hard coded during the generation of trampoline.
> > That's a ton of extra complexity in unwinder.
> > It's not just calling destructors for objects.
> > Depending on trampoline and the progs in there it's one or more
> > __bpf_prog_exit*, percpu_ref_put(),
> > and extra headache due to bpf_stats_enabled_key(),
> > and who knows what else that I'm forgetting.
>
> What I'm saying is that we don't have to do all that.
> It's just overcomplication for the sake of it.

So you discarded this use case because the unwind approach
cannot deal with it?
And then claim that within this limited scope they're equivalent?
:)
prog -> trampoline -> prog was just one example.
unwinder is helpless when there is any kernel code between progs.
sched-ext will have hierarchical scheds eventually.
prog->kfunc->prog
Unwinding inner prog only is imo broken, since we need to abort
the whole thing.
Even simpler case:
prog->for_each->subprog_callback.
That callback is likely tiny and unwinding into for_each kfunc
doesn't really abort the prog.
Unwind approach works in higher level languages because the compiler
sees the whole control flow from main entry till the leaf.
The verifier is blind to kernel code, so unwind is limited
to progs only and doing it poorly. We still don't have support
for bpf_throw with object cleanup and at this point we should
align all efforts into fast execute, since it's clearly the winner.
bpf_throw may stay as a kfunc with fast execute underneath or
we may remove it, since it has no users anyway.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-07  1:15             ` Alexei Starovoitov
@ 2025-05-07  2:10               ` Kumar Kartikeya Dwivedi
  2025-05-07  3:36                 ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-05-07  2:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Siddharth Chintamaneni,
	Tejun Heo

On Wed, 7 May 2025 at 03:15, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, May 6, 2025 at 5:39 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > On Wed, 7 May 2025 at 02:33, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, May 6, 2025 at 4:00 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > >
> > > > On Wed, 7 May 2025 at 00:45, Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Mon, May 5, 2025 at 10:55 PM Raj Sahu <rjsu26@gmail.com> wrote:
> > > > > >
> > > > > >       2. A BPF prog is attached to a function called by another BPF program.
> > > > > >              - This is the interesting case.
> > > > > >
> > > > > > > What do you do if e.g. I attach to some kfunc that you don't stub out?
> > > > > > > E.g. you may stub out bpf_sk_lookup, but I can attach something to
> > > > > > > bpf_get_prandom_u32 called after it in the stub which is not stubbed
> > > > > > > out and stall.
> > > > > >
> > > > > > We have thought of 2 components to deal with unintended nesting:
> > > > > > 1. Have a per-CPU variable to indicate a termination is in-progress.
> > > > > >     If this is set, any new nesting won't occur.
> > > > > > 2. Stub the entire nesting chain:
> > > > > >     For example,
> > > > > >     prog1
> > > > > >          -> prog2
> > > > > >                 -> prog3
> > > > > >
> > > > > >    Say prog3 is long-running and violates the runtime policy of prog2.
> > > > > >    The watchdog will be triggered for prog2, in that case we walk
> > > > > > through the stack
> > > > > >    and stub all BPF programs leading up to prog2 (In this case prog3
> > > > > > and prog2 will
> > > > > >    get stubbed).
> > > > >
> > > > > I feel that concerns about fentry/freplace consuming too much
> > > > > while parent prog is fast-exiting are overblown.
> > > > > If child prog is slow it will get flagged by the watchdog sooner or later.
> > > >
> > > > No, there's some difference:
> > > > It becomes more common because we're continuing to execute the suffix
> > > > of the program in the stub.
> > > > Compared to stopping executing and returning control immediately.
> > > >
> > > > > But fentry and tailcall cases are good examples that highlight
> > > > > that fast-execute is a better path forward.
> > > > > Manual stack unwind through bpf trampoline and tail call logic
> > > > > is going to be quite complex, error prone, architecture specific
> > > > > and hard to keep consistent with changes.
> > > > > We have complicated lifetime rules for bpf trampoline.
> > > > > See comment in bpf_tramp_image_put().
> > > > > Doing that manually in stack unwinder is not practical.
> > > > > iirc bpf_throw() stops at the first non-bpf_prog frame including
> > > > > bpf trampoline.
> > > > > But if we want to, the fast execute approach can unwind through fentry.
> > > > > Say hw watchdog tells us that CPU is stuck in:
> > > > > bpf_prog_A
> > > > >    bpf_subprog_1
> > > > >      kfunc
> > > > >        fentry
> > > > >           bpf_prog_B
> > > > >
> > > > > since every bpf prog in the system will be cloned and prepared
> > > > > for fast execute we can replace return addresses everywhere
> > > > > in the above and fast execute will take us all the way to kernel proper.
> > > >
> > > > The same will be true for unwinding, we can just unwind all the way to
> > > > the top of the stack trace in case of cancellation-triggered
> > > > unwinding.
> > > > If trampoline calls some program, it will see it as a return from the
> > > > called BPF program just like stubs would return.
> > > >
> > > > You're essentially going to replace return addresses to jump control
> > > > to stubs, we can do that same for jumping into some unwinding code
> > > > that can continue the process.
> > > > Be it unwinding or stubs, once control goes back to kernel and clean
> > > > up must continue, we will have to pass control back to code for both.
> > > >
> > > > Conceptually, both approaches are going to do something to clean up resources,
> > > > that can be executing stub code or calling release handlers for
> > > > objects on stack.
> > >
> > > I feel you're missing my point.
> > > For unwinding to work through bpf trampoline the unwinder needs
> > > to execute very specific trampoline _exit_ procedure that is arch
> > > specific and hard coded during the generation of trampoline.
> > > That's a ton of extra complexity in unwinder.
> > > It's not just calling destructors for objects.
> > > Depending on trampoline and the progs in there it's one or more
> > > __bpf_prog_exit*, percpu_ref_put(),
> > > and extra headache due to bpf_stats_enabled_key(),
> > > and who knows what else that I'm forgetting.
> >
> > What I'm saying is that we don't have to do all that.
> > It's just overcomplication for the sake of it.
>
> So you discarded this use case because the unwind approach
> cannot deal with it?

At this point, I think we're talking past each other.
I haven't discarded anything.
I just rephrased what I said earlier.

When you have a stack trace interspersed with BPF and kernel frames,
control should bounce between BPF cleanup and kernel side (which will
be steered to return back to BPF side, eventually returning control to
kernel proper).

Be it stubs, be it unwinding.
It needs to work, but it won't be very different for both cases.
No support exists yet, so solution space must be explored, but it
won't be very different for both.
You're insisting "unwinding needs to unwind through kernel frames so
it's complex" but it's not.

When control goes back to the kernel, be it stubs or unwinding, it
will go back to the caller BPF program and then we continue the
cleanup, either through stubs or unwinding.

The flow of control from BPF to the kernel is through return from the
function call. Inside the program, we can destruct frames by just
letting it fast-execute or unwind.

> And then claim that within this limited scope they're equivalent?
> :)
> prog -> trampoline -> prog was just one example.
> unwinder is helpless when there is any kernel code between progs.
> sched-ext will have hierarchical scheds eventually.
> prog->kfunc->prog
> Unwinding inner prog only is imo broken, since we need to abort
> the whole thing.
> Even simpler case:
> prog->for_each->subprog_callback.
> That callback is likely tiny and unwinding into for_each kfunc
> doesn't really abort the prog.

I didn't say we'll only unwind the callback.

We will unwind the callback, return control into kfunc, it will clean
itself up and return to prog, and then we continue unwinding when it
returns into caller BPF program.
And we'll do exactly this with stubs as well.

Program stub callback will return to kfunc, kfunc will return to
program, and replaced return address causes jump into stub again.

> Unwind approach works in higher level languages because the compiler
> sees the whole control flow from main entry till the leaf.
> The verifier is blind to kernel code, so unwind is limited
> to progs only and doing it poorly.

We still model callbacks as direct calls in symbolic execution.
From the perspective of the virtual machine, kernel functions in the
middle are just setting up more context and hidden.
So we will still see the whole CFG and all possible reachable program paths.

Unwinding or stubs will work the same way.
Unwind / return using stubs from sequential contiguous sets of BPF
frames until the point where the kernel has called into them.
Kernel returns back into caller BPF context. Then continue if we need
to clean up the entire stacktrace until we get to kernel proper.

> We still don't have support
> for bpf_throw with object cleanup and at this point we should
> align all efforts into fast execute, since it's clearly the winner.

Well, it's not because of lack of trying.
I've implemented it, I've addressed feedback and concerns, I've even used it in
interesting ways and hopefully demonstrated value.

But until we agree on the implementation, we cannot land it.
It's a chicken and egg issue.
If we add it, sched-ext will replace their MEMBER_VPTR and other
associated macro hacks.
https://github.com/sched-ext/scx/blob/main/scheds/include/scx/common.bpf.h#L227

bpf_printk() in bpf_arena_spin_lock implementation is similar.

> bpf_throw may stay as a kfunc with fast execute underneath or

Yes, by having users write clean up code which is what they are doing today.
Which defeats the whole point of supporting assertions.
Then it's a kfunc with a misleading name, it's
bpf_fail_everything_from_this_point().

> we may remove it, since it has no users anyway.

I think you misunderstand what bpf_throw/exceptions were supposed to achieve.
The whole point of bpf_throw() was to support assertions, where one
does not write clean up code.

The path containing bpf_throw is _not_ explored.

You do bpf_assert(i < N) and don't write anything to handle the case
where i >= N, and arr[i] is not well-formed.
This is the benefit to program writers: they can assert something as
true and don't have to "appease" the verifier by writing alternative
paths in the program.
The program is aborted in case the assertion does not hold.

---

Just to summarize the discussion:

There are two classes of termination:
- Cooperative (bpf_throw()/assertions/panic).
- Non-cooperative (program is stuck and kernel needs to abort it).

BPF_EXIT is cooperative, but we statically enforce resource cleanup.

Both cases need to do something similar, perform program cleanup, so
infra for bpf_throw is reusable to do non-cooperative terminations.

Fast-execute and unwinding both work for the second case, as I've
described above.

They can work even when we have BPF -> kernel -> BPF -> kernel in the
stack trace and go all the way back to the kernel calling the first
BPF context.
Nature of the solution for both cases is the same, how they destruct
BPF frames is different.
If we're using stubs, we will execute stubs for the BPF portions and
return control into kernel and eventually get it back in caller BPF
prog.
With unwinding, we do the same, the frames are destructed and control
returns into kernel (e.g. a callback will return some value).
Eventually, we repeat it all the way up the stack trace and clean up everything.

But we can only unwind for assertions, because by definition, the
program path taken when the assertion fails cannot be explored.
Does Rust keep executing the rest of the program when assert_eq!(x, y) fails?
Whether we use tables to do it or have the compiler emit them or do
anything else (how) is immaterial to the basic requirement (what).

The virtual machine must stop/halt, ergo, when it is mapped to a real
architecture, we need to stop program execution and perform resource
cleanup.
Verifier won't explore the path where it hits bpf_throw further.

So we cannot use fast-execute to do assertions, because users are
forced to handle the case where bpf_assert() fails.
Then it is no better than an if condition, the verifier still explores
the supposedly. untaken branch.

Anyway, I don't think I can explain this more clearly.

I can only imagine three scenarios from this discussion.

- Assertions are unnecessary.
 - Then we don't have to continue discussing further, and we can just
focus on fast-execute, and phase out bpf_throw().
- They are useful, but not necessary now.
 - Then it is very likely we'll end up with two ways of doing
essentially the same thing.
- Fast-execute can do assertions, so above doesn't apply.
  - Then we just disagree about what "assertion" means.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-07  2:10               ` Kumar Kartikeya Dwivedi
@ 2025-05-07  3:36                 ` Alexei Starovoitov
  2025-05-07  5:04                   ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-05-07  3:36 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Siddharth Chintamaneni,
	Tejun Heo

On Tue, May 6, 2025 at 7:11 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> We will unwind the callback, return control into kfunc, it will clean
> itself up and return to prog, and then we continue unwinding when it
> returns into caller BPF program.
> And we'll do exactly this with stubs as well.
>
> Program stub callback will return to kfunc, kfunc will return to
> program, and replaced return address causes jump into stub again.

Ok, so this is a new proposal where unwind would
jump back into kfunc, but it will also replace the address
where kfunc would have returned with a jump to 2nd phase
of unwinder.
So for
prog A -> kern -> prog B -> kern -> prog C.
In the first phase the unwinder will deal with C,
then let kern continue as normal code, but the kernel
will 'ret' back into unwinder machinery, so it can
continue to unwind B, then another jump back to unwinder
machinery to unwind A.

Yes, it can be made to work, but this is quite complex,
and it's really a combination of unwind and fast-execute
through kernel bits.
When the whole thing is fast executed there is only one
step to adjust all return addresses on the stack, and then
everything just flows to the kernel proper.
Much simpler than going back to unwind machinery
and without a seesaw pattern of stack usage.

> If we add it, sched-ext will replace their MEMBER_VPTR and other
> associated macro hacks.
> https://github.com/sched-ext/scx/blob/main/scheds/include/scx/common.bpf.h#L227

I really doubt it. It's fighting the verifier because the verifier
needs to see that bounds check. That's why it's macro with asm.

> > bpf_throw may stay as a kfunc with fast execute underneath or
>
> Yes, by having users write clean up code which is what they are doing today.
> Which defeats the whole point of supporting assertions.
> Then it's a kfunc with a misleading name, it's
> bpf_fail_everything_from_this_point().

Indeed. That's a better name. 'throw' has an incorrect analogy.

> You do bpf_assert(i < N) and don't write anything to handle the case
> where i >= N, and arr[i] is not well-formed.
> This is the benefit to program writers: they can assert something as
> true and don't have to "appease" the verifier by writing alternative
> paths in the program.

That indeed was the goal of assertions, but I think it was explored
and it failed.
#define bpf_assert(cond) if (!(cond)) bpf_throw(0);

should have worked, but the compiler doesn't care about keeping
"if (!(cond))" like that.
All the asm volatile ("if .. were added
in __bpf_assert() form and in bpf_cmp_likely().
And they didn't succeed.
They're too unnatural for C code.
All uses remained in selftests.

> Does Rust keep executing the rest of the program when assert_eq!(x, y) fails?
> Whether we use tables to do it or have the compiler emit them or do
> anything else (how) is immaterial to the basic requirement (what).

The high level language can do it, because it's done at a compiler level.
rust->bpf compiler can be made to support it, but we deal with C and
bpf_assert(i, <, 100);
arr[i] = ..;
didn't fly.

> - Assertions are unnecessary.
>  - Then we don't have to continue discussing further, and we can just
> focus on fast-execute, and phase out bpf_throw().
> - They are useful, but not necessary now.

They could have been useful, but I think we already explored
and concluded that it's very hard to do true C++like throw()
in the verifier alone. Too many corner cases.

I also doubt that we can go with a compiler assisted approach where
the compiler will generate cleanup code for us.
Ideally the users would write
assert(i < 100);
and the compiler will generate cleanup, but
it cannot do it without explicit __attribute__((cleanup(..)
And if the users have to write it then
if (i < 100) return;
is just more natural without any complexity.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-07  3:36                 ` Alexei Starovoitov
@ 2025-05-07  5:04                   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 30+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-05-07  5:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Siddharth Chintamaneni,
	Tejun Heo

On Wed, 7 May 2025 at 05:36, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, May 6, 2025 at 7:11 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > We will unwind the callback, return control into kfunc, it will clean
> > itself up and return to prog, and then we continue unwinding when it
> > returns into caller BPF program.
> > And we'll do exactly this with stubs as well.
> >
> > Program stub callback will return to kfunc, kfunc will return to
> > program, and replaced return address causes jump into stub again.
>
> Ok, so this is a new proposal where unwind would
> jump back into kfunc, but it will also replace the address
> where kfunc would have returned with a jump to 2nd phase
> of unwinder.

I don't see why it's new, but fine.
Even today, when we throw, we eventually return to the kernel with a retq.
In my mind it's the same thing we'll do intermediate unwinds.

Yes, we'll modify the return addresses to jump back into unwinder, but
it has to be done either way.
Even for stubs, you're rewriting the return address.

Back when this was added, we didn't add support for callback unwinding
because we were having a similar discussion.
Back then, the proposal was similar, a return value to let kernel
callback loops detect when to stop iterating and start returning.
And then you'd continue unwinding once in.
We decided we can table it for the time being since callbacks were
potentially less common users of whatever was being added.

We can argue about details but on a high level it's the same idea.

> So for
> prog A -> kern -> prog B -> kern -> prog C.
> In the first phase the unwinder will deal with C,
> then let kern continue as normal code, but the kernel
> will 'ret' back into unwinder machinery, so it can
> continue to unwind B, then another jump back to unwinder
> machinery to unwind A.

Yes, and stubs would work the same way.
In both cases you have the unwinder entry / stub entry, and modify all
return addresses in each BPF frame.

Then it finds and executes logic for that pc and unwinds frames until
it hits a boundary.
Or executes the rest of the program and returns.

Code that does this return address rewriting can be oblivious to the
underlying mechanism.
It does rewrite of return addresses in a loop and just relies on logic
being invoked.
Kernel will simply execute the rest of the logic and return, all the way up.

>
> Yes, it can be made to work, but this is quite complex,
> and it's really a combination of unwind and fast-execute
> through kernel bits.
>
> When the whole thing is fast executed there is only one
> step to adjust all return addresses on the stack, and then
> everything just flows to the kernel proper.
> Much simpler than going back to unwind machinery
> and without a seesaw pattern of stack usage.
>

Fine, I think I've said what I wanted to say.

> > If we add it, sched-ext will replace their MEMBER_VPTR and other
> > associated macro hacks.
> > https://github.com/sched-ext/scx/blob/main/scheds/include/scx/common.bpf.h#L227
>
> I really doubt it. It's fighting the verifier because the verifier
> needs to see that bounds check. That's why it's macro with asm.
>

Sure, it needs to see it. But that was the whole point.
That's the part we'll assert and not handle the other case, which
verifier forces us to handle.
Any complex program is littered with such cases.

> > > bpf_throw may stay as a kfunc with fast execute underneath or
> >
> > Yes, by having users write clean up code which is what they are doing today.
> > Which defeats the whole point of supporting assertions.
> > Then it's a kfunc with a misleading name, it's
> > bpf_fail_everything_from_this_point().
>
> Indeed. That's a better name. 'throw' has an incorrect analogy.
>
> > You do bpf_assert(i < N) and don't write anything to handle the case
> > where i >= N, and arr[i] is not well-formed.
> > This is the benefit to program writers: they can assert something as
> > true and don't have to "appease" the verifier by writing alternative
> > paths in the program.
>
> That indeed was the goal of assertions, but I think it was explored
> and it failed.
> #define bpf_assert(cond) if (!(cond)) bpf_throw(0);
>
> should have worked, but the compiler doesn't care about keeping
> "if (!(cond))" like that.
> All the asm volatile ("if .. were added
> in __bpf_assert() form and in bpf_cmp_likely().
> And they didn't succeed.
> They're too unnatural for C code.
> All uses remained in selftests.

Again, this is chicken and egg.
Resource cleanup doesn't happen so it cannot be used by anyone.
The verifier just rejects the program.

We have all other variants for integers doing things in assembly so
the compiler doesn't optimize things away.
And they work reliably, there are tests ensuring that.

"Didn't succeed" is the correct verdict when something was implemented
fully, put into the hands of people, and failed to gain traction.

The same goes for bpf_cmp_likely, success and failure is not intrinsic
to the concept alone.
Most people don't even know about the macro.
It's tucked somewhere deep in selftests with no visibility.
Most people don't know they need to copy bpf_experimental.h as part of
their BPF workflow to gain access to useful primitives.
We have no stdlib, so things go unnoticed.

There is no documentation or blog post describing when it's useful.
If they did and were told when it's useful, I'm sure people would make
use of it when they hit a related problem.
Unless you're an expert at following verifier output you're out of luck.
Most people are not even reading the mailing list, so they have no way
of catching up with all the new developments.

>
> > Does Rust keep executing the rest of the program when assert_eq!(x, y) fails?
> > Whether we use tables to do it or have the compiler emit them or do
> > anything else (how) is immaterial to the basic requirement (what).
>
> The high level language can do it, because it's done at a compiler level.
> rust->bpf compiler can be made to support it, but we deal with C and
> bpf_assert(i, <, 100);
> arr[i] = ..;
> didn't fly.
>
> > - Assertions are unnecessary.
> >  - Then we don't have to continue discussing further, and we can just
> > focus on fast-execute, and phase out bpf_throw().
> > - They are useful, but not necessary now.
>
> They could have been useful, but I think we already explored
> and concluded that it's very hard to do true C++like throw()
> in the verifier alone. Too many corner cases.

Sure, there are tradeoffs.
It's the same tradeoff of performing program analysis on compiled
low-level bytecode that the verifier deals with in general, much of
the program structure and semantics are lost.

The main problem was non-converging paths, but there were ways to fix
it without "merging" tables.

When Eduard and I discussed a compiler solution long back, the
conclusion was to do:
1) prep tables simply by tracking which resources are where, so map
verifier_state to a distilled struct.
2) fix non-convergence by spilling to unique locations in the stack.
No table merging complexity.
This works even when the compiler provides no support, so we make it
work for all existing programs.
And replace 1) with a table of compiler supplied landing pads when
provided by it.
The verifier will just explore if the landing pad did the right thing
(series of release kfunc calls) before terminating path exploration.

>
> I also doubt that we can go with a compiler assisted approach where
> the compiler will generate cleanup code for us.
> Ideally the users would write
> assert(i < 100);
> and the compiler will generate cleanup, but
> it cannot do it without explicit __attribute__((cleanup(..)
> And if the users have to write it then
> if (i < 100) return;
> is just more natural without any complexity.

But both statements are not the same.
It has no complexity only when you don't have to release anything.
So it's a misnomer.

If you have the compiler's assistance, you can make it do it for you.
I don't why explicit tagging is needed if we're lifting the compiler.

Resources which need to be released are initialized to a valid state before use.
Objects on the stack which are acquired across a throw will have
themselves cleaned up (and initialized accordingly if necessary).
And we'll just generate a landing pad corresponding to the throw where
they are destroyed.
This landing pad is never reachable by normal code.
Just that we won't emit "destructor" for the normal code unless
explicitly tagged with attribute((cleanup(...))).

I think every other bytecode VM managed to figure it out, including
WASM (and they did end up adding support to llvm).
Many other languages do a more powerful shift/reset and layer exceptions on top.
The nature of the problem is different for all of them than us, but
there's nothing special about our case that precludes a compiler
solution.

WG14 will sort of bake all this in the standard with panic and defer.
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2589.pdf (slide 19)
When the thread panic's, everything tagged with defer statement will
have its clean up logic done.

Anyhow, I've made my point and I think the thread has run its course.
Let's just continue with this approach, and remove the half-baked
support we have right now for exceptions, while we're at it.
It's just dead code since we don't plan on supporting the feature.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-04-20 10:55 [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Raj Sahu
                   ` (4 preceding siblings ...)
  2025-05-05 20:13 ` [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Kumar Kartikeya Dwivedi
@ 2025-05-07 18:15 ` Zvi Effron
  2025-05-07 20:01   ` Alexei Starovoitov
  5 siblings, 1 reply; 30+ messages in thread
From: Zvi Effron @ 2025-05-07 18:15 UTC (permalink / raw)
  To: Raj Sahu
  Cc: bpf, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	djwillia, miloc, ericts, rahult, doniaghazy, quanzhif, jinghao7,
	sidchintamaneni, memxor

On Sun, Apr 20, 2025 at 3:56 AM Raj Sahu <rjsu26@gmail.com> wrote:
>
> From: Raj <rjsu26@gmail.com>
>
> Motivation:
> We propose an enhancement to the BPF infrastructure to address the
> critical issue of long-running BPF programs [1,2,3,4]. While the verifier
> ensures termination by restricting instruction count and backward edges, the
> total execution time of BPF programs is not bounded. Certain helper functions
> and iterators can result in extended runtimes, affecting system performance.
>
> The existing BPF infrastructure verifies that programs will not indefinitely
> loop or leak resources. However, helper functions such as `bpf_loop`,
> `bpf_for_each_map_elem`, and other iterative or expensive kernel interactions
> can cause runtimes that significantly degrade system performance [6]. Current
> detaching mechanisms do not immediately terminate running instances,
> monopolizing CPU. Therefore, a termination solution is necessary to swiftly
> terminate execution while safely releasing resources.
>
> Existing termination approach like the BPF Exception or Runtime hooks [5] have
> the issue of either lack of dynamism or having runtime overheads: BPF
> Exceptions: Introduces bpf_throw() and exception callbacks, requiring stack
> unwinding and exception state management. Cleanup can only be done for
> pre-defined cancellation points.  Runtime Hooks: Proposes watchdog timers, which
> requires resource tracking, though minimal but non-zero runtime overheads.
>
> Design:
> We introduce the Fast-Path termination mechanism, leveraging the
> verifier's guarantees regarding control flow and resource management. The
> approach dynamically patches running BPF programs with a stripped-down version
> that accelerates termination. This can be used to terminate any given instance
> of a BPF execution. Key elements include:
>
> - Implicit Lifetime Management: Utilizing the verifier’s inherent control flow
>   and resource cleanup paths encoded within the BPF program structure,
>   eliminating the need for explicit garbage collection or unwinding tables.
>
> - Dynamic Program Patching: At runtime, BPF programs are atomically patched,
>   replacing expensive helper calls with stubbed versions (fast fall-through
>   implementations). This ensures swift termination without compromising safety
>   or leaking resources.
>
> - Helper Function Adjustments: Certain helper functions (e.g., `bpf_loop`,
>   `bpf_for_each_map_elem`) include  mechanisms to facilitate early exits through
>   modified return values.
>

I understand that the motivation for this proposal is kernel safety, so perhaps
my concern simply doesn't matter in that context, but I'm concerned about the
possibility of data corruption, specifically data stored in maps.

There are many ways to write data into maps that (especially with JIT) do not
end up going through any helper functions. For example, once a pointer to a map
value has been obtained, it can simply be written to update the map. That means
there is no helper to be patched to prevent the write.

My understanding is that with the Fast-Path termination mechanism, those write
instructions will still be executed and will still update the map. But if the
values they are writing are dependent on the results of any patched function
calls, the values will not be the intended ones which will result in data
corruption. This corruption would not impact the safety of the kernel, but
could cause problems for userspace applications relying on the map data.

Is that a correct understanding? If so, is that a concern that should be
addressed/mitigated?

> TODOs:
> - Termination support for nested BPF programs.
> - Policy enforcements to control runtime of BPF programs in a system:
> - Timer based termination (watchdog)
>         - Userspace management to detect low-performing BPF program and
>           terminated them
>
> We haven’t added any selftests in the POC as this mail is mainly to get
> feedback on the design. Attaching link to sample BPF programs to
> validate the POC [7].  Styling will be taken care in next iteration.
>
> References:
> 1. https://lpc.events/event/17/contributions/1610/attachments/1229/2505/LPC_BPF_termination_Raj_Sahu.pdf
> 2. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/f0749daa-4560-41c9-9f36-6aa618161665/content
> 3. https://lore.kernel.org/bpf/AM6PR03MB508011599420DB53480E8BF799F72@AM6PR03MB5080.eurprd03.prod.outlook.com/T/
> 4. https://vtechworks.lib.vt.edu/server/api/core/bitstreams/7fb70c04-0736-4e2d-b48b-2d8d012bacfc/content
> 5. https://lwn.net/ml/all/AM6PR03MB5080513BFAEB54A93CC70D4399FE2@AM6PR03MB5080.eurprd03.prod.outlook.com/#t
> 6. https://people.cs.vt.edu/djwillia/papers/ebpf23-runtime.pdf
> 7. https://github.com/sidchintamaneni/os-dev-env/tree/main/bpf-programs-catalog/research/termination/patch_gen_testing
>
>  arch/x86/kernel/smp.c          |   4 +-
>  include/linux/bpf.h            |  18 ++
>  include/linux/filter.h         |  16 ++
>  include/linux/smp.h            |   2 +-
>  include/uapi/linux/bpf.h       |  13 ++
>  kernel/bpf/bpf_iter.c          |  65 ++++++
>  kernel/bpf/core.c              |  45 ++++
>  kernel/bpf/helpers.c           |   8 +
>  kernel/bpf/syscall.c           | 375 +++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c          |  16 +-
>  kernel/smp.c                   |  22 +-
>  tools/bpf/bpftool/prog.c       |  40 ++++
>  tools/include/uapi/linux/bpf.h |   5 +
>  tools/lib/bpf/bpf.c            |  15 ++
>  tools/lib/bpf/bpf.h            |  10 +
>  tools/lib/bpf/libbpf.map       |   1 +
>  16 files changed, 643 insertions(+), 12 deletions(-)
>
> --
> 2.43.0
>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination
  2025-05-07 18:15 ` Zvi Effron
@ 2025-05-07 20:01   ` Alexei Starovoitov
  0 siblings, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-05-07 20:01 UTC (permalink / raw)
  To: Zvi Effron
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Siddharth Chintamaneni,
	Kumar Kartikeya Dwivedi

On Wed, May 7, 2025 at 11:15 AM Zvi Effron <zeffron@riotgames.com> wrote:
>
>
> My understanding is that with the Fast-Path termination mechanism, those write
> instructions will still be executed and will still update the map. But if the
> values they are writing are dependent on the results of any patched function
> calls, the values will not be the intended ones which will result in data
> corruption. This corruption would not impact the safety of the kernel, but
> could cause problems for userspace applications relying on the map data.
>
> Is that a correct understanding? If so, is that a concern that should be
> addressed/mitigated?

In broad strokes it's correct.
The fast execute approach will not be stubbing out unconditional calls.
Like all bpf_rcu_read_lock, bpf_spin_lock, etc will still be executed.
Anything that returns OR_NULL will be replaced with NULL.
Like bpf_obj_new() will return NULL.
Think of it as forced ENOMEM situation where everything that can fail
is artificially failing.
There surely will be logical bugs in the program, since programmers
rarely test the logic in cases of "impossible" failures.
The kernel itself is an example. syzbot reports due to failure
injection are here to stay.
if (kmalloc() == NULL) // just exit quickly
is a typical pattern for kernel code and for bpf progs.
In the latter case the verifier checks that the kernel is safe,
but the code may leave user space in an inconsistent state.
The fast execute approach will exacerbate this issue.
And I think it's an acceptable trade-off.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 1/4] bpf: Introduce new structs and struct fields
  2025-04-20 10:55 ` [RFC bpf-next 1/4] bpf: Introduce new structs and struct fields Raj Sahu
@ 2025-05-13  0:04   ` Eduard Zingerman
  0 siblings, 0 replies; 30+ messages in thread
From: Eduard Zingerman @ 2025-05-13  0:04 UTC (permalink / raw)
  To: Raj Sahu
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, djwillia, miloc,
	ericts, rahult, doniaghazy, quanzhif, jinghao7, sidchintamaneni,
	memxor, sidchintamaneni

Raj Sahu <rjsu26@gmail.com> writes:

Hi Raj,

Sorry for delayed response, finally got to read through this series.
Please find a few comments below and in patch #3.
I understand that things are in an incomplete state atm.

[...]

> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index ba6b6118cf50..27dcf59f4445 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c

[...]

> @@ -135,6 +160,16 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
>  	mutex_init(&fp->aux->dst_mutex);
>  
>  	return fp;
> +
> +free_per_cpu_state:
> +	kfree(termination_states->per_cpu_state);
> +free_bpf_termination_states:
> +	kfree(termination_states);
> +free_bpf_struct_ptr_alloc:

Nit: In verifier code base such exit labels are usually collapsed as one,
     as free() functions can handle NULL arguments.

> +	free_percpu(fp->active);
> +	vfree(fp);
> +	kfree(aux);
> +	return NULL;
>  }
>  
>  struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
> @@ -282,6 +317,13 @@ void __bpf_prog_free(struct bpf_prog *fp)
>  		kfree(fp->aux->poke_tab);
>  		kfree(fp->aux);
>  	}
> +
> +	if (fp->termination_states) {
> +		kfree(fp->termination_states->pre_execution_state);
> +		kfree(fp->termination_states->per_cpu_state);
> +		kfree(fp->termination_states);
> +	}
> +

Does this need special handling in core.c:bpf_prog_realloc ?
Also, is it possible to use alloc_percpu_gfp()/free_percpu() functions
for these fields?

>  	free_percpu(fp->stats);
>  	free_percpu(fp->active);
>  	vfree(fp);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-04-20 10:55 ` [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination Raj Sahu
@ 2025-05-13  0:07   ` Eduard Zingerman
  2025-05-13  0:20     ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Eduard Zingerman @ 2025-05-13  0:07 UTC (permalink / raw)
  To: Raj Sahu
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, djwillia, miloc,
	ericts, rahult, doniaghazy, quanzhif, jinghao7, sidchintamaneni,
	memxor

Raj Sahu <rjsu26@gmail.com> writes:

[...]

> diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
> index 380e9a7cac75..e5dceebfb302 100644
> --- a/kernel/bpf/bpf_iter.c
> +++ b/kernel/bpf/bpf_iter.c
> @@ -6,6 +6,7 @@
>  #include <linux/filter.h>
>  #include <linux/bpf.h>
>  #include <linux/rcupdate_trace.h>
> +#include <asm/unwind.h>
>  
>  struct bpf_iter_target_info {
>  	struct list_head list;
> @@ -775,6 +776,70 @@ const struct bpf_func_proto bpf_loop_proto = {
>  	.arg4_type	= ARG_ANYTHING,
>  };
>  
> +BPF_CALL_4(bpf_loop_termination, u32, nr_loops, void *, callback_fn,
> +		void *, callback_ctx, u64, flags)
> +{
> +	/* Since a patched BPF program for termination will want
> +	 * to finish as fast as possible,
> +	 * we simply don't run any loop in here.
> +	 */
> +
> +	/* Restoring the callee-saved registers and exit.
> +	 * Hacky/ err prone way of restoring the registers.
> +	 * We are figuring out a way to have arch independent
> +	 * way to do this.
> +	 */
> +
> +	asm volatile(
> +	"pop %rbx\n\t"
> +	"pop %rbp\n\t"
> +	"pop %r12\n\t"
> +	"pop %r13\n\t"
> +	);

Why do anything special here?
I'd expect a bpf_loop() version simplified to a single return to work.

> +
> +	return 0;
> +}

[...]

> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c

[...]

> +
> +static bool find_in_skiplist(int func_id) {

Nit: "skip list" is a name of an (unrelated) data structure,
     maybe disambiguate the name here?

> +	return is_verifier_inlined_function(func_id) ||
> +	       is_debug_function(func_id) ||
> +	       is_resource_release_function(func_id);
> +}
> +
> +static int get_replacement_helper(int func_id, enum bpf_return_type ret_type) {
> +
> +	switch (func_id) {
> +		case BPF_FUNC_loop:
> +			return BPF_FUNC_loop_termination;
> +		case BPF_FUNC_for_each_map_elem:
> +		case BPF_FUNC_user_ringbuf_drain:
> +			return -ENOTSUPP;
> +	}
> +
> +	switch (ret_type) {
> +		case RET_VOID:
> +			return BPF_FUNC_dummy_void;
> +		case RET_INTEGER:
> +			return BPF_FUNC_dummy_int;
> +		case RET_PTR_TO_MAP_VALUE_OR_NULL:
> +			return BPF_FUNC_dummy_ptr_to_map;
> +		case RET_PTR_TO_SOCKET_OR_NULL:
> +		case RET_PTR_TO_TCP_SOCK_OR_NULL:
> +		case RET_PTR_TO_SOCK_COMMON_OR_NULL:
> +		case RET_PTR_TO_RINGBUF_MEM_OR_NULL:
> +		case RET_PTR_TO_DYNPTR_MEM_OR_NULL:
> +		case RET_PTR_TO_BTF_ID_OR_NULL:
> +		case RET_PTR_TO_BTF_ID_TRUSTED:
> +		case RET_PTR_TO_MAP_VALUE:
> +		case RET_PTR_TO_SOCKET:
> +		case RET_PTR_TO_TCP_SOCK:
> +		case RET_PTR_TO_SOCK_COMMON:
> +		case RET_PTR_TO_MEM:
> +		case RET_PTR_TO_MEM_OR_BTF_ID:
> +		case RET_PTR_TO_BTF_ID:

I'm curious what's the plan for RET_PTR_TO_BTF_ID?
verifier.c:check_ptr_tp_btf_access() has the following comment:

  /* By default any pointer obtained from walking a trusted pointer is no
   * longer trusted, unless the field being accessed has explicitly been
   * marked as inheriting its parent's state of trust (either full or RCU).
   * For example:
   * 'cgroups' pointer is untrusted if task->cgroups dereference
   * happened in a sleepable program outside of bpf_rcu_read_lock()
   * section. In a non-sleepable program it's trusted while in RCU CS (aka MEM_RCU).
   * Note bpf_rcu_read_unlock() converts MEM_RCU pointers to PTR_UNTRUSTED.
   *
   * A regular RCU-protected pointer with __rcu tag can also be deemed
   * trusted if we are in an RCU CS. Such pointer can be NULL.
   */

> +		default:
> +			return -ENOTSUPP;
> +	}
> +}
> +
> +static void patch_generator(struct bpf_prog *prog)
> +{
> +	struct call_insn_aux{
> +		int insn_idx;
> +		int replacement_helper;
> +	};
> +
> +	struct call_insn_aux *call_indices;
> +	int num_calls=0;
> +	call_indices = vmalloc(sizeof(call_indices) * prog->len);

clang-tidy spotted a warning here, the expression should be `sizeof(*call_indices)`.

> +
> +	/* Find all call insns */
> +	for(int insn_idx =0 ;insn_idx < prog->len; insn_idx++)
> +	{
> +		struct bpf_insn *insn = &prog->insnsi[insn_idx] ;
> +		u8 class = BPF_CLASS(insn->code);
> +		if (class == BPF_JMP || class == BPF_JMP32) {
> +			if (BPF_OP(insn->code) == BPF_CALL){
> +				if (insn->src_reg == BPF_PSEUDO_CALL) {
> +					continue;
> +				}
> +				if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL){ /*kfunc */
> +					// TODO Need to use btf for getting proto
> +					// If release function --> skip
> +					// If acquire function --> find return type and add to list
> +				}
> +				else {
> +					int func_id = insn->imm;
> +					const struct bpf_func_proto *fn = NULL;
> +					int new_helper_id = -1;
> +
> +					if (find_in_skiplist(func_id)) {
> +						continue;
> +					}
> +
> +					fn = bpf_verifier_ops[prog->type]->get_func_proto(func_id, prog);

Nit: please reuse verifier.c:get_helper_proto() utility function here.

> +					if (!fn && !fn->func) {
> +						continue;
> +					}
> +
> +					new_helper_id = get_replacement_helper(func_id, fn->ret_type);
> +					if (new_helper_id < 0) {
> +						continue;

Nit: propagate error?

> +					}
> +
> +					call_indices[num_calls].insn_idx = insn_idx;
> +					call_indices[num_calls].replacement_helper= new_helper_id;
> +					num_calls++;
> +				}
> +			}
> +		}
> +	}
> +
> +	/* Patch all call insns */
> +	for(int k =0; k < num_calls; k++){
> +		prog->insnsi[call_indices[k].insn_idx].imm = call_indices[k].replacement_helper;

Nit: each instruction is visited only once, so it appears that patching
     can be done in-place w/o need for call_indices array.

> +	}
> +}
> +
> +static bool create_termination_prog(struct bpf_prog *prog,
> +					union bpf_attr *attr,
> +					bpfptr_t uattr,
> +					u32 uattr_size)
> +{
> +	if (prog->len < 10)
> +		return false;
> +
> +	int err;
> +	struct bpf_prog *patch_prog;
> +	patch_prog = bpf_prog_alloc_no_stats(bpf_prog_size(prog->len), 0);
> +	if (!patch_prog) {
> +		return false;
> +	}
> +
> +	patch_prog->termination_states->is_termination_prog = true;
> +
> +	err = clone_bpf_prog(patch_prog, prog);
> +	if (err)
> +			goto free_termination_prog;
> +
> +	patch_generator(patch_prog);
> +
> +	err = bpf_check(&patch_prog, attr, uattr, uattr_size);

There might be issues with program verification if bpf_check is called
for a patched program. For example, for a full implementation you'd need
to handle kfuncs, replacing loops like:

  while (bpf_iter_next()) {}

with loops like:

  while (dummy_return_null()) {}

won't work as verifier would assume that the loop is infinite.
In general, check_helper_call, check_kfunc_call and is_state_visited
have logic for specific helpers/kfuncs that modifies verifier state
basing not only on the helper return value type.

What do you plan to do with functions that unconditionally take
resources, e.g. bpf_spin_lock()?

- From verification point of view:
  this function is RET_VOID and is not in
  find_in_skiplist(), patch_generator() would replace its call with a
  dummy. However, a corresponding bpf_spin_unlock() would remain and thus
  bpf_check() will exit with error.
  So, you would need some special version of bpf_check, that collects
  all resources needed for program translation (e.g. maps), but does
  not perform semantic checks.
  Or patch_generator() has to be called for a program that is already
  verified.

- From termination point of view:
  this function cannot be replaced with a dummy w/o removing
  corresponding unlock. Do you plan to leave these as-is?

> +	if (err) {
> +		goto free_termination_prog;
> +	}
> +
> +	patch_prog = bpf_prog_select_runtime(patch_prog, &err);
> +	if (err) {
> +		goto free_termination_prog;
> +	}
> +
> +	prog->termination_states->patch_prog = patch_prog;
> +	return true;
> +
> +free_termination_prog:
> +	free_percpu(patch_prog->stats);
> +	free_percpu(patch_prog->active);
> +	kfree(patch_prog->aux);
> +	return false;


In case of a load failure of the main program bpf_prog_load_load() does
much more cleanup.

[...]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 54c6953a8b84..57b4fd1f6a72 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c

[...]

> @@ -22502,7 +22506,14 @@ static struct bpf_prog *inline_bpf_loop(struct bpf_verifier_env *env,
>  	 */
>  	insn_buf[cnt++] = BPF_MOV64_REG(BPF_REG_1, reg_loop_cnt);
>  	insn_buf[cnt++] = BPF_MOV64_REG(BPF_REG_2, reg_loop_ctx);
> -	insn_buf[cnt++] = BPF_CALL_REL(0);
> +
> +	if (termination_states && termination_states->is_termination_prog) {
> +		/* In a termination BPF prog, we want to exit - set R0 = 1 */
> +		insn_buf[cnt++] = BPF_MOV64_IMM(BPF_REG_0, 1);
> +	} else {
> +		insn_buf[cnt++] = BPF_CALL_REL(0);
> +	}
> +

Q: Do you need 1:1 instruction pointer correspondence between original
   and patched programs?
   Since you patch inlined bpf_loop instead of disallowing inlining in
   the patched program I assume you do. In this case, there is no
   guarantee that instructions above have the same size after jit.

>  	/* increment loop counter */
>  	insn_buf[cnt++] = BPF_ALU64_IMM(BPF_ADD, reg_loop_cnt, 1);
>  	/* jump to loop header if callback returned 0 */

[...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-05-13  0:07   ` Eduard Zingerman
@ 2025-05-13  0:20     ` Alexei Starovoitov
  2025-05-13  4:40       ` Eduard Zingerman
  2025-05-13  5:16       ` Siddharth Chintamaneni
  0 siblings, 2 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-05-13  0:20 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Dan Williams, miloc, ericts, rahult, doniaghazy, quanzhif,
	Jinghao Jia, Siddharth Chintamaneni, Kumar Kartikeya Dwivedi

On Mon, May 12, 2025 at 5:07 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
>
> - From verification point of view:
>   this function is RET_VOID and is not in
>   find_in_skiplist(), patch_generator() would replace its call with a
>   dummy. However, a corresponding bpf_spin_unlock() would remain and thus
>   bpf_check() will exit with error.
>   So, you would need some special version of bpf_check, that collects
>   all resources needed for program translation (e.g. maps), but does
>   not perform semantic checks.
>   Or patch_generator() has to be called for a program that is already
>   verified.

No. let's not parametrize bpf_check.

Here is what I proposed earlier in the thread:

the verifier should just remember all places where kfuncs
and helpers return _OR_NULL,
then when the verification is complete, copy the prog,
replaces 'call kfunc/help' with 'call stub',
run two JITs, and compare JIT artifacts
to make sure IPs match.

But thinking about it more...
I'm not sure any more that it's a good idea to fast execute
the program on one cpu and let it continue running as-is on
all other cpus including future invocations on this cpu.
So far the reasons to terminate bpf program:
- timeout in rqspinlock
- fault in arena
- some future watchdog

In all cases the program is buggy, so it's safer
from kernel pov and from data integrity pov to stop
all instances now and prevent future invocations.
So I think we should patch the prog text in run-time
without cloning.

The verifier should prepare an array of patches in
text_poke_bp_batch() format and when timeout/fault detected
do one call to text_poke_bp_batch() to stub out the whole prog.

At least on x86 we always emit nop5 in the prologue,
so we can patch it with goto exit as well.
Then the prog will be completely gutted.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-05-13  0:20     ` Alexei Starovoitov
@ 2025-05-13  4:40       ` Eduard Zingerman
  2025-05-13  5:16       ` Siddharth Chintamaneni
  1 sibling, 0 replies; 30+ messages in thread
From: Eduard Zingerman @ 2025-05-13  4:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Raj Sahu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Dan Williams, miloc, ericts, rahult, doniaghazy, quanzhif,
	Jinghao Jia, Siddharth Chintamaneni, Kumar Kartikeya Dwivedi

On Mon, 2025-05-12 at 17:20 -0700, Alexei Starovoitov wrote:
> On Mon, May 12, 2025 at 5:07 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > 
> > 
> > - From verification point of view:
> >   this function is RET_VOID and is not in
> >   find_in_skiplist(), patch_generator() would replace its call with a
> >   dummy. However, a corresponding bpf_spin_unlock() would remain and thus
> >   bpf_check() will exit with error.
> >   So, you would need some special version of bpf_check, that collects
> >   all resources needed for program translation (e.g. maps), but does
> >   not perform semantic checks.
> >   Or patch_generator() has to be called for a program that is already
> >   verified.
> 
> No. let's not parametrize bpf_check.
> 
> Here is what I proposed earlier in the thread:
> 
> the verifier should just remember all places where kfuncs
> and helpers return _OR_NULL,
> then when the verification is complete, copy the prog,
> replaces 'call kfunc/help' with 'call stub',
> run two JITs, and compare JIT artifacts
> to make sure IPs match.

Makes sense, much cleaner compared to special version of bpf_check().

> But thinking about it more...
> I'm not sure any more that it's a good idea to fast execute
> the program on one cpu and let it continue running as-is on
> all other cpus including future invocations on this cpu.
> So far the reasons to terminate bpf program:
> - timeout in rqspinlock
> - fault in arena
> - some future watchdog
> 
> In all cases the program is buggy, so it's safer
> from kernel pov and from data integrity pov to stop
> all instances now and prevent future invocations.
> So I think we should patch the prog text in run-time
> without cloning.

Stopping all program instances makes sense.

However, this is orthogonal to preparing dummy version of the program.
Atomically converting program to dummy might be useful from the same
data integrity pov.

Also, patching for dummy might be an effort to make program side
effect free, e.g. by masking any map access.

Still curious about PTR_TO_BTF_ID and bpf_spin_lock().

[...]



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-05-13  0:20     ` Alexei Starovoitov
  2025-05-13  4:40       ` Eduard Zingerman
@ 2025-05-13  5:16       ` Siddharth Chintamaneni
  2025-05-15  1:59         ` Alexei Starovoitov
  1 sibling, 1 reply; 30+ messages in thread
From: Siddharth Chintamaneni @ 2025-05-13  5:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eduard Zingerman, Raj Sahu, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Kumar Kartikeya Dwivedi

On Mon, 12 May 2025 at 17:20, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, May 12, 2025 at 5:07 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> >
> >
> > - From verification point of view:
> >   this function is RET_VOID and is not in
> >   find_in_skiplist(), patch_generator() would replace its call with a
> >   dummy. However, a corresponding bpf_spin_unlock() would remain and thus
> >   bpf_check() will exit with error.
> >   So, you would need some special version of bpf_check, that collects
> >   all resources needed for program translation (e.g. maps), but does
> >   not perform semantic checks.
> >   Or patch_generator() has to be called for a program that is already
> >   verified.
>
> No. let's not parametrize bpf_check.
>
> Here is what I proposed earlier in the thread:
>
> the verifier should just remember all places where kfuncs
> and helpers return _OR_NULL,
> then when the verification is complete, copy the prog,
> replaces 'call kfunc/help' with 'call stub',
> run two JITs, and compare JIT artifacts
> to make sure IPs match.
>

This is something that we've experimented with last week
https://github.com/sidchintamaneni/bpf/commits/bpf_term/v2_exploration/.
We did the cloning part after `do_misc_fixups` and before
`fixup_call_args` inside
bpf_check().

> But thinking about it more...
> I'm not sure any more that it's a good idea to fast execute
> the program on one cpu and let it continue running as-is on
> all other cpus including future invocations on this cpu.
> So far the reasons to terminate bpf program:
> - timeout in rqspinlock
> - fault in arena
> - some future watchdog
>

Also long running interators, for example -
https://github.com/sidchintamaneni/os-dev-env/blob/main/bpf-programs-catalog/research/termination/patch_gen_testing/bpf_loop_lr.kern.c
Eventhough this is just an example, this could be possible when
iterating through a map which grows unconditionally.

> In all cases the program is buggy, so it's safer
> from kernel pov and from data integrity pov to stop
> all instances now and prevent future invocations.
> So I think we should patch the prog text in run-time
> without cloning.
>

Yes, this is something that we had in mind:
1. Terminate the program on a single CPU
2. Terminate the program on all CPUs and de-link it

Single CPU termination could be useful when a BPF program is using a
per-CPU map and the map on a single CPU grows, causing the iterator to
take a lot of time.

> The verifier should prepare an array of patches in
> text_poke_bp_batch() format and when timeout/fault detected
> do one call to text_poke_bp_batch() to stub out the whole prog.
>
Do you mean creating different versions of patches and applying one of
them based on the current execution state?

> At least on x86 we always emit nop5 in the prologue,
> so we can patch it with goto exit as well.
> Then the prog will be completely gutted.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-05-13  5:16       ` Siddharth Chintamaneni
@ 2025-05-15  1:59         ` Alexei Starovoitov
  2025-05-15 17:43           ` Andrii Nakryiko
  2025-05-17  5:01           ` Raj Sahu
  0 siblings, 2 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-05-15  1:59 UTC (permalink / raw)
  To: Siddharth Chintamaneni
  Cc: Eduard Zingerman, Raj Sahu, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Kumar Kartikeya Dwivedi

On Mon, May 12, 2025 at 10:16 PM Siddharth Chintamaneni
<sidchintamaneni@gmail.com> wrote:
>
> On Mon, 12 May 2025 at 17:20, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, May 12, 2025 at 5:07 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > >
> > >
> > > - From verification point of view:
> > >   this function is RET_VOID and is not in
> > >   find_in_skiplist(), patch_generator() would replace its call with a
> > >   dummy. However, a corresponding bpf_spin_unlock() would remain and thus
> > >   bpf_check() will exit with error.
> > >   So, you would need some special version of bpf_check, that collects
> > >   all resources needed for program translation (e.g. maps), but does
> > >   not perform semantic checks.
> > >   Or patch_generator() has to be called for a program that is already
> > >   verified.
> >
> > No. let's not parametrize bpf_check.
> >
> > Here is what I proposed earlier in the thread:
> >
> > the verifier should just remember all places where kfuncs
> > and helpers return _OR_NULL,
> > then when the verification is complete, copy the prog,
> > replaces 'call kfunc/help' with 'call stub',
> > run two JITs, and compare JIT artifacts
> > to make sure IPs match.
> >
>
> This is something that we've experimented with last week
> https://github.com/sidchintamaneni/bpf/commits/bpf_term/v2_exploration/.
> We did the cloning part after `do_misc_fixups` and before
> `fixup_call_args` inside
> bpf_check().

Something like that but it needs to handle both helpers and kfuncs,
and you don't need new fake helpers.
text_poke_bp_batch() doesn't care what insn is being patched.

> > But thinking about it more...
> > I'm not sure any more that it's a good idea to fast execute
> > the program on one cpu and let it continue running as-is on
> > all other cpus including future invocations on this cpu.
> > So far the reasons to terminate bpf program:
> > - timeout in rqspinlock
> > - fault in arena
> > - some future watchdog
> >
>
> Also long running interators, for example -
> https://github.com/sidchintamaneni/os-dev-env/blob/main/bpf-programs-catalog/research/termination/patch_gen_testing/bpf_loop_lr.kern.c
> Eventhough this is just an example, this could be possible when
> iterating through a map which grows unconditionally.

In terms of detection this is a subset of watchdog.
In terms of termination we still need to figure out the best
path forward.
bpf_loop() may or may not be inlined.
If it's still a helper call then we can have per-prog "stop_me" flag,
but it will penalize run-time, and won't really work for
inlined (unless we force inlining logic to consult that flag
as well).
One option is to patch the callback subprog to return 1,
but the callback might not have a branch that returns 1.
Another option is to remember the insn that does:
        /* loop header,
         * if reg_loop_cnt >= reg_loop_max skip the loop body
         */
        insn_buf[cnt++] = BPF_JMP_REG(BPF_JGE, reg_loop_cnt, reg_loop_max, 5);

in inlined bpf_loop() and patch that insn with 'goto +5',
so inlined bpf_loop will terminate quickly.

> > In all cases the program is buggy, so it's safer
> > from kernel pov and from data integrity pov to stop
> > all instances now and prevent future invocations.
> > So I think we should patch the prog text in run-time
> > without cloning.
> >
>
> Yes, this is something that we had in mind:
> 1. Terminate the program on a single CPU
> 2. Terminate the program on all CPUs and de-link it
>
> Single CPU termination could be useful when a BPF program is using a
> per-CPU map and the map on a single CPU grows, causing the iterator to
> take a lot of time.

I think de-link (and detach) is difficult.
The context where an abnormal condition is detected (like watchdog)
may not allow detaching.
So I think replacing nop5 in the prologue with 'goto out'
is better.

>
> > The verifier should prepare an array of patches in
> > text_poke_bp_batch() format and when timeout/fault detected
> > do one call to text_poke_bp_batch() to stub out the whole prog.
> >
> Do you mean creating different versions of patches and applying one of
> them based on the current execution state?

I mean the verifier should prepare the whole batch with all insns
that needs to be patched and apply the whole thing at once
with one call to text_poke_bp_batch() that will affect all cpus.
It doesn't matter where the program was running on this cpu.
It might be running somewhere else on a different cpu.
text_poke_bp_batch() won't be atomic, but it doesn't have to be.
Every cpu might have a different fast-execute path to exit.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-05-15  1:59         ` Alexei Starovoitov
@ 2025-05-15 17:43           ` Andrii Nakryiko
  2025-05-17  5:01           ` Raj Sahu
  1 sibling, 0 replies; 30+ messages in thread
From: Andrii Nakryiko @ 2025-05-15 17:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Siddharth Chintamaneni, Eduard Zingerman, Raj Sahu, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Dan Williams,
	miloc, ericts, rahult, doniaghazy, quanzhif, Jinghao Jia,
	Kumar Kartikeya Dwivedi

On Wed, May 14, 2025 at 6:59 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, May 12, 2025 at 10:16 PM Siddharth Chintamaneni
> <sidchintamaneni@gmail.com> wrote:
> >
> > On Mon, 12 May 2025 at 17:20, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, May 12, 2025 at 5:07 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > >
> > > >
> > > > - From verification point of view:
> > > >   this function is RET_VOID and is not in
> > > >   find_in_skiplist(), patch_generator() would replace its call with a
> > > >   dummy. However, a corresponding bpf_spin_unlock() would remain and thus
> > > >   bpf_check() will exit with error.
> > > >   So, you would need some special version of bpf_check, that collects
> > > >   all resources needed for program translation (e.g. maps), but does
> > > >   not perform semantic checks.
> > > >   Or patch_generator() has to be called for a program that is already
> > > >   verified.
> > >
> > > No. let's not parametrize bpf_check.
> > >
> > > Here is what I proposed earlier in the thread:
> > >
> > > the verifier should just remember all places where kfuncs
> > > and helpers return _OR_NULL,
> > > then when the verification is complete, copy the prog,
> > > replaces 'call kfunc/help' with 'call stub',
> > > run two JITs, and compare JIT artifacts
> > > to make sure IPs match.
> > >
> >
> > This is something that we've experimented with last week
> > https://github.com/sidchintamaneni/bpf/commits/bpf_term/v2_exploration/.
> > We did the cloning part after `do_misc_fixups` and before
> > `fixup_call_args` inside
> > bpf_check().
>
> Something like that but it needs to handle both helpers and kfuncs,
> and you don't need new fake helpers.
> text_poke_bp_batch() doesn't care what insn is being patched.
>
> > > But thinking about it more...
> > > I'm not sure any more that it's a good idea to fast execute
> > > the program on one cpu and let it continue running as-is on
> > > all other cpus including future invocations on this cpu.
> > > So far the reasons to terminate bpf program:
> > > - timeout in rqspinlock
> > > - fault in arena
> > > - some future watchdog
> > >
> >
> > Also long running interators, for example -
> > https://github.com/sidchintamaneni/os-dev-env/blob/main/bpf-programs-catalog/research/termination/patch_gen_testing/bpf_loop_lr.kern.c
> > Eventhough this is just an example, this could be possible when
> > iterating through a map which grows unconditionally.
>
> In terms of detection this is a subset of watchdog.
> In terms of termination we still need to figure out the best
> path forward.
> bpf_loop() may or may not be inlined.
> If it's still a helper call then we can have per-prog "stop_me" flag,
> but it will penalize run-time, and won't really work for
> inlined (unless we force inlining logic to consult that flag
> as well).
> One option is to patch the callback subprog to return 1,
> but the callback might not have a branch that returns 1.
> Another option is to remember the insn that does:
>         /* loop header,
>          * if reg_loop_cnt >= reg_loop_max skip the loop body
>          */
>         insn_buf[cnt++] = BPF_JMP_REG(BPF_JGE, reg_loop_cnt, reg_loop_max, 5);
>
> in inlined bpf_loop() and patch that insn with 'goto +5',
> so inlined bpf_loop will terminate quickly.
>
> > > In all cases the program is buggy, so it's safer
> > > from kernel pov and from data integrity pov to stop
> > > all instances now and prevent future invocations.
> > > So I think we should patch the prog text in run-time
> > > without cloning.
> > >
> >
> > Yes, this is something that we had in mind:
> > 1. Terminate the program on a single CPU
> > 2. Terminate the program on all CPUs and de-link it
> >
> > Single CPU termination could be useful when a BPF program is using a
> > per-CPU map and the map on a single CPU grows, causing the iterator to
> > take a lot of time.
>
> I think de-link (and detach) is difficult.
> The context where an abnormal condition is detected (like watchdog)
> may not allow detaching.
> So I think replacing nop5 in the prologue with 'goto out'
> is better.
>

We do have a "defunct link" state for some BPF link types, like XDP
and a bunch of others, for example. If the interface to which XDP
program was attached gets removed, you'll still have a BPF link
remaining, but it won't really be attached anymore. We also have
BPF_LINK_DETACH command (meant to be used by admins to force-detach
links), which is basically the same concept. Original application will
still have link FD, and that link object will exist in the kernel, but
it will be "defunc" with no underlying BPF program and no attachment
to its original BPF hook.

That's just to say that we can extend that to other BPF link types, if
necessary.

> >
> > > The verifier should prepare an array of patches in
> > > text_poke_bp_batch() format and when timeout/fault detected
> > > do one call to text_poke_bp_batch() to stub out the whole prog.
> > >
> > Do you mean creating different versions of patches and applying one of
> > them based on the current execution state?
>
> I mean the verifier should prepare the whole batch with all insns
> that needs to be patched and apply the whole thing at once
> with one call to text_poke_bp_batch() that will affect all cpus.
> It doesn't matter where the program was running on this cpu.
> It might be running somewhere else on a different cpu.
> text_poke_bp_batch() won't be atomic, but it doesn't have to be.
> Every cpu might have a different fast-execute path to exit.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-05-15  1:59         ` Alexei Starovoitov
  2025-05-15 17:43           ` Andrii Nakryiko
@ 2025-05-17  5:01           ` Raj Sahu
  2025-05-19  1:20             ` Alexei Starovoitov
  1 sibling, 1 reply; 30+ messages in thread
From: Raj Sahu @ 2025-05-17  5:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Siddharth Chintamaneni, Eduard Zingerman, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Kumar Kartikeya Dwivedi

> In terms of detection this is a subset of watchdog.
> In terms of termination we still need to figure out the best
> path forward.
> bpf_loop() may or may not be inlined.
> If it's still a helper call then we can have per-prog "stop_me" flag,
> but it will penalize run-time, and won't really work for
> inlined (unless we force inlining logic to consult that flag
> as well).
> One option is to patch the callback subprog to return 1,
> but the callback might not have a branch that returns 1.
> Another option is to remember the insn that does:
>         /* loop header,
>          * if reg_loop_cnt >= reg_loop_max skip the loop body
>          */
>         insn_buf[cnt++] = BPF_JMP_REG(BPF_JGE, reg_loop_cnt, reg_loop_max, 5);
>
> in inlined bpf_loop() and patch that insn with 'goto +5',
> so inlined bpf_loop will terminate quickly.
>
We indeed thought about modifying the subprog to return 1 but realised
it would be complex. Currently the design does 2 things:
   1. For Inlining case we set the R0 = 1 which is later checked and forces
       the for loop to stop.

@@ -22502,7 +22506,14 @@ static struct bpf_prog
*inline_bpf_loop(struct bpf_verifier_env *env,

+ if (termination_states && termination_states->is_termination_prog) {
+                 /* In a termination BPF prog, we want to exit - set R0 = 1 */
+                 insn_buf[cnt++] = BPF_MOV64_IMM(BPF_REG_0, 1);
+ } else {
+                 insn_buf[cnt++] = BPF_CALL_REL(0);
+ }
+

   2. For non-inlining cases we just modified the bpf_loop helper.
This implies 2 things:
            2.1 If bpf_loop is yet to be executed, then bingo. We just return.
            2.2 If we are currently inside a bpf_loop, we must be
iteratively calling the callback function
                  (which btw is also stubbed out of all helpers and bpf_loop).
                  So the already running bpf_loop won't be doing any
expensive helper calls or issuing
                  further loop calls and the long-running BPF
program's execution time will be brought down
                  significantly.

> I mean the verifier should prepare the whole batch with all insns
> that needs to be patched and apply the whole thing at once
> with one call to text_poke_bp_batch() that will affect all cpus.
> It doesn't matter where the program was running on this cpu.
> It might be running somewhere else on a different cpu.
> text_poke_bp_batch() won't be atomic, but it doesn't have to be.
> Every cpu might have a different fast-execute path to exit.

While I was going through the definition of text_poke_bp_batch, I am
worried if, say, a running
instruction ends up getting modified midway with another instruction,
making the CPU read
half of insn1 and half of insn2. That would end up faulting !?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-05-17  5:01           ` Raj Sahu
@ 2025-05-19  1:20             ` Alexei Starovoitov
  2025-05-28  7:10               ` Raj Sahu
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-05-19  1:20 UTC (permalink / raw)
  To: Raj Sahu
  Cc: Siddharth Chintamaneni, Eduard Zingerman, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Kumar Kartikeya Dwivedi

On Fri, May 16, 2025 at 10:01 PM Raj Sahu <rjsu26@gmail.com> wrote:
>
> > In terms of detection this is a subset of watchdog.
> > In terms of termination we still need to figure out the best
> > path forward.
> > bpf_loop() may or may not be inlined.
> > If it's still a helper call then we can have per-prog "stop_me" flag,
> > but it will penalize run-time, and won't really work for
> > inlined (unless we force inlining logic to consult that flag
> > as well).
> > One option is to patch the callback subprog to return 1,
> > but the callback might not have a branch that returns 1.
> > Another option is to remember the insn that does:
> >         /* loop header,
> >          * if reg_loop_cnt >= reg_loop_max skip the loop body
> >          */
> >         insn_buf[cnt++] = BPF_JMP_REG(BPF_JGE, reg_loop_cnt, reg_loop_max, 5);
> >
> > in inlined bpf_loop() and patch that insn with 'goto +5',
> > so inlined bpf_loop will terminate quickly.
> >
> We indeed thought about modifying the subprog to return 1 but realised
> it would be complex. Currently the design does 2 things:
>    1. For Inlining case we set the R0 = 1 which is later checked and forces
>        the for loop to stop.
>
> @@ -22502,7 +22506,14 @@ static struct bpf_prog
> *inline_bpf_loop(struct bpf_verifier_env *env,
>
> + if (termination_states && termination_states->is_termination_prog) {
> +                 /* In a termination BPF prog, we want to exit - set R0 = 1 */
> +                 insn_buf[cnt++] = BPF_MOV64_IMM(BPF_REG_0, 1);
> + } else {
> +                 insn_buf[cnt++] = BPF_CALL_REL(0);
> + }
> +

That's exactly the case I was concerned about earlier.
These two insns might generate different JIT images,
so to prepare patches for text_poke_bp_batch()
the verifier needs to coordinate with JIT.
Replacing call with mov is a danger zone.
iirc "mov %eax, 1" is 5 bytes, just like "call foo" is 5 bytes.
But this is pure luck.
We should only replace a call with a call or a jmp or a nop.
See map_poke_track/untrack/run logic.
Maybe we can reuse some of it.

>
>    2. For non-inlining cases we just modified the bpf_loop helper.
> This implies 2 things:
>             2.1 If bpf_loop is yet to be executed, then bingo. We just return.
>             2.2 If we are currently inside a bpf_loop, we must be
> iteratively calling the callback function
>                   (which btw is also stubbed out of all helpers and bpf_loop).
>                   So the already running bpf_loop won't be doing any
> expensive helper calls or issuing
>                   further loop calls and the long-running BPF
> program's execution time will be brought down
>                   significantly.
>
> > I mean the verifier should prepare the whole batch with all insns
> > that needs to be patched and apply the whole thing at once
> > with one call to text_poke_bp_batch() that will affect all cpus.
> > It doesn't matter where the program was running on this cpu.
> > It might be running somewhere else on a different cpu.
> > text_poke_bp_batch() won't be atomic, but it doesn't have to be.
> > Every cpu might have a different fast-execute path to exit.
>
> While I was going through the definition of text_poke_bp_batch, I am
> worried if, say, a running
> instruction ends up getting modified midway with another instruction,
> making the CPU read
> half of insn1 and half of insn2. That would end up faulting !?

text_poke_bp() takes care of that.
That's what the "_bp" suffix signifies. It's modifying live text
via 'bp' (breakpoint). It has a multistep process to make it safe.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-05-19  1:20             ` Alexei Starovoitov
@ 2025-05-28  7:10               ` Raj Sahu
  2025-06-04 23:02                 ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Raj Sahu @ 2025-05-28  7:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Siddharth Chintamaneni, Eduard Zingerman, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Kumar Kartikeya Dwivedi

> That's exactly the case I was concerned about earlier.
> These two insns might generate different JIT images,
> so to prepare patches for text_poke_bp_batch()
> the verifier needs to coordinate with JIT.
> Replacing call with mov is a danger zone.
> iirc "mov %eax, 1" is 5 bytes, just like "call foo" is 5 bytes.
> But this is pure luck.
> We should only replace a call with a call or a jmp or a nop.
> See map_poke_track/untrack/run logic.
> Maybe we can reuse some of it.
.
.
> text_poke_bp() takes care of that.
> That's what the "_bp" suffix signifies. It's modifying live text
> via 'bp' (breakpoint). It has a multistep process to make it safe.

We were exploring the suggested design on a high-level.
The idea of moving patch generation after verification is really
simplifying the code.  We are also exploring whether we can
move it all the way after JIT.
There is a clarification needed.
If we need to support global termination, we need the call sites
(prepared during load-time) and pass it to text_poke_queue() which
will take care of patching the call instructions in-memory.
In this case, the only information we need is the call sites and not
the patched program (which we can free-up towards the end of program
load once the call_site information is obtained).

However, if we want to support per-CPU termination (for the case of a
super large per-CPU map causing large runtime, etc), we will need the
patch to stay around.
In this case, the termination handler (bpf_die in code) will perform
the `rip` change and stack modifications.

So, are we looking to support both, or just global termination?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-05-28  7:10               ` Raj Sahu
@ 2025-06-04 23:02                 ` Alexei Starovoitov
  2025-06-10 22:06                   ` Siddharth Chintamaneni
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-06-04 23:02 UTC (permalink / raw)
  To: Raj Sahu
  Cc: Siddharth Chintamaneni, Eduard Zingerman, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Kumar Kartikeya Dwivedi

On Wed, May 28, 2025 at 12:11 AM Raj Sahu <rjsu26@gmail.com> wrote:
>
> > That's exactly the case I was concerned about earlier.
> > These two insns might generate different JIT images,
> > so to prepare patches for text_poke_bp_batch()
> > the verifier needs to coordinate with JIT.
> > Replacing call with mov is a danger zone.
> > iirc "mov %eax, 1" is 5 bytes, just like "call foo" is 5 bytes.
> > But this is pure luck.
> > We should only replace a call with a call or a jmp or a nop.
> > See map_poke_track/untrack/run logic.
> > Maybe we can reuse some of it.
> .
> .
> > text_poke_bp() takes care of that.
> > That's what the "_bp" suffix signifies. It's modifying live text
> > via 'bp' (breakpoint). It has a multistep process to make it safe.
>
> We were exploring the suggested design on a high-level.
> The idea of moving patch generation after verification is really
> simplifying the code.  We are also exploring whether we can
> move it all the way after JIT.
> There is a clarification needed.
> If we need to support global termination, we need the call sites
> (prepared during load-time) and pass it to text_poke_queue() which
> will take care of patching the call instructions in-memory.
> In this case, the only information we need is the call sites and not
> the patched program (which we can free-up towards the end of program
> load once the call_site information is obtained).
>
> However, if we want to support per-CPU termination (for the case of a
> super large per-CPU map causing large runtime, etc), we will need the
> patch to stay around.
> In this case, the termination handler (bpf_die in code) will perform
> the `rip` change and stack modifications.
>
> So, are we looking to support both, or just global termination?

I would keep things simple and support global termination only.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-06-04 23:02                 ` Alexei Starovoitov
@ 2025-06-10 22:06                   ` Siddharth Chintamaneni
  2025-06-11 15:59                     ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Siddharth Chintamaneni @ 2025-06-10 22:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Raj Sahu, Eduard Zingerman, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Kumar Kartikeya Dwivedi

<SNIP>

> > > text_poke_bp() takes care of that.
> > > That's what the "_bp" suffix signifies. It's modifying live text
> > > via 'bp' (breakpoint). It has a multistep process to make it safe.
> >

We are almost there with our next iteration, but we are stuck with a
WARNING that is getting triggered when calling
smp_text_poke_batch_patch which internally call


smp_call_function_many_cond during a watchdog callback trigger.
https://elixir.bootlin.com/linux/v6.15.1/source/kernel/smp.c#L815

Please let us know if you have any suggestions around this?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination
  2025-06-10 22:06                   ` Siddharth Chintamaneni
@ 2025-06-11 15:59                     ` Alexei Starovoitov
  0 siblings, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-06-11 15:59 UTC (permalink / raw)
  To: Siddharth Chintamaneni
  Cc: Raj Sahu, Eduard Zingerman, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Dan Williams, miloc, ericts, rahult,
	doniaghazy, quanzhif, Jinghao Jia, Kumar Kartikeya Dwivedi

On Tue, Jun 10, 2025 at 3:07 PM Siddharth Chintamaneni
<sidchintamaneni@gmail.com> wrote:
>
> <SNIP>
>
> > > > text_poke_bp() takes care of that.
> > > > That's what the "_bp" suffix signifies. It's modifying live text
> > > > via 'bp' (breakpoint). It has a multistep process to make it safe.
> > >
>
> We are almost there with our next iteration, but we are stuck with a
> WARNING that is getting triggered when calling
> smp_text_poke_batch_patch which internally call
>
>
> smp_call_function_many_cond during a watchdog callback trigger.
> https://elixir.bootlin.com/linux/v6.15.1/source/kernel/smp.c#L815
>
> Please let us know if you have any suggestions around this?

That warn is correct. You're doing something wrong, but it's
impossible to suggest the fix without seeing patches.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-06-11 15:59 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-20 10:55 [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Raj Sahu
2025-04-20 10:55 ` [RFC bpf-next 1/4] bpf: Introduce new structs and struct fields Raj Sahu
2025-05-13  0:04   ` Eduard Zingerman
2025-04-20 10:55 ` [RFC bpf-next 2/4] bpftool/libbpf : Introduce bpf_prog_termination to trigger termination signal Raj Sahu
2025-04-20 10:55 ` [RFC bpf-next 3/4] bpf: Generating a stubbed version of BPF program for termination Raj Sahu
2025-05-13  0:07   ` Eduard Zingerman
2025-05-13  0:20     ` Alexei Starovoitov
2025-05-13  4:40       ` Eduard Zingerman
2025-05-13  5:16       ` Siddharth Chintamaneni
2025-05-15  1:59         ` Alexei Starovoitov
2025-05-15 17:43           ` Andrii Nakryiko
2025-05-17  5:01           ` Raj Sahu
2025-05-19  1:20             ` Alexei Starovoitov
2025-05-28  7:10               ` Raj Sahu
2025-06-04 23:02                 ` Alexei Starovoitov
2025-06-10 22:06                   ` Siddharth Chintamaneni
2025-06-11 15:59                     ` Alexei Starovoitov
2025-04-20 10:55 ` [RFC bpf-next 4/4] bpf: Runtime part of fast-path termination approach Raj Sahu
2025-05-05 20:13 ` [RFC bpf-next 0/4] bpf: Fast-Path approach for BPF program Termination Kumar Kartikeya Dwivedi
2025-05-06  5:55   ` Raj Sahu
2025-05-06 22:45     ` Alexei Starovoitov
2025-05-06 22:59       ` Kumar Kartikeya Dwivedi
2025-05-07  0:32         ` Alexei Starovoitov
2025-05-07  0:38           ` Kumar Kartikeya Dwivedi
2025-05-07  1:15             ` Alexei Starovoitov
2025-05-07  2:10               ` Kumar Kartikeya Dwivedi
2025-05-07  3:36                 ` Alexei Starovoitov
2025-05-07  5:04                   ` Kumar Kartikeya Dwivedi
2025-05-07 18:15 ` Zvi Effron
2025-05-07 20:01   ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).