netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/2] Fix release of struct_ops map
@ 2024-11-08  8:26 Xu Kuohai
  2024-11-08  8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai
  2024-11-08  8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai
  0 siblings, 2 replies; 8+ messages in thread
From: Xu Kuohai @ 2024-11-08  8:26 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee

From: Xu Kuohai <xukuohai@huawei.com>

This series fix a bug I found when doing rcu waiting cleanup for struct_ops
map. When there is sleepable prog in struct_ops map, the map risks being
released while the prog is still running.

Xu Kuohai (2):
  bpf: Fix release of struct_ops map
  selftests/bpf: Add test for struct_ops map release

 kernel/bpf/bpf_struct_ops.c                   |  37 +++--
 kernel/bpf/syscall.c                          |   7 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  78 ++++++---
 .../bpf/bpf_testmod/bpf_testmod_kfunc.h       |   2 +-
 .../bpf/prog_tests/test_struct_ops_module.c   | 154 ++++++++++++++++++
 .../bpf/progs/struct_ops_map_release.c        |  30 ++++
 6 files changed, 267 insertions(+), 41 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_map_release.c

-- 
2.39.5


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 1/2] bpf: Fix release of struct_ops map
  2024-11-08  8:26 [PATCH bpf-next 0/2] Fix release of struct_ops map Xu Kuohai
@ 2024-11-08  8:26 ` Xu Kuohai
  2024-11-08 17:00   ` Alexei Starovoitov
  2024-11-08  8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai
  1 sibling, 1 reply; 8+ messages in thread
From: Xu Kuohai @ 2024-11-08  8:26 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee

From: Xu Kuohai <xukuohai@huawei.com>

The test in the follow-up patch triggers the following kernel panic:

 Oops: int3: 0000 [#1] PREEMPT SMP PTI
 CPU: 0 UID: 0 PID: 465 Comm: test_progs Tainted: G           OE      6.12.0-rc4-gd1d187515
 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-pr4
 RIP: 0010:0xffffffffc0015041
 Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ccc
 RSP: 0018:ffffc9000187fd20 EFLAGS: 00000246
 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffffffff82c54639 RDI: 0000000000000000
 RBP: ffffc9000187fd48 R08: 0000000000000001 R09: 0000000000000000
 R10: 0000000000000001 R11: 000000004cba6383 R12: 0000000000000000
 R13: 0000000000000002 R14: ffff88810438b7a0 R15: ffffffff8369d7a0
 FS:  00007f05178006c0(0000) GS:ffff888236e00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f0508c94000 CR3: 0000000100d16003 CR4: 0000000000170ef0
 Call Trace:
  <TASK>
  ? show_regs+0x68/0x80
  ? die+0x3b/0x90
  ? exc_int3+0xca/0xe0
  ? asm_exc_int3+0x3e/0x50
  run_struct_ops+0x58/0xb0 [bpf_testmod]
  param_attr_store+0xa2/0x100
  module_attr_store+0x25/0x40
  sysfs_kf_write+0x50/0x70
  kernfs_fop_write_iter+0x146/0x1f0
  vfs_write+0x27e/0x530
  ksys_write+0x75/0x100
  __x64_sys_write+0x1d/0x30
  x64_sys_call+0x1d30/0x1f30
  do_syscall_64+0x68/0x140
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f051831727f
 Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 108
 RSP: 002b:00007f05177ffdf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 00007f05178006c0 RCX: 00007f051831727f
 RDX: 0000000000000002 RSI: 00007f05177ffe30 RDI: 0000000000000004
 RBP: 00007f05177ffe90 R08: 0000000000000000 R09: 000000000000000b
 R10: 0000000000000000 R11: 0000000000000293 R12: ffffffffffffff30
 R13: 0000000000000002 R14: 00007ffd926bfd50 R15: 00007f0517000000
  </TASK>

It's because the sleepable prog is still running when the struct_ops
map is released.

To fix it, follow the approach used in bpf_tramp_image_put. First,
before release struct_ops map, wait a rcu_tasks_trace gp for sleepable
progs to finish. Then, wait a rcu_tasks gp for normal progs and the
rest trampoline insns to finish.

Additionally, switch to call_rcu to remove the synchronous waiting,
as suggested by Alexei and Martin.

Fixes: b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
 kernel/bpf/bpf_struct_ops.c | 37 +++++++++++++++++++------------------
 kernel/bpf/syscall.c        |  7 ++++++-
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index fda3dd2ee984..dd5f9bf12791 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -865,24 +865,6 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 	 */
 	if (btf_is_module(st_map->btf))
 		module_put(st_map->st_ops_desc->st_ops->owner);
-
-	/* The struct_ops's function may switch to another struct_ops.
-	 *
-	 * For example, bpf_tcp_cc_x->init() may switch to
-	 * another tcp_cc_y by calling
-	 * setsockopt(TCP_CONGESTION, "tcp_cc_y").
-	 * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
-	 * and its refcount may reach 0 which then free its
-	 * trampoline image while tcp_cc_x is still running.
-	 *
-	 * A vanilla rcu gp is to wait for all bpf-tcp-cc prog
-	 * to finish. bpf-tcp-cc prog is non sleepable.
-	 * A rcu_tasks gp is to wait for the last few insn
-	 * in the tramopline image to finish before releasing
-	 * the trampoline image.
-	 */
-	synchronize_rcu_mult(call_rcu, call_rcu_tasks);
-
 	__bpf_struct_ops_map_free(map);
 }
 
@@ -974,6 +956,25 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 	mutex_init(&st_map->lock);
 	bpf_map_init_from_attr(map, attr);
 
+	/* The struct_ops's function may switch to another struct_ops.
+	 *
+	 * For example, bpf_tcp_cc_x->init() may switch to
+	 * another tcp_cc_y by calling
+	 * setsockopt(TCP_CONGESTION, "tcp_cc_y").
+	 * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
+	 * and its refcount may reach 0 which then free its
+	 * trampoline image while tcp_cc_x is still running.
+	 *
+	 * Since struct_ops prog can be sleepable, a rcu_tasks_trace gp
+	 * is to wait for sleepable progs in the map to finish. Then a
+	 * rcu_tasks gp is to wait for the normal progs and the last few
+	 * insns in the tramopline image to finish before releasing the
+	 * trampoline image.
+	 *
+	 * Also see the comment in function bpf_tramp_image_put.
+	 */
+	WRITE_ONCE(map->free_after_mult_rcu_gp, true);
+
 	return map;
 
 errout_free:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8254b2973157..ae927f087f04 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -886,7 +886,12 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu)
 
 static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu)
 {
-	if (rcu_trace_implies_rcu_gp())
+	struct bpf_map *map = container_of(rcu, struct bpf_map, rcu);
+
+	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS)
+		/* See comment in the end of bpf_struct_ops_map_alloc */
+		call_rcu_tasks(rcu, bpf_map_free_rcu_gp);
+	else if (rcu_trace_implies_rcu_gp())
 		bpf_map_free_rcu_gp(rcu);
 	else
 		call_rcu(rcu, bpf_map_free_rcu_gp);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
  2024-11-08  8:26 [PATCH bpf-next 0/2] Fix release of struct_ops map Xu Kuohai
  2024-11-08  8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai
@ 2024-11-08  8:26 ` Xu Kuohai
  2024-11-08 19:39   ` Martin KaFai Lau
  1 sibling, 1 reply; 8+ messages in thread
From: Xu Kuohai @ 2024-11-08  8:26 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee

From: Xu Kuohai <xukuohai@huawei.com>

Add a test to ensure that struct_ops map is released safely. Without
the previous fix, this test triggers kernel panic due to UAF.

The test runs multiple threads concurrently. Some threads create and
destroy struct_ops maps, while others execute progs within the maps.
Each map has two progs, one sleepable, and the other unsleepable.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  78 ++++++---
 .../bpf/bpf_testmod/bpf_testmod_kfunc.h       |   2 +-
 .../bpf/prog_tests/test_struct_ops_module.c   | 154 ++++++++++++++++++
 .../bpf/progs/struct_ops_map_release.c        |  30 ++++
 4 files changed, 242 insertions(+), 22 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_map_release.c

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 987d41af71d2..72d21e8ba5d4 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -1016,6 +1016,11 @@ __bpf_kfunc int bpf_kfunc_st_ops_inc10(struct st_ops_args *args)
 	return args->a;
 }
 
+__bpf_kfunc void bpf_kfunc_msleep(u32 msecs)
+{
+	msleep(msecs);
+}
+
 BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids)
 BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc)
 BTF_ID_FLAGS(func, bpf_kfunc_call_test1)
@@ -1056,6 +1061,7 @@ BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_prologue, KF_TRUSTED_ARGS | KF_SLEEPABL
 BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_pro_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_kfunc_st_ops_inc10, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_kfunc_msleep, KF_SLEEPABLE)
 BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids)
 
 static int bpf_testmod_ops_init(struct btf *btf)
@@ -1096,6 +1102,29 @@ static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
 	.is_valid_access = bpf_testmod_ops_is_valid_access,
 };
 
+static int bpf_testmod_test_1(void)
+{
+	return 0;
+}
+
+static void bpf_testmod_test_2(int a, int b)
+{
+}
+
+static int bpf_testmod_ops__test_maybe_null(int dummy,
+					    struct task_struct *task__nullable)
+{
+	return 0;
+}
+
+static struct bpf_testmod_ops __bpf_testmod_ops = {
+	.test_1 = bpf_testmod_test_1,
+	.test_2 = bpf_testmod_test_2,
+	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
+};
+
+static struct bpf_testmod_ops *__bpf_dummy_ops = &__bpf_testmod_ops;
+
 static int bpf_dummy_reg(void *kdata, struct bpf_link *link)
 {
 	struct bpf_testmod_ops *ops = kdata;
@@ -1108,20 +1137,14 @@ static int bpf_dummy_reg(void *kdata, struct bpf_link *link)
 	if (ops->test_2)
 		ops->test_2(4, ops->data);
 
-	return 0;
-}
-
-static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
-{
-}
+	WRITE_ONCE(__bpf_dummy_ops, ops);
 
-static int bpf_testmod_test_1(void)
-{
 	return 0;
 }
 
-static void bpf_testmod_test_2(int a, int b)
+static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
 {
+	WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
 }
 
 static int bpf_testmod_tramp(int value)
@@ -1129,18 +1152,6 @@ static int bpf_testmod_tramp(int value)
 	return 0;
 }
 
-static int bpf_testmod_ops__test_maybe_null(int dummy,
-					    struct task_struct *task__nullable)
-{
-	return 0;
-}
-
-static struct bpf_testmod_ops __bpf_testmod_ops = {
-	.test_1 = bpf_testmod_test_1,
-	.test_2 = bpf_testmod_test_2,
-	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
-};
-
 struct bpf_struct_ops bpf_bpf_testmod_ops = {
 	.verifier_ops = &bpf_testmod_verifier_ops,
 	.init = bpf_testmod_ops_init,
@@ -1375,6 +1386,31 @@ static void bpf_testmod_exit(void)
 	unregister_bpf_testmod_uprobe();
 }
 
+static int run_struct_ops(const char *val, const struct kernel_param *kp)
+{
+	int ret;
+	unsigned int repeat;
+	struct bpf_testmod_ops *ops;
+
+	ret = kstrtouint(val, 10, &repeat);
+	if (ret)
+		return ret;
+
+	if (repeat > 10000)
+		return -ERANGE;
+
+	while (repeat-- > 0) {
+		ops = READ_ONCE(__bpf_dummy_ops);
+		if (ops->test_1)
+			ops->test_1();
+		if (ops->test_2)
+			ops->test_2(0, 0);
+	}
+
+	return 0;
+}
+
+module_param_call(run_struct_ops, run_struct_ops, NULL, NULL, 0200);
 module_init(bpf_testmod_init);
 module_exit(bpf_testmod_exit);
 
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
index b58817938deb..260faebd5633 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
@@ -158,5 +158,5 @@ void bpf_kfunc_trusted_vma_test(struct vm_area_struct *ptr) __ksym;
 void bpf_kfunc_trusted_task_test(struct task_struct *ptr) __ksym;
 void bpf_kfunc_trusted_num_test(int *ptr) __ksym;
 void bpf_kfunc_rcu_task_test(struct task_struct *ptr) __ksym;
-
+void bpf_kfunc_msleep(__u32 msecs) __ksym;
 #endif /* _BPF_TESTMOD_KFUNC_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c
index 75a0dea511b3..df744d51cade 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c
@@ -9,6 +9,7 @@
 #include "struct_ops_nulled_out_cb.skel.h"
 #include "struct_ops_forgotten_cb.skel.h"
 #include "struct_ops_detach.skel.h"
+#include "struct_ops_map_release.skel.h"
 #include "unsupported_ops.skel.h"
 
 static void check_map_info(struct bpf_map_info *info)
@@ -246,6 +247,157 @@ static void test_struct_ops_forgotten_cb(void)
 	struct_ops_forgotten_cb__destroy(skel);
 }
 
+struct test_context {
+	pthread_mutex_t mutex;
+	pthread_cond_t cond;
+	int total_threads;
+	int wait_threads;
+	int dead_threads;
+	int repeat;
+	int loop;
+};
+
+static int wait_others(struct test_context *ctx)
+{
+	int ret = 0;
+
+	pthread_mutex_lock(&ctx->mutex);
+
+	if (ctx->dead_threads) {
+		pthread_cond_broadcast(&ctx->cond);
+		pthread_mutex_unlock(&ctx->mutex);
+		return -1;
+	}
+
+	++ctx->wait_threads;
+	if (ctx->wait_threads >= ctx->total_threads) {
+		pthread_cond_broadcast(&ctx->cond);
+		ctx->wait_threads = 0;
+	} else {
+		pthread_cond_wait(&ctx->cond, &ctx->mutex);
+		if (ctx->dead_threads)
+			ret = -1;
+	}
+
+	pthread_mutex_unlock(&ctx->mutex);
+
+	return ret;
+}
+
+static void mark_dead(struct test_context *ctx)
+{
+	pthread_mutex_lock(&ctx->mutex);
+	ctx->dead_threads++;
+	pthread_cond_broadcast(&ctx->cond);
+	pthread_mutex_unlock(&ctx->mutex);
+}
+
+static int load_release(struct test_context *ctx)
+{
+	int ret = 0;
+	struct bpf_link *link = NULL;
+	struct struct_ops_map_release *skel = NULL;
+
+	skel = struct_ops_map_release__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load")) {
+		ret = -1;
+		mark_dead(ctx);
+		goto out;
+	}
+
+	link = bpf_map__attach_struct_ops(skel->maps.testmod_ops);
+	if (!ASSERT_OK_PTR(link, "attach_struct_ops")) {
+		ret = -1;
+		mark_dead(ctx);
+		goto out;
+	}
+
+	if (wait_others(ctx)) {
+		ret = -1;
+		goto out;
+	}
+
+out:
+	bpf_link__destroy(link);
+	struct_ops_map_release__destroy(skel);
+	return ret;
+}
+
+static void *thread_load_release(void *arg)
+{
+	struct test_context *ctx = (struct test_context *)arg;
+
+	for (int i = 0; i < ctx->loop; i++)
+		if (load_release(ctx))
+			break;
+	return NULL;
+}
+
+static void *thread_run_prog(void *arg)
+{
+	int fd;
+	int len;
+	char buf[8];
+	struct test_context *ctx = (struct test_context *)arg;
+
+	fd = open("/sys/module/bpf_testmod/parameters/run_struct_ops", O_WRONLY);
+	if (!ASSERT_OK_FD(fd, "open run_struct_ops for write")) {
+		mark_dead(ctx);
+		return NULL;
+	}
+
+	len = snprintf(buf, sizeof(buf), "%d", ctx->repeat);
+	if (!ASSERT_GT(len, 0, "snprintf repeat number")) {
+		mark_dead(ctx);
+		goto out;
+	}
+
+	for (int i = 0; i < ctx->loop; i++) {
+		if (wait_others(ctx))
+			goto out;
+		if (!ASSERT_EQ(write(fd, buf, len), len, "write file")) {
+			mark_dead(ctx);
+			goto out;
+		}
+	}
+
+out:
+	close(fd);
+	return NULL;
+}
+
+#define NR_REL_THREAD	2
+#define NR_RUN_THREAD	8
+#define NR_THREAD	(NR_REL_THREAD + NR_RUN_THREAD)
+#define NR_REPEAT	4
+#define NR_LOOP		5
+
+static void test_struct_ops_map_release(void)
+{
+	int i, j;
+	pthread_t t[NR_THREAD];
+	struct test_context ctx = {
+		.loop = NR_LOOP,
+		.repeat = NR_REPEAT,
+		.total_threads = NR_THREAD,
+		.wait_threads = 0,
+		.dead_threads = 0,
+	};
+
+	pthread_mutex_init(&ctx.mutex, NULL);
+	pthread_cond_init(&ctx.cond, NULL);
+
+	j = 0;
+	for (i = 0; i < NR_REL_THREAD; i++)
+		pthread_create(&t[j++], NULL, thread_load_release, &ctx);
+
+	for (i = 0; i < NR_RUN_THREAD; i++)
+		pthread_create(&t[j++], NULL, thread_run_prog, &ctx);
+
+	for (i = 0; i < NR_THREAD; i++)
+		pthread_join(t[i], NULL);
+}
+
 /* Detach a link from a user space program */
 static void test_detach_link(void)
 {
@@ -310,6 +462,8 @@ void serial_test_struct_ops_module(void)
 		test_struct_ops_nulled_out_cb();
 	if (test__start_subtest("struct_ops_forgotten_cb"))
 		test_struct_ops_forgotten_cb();
+	if (test__start_subtest("struct_ops_map_release"))
+		test_struct_ops_map_release();
 	if (test__start_subtest("test_detach_link"))
 		test_detach_link();
 	RUN_TESTS(unsupported_ops);
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_map_release.c b/tools/testing/selftests/bpf/progs/struct_ops_map_release.c
new file mode 100644
index 000000000000..78aa5e1875b6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_map_release.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2024 Huawei Technologies Co., Ltd */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "../bpf_testmod/bpf_testmod.h"
+#include "../bpf_testmod/bpf_testmod_kfunc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops.s/test_1")
+int BPF_PROG(test_1_prog)
+{
+	bpf_kfunc_msleep(100);
+	return 0;
+}
+
+SEC("struct_ops/test_2")
+int BPF_PROG(test_2_prog, int a, int b)
+{
+	return 0;
+}
+
+SEC(".struct_ops")
+struct bpf_testmod_ops testmod_ops = {
+	.test_1 = (void *)test_1_prog,
+	.test_2 = (void *)test_2_prog
+};
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 1/2] bpf: Fix release of struct_ops map
  2024-11-08  8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai
@ 2024-11-08 17:00   ` Alexei Starovoitov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2024-11-08 17:00 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Yonghong Song, Kui-Feng Lee

On Fri, Nov 8, 2024 at 12:15 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>
> From: Xu Kuohai <xukuohai@huawei.com>
>
> The test in the follow-up patch triggers the following kernel panic:
>
>  Oops: int3: 0000 [#1] PREEMPT SMP PTI
>  CPU: 0 UID: 0 PID: 465 Comm: test_progs Tainted: G           OE      6.12.0-rc4-gd1d187515
>  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-pr4
>  RIP: 0010:0xffffffffc0015041
>  Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ccc
>  RSP: 0018:ffffc9000187fd20 EFLAGS: 00000246
>  RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: ffffffff82c54639 RDI: 0000000000000000
>  RBP: ffffc9000187fd48 R08: 0000000000000001 R09: 0000000000000000
>  R10: 0000000000000001 R11: 000000004cba6383 R12: 0000000000000000
>  R13: 0000000000000002 R14: ffff88810438b7a0 R15: ffffffff8369d7a0
>  FS:  00007f05178006c0(0000) GS:ffff888236e00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007f0508c94000 CR3: 0000000100d16003 CR4: 0000000000170ef0
>  Call Trace:
>   <TASK>
>   ? show_regs+0x68/0x80
>   ? die+0x3b/0x90
>   ? exc_int3+0xca/0xe0
>   ? asm_exc_int3+0x3e/0x50
>   run_struct_ops+0x58/0xb0 [bpf_testmod]
>   param_attr_store+0xa2/0x100
>   module_attr_store+0x25/0x40
>   sysfs_kf_write+0x50/0x70
>   kernfs_fop_write_iter+0x146/0x1f0
>   vfs_write+0x27e/0x530
>   ksys_write+0x75/0x100
>   __x64_sys_write+0x1d/0x30
>   x64_sys_call+0x1d30/0x1f30
>   do_syscall_64+0x68/0x140
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>  RIP: 0033:0x7f051831727f
>  Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 108
>  RSP: 002b:00007f05177ffdf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
>  RAX: ffffffffffffffda RBX: 00007f05178006c0 RCX: 00007f051831727f
>  RDX: 0000000000000002 RSI: 00007f05177ffe30 RDI: 0000000000000004
>  RBP: 00007f05177ffe90 R08: 0000000000000000 R09: 000000000000000b
>  R10: 0000000000000000 R11: 0000000000000293 R12: ffffffffffffff30
>  R13: 0000000000000002 R14: 00007ffd926bfd50 R15: 00007f0517000000
>   </TASK>
>
> It's because the sleepable prog is still running when the struct_ops
> map is released.
>
> To fix it, follow the approach used in bpf_tramp_image_put. First,
> before release struct_ops map, wait a rcu_tasks_trace gp for sleepable
> progs to finish. Then, wait a rcu_tasks gp for normal progs and the
> rest trampoline insns to finish.
>
> Additionally, switch to call_rcu to remove the synchronous waiting,
> as suggested by Alexei and Martin.
>
> Fixes: b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.")
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
> ---
>  kernel/bpf/bpf_struct_ops.c | 37 +++++++++++++++++++------------------
>  kernel/bpf/syscall.c        |  7 ++++++-
>  2 files changed, 25 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index fda3dd2ee984..dd5f9bf12791 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -865,24 +865,6 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>          */
>         if (btf_is_module(st_map->btf))
>                 module_put(st_map->st_ops_desc->st_ops->owner);
> -
> -       /* The struct_ops's function may switch to another struct_ops.
> -        *
> -        * For example, bpf_tcp_cc_x->init() may switch to
> -        * another tcp_cc_y by calling
> -        * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> -        * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
> -        * and its refcount may reach 0 which then free its
> -        * trampoline image while tcp_cc_x is still running.
> -        *
> -        * A vanilla rcu gp is to wait for all bpf-tcp-cc prog
> -        * to finish. bpf-tcp-cc prog is non sleepable.
> -        * A rcu_tasks gp is to wait for the last few insn
> -        * in the tramopline image to finish before releasing
> -        * the trampoline image.
> -        */
> -       synchronize_rcu_mult(call_rcu, call_rcu_tasks);
> -
>         __bpf_struct_ops_map_free(map);
>  }
>
> @@ -974,6 +956,25 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
>         mutex_init(&st_map->lock);
>         bpf_map_init_from_attr(map, attr);
>
> +       /* The struct_ops's function may switch to another struct_ops.
> +        *
> +        * For example, bpf_tcp_cc_x->init() may switch to
> +        * another tcp_cc_y by calling
> +        * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> +        * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
> +        * and its refcount may reach 0 which then free its
> +        * trampoline image while tcp_cc_x is still running.
> +        *
> +        * Since struct_ops prog can be sleepable, a rcu_tasks_trace gp
> +        * is to wait for sleepable progs in the map to finish. Then a
> +        * rcu_tasks gp is to wait for the normal progs and the last few
> +        * insns in the tramopline image to finish before releasing the
> +        * trampoline image.
> +        *
> +        * Also see the comment in function bpf_tramp_image_put.
> +        */
> +       WRITE_ONCE(map->free_after_mult_rcu_gp, true);
> +
>         return map;
>
>  errout_free:
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8254b2973157..ae927f087f04 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -886,7 +886,12 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu)
>
>  static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu)
>  {
> -       if (rcu_trace_implies_rcu_gp())
> +       struct bpf_map *map = container_of(rcu, struct bpf_map, rcu);
> +
> +       if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS)
> +               /* See comment in the end of bpf_struct_ops_map_alloc */

The fix makes sense, but pls copy paste a sentence here instead
of the above comment. Like:
"
rcu_tasks gp is necessary to wait for struct_ops bpf trampoline to finish.
Unlike all other bpf trampolines struct_ops trampoline is not
protected by percpu_ref.
"
> +               call_rcu_tasks(rcu, bpf_map_free_rcu_gp);
> +       else if (rcu_trace_implies_rcu_gp())

pw-bot: cr

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
  2024-11-08  8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai
@ 2024-11-08 19:39   ` Martin KaFai Lau
  2024-11-09  8:40     ` Xu Kuohai
  0 siblings, 1 reply; 8+ messages in thread
From: Martin KaFai Lau @ 2024-11-08 19:39 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Yonghong Song, Kui-Feng Lee

On 11/8/24 12:26 AM, Xu Kuohai wrote:
> -static void bpf_testmod_test_2(int a, int b)
> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
>   {
> +	WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
>   }

[ ... ]

> +static int run_struct_ops(const char *val, const struct kernel_param *kp)
> +{
> +	int ret;
> +	unsigned int repeat;
> +	struct bpf_testmod_ops *ops;
> +
> +	ret = kstrtouint(val, 10, &repeat);
> +	if (ret)
> +		return ret;
> +
> +	if (repeat > 10000)
> +		return -ERANGE;
> +
> +	while (repeat-- > 0) {
> +		ops = READ_ONCE(__bpf_dummy_ops);

I don't think it is the usual bpf_struct_ops implementation which only uses 
READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a 
refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use 
kthread_flush_work() to wait for all ops calling finished. Meaning I don't think 
the current bpf_struct_ops unreg implementation will run into this issue for 
sleepable ops.

The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for 
the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its own 
refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should have 
been introduced instead to return a new tcp-cc-ops to be used.

> +		if (ops->test_1)
> +			ops->test_1();
> +		if (ops->test_2)
> +			ops->test_2(0, 0);
> +	}
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
  2024-11-08 19:39   ` Martin KaFai Lau
@ 2024-11-09  8:40     ` Xu Kuohai
  2024-11-11 21:30       ` Martin KaFai Lau
  0 siblings, 1 reply; 8+ messages in thread
From: Xu Kuohai @ 2024-11-09  8:40 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Yonghong Song, Kui-Feng Lee

On 11/9/2024 3:39 AM, Martin KaFai Lau wrote:
> On 11/8/24 12:26 AM, Xu Kuohai wrote:
>> -static void bpf_testmod_test_2(int a, int b)
>> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
>>   {
>> +    WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
>>   }
> 
> [ ... ]
> 
>> +static int run_struct_ops(const char *val, const struct kernel_param *kp)
>> +{
>> +    int ret;
>> +    unsigned int repeat;
>> +    struct bpf_testmod_ops *ops;
>> +
>> +    ret = kstrtouint(val, 10, &repeat);
>> +    if (ret)
>> +        return ret;
>> +
>> +    if (repeat > 10000)
>> +        return -ERANGE;
>> +
>> +    while (repeat-- > 0) {
>> +        ops = READ_ONCE(__bpf_dummy_ops);
> 
> I don't think it is the usual bpf_struct_ops implementation which only uses READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use kthread_flush_work() to wait for 
> all ops calling finished. Meaning I don't think the current bpf_struct_ops unreg implementation will run into this issue for sleepable ops.
> 

Thanks for the explanation.

Are you saying that it's not the struct_ops framework's
responsibility to ensure the struct_ops map is not
released while it may be still in use? And the "bug" in
this series should be "fixed" in the test, namely this
patch?

> The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should have been introduced 
> instead to return a new tcp-cc-ops to be used.

Not quite clear, but from the description, it seems that
the synchronize_rcu_mult(call_rcu, call_rcu_tasks) could
be just removed in some way, no need to do a cleanup to
switch it to call_rcu.

> 
>> +        if (ops->test_1)
>> +            ops->test_1();
>> +        if (ops->test_2)
>> +            ops->test_2(0, 0);
>> +    }
>> +
>> +    return 0;
>> +}


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
  2024-11-09  8:40     ` Xu Kuohai
@ 2024-11-11 21:30       ` Martin KaFai Lau
  2024-11-12 12:22         ` Xu Kuohai
  0 siblings, 1 reply; 8+ messages in thread
From: Martin KaFai Lau @ 2024-11-11 21:30 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Yonghong Song, Kui-Feng Lee

On 11/9/24 12:40 AM, Xu Kuohai wrote:
> On 11/9/2024 3:39 AM, Martin KaFai Lau wrote:
>> On 11/8/24 12:26 AM, Xu Kuohai wrote:
>>> -static void bpf_testmod_test_2(int a, int b)
>>> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
>>>   {
>>> +    WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
>>>   }
>>
>> [ ... ]
>>
>>> +static int run_struct_ops(const char *val, const struct kernel_param *kp)
>>> +{
>>> +    int ret;
>>> +    unsigned int repeat;
>>> +    struct bpf_testmod_ops *ops;
>>> +
>>> +    ret = kstrtouint(val, 10, &repeat);
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    if (repeat > 10000)
>>> +        return -ERANGE;
>>> +
>>> +    while (repeat-- > 0) {
>>> +        ops = READ_ONCE(__bpf_dummy_ops);
>>
>> I don't think it is the usual bpf_struct_ops implementation which only uses 
>> READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a 
>> refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use 
>> kthread_flush_work() to wait for all ops calling finished. Meaning I don't 
>> think the current bpf_struct_ops unreg implementation will run into this issue 
>> for sleepable ops.
>>
> 
> Thanks for the explanation.
> 
> Are you saying that it's not the struct_ops framework's
> responsibility to ensure the struct_ops map is not
> released while it may be still in use? And the "bug" in
> this series should be "fixed" in the test, namely this
> patch?

Yeah, it is what I was trying to say. I don't think there is thing to fix. Think 
about extending a subsystem by a kernel module. The subsystem will also do the 
needed protection itself during the unreg process. There is already a 
bpf_try_module_get() to help the subsystem.

>> The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for 
>> the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its 
>> own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should 
>> have been introduced instead to return a new tcp-cc-ops to be used.
> 
> Not quite clear, but from the description, it seems that
> the synchronize_rcu_mult(call_rcu, call_rcu_tasks) could

This synchronize_rcu_mult is only need for the tcp_congestion_ops 
(bpf_tcp_ca.c). May be it is cleaner to just make a special case for 
"tcp_congestion_ops" in st_ops->name in map_alloc and only set 
free_after_mult_rcu_gp to TRUE for this one case, then it won't slow down other 
struct_ops map freeing also.

imo, the test in this patch is not needed in its current form also since it is 
not how the kernel subsystem implements unreg in struct_ops.

> be just removed in some way, no need to do a cleanup to
> switch it to call_rcu.
> 
>>
>>> +        if (ops->test_1)
>>> +            ops->test_1();
>>> +        if (ops->test_2)
>>> +            ops->test_2(0, 0);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
  2024-11-11 21:30       ` Martin KaFai Lau
@ 2024-11-12 12:22         ` Xu Kuohai
  0 siblings, 0 replies; 8+ messages in thread
From: Xu Kuohai @ 2024-11-12 12:22 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Yonghong Song, Kui-Feng Lee

On 11/12/2024 5:30 AM, Martin KaFai Lau wrote:
> On 11/9/24 12:40 AM, Xu Kuohai wrote:
>> On 11/9/2024 3:39 AM, Martin KaFai Lau wrote:
>>> On 11/8/24 12:26 AM, Xu Kuohai wrote:
>>>> -static void bpf_testmod_test_2(int a, int b)
>>>> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
>>>>   {
>>>> +    WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
>>>>   }
>>>
>>> [ ... ]
>>>
>>>> +static int run_struct_ops(const char *val, const struct kernel_param *kp)
>>>> +{
>>>> +    int ret;
>>>> +    unsigned int repeat;
>>>> +    struct bpf_testmod_ops *ops;
>>>> +
>>>> +    ret = kstrtouint(val, 10, &repeat);
>>>> +    if (ret)
>>>> +        return ret;
>>>> +
>>>> +    if (repeat > 10000)
>>>> +        return -ERANGE;
>>>> +
>>>> +    while (repeat-- > 0) {
>>>> +        ops = READ_ONCE(__bpf_dummy_ops);
>>>
>>> I don't think it is the usual bpf_struct_ops implementation which only uses READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use kthread_flush_work() to wait 
>>> for all ops calling finished. Meaning I don't think the current bpf_struct_ops unreg implementation will run into this issue for sleepable ops.
>>>
>>
>> Thanks for the explanation.
>>
>> Are you saying that it's not the struct_ops framework's
>> responsibility to ensure the struct_ops map is not
>> released while it may be still in use? And the "bug" in
>> this series should be "fixed" in the test, namely this
>> patch?
> 
> Yeah, it is what I was trying to say. I don't think there is thing to fix. Think about extending a subsystem by a kernel module. The subsystem will also do the needed protection itself during the unreg process. There is already a bpf_try_module_get() to 
> help the subsystem.
>

Got it

>>> The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should have been introduced 
>>> instead to return a new tcp-cc-ops to be used.
>>
>> Not quite clear, but from the description, it seems that
>> the synchronize_rcu_mult(call_rcu, call_rcu_tasks) could
> 
> This synchronize_rcu_mult is only need for the tcp_congestion_ops (bpf_tcp_ca.c). May be it is cleaner to just make a special case for "tcp_congestion_ops" in st_ops->name in map_alloc and only set free_after_mult_rcu_gp to TRUE for this one case, then it 
> won't slow down other struct_ops map freeing also.
>

OK, will git it a try, thanks.

> imo, the test in this patch is not needed in its current form also since it is not how the kernel subsystem implements unreg in struct_ops.
> 
>> be just removed in some way, no need to do a cleanup to
>> switch it to call_rcu.
>>
>>>
>>>> +        if (ops->test_1)
>>>> +            ops->test_1();
>>>> +        if (ops->test_2)
>>>> +            ops->test_2(0, 0);
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-11-12 12:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-08  8:26 [PATCH bpf-next 0/2] Fix release of struct_ops map Xu Kuohai
2024-11-08  8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai
2024-11-08 17:00   ` Alexei Starovoitov
2024-11-08  8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai
2024-11-08 19:39   ` Martin KaFai Lau
2024-11-09  8:40     ` Xu Kuohai
2024-11-11 21:30       ` Martin KaFai Lau
2024-11-12 12:22         ` Xu Kuohai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).