* [PATCH bpf-next 0/2] Fix release of struct_ops map
@ 2024-11-08 8:26 Xu Kuohai
2024-11-08 8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai
2024-11-08 8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai
0 siblings, 2 replies; 8+ messages in thread
From: Xu Kuohai @ 2024-11-08 8:26 UTC (permalink / raw)
To: bpf, netdev
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee
From: Xu Kuohai <xukuohai@huawei.com>
This series fix a bug I found when doing rcu waiting cleanup for struct_ops
map. When there is sleepable prog in struct_ops map, the map risks being
released while the prog is still running.
Xu Kuohai (2):
bpf: Fix release of struct_ops map
selftests/bpf: Add test for struct_ops map release
kernel/bpf/bpf_struct_ops.c | 37 +++--
kernel/bpf/syscall.c | 7 +-
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 78 ++++++---
.../bpf/bpf_testmod/bpf_testmod_kfunc.h | 2 +-
.../bpf/prog_tests/test_struct_ops_module.c | 154 ++++++++++++++++++
.../bpf/progs/struct_ops_map_release.c | 30 ++++
6 files changed, 267 insertions(+), 41 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_map_release.c
--
2.39.5
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH bpf-next 1/2] bpf: Fix release of struct_ops map
2024-11-08 8:26 [PATCH bpf-next 0/2] Fix release of struct_ops map Xu Kuohai
@ 2024-11-08 8:26 ` Xu Kuohai
2024-11-08 17:00 ` Alexei Starovoitov
2024-11-08 8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai
1 sibling, 1 reply; 8+ messages in thread
From: Xu Kuohai @ 2024-11-08 8:26 UTC (permalink / raw)
To: bpf, netdev
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee
From: Xu Kuohai <xukuohai@huawei.com>
The test in the follow-up patch triggers the following kernel panic:
Oops: int3: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 465 Comm: test_progs Tainted: G OE 6.12.0-rc4-gd1d187515
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-pr4
RIP: 0010:0xffffffffc0015041
Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ccc
RSP: 0018:ffffc9000187fd20 EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff82c54639 RDI: 0000000000000000
RBP: ffffc9000187fd48 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 000000004cba6383 R12: 0000000000000000
R13: 0000000000000002 R14: ffff88810438b7a0 R15: ffffffff8369d7a0
FS: 00007f05178006c0(0000) GS:ffff888236e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0508c94000 CR3: 0000000100d16003 CR4: 0000000000170ef0
Call Trace:
<TASK>
? show_regs+0x68/0x80
? die+0x3b/0x90
? exc_int3+0xca/0xe0
? asm_exc_int3+0x3e/0x50
run_struct_ops+0x58/0xb0 [bpf_testmod]
param_attr_store+0xa2/0x100
module_attr_store+0x25/0x40
sysfs_kf_write+0x50/0x70
kernfs_fop_write_iter+0x146/0x1f0
vfs_write+0x27e/0x530
ksys_write+0x75/0x100
__x64_sys_write+0x1d/0x30
x64_sys_call+0x1d30/0x1f30
do_syscall_64+0x68/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f051831727f
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 108
RSP: 002b:00007f05177ffdf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f05178006c0 RCX: 00007f051831727f
RDX: 0000000000000002 RSI: 00007f05177ffe30 RDI: 0000000000000004
RBP: 00007f05177ffe90 R08: 0000000000000000 R09: 000000000000000b
R10: 0000000000000000 R11: 0000000000000293 R12: ffffffffffffff30
R13: 0000000000000002 R14: 00007ffd926bfd50 R15: 00007f0517000000
</TASK>
It's because the sleepable prog is still running when the struct_ops
map is released.
To fix it, follow the approach used in bpf_tramp_image_put. First,
before release struct_ops map, wait a rcu_tasks_trace gp for sleepable
progs to finish. Then, wait a rcu_tasks gp for normal progs and the
rest trampoline insns to finish.
Additionally, switch to call_rcu to remove the synchronous waiting,
as suggested by Alexei and Martin.
Fixes: b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
kernel/bpf/bpf_struct_ops.c | 37 +++++++++++++++++++------------------
kernel/bpf/syscall.c | 7 ++++++-
2 files changed, 25 insertions(+), 19 deletions(-)
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index fda3dd2ee984..dd5f9bf12791 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -865,24 +865,6 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
*/
if (btf_is_module(st_map->btf))
module_put(st_map->st_ops_desc->st_ops->owner);
-
- /* The struct_ops's function may switch to another struct_ops.
- *
- * For example, bpf_tcp_cc_x->init() may switch to
- * another tcp_cc_y by calling
- * setsockopt(TCP_CONGESTION, "tcp_cc_y").
- * During the switch, bpf_struct_ops_put(tcp_cc_x) is called
- * and its refcount may reach 0 which then free its
- * trampoline image while tcp_cc_x is still running.
- *
- * A vanilla rcu gp is to wait for all bpf-tcp-cc prog
- * to finish. bpf-tcp-cc prog is non sleepable.
- * A rcu_tasks gp is to wait for the last few insn
- * in the tramopline image to finish before releasing
- * the trampoline image.
- */
- synchronize_rcu_mult(call_rcu, call_rcu_tasks);
-
__bpf_struct_ops_map_free(map);
}
@@ -974,6 +956,25 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
mutex_init(&st_map->lock);
bpf_map_init_from_attr(map, attr);
+ /* The struct_ops's function may switch to another struct_ops.
+ *
+ * For example, bpf_tcp_cc_x->init() may switch to
+ * another tcp_cc_y by calling
+ * setsockopt(TCP_CONGESTION, "tcp_cc_y").
+ * During the switch, bpf_struct_ops_put(tcp_cc_x) is called
+ * and its refcount may reach 0 which then free its
+ * trampoline image while tcp_cc_x is still running.
+ *
+ * Since struct_ops prog can be sleepable, a rcu_tasks_trace gp
+ * is to wait for sleepable progs in the map to finish. Then a
+ * rcu_tasks gp is to wait for the normal progs and the last few
+ * insns in the tramopline image to finish before releasing the
+ * trampoline image.
+ *
+ * Also see the comment in function bpf_tramp_image_put.
+ */
+ WRITE_ONCE(map->free_after_mult_rcu_gp, true);
+
return map;
errout_free:
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8254b2973157..ae927f087f04 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -886,7 +886,12 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu)
static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu)
{
- if (rcu_trace_implies_rcu_gp())
+ struct bpf_map *map = container_of(rcu, struct bpf_map, rcu);
+
+ if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS)
+ /* See comment in the end of bpf_struct_ops_map_alloc */
+ call_rcu_tasks(rcu, bpf_map_free_rcu_gp);
+ else if (rcu_trace_implies_rcu_gp())
bpf_map_free_rcu_gp(rcu);
else
call_rcu(rcu, bpf_map_free_rcu_gp);
--
2.39.5
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
2024-11-08 8:26 [PATCH bpf-next 0/2] Fix release of struct_ops map Xu Kuohai
2024-11-08 8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai
@ 2024-11-08 8:26 ` Xu Kuohai
2024-11-08 19:39 ` Martin KaFai Lau
1 sibling, 1 reply; 8+ messages in thread
From: Xu Kuohai @ 2024-11-08 8:26 UTC (permalink / raw)
To: bpf, netdev
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee
From: Xu Kuohai <xukuohai@huawei.com>
Add a test to ensure that struct_ops map is released safely. Without
the previous fix, this test triggers kernel panic due to UAF.
The test runs multiple threads concurrently. Some threads create and
destroy struct_ops maps, while others execute progs within the maps.
Each map has two progs, one sleepable, and the other unsleepable.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 78 ++++++---
.../bpf/bpf_testmod/bpf_testmod_kfunc.h | 2 +-
.../bpf/prog_tests/test_struct_ops_module.c | 154 ++++++++++++++++++
.../bpf/progs/struct_ops_map_release.c | 30 ++++
4 files changed, 242 insertions(+), 22 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_map_release.c
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 987d41af71d2..72d21e8ba5d4 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -1016,6 +1016,11 @@ __bpf_kfunc int bpf_kfunc_st_ops_inc10(struct st_ops_args *args)
return args->a;
}
+__bpf_kfunc void bpf_kfunc_msleep(u32 msecs)
+{
+ msleep(msecs);
+}
+
BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids)
BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc)
BTF_ID_FLAGS(func, bpf_kfunc_call_test1)
@@ -1056,6 +1061,7 @@ BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_prologue, KF_TRUSTED_ARGS | KF_SLEEPABL
BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE)
BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_pro_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE)
BTF_ID_FLAGS(func, bpf_kfunc_st_ops_inc10, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_kfunc_msleep, KF_SLEEPABLE)
BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids)
static int bpf_testmod_ops_init(struct btf *btf)
@@ -1096,6 +1102,29 @@ static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
.is_valid_access = bpf_testmod_ops_is_valid_access,
};
+static int bpf_testmod_test_1(void)
+{
+ return 0;
+}
+
+static void bpf_testmod_test_2(int a, int b)
+{
+}
+
+static int bpf_testmod_ops__test_maybe_null(int dummy,
+ struct task_struct *task__nullable)
+{
+ return 0;
+}
+
+static struct bpf_testmod_ops __bpf_testmod_ops = {
+ .test_1 = bpf_testmod_test_1,
+ .test_2 = bpf_testmod_test_2,
+ .test_maybe_null = bpf_testmod_ops__test_maybe_null,
+};
+
+static struct bpf_testmod_ops *__bpf_dummy_ops = &__bpf_testmod_ops;
+
static int bpf_dummy_reg(void *kdata, struct bpf_link *link)
{
struct bpf_testmod_ops *ops = kdata;
@@ -1108,20 +1137,14 @@ static int bpf_dummy_reg(void *kdata, struct bpf_link *link)
if (ops->test_2)
ops->test_2(4, ops->data);
- return 0;
-}
-
-static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
-{
-}
+ WRITE_ONCE(__bpf_dummy_ops, ops);
-static int bpf_testmod_test_1(void)
-{
return 0;
}
-static void bpf_testmod_test_2(int a, int b)
+static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
{
+ WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
}
static int bpf_testmod_tramp(int value)
@@ -1129,18 +1152,6 @@ static int bpf_testmod_tramp(int value)
return 0;
}
-static int bpf_testmod_ops__test_maybe_null(int dummy,
- struct task_struct *task__nullable)
-{
- return 0;
-}
-
-static struct bpf_testmod_ops __bpf_testmod_ops = {
- .test_1 = bpf_testmod_test_1,
- .test_2 = bpf_testmod_test_2,
- .test_maybe_null = bpf_testmod_ops__test_maybe_null,
-};
-
struct bpf_struct_ops bpf_bpf_testmod_ops = {
.verifier_ops = &bpf_testmod_verifier_ops,
.init = bpf_testmod_ops_init,
@@ -1375,6 +1386,31 @@ static void bpf_testmod_exit(void)
unregister_bpf_testmod_uprobe();
}
+static int run_struct_ops(const char *val, const struct kernel_param *kp)
+{
+ int ret;
+ unsigned int repeat;
+ struct bpf_testmod_ops *ops;
+
+ ret = kstrtouint(val, 10, &repeat);
+ if (ret)
+ return ret;
+
+ if (repeat > 10000)
+ return -ERANGE;
+
+ while (repeat-- > 0) {
+ ops = READ_ONCE(__bpf_dummy_ops);
+ if (ops->test_1)
+ ops->test_1();
+ if (ops->test_2)
+ ops->test_2(0, 0);
+ }
+
+ return 0;
+}
+
+module_param_call(run_struct_ops, run_struct_ops, NULL, NULL, 0200);
module_init(bpf_testmod_init);
module_exit(bpf_testmod_exit);
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
index b58817938deb..260faebd5633 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h
@@ -158,5 +158,5 @@ void bpf_kfunc_trusted_vma_test(struct vm_area_struct *ptr) __ksym;
void bpf_kfunc_trusted_task_test(struct task_struct *ptr) __ksym;
void bpf_kfunc_trusted_num_test(int *ptr) __ksym;
void bpf_kfunc_rcu_task_test(struct task_struct *ptr) __ksym;
-
+void bpf_kfunc_msleep(__u32 msecs) __ksym;
#endif /* _BPF_TESTMOD_KFUNC_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c
index 75a0dea511b3..df744d51cade 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c
@@ -9,6 +9,7 @@
#include "struct_ops_nulled_out_cb.skel.h"
#include "struct_ops_forgotten_cb.skel.h"
#include "struct_ops_detach.skel.h"
+#include "struct_ops_map_release.skel.h"
#include "unsupported_ops.skel.h"
static void check_map_info(struct bpf_map_info *info)
@@ -246,6 +247,157 @@ static void test_struct_ops_forgotten_cb(void)
struct_ops_forgotten_cb__destroy(skel);
}
+struct test_context {
+ pthread_mutex_t mutex;
+ pthread_cond_t cond;
+ int total_threads;
+ int wait_threads;
+ int dead_threads;
+ int repeat;
+ int loop;
+};
+
+static int wait_others(struct test_context *ctx)
+{
+ int ret = 0;
+
+ pthread_mutex_lock(&ctx->mutex);
+
+ if (ctx->dead_threads) {
+ pthread_cond_broadcast(&ctx->cond);
+ pthread_mutex_unlock(&ctx->mutex);
+ return -1;
+ }
+
+ ++ctx->wait_threads;
+ if (ctx->wait_threads >= ctx->total_threads) {
+ pthread_cond_broadcast(&ctx->cond);
+ ctx->wait_threads = 0;
+ } else {
+ pthread_cond_wait(&ctx->cond, &ctx->mutex);
+ if (ctx->dead_threads)
+ ret = -1;
+ }
+
+ pthread_mutex_unlock(&ctx->mutex);
+
+ return ret;
+}
+
+static void mark_dead(struct test_context *ctx)
+{
+ pthread_mutex_lock(&ctx->mutex);
+ ctx->dead_threads++;
+ pthread_cond_broadcast(&ctx->cond);
+ pthread_mutex_unlock(&ctx->mutex);
+}
+
+static int load_release(struct test_context *ctx)
+{
+ int ret = 0;
+ struct bpf_link *link = NULL;
+ struct struct_ops_map_release *skel = NULL;
+
+ skel = struct_ops_map_release__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load")) {
+ ret = -1;
+ mark_dead(ctx);
+ goto out;
+ }
+
+ link = bpf_map__attach_struct_ops(skel->maps.testmod_ops);
+ if (!ASSERT_OK_PTR(link, "attach_struct_ops")) {
+ ret = -1;
+ mark_dead(ctx);
+ goto out;
+ }
+
+ if (wait_others(ctx)) {
+ ret = -1;
+ goto out;
+ }
+
+out:
+ bpf_link__destroy(link);
+ struct_ops_map_release__destroy(skel);
+ return ret;
+}
+
+static void *thread_load_release(void *arg)
+{
+ struct test_context *ctx = (struct test_context *)arg;
+
+ for (int i = 0; i < ctx->loop; i++)
+ if (load_release(ctx))
+ break;
+ return NULL;
+}
+
+static void *thread_run_prog(void *arg)
+{
+ int fd;
+ int len;
+ char buf[8];
+ struct test_context *ctx = (struct test_context *)arg;
+
+ fd = open("/sys/module/bpf_testmod/parameters/run_struct_ops", O_WRONLY);
+ if (!ASSERT_OK_FD(fd, "open run_struct_ops for write")) {
+ mark_dead(ctx);
+ return NULL;
+ }
+
+ len = snprintf(buf, sizeof(buf), "%d", ctx->repeat);
+ if (!ASSERT_GT(len, 0, "snprintf repeat number")) {
+ mark_dead(ctx);
+ goto out;
+ }
+
+ for (int i = 0; i < ctx->loop; i++) {
+ if (wait_others(ctx))
+ goto out;
+ if (!ASSERT_EQ(write(fd, buf, len), len, "write file")) {
+ mark_dead(ctx);
+ goto out;
+ }
+ }
+
+out:
+ close(fd);
+ return NULL;
+}
+
+#define NR_REL_THREAD 2
+#define NR_RUN_THREAD 8
+#define NR_THREAD (NR_REL_THREAD + NR_RUN_THREAD)
+#define NR_REPEAT 4
+#define NR_LOOP 5
+
+static void test_struct_ops_map_release(void)
+{
+ int i, j;
+ pthread_t t[NR_THREAD];
+ struct test_context ctx = {
+ .loop = NR_LOOP,
+ .repeat = NR_REPEAT,
+ .total_threads = NR_THREAD,
+ .wait_threads = 0,
+ .dead_threads = 0,
+ };
+
+ pthread_mutex_init(&ctx.mutex, NULL);
+ pthread_cond_init(&ctx.cond, NULL);
+
+ j = 0;
+ for (i = 0; i < NR_REL_THREAD; i++)
+ pthread_create(&t[j++], NULL, thread_load_release, &ctx);
+
+ for (i = 0; i < NR_RUN_THREAD; i++)
+ pthread_create(&t[j++], NULL, thread_run_prog, &ctx);
+
+ for (i = 0; i < NR_THREAD; i++)
+ pthread_join(t[i], NULL);
+}
+
/* Detach a link from a user space program */
static void test_detach_link(void)
{
@@ -310,6 +462,8 @@ void serial_test_struct_ops_module(void)
test_struct_ops_nulled_out_cb();
if (test__start_subtest("struct_ops_forgotten_cb"))
test_struct_ops_forgotten_cb();
+ if (test__start_subtest("struct_ops_map_release"))
+ test_struct_ops_map_release();
if (test__start_subtest("test_detach_link"))
test_detach_link();
RUN_TESTS(unsupported_ops);
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_map_release.c b/tools/testing/selftests/bpf/progs/struct_ops_map_release.c
new file mode 100644
index 000000000000..78aa5e1875b6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_map_release.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2024 Huawei Technologies Co., Ltd */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "../bpf_testmod/bpf_testmod.h"
+#include "../bpf_testmod/bpf_testmod_kfunc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops.s/test_1")
+int BPF_PROG(test_1_prog)
+{
+ bpf_kfunc_msleep(100);
+ return 0;
+}
+
+SEC("struct_ops/test_2")
+int BPF_PROG(test_2_prog, int a, int b)
+{
+ return 0;
+}
+
+SEC(".struct_ops")
+struct bpf_testmod_ops testmod_ops = {
+ .test_1 = (void *)test_1_prog,
+ .test_2 = (void *)test_2_prog
+};
--
2.39.5
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 1/2] bpf: Fix release of struct_ops map
2024-11-08 8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai
@ 2024-11-08 17:00 ` Alexei Starovoitov
0 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2024-11-08 17:00 UTC (permalink / raw)
To: Xu Kuohai
Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Yonghong Song, Kui-Feng Lee
On Fri, Nov 8, 2024 at 12:15 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote:
>
> From: Xu Kuohai <xukuohai@huawei.com>
>
> The test in the follow-up patch triggers the following kernel panic:
>
> Oops: int3: 0000 [#1] PREEMPT SMP PTI
> CPU: 0 UID: 0 PID: 465 Comm: test_progs Tainted: G OE 6.12.0-rc4-gd1d187515
> Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-pr4
> RIP: 0010:0xffffffffc0015041
> Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ccc
> RSP: 0018:ffffc9000187fd20 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff82c54639 RDI: 0000000000000000
> RBP: ffffc9000187fd48 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 000000004cba6383 R12: 0000000000000000
> R13: 0000000000000002 R14: ffff88810438b7a0 R15: ffffffff8369d7a0
> FS: 00007f05178006c0(0000) GS:ffff888236e00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0508c94000 CR3: 0000000100d16003 CR4: 0000000000170ef0
> Call Trace:
> <TASK>
> ? show_regs+0x68/0x80
> ? die+0x3b/0x90
> ? exc_int3+0xca/0xe0
> ? asm_exc_int3+0x3e/0x50
> run_struct_ops+0x58/0xb0 [bpf_testmod]
> param_attr_store+0xa2/0x100
> module_attr_store+0x25/0x40
> sysfs_kf_write+0x50/0x70
> kernfs_fop_write_iter+0x146/0x1f0
> vfs_write+0x27e/0x530
> ksys_write+0x75/0x100
> __x64_sys_write+0x1d/0x30
> x64_sys_call+0x1d30/0x1f30
> do_syscall_64+0x68/0x140
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7f051831727f
> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 108
> RSP: 002b:00007f05177ffdf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 00007f05178006c0 RCX: 00007f051831727f
> RDX: 0000000000000002 RSI: 00007f05177ffe30 RDI: 0000000000000004
> RBP: 00007f05177ffe90 R08: 0000000000000000 R09: 000000000000000b
> R10: 0000000000000000 R11: 0000000000000293 R12: ffffffffffffff30
> R13: 0000000000000002 R14: 00007ffd926bfd50 R15: 00007f0517000000
> </TASK>
>
> It's because the sleepable prog is still running when the struct_ops
> map is released.
>
> To fix it, follow the approach used in bpf_tramp_image_put. First,
> before release struct_ops map, wait a rcu_tasks_trace gp for sleepable
> progs to finish. Then, wait a rcu_tasks gp for normal progs and the
> rest trampoline insns to finish.
>
> Additionally, switch to call_rcu to remove the synchronous waiting,
> as suggested by Alexei and Martin.
>
> Fixes: b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.")
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
> ---
> kernel/bpf/bpf_struct_ops.c | 37 +++++++++++++++++++------------------
> kernel/bpf/syscall.c | 7 ++++++-
> 2 files changed, 25 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index fda3dd2ee984..dd5f9bf12791 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -865,24 +865,6 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
> */
> if (btf_is_module(st_map->btf))
> module_put(st_map->st_ops_desc->st_ops->owner);
> -
> - /* The struct_ops's function may switch to another struct_ops.
> - *
> - * For example, bpf_tcp_cc_x->init() may switch to
> - * another tcp_cc_y by calling
> - * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> - * During the switch, bpf_struct_ops_put(tcp_cc_x) is called
> - * and its refcount may reach 0 which then free its
> - * trampoline image while tcp_cc_x is still running.
> - *
> - * A vanilla rcu gp is to wait for all bpf-tcp-cc prog
> - * to finish. bpf-tcp-cc prog is non sleepable.
> - * A rcu_tasks gp is to wait for the last few insn
> - * in the tramopline image to finish before releasing
> - * the trampoline image.
> - */
> - synchronize_rcu_mult(call_rcu, call_rcu_tasks);
> -
> __bpf_struct_ops_map_free(map);
> }
>
> @@ -974,6 +956,25 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
> mutex_init(&st_map->lock);
> bpf_map_init_from_attr(map, attr);
>
> + /* The struct_ops's function may switch to another struct_ops.
> + *
> + * For example, bpf_tcp_cc_x->init() may switch to
> + * another tcp_cc_y by calling
> + * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> + * During the switch, bpf_struct_ops_put(tcp_cc_x) is called
> + * and its refcount may reach 0 which then free its
> + * trampoline image while tcp_cc_x is still running.
> + *
> + * Since struct_ops prog can be sleepable, a rcu_tasks_trace gp
> + * is to wait for sleepable progs in the map to finish. Then a
> + * rcu_tasks gp is to wait for the normal progs and the last few
> + * insns in the tramopline image to finish before releasing the
> + * trampoline image.
> + *
> + * Also see the comment in function bpf_tramp_image_put.
> + */
> + WRITE_ONCE(map->free_after_mult_rcu_gp, true);
> +
> return map;
>
> errout_free:
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8254b2973157..ae927f087f04 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -886,7 +886,12 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu)
>
> static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu)
> {
> - if (rcu_trace_implies_rcu_gp())
> + struct bpf_map *map = container_of(rcu, struct bpf_map, rcu);
> +
> + if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS)
> + /* See comment in the end of bpf_struct_ops_map_alloc */
The fix makes sense, but pls copy paste a sentence here instead
of the above comment. Like:
"
rcu_tasks gp is necessary to wait for struct_ops bpf trampoline to finish.
Unlike all other bpf trampolines struct_ops trampoline is not
protected by percpu_ref.
"
> + call_rcu_tasks(rcu, bpf_map_free_rcu_gp);
> + else if (rcu_trace_implies_rcu_gp())
pw-bot: cr
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
2024-11-08 8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai
@ 2024-11-08 19:39 ` Martin KaFai Lau
2024-11-09 8:40 ` Xu Kuohai
0 siblings, 1 reply; 8+ messages in thread
From: Martin KaFai Lau @ 2024-11-08 19:39 UTC (permalink / raw)
To: Xu Kuohai
Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Yonghong Song, Kui-Feng Lee
On 11/8/24 12:26 AM, Xu Kuohai wrote:
> -static void bpf_testmod_test_2(int a, int b)
> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
> {
> + WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
> }
[ ... ]
> +static int run_struct_ops(const char *val, const struct kernel_param *kp)
> +{
> + int ret;
> + unsigned int repeat;
> + struct bpf_testmod_ops *ops;
> +
> + ret = kstrtouint(val, 10, &repeat);
> + if (ret)
> + return ret;
> +
> + if (repeat > 10000)
> + return -ERANGE;
> +
> + while (repeat-- > 0) {
> + ops = READ_ONCE(__bpf_dummy_ops);
I don't think it is the usual bpf_struct_ops implementation which only uses
READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a
refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use
kthread_flush_work() to wait for all ops calling finished. Meaning I don't think
the current bpf_struct_ops unreg implementation will run into this issue for
sleepable ops.
The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for
the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its own
refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should have
been introduced instead to return a new tcp-cc-ops to be used.
> + if (ops->test_1)
> + ops->test_1();
> + if (ops->test_2)
> + ops->test_2(0, 0);
> + }
> +
> + return 0;
> +}
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
2024-11-08 19:39 ` Martin KaFai Lau
@ 2024-11-09 8:40 ` Xu Kuohai
2024-11-11 21:30 ` Martin KaFai Lau
0 siblings, 1 reply; 8+ messages in thread
From: Xu Kuohai @ 2024-11-09 8:40 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Yonghong Song, Kui-Feng Lee
On 11/9/2024 3:39 AM, Martin KaFai Lau wrote:
> On 11/8/24 12:26 AM, Xu Kuohai wrote:
>> -static void bpf_testmod_test_2(int a, int b)
>> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
>> {
>> + WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
>> }
>
> [ ... ]
>
>> +static int run_struct_ops(const char *val, const struct kernel_param *kp)
>> +{
>> + int ret;
>> + unsigned int repeat;
>> + struct bpf_testmod_ops *ops;
>> +
>> + ret = kstrtouint(val, 10, &repeat);
>> + if (ret)
>> + return ret;
>> +
>> + if (repeat > 10000)
>> + return -ERANGE;
>> +
>> + while (repeat-- > 0) {
>> + ops = READ_ONCE(__bpf_dummy_ops);
>
> I don't think it is the usual bpf_struct_ops implementation which only uses READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use kthread_flush_work() to wait for
> all ops calling finished. Meaning I don't think the current bpf_struct_ops unreg implementation will run into this issue for sleepable ops.
>
Thanks for the explanation.
Are you saying that it's not the struct_ops framework's
responsibility to ensure the struct_ops map is not
released while it may be still in use? And the "bug" in
this series should be "fixed" in the test, namely this
patch?
> The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should have been introduced
> instead to return a new tcp-cc-ops to be used.
Not quite clear, but from the description, it seems that
the synchronize_rcu_mult(call_rcu, call_rcu_tasks) could
be just removed in some way, no need to do a cleanup to
switch it to call_rcu.
>
>> + if (ops->test_1)
>> + ops->test_1();
>> + if (ops->test_2)
>> + ops->test_2(0, 0);
>> + }
>> +
>> + return 0;
>> +}
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
2024-11-09 8:40 ` Xu Kuohai
@ 2024-11-11 21:30 ` Martin KaFai Lau
2024-11-12 12:22 ` Xu Kuohai
0 siblings, 1 reply; 8+ messages in thread
From: Martin KaFai Lau @ 2024-11-11 21:30 UTC (permalink / raw)
To: Xu Kuohai
Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Yonghong Song, Kui-Feng Lee
On 11/9/24 12:40 AM, Xu Kuohai wrote:
> On 11/9/2024 3:39 AM, Martin KaFai Lau wrote:
>> On 11/8/24 12:26 AM, Xu Kuohai wrote:
>>> -static void bpf_testmod_test_2(int a, int b)
>>> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
>>> {
>>> + WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
>>> }
>>
>> [ ... ]
>>
>>> +static int run_struct_ops(const char *val, const struct kernel_param *kp)
>>> +{
>>> + int ret;
>>> + unsigned int repeat;
>>> + struct bpf_testmod_ops *ops;
>>> +
>>> + ret = kstrtouint(val, 10, &repeat);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + if (repeat > 10000)
>>> + return -ERANGE;
>>> +
>>> + while (repeat-- > 0) {
>>> + ops = READ_ONCE(__bpf_dummy_ops);
>>
>> I don't think it is the usual bpf_struct_ops implementation which only uses
>> READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a
>> refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use
>> kthread_flush_work() to wait for all ops calling finished. Meaning I don't
>> think the current bpf_struct_ops unreg implementation will run into this issue
>> for sleepable ops.
>>
>
> Thanks for the explanation.
>
> Are you saying that it's not the struct_ops framework's
> responsibility to ensure the struct_ops map is not
> released while it may be still in use? And the "bug" in
> this series should be "fixed" in the test, namely this
> patch?
Yeah, it is what I was trying to say. I don't think there is thing to fix. Think
about extending a subsystem by a kernel module. The subsystem will also do the
needed protection itself during the unreg process. There is already a
bpf_try_module_get() to help the subsystem.
>> The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for
>> the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its
>> own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should
>> have been introduced instead to return a new tcp-cc-ops to be used.
>
> Not quite clear, but from the description, it seems that
> the synchronize_rcu_mult(call_rcu, call_rcu_tasks) could
This synchronize_rcu_mult is only need for the tcp_congestion_ops
(bpf_tcp_ca.c). May be it is cleaner to just make a special case for
"tcp_congestion_ops" in st_ops->name in map_alloc and only set
free_after_mult_rcu_gp to TRUE for this one case, then it won't slow down other
struct_ops map freeing also.
imo, the test in this patch is not needed in its current form also since it is
not how the kernel subsystem implements unreg in struct_ops.
> be just removed in some way, no need to do a cleanup to
> switch it to call_rcu.
>
>>
>>> + if (ops->test_1)
>>> + ops->test_1();
>>> + if (ops->test_2)
>>> + ops->test_2(0, 0);
>>> + }
>>> +
>>> + return 0;
>>> +}
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release
2024-11-11 21:30 ` Martin KaFai Lau
@ 2024-11-12 12:22 ` Xu Kuohai
0 siblings, 0 replies; 8+ messages in thread
From: Xu Kuohai @ 2024-11-12 12:22 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Yonghong Song, Kui-Feng Lee
On 11/12/2024 5:30 AM, Martin KaFai Lau wrote:
> On 11/9/24 12:40 AM, Xu Kuohai wrote:
>> On 11/9/2024 3:39 AM, Martin KaFai Lau wrote:
>>> On 11/8/24 12:26 AM, Xu Kuohai wrote:
>>>> -static void bpf_testmod_test_2(int a, int b)
>>>> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link)
>>>> {
>>>> + WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops);
>>>> }
>>>
>>> [ ... ]
>>>
>>>> +static int run_struct_ops(const char *val, const struct kernel_param *kp)
>>>> +{
>>>> + int ret;
>>>> + unsigned int repeat;
>>>> + struct bpf_testmod_ops *ops;
>>>> +
>>>> + ret = kstrtouint(val, 10, &repeat);
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> + if (repeat > 10000)
>>>> + return -ERANGE;
>>>> +
>>>> + while (repeat-- > 0) {
>>>> + ops = READ_ONCE(__bpf_dummy_ops);
>>>
>>> I don't think it is the usual bpf_struct_ops implementation which only uses READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use kthread_flush_work() to wait
>>> for all ops calling finished. Meaning I don't think the current bpf_struct_ops unreg implementation will run into this issue for sleepable ops.
>>>
>>
>> Thanks for the explanation.
>>
>> Are you saying that it's not the struct_ops framework's
>> responsibility to ensure the struct_ops map is not
>> released while it may be still in use? And the "bug" in
>> this series should be "fixed" in the test, namely this
>> patch?
>
> Yeah, it is what I was trying to say. I don't think there is thing to fix. Think about extending a subsystem by a kernel module. The subsystem will also do the needed protection itself during the unreg process. There is already a bpf_try_module_get() to
> help the subsystem.
>
Got it
>>> The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should have been introduced
>>> instead to return a new tcp-cc-ops to be used.
>>
>> Not quite clear, but from the description, it seems that
>> the synchronize_rcu_mult(call_rcu, call_rcu_tasks) could
>
> This synchronize_rcu_mult is only need for the tcp_congestion_ops (bpf_tcp_ca.c). May be it is cleaner to just make a special case for "tcp_congestion_ops" in st_ops->name in map_alloc and only set free_after_mult_rcu_gp to TRUE for this one case, then it
> won't slow down other struct_ops map freeing also.
>
OK, will git it a try, thanks.
> imo, the test in this patch is not needed in its current form also since it is not how the kernel subsystem implements unreg in struct_ops.
>
>> be just removed in some way, no need to do a cleanup to
>> switch it to call_rcu.
>>
>>>
>>>> + if (ops->test_1)
>>>> + ops->test_1();
>>>> + if (ops->test_2)
>>>> + ops->test_2(0, 0);
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-11-12 12:22 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-08 8:26 [PATCH bpf-next 0/2] Fix release of struct_ops map Xu Kuohai
2024-11-08 8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai
2024-11-08 17:00 ` Alexei Starovoitov
2024-11-08 8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai
2024-11-08 19:39 ` Martin KaFai Lau
2024-11-09 8:40 ` Xu Kuohai
2024-11-11 21:30 ` Martin KaFai Lau
2024-11-12 12:22 ` Xu Kuohai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).