[PATCH v2 bpf-next 0/5] bpf: Remove recursion check for struct

BPF List
 help / color / mirror / Atom feed

* [PATCH v2 bpf-next 0/5] bpf: Remove recursion check for struct_ops prog
@ 2022-09-23 22:44 Martin KaFai Lau
  2022-09-23 22:44 ` [PATCH v2 bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2022-09-23 22:44 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni

From: Martin KaFai Lau <martin.lau@kernel.org>

The struct_ops is sharing the tracing-trampoline's enter/exit
function which tracks prog->active to avoid recursion.  It turns
out the struct_ops bpf prog will hit this prog->active and
unnecessarily skipped running the struct_ops prog.  eg.  The
'.ssthresh' may run in_task() and then interrupted by softirq
that runs the same '.ssthresh'.

The kernel does not call the tcp-cc's ops in a recursive way,
so this set is to remove the recursion check for struct_ops prog.

v2:
v1 [0] turned into a long discussion on a few cases and also
whether it needs to follow the bpf_run_ctx chain if there is
tracing bpf_run_ctx (kprobe/trace/trampoline) running in between.

It is a good signal that it is not obvious enough to reason
about it and needs a tradeoff for a more straight forward approach.

This revision uses one bit out of an existing 1 byte hole
in the tcp_sock.  It is in Patch 4.

[0]: https://lore.kernel.org/bpf/20220922225616.3054840-1-kafai@fb.com/T/#md98d40ac5ec295fdadef476c227a3401b2b6b911

Martin KaFai Lau (5):
  bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
  bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()
  bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another
    function
  bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur
    itself
  selftests/bpf: Check -EBUSY for the recurred
    bpf_setsockopt(TCP_CONGESTION)

 arch/x86/net/bpf_jit_comp.c                   |  3 +
 include/linux/bpf.h                           |  4 ++
 include/linux/tcp.h                           |  6 ++
 kernel/bpf/trampoline.c                       | 23 ++++++
 net/core/filter.c                             | 70 ++++++++++++++-----
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |  4 ++
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 25 ++++---
 7 files changed, 111 insertions(+), 24 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
  2022-09-23 22:44 [PATCH v2 bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
@ 2022-09-23 22:44 ` Martin KaFai Lau
  2022-09-23 22:45 ` [PATCH v2 bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() Martin KaFai Lau
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2022-09-23 22:44 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

The struct_ops prog is to allow using bpf to implement the functions in
a struct (eg. kernel module).  The current usage is to implement the
tcp_congestion.  The kernel does not call the tcp-cc's ops (ie.
the bpf prog) in a recursive way.

The struct_ops is sharing the tracing-trampoline's enter/exit
function which tracks prog->active to avoid recursion.  It is
needed for tracing prog.  However, it turns out the struct_ops
bpf prog will hit this prog->active and unnecessarily skipped
running the struct_ops prog.  eg.  The '.ssthresh' may run in_task()
and then interrupted by softirq that runs the same '.ssthresh'.
Skip running the '.ssthresh' will end up returning random value
to the caller.

The patch adds __bpf_prog_{enter,exit}_struct_ops for the
struct_ops trampoline.  They do not track the prog->active
to detect recursion.

One exception is when the tcp_congestion's '.init' ops is doing
bpf_setsockopt(TCP_CONGESTION) and then recurs to the same
'.init' ops.  This will be addressed in the following patches.

Fixes: ca06f55b9002 ("bpf: Add per-program recursion prevention mechanism")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c |  3 +++
 include/linux/bpf.h         |  4 ++++
 kernel/bpf/trampoline.c     | 23 +++++++++++++++++++++++
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index ae89f4143eb4..58a131dec954 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1836,6 +1836,9 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog,
 	if (p->aux->sleepable) {
 		enter = __bpf_prog_enter_sleepable;
 		exit = __bpf_prog_exit_sleepable;
+	} else if (p->type == BPF_PROG_TYPE_STRUCT_OPS) {
+		enter = __bpf_prog_enter_struct_ops;
+		exit = __bpf_prog_exit_struct_ops;
 	} else if (p->expected_attach_type == BPF_LSM_CGROUP) {
 		enter = __bpf_prog_enter_lsm_cgroup;
 		exit = __bpf_prog_exit_lsm_cgroup;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index edd43edb27d6..6fdbc1398b8a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -864,6 +864,10 @@ u64 notrace __bpf_prog_enter_lsm_cgroup(struct bpf_prog *prog,
 					struct bpf_tramp_run_ctx *run_ctx);
 void notrace __bpf_prog_exit_lsm_cgroup(struct bpf_prog *prog, u64 start,
 					struct bpf_tramp_run_ctx *run_ctx);
+u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
+					struct bpf_tramp_run_ctx *run_ctx);
+void notrace __bpf_prog_exit_struct_ops(struct bpf_prog *prog, u64 start,
+					struct bpf_tramp_run_ctx *run_ctx);
 void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr);
 void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr);
 
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 41b67eb83ab3..e6551e4a6064 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -976,6 +976,29 @@ void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start,
 	rcu_read_unlock_trace();
 }
 
+u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
+					struct bpf_tramp_run_ctx *run_ctx)
+	__acquires(RCU)
+{
+	rcu_read_lock();
+	migrate_disable();
+
+	run_ctx->saved_run_ctx = bpf_set_run_ctx(&run_ctx->run_ctx);
+
+	return bpf_prog_start_time();
+}
+
+void notrace __bpf_prog_exit_struct_ops(struct bpf_prog *prog, u64 start,
+					struct bpf_tramp_run_ctx *run_ctx)
+	__releases(RCU)
+{
+	bpf_reset_run_ctx(run_ctx->saved_run_ctx);
+
+	update_prog_stats(prog, start);
+	migrate_enable();
+	rcu_read_unlock();
+}
+
 void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr)
 {
 	percpu_ref_get(&tr->pcref);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()
  2022-09-23 22:44 [PATCH v2 bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
  2022-09-23 22:44 ` [PATCH v2 bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
@ 2022-09-23 22:45 ` Martin KaFai Lau
  2022-09-23 22:45 ` [PATCH v2 bpf-next 3/5] bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function Martin KaFai Lau
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2022-09-23 22:45 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

The check on the tcp-cc, "cdg", is done in the bpf_sk_setsockopt which is
used by the bpf_tcp_ca, bpf_lsm, cg_sockopt, and tcp_iter hooks.
However, it is not done for cg sock_ddr, cg sockops, and some of
the bpf_lsm_cgroup hooks.

The tcp-cc "cdg" should have very limited usage.  This patch is to
move the "cdg" check to the common sol_tcp_sockopt() so that all
hooks have a consistent behavior.   The motivation to make
this check consistent now is because the latter patch will
refactor the bpf_setsockopt(TCP_CONGESTION) into another function,
so it is better to take this chance to refactor this piece
also.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 net/core/filter.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2fd9449026aa..f4cea3ff994a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5127,6 +5127,13 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
 	case TCP_CONGESTION:
 		if (*optlen < 2)
 			return -EINVAL;
+		/* "cdg" is the only cc that alloc a ptr
+		 * in inet_csk_ca area.  The bpf-tcp-cc may
+		 * overwrite this ptr after switching to cdg.
+		 */
+		if (!getopt && *optlen >= sizeof("cdg") - 1 &&
+		    !strncmp("cdg", optval, *optlen))
+			return -ENOTSUPP;
 		break;
 	case TCP_SAVED_SYN:
 		if (*optlen < 1)
@@ -5285,12 +5292,6 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname,
 BPF_CALL_5(bpf_sk_setsockopt, struct sock *, sk, int, level,
 	   int, optname, char *, optval, int, optlen)
 {
-	if (level == SOL_TCP && optname == TCP_CONGESTION) {
-		if (optlen >= sizeof("cdg") - 1 &&
-		    !strncmp("cdg", optval, optlen))
-			return -ENOTSUPP;
-	}
-
 	return _bpf_setsockopt(sk, level, optname, optval, optlen);
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 bpf-next 3/5] bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function
  2022-09-23 22:44 [PATCH v2 bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
  2022-09-23 22:44 ` [PATCH v2 bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
  2022-09-23 22:45 ` [PATCH v2 bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() Martin KaFai Lau
@ 2022-09-23 22:45 ` Martin KaFai Lau
  2022-09-23 22:45 ` [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
  2022-09-23 22:45 ` [PATCH v2 bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) Martin KaFai Lau
  4 siblings, 0 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2022-09-23 22:45 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

This patch moves the bpf_setsockopt(TCP_CONGESTION) logic into
another function.  The next patch will add extra logic to avoid
recursion and this will make the latter patch easier to follow.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 net/core/filter.c | 45 ++++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index f4cea3ff994a..96f2f7a65e65 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5102,6 +5102,33 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
 	return 0;
 }
 
+static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
+				      int *optlen, bool getopt)
+{
+	if (*optlen < 2)
+		return -EINVAL;
+
+	if (getopt) {
+		if (!inet_csk(sk)->icsk_ca_ops)
+			return -EINVAL;
+		/* BPF expects NULL-terminated tcp-cc string */
+		optval[--(*optlen)] = '\0';
+		return do_tcp_getsockopt(sk, SOL_TCP, TCP_CONGESTION,
+					 KERNEL_SOCKPTR(optval),
+					 KERNEL_SOCKPTR(optlen));
+	}
+
+	/* "cdg" is the only cc that alloc a ptr
+	 * in inet_csk_ca area.  The bpf-tcp-cc may
+	 * overwrite this ptr after switching to cdg.
+	 */
+	if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
+		return -ENOTSUPP;
+
+	return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+				KERNEL_SOCKPTR(optval), *optlen);
+}
+
 static int sol_tcp_sockopt(struct sock *sk, int optname,
 			   char *optval, int *optlen,
 			   bool getopt)
@@ -5125,16 +5152,7 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
 			return -EINVAL;
 		break;
 	case TCP_CONGESTION:
-		if (*optlen < 2)
-			return -EINVAL;
-		/* "cdg" is the only cc that alloc a ptr
-		 * in inet_csk_ca area.  The bpf-tcp-cc may
-		 * overwrite this ptr after switching to cdg.
-		 */
-		if (!getopt && *optlen >= sizeof("cdg") - 1 &&
-		    !strncmp("cdg", optval, *optlen))
-			return -ENOTSUPP;
-		break;
+		return sol_tcp_sockopt_congestion(sk, optval, optlen, getopt);
 	case TCP_SAVED_SYN:
 		if (*optlen < 1)
 			return -EINVAL;
@@ -5159,13 +5177,6 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
 			return 0;
 		}
 
-		if (optname == TCP_CONGESTION) {
-			if (!inet_csk(sk)->icsk_ca_ops)
-				return -EINVAL;
-			/* BPF expects NULL-terminated tcp-cc string */
-			optval[--(*optlen)] = '\0';
-		}
-
 		return do_tcp_getsockopt(sk, SOL_TCP, optname,
 					 KERNEL_SOCKPTR(optval),
 					 KERNEL_SOCKPTR(optlen));
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23 22:44 [PATCH v2 bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2022-09-23 22:45 ` [PATCH v2 bpf-next 3/5] bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function Martin KaFai Lau
@ 2022-09-23 22:45 ` Martin KaFai Lau
  2022-09-27  3:34   ` Alexei Starovoitov
  2022-09-29  2:04   ` Eric Dumazet
  2022-09-23 22:45 ` [PATCH v2 bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) Martin KaFai Lau
  4 siblings, 2 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2022-09-23 22:45 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

When a bad bpf prog '.init' calls
bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:

.init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
... => .init => bpf_setsockopt(tcp_cc).

It was prevented by the prog->active counter before but the prog->active
detection cannot be used in struct_ops as explained in the earlier
patch of the set.

In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
in order to break the loop.  This is done by using a bit of
an existing 1 byte hole in tcp_sock to check if there is
on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.

Note that this essentially limits only the first '.init' can
call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
does not support ECN) and the second '.init' cannot fallback to
another cc.  This applies even the second
bpf_setsockopt(TCP_CONGESTION) will not cause a loop.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 include/linux/tcp.h |  6 ++++++
 net/core/filter.c   | 28 +++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a9fbe22732c3..3bdf687e2fb3 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -388,6 +388,12 @@ struct tcp_sock {
 	u8	bpf_sock_ops_cb_flags;  /* Control calling BPF programs
 					 * values defined in uapi/linux/tcp.h
 					 */
+	u8	bpf_chg_cc_inprogress:1; /* In the middle of
+					  * bpf_setsockopt(TCP_CONGESTION),
+					  * it is to avoid the bpf_tcp_cc->init()
+					  * to recur itself by calling
+					  * bpf_setsockopt(TCP_CONGESTION, "itself").
+					  */
 #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
 #else
 #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
diff --git a/net/core/filter.c b/net/core/filter.c
index 96f2f7a65e65..ac4c45c02da5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
 static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
 				      int *optlen, bool getopt)
 {
+	struct tcp_sock *tp;
+	int ret;
+
 	if (*optlen < 2)
 		return -EINVAL;
 
@@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
 	if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
 		return -ENOTSUPP;
 
-	return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+	/* It stops this looping
+	 *
+	 * .init => bpf_setsockopt(tcp_cc) => .init =>
+	 * bpf_setsockopt(tcp_cc)" => .init => ....
+	 *
+	 * The second bpf_setsockopt(tcp_cc) is not allowed
+	 * in order to break the loop when both .init
+	 * are the same bpf prog.
+	 *
+	 * This applies even the second bpf_setsockopt(tcp_cc)
+	 * does not cause a loop.  This limits only the first
+	 * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
+	 * pick a fallback cc (eg. peer does not support ECN)
+	 * and the second '.init' cannot fallback to
+	 * another.
+	 */
+	tp = tcp_sk(sk);
+	if (tp->bpf_chg_cc_inprogress)
+		return -EBUSY;
+
+	tp->bpf_chg_cc_inprogress = 1;
+	ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
 				KERNEL_SOCKPTR(optval), *optlen);
+	tp->bpf_chg_cc_inprogress = 0;
+	return ret;
 }
 
 static int sol_tcp_sockopt(struct sock *sk, int optname,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION)
  2022-09-23 22:44 [PATCH v2 bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2022-09-23 22:45 ` [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
@ 2022-09-23 22:45 ` Martin KaFai Lau
  4 siblings, 0 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2022-09-23 22:45 UTC (permalink / raw)
  To: bpf, netdev
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Eric Dumazet, Jakub Kicinski, kernel-team,
	Paolo Abeni, Martin KaFai Lau

From: Martin KaFai Lau <martin.lau@kernel.org>

This patch changes the bpf_dctcp test to ensure the recurred
bpf_setsockopt(TCP_CONGESTION) returns -EBUSY.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |  4 +++
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 25 +++++++++++++------
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
index 2959a52ced06..e980188d4124 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -290,6 +290,10 @@ static void test_dctcp_fallback(void)
 		goto done;
 	ASSERT_STREQ(dctcp_skel->bss->cc_res, "cubic", "cc_res");
 	ASSERT_EQ(dctcp_skel->bss->tcp_cdg_res, -ENOTSUPP, "tcp_cdg_res");
+	/* All setsockopt(TCP_CONGESTION) in the recurred
+	 * bpf_dctcp->init() should fail with -EBUSY.
+	 */
+	ASSERT_EQ(dctcp_skel->bss->ebusy_cnt, 3, "ebusy_cnt");
 
 	err = getsockopt(srv_fd, SOL_TCP, TCP_CONGESTION, srv_cc, &cc_len);
 	if (!ASSERT_OK(err, "getsockopt(srv_fd, TCP_CONGESTION)"))
diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
index 9573be6122be..460682759aed 100644
--- a/tools/testing/selftests/bpf/progs/bpf_dctcp.c
+++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
@@ -11,6 +11,7 @@
 #include <linux/types.h>
 #include <linux/stddef.h>
 #include <linux/tcp.h>
+#include <errno.h>
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
 #include "bpf_tcp_helpers.h"
@@ -23,6 +24,7 @@ const char tcp_cdg[] = "cdg";
 char cc_res[TCP_CA_NAME_MAX];
 int tcp_cdg_res = 0;
 int stg_result = 0;
+int ebusy_cnt = 0;
 
 struct {
 	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
@@ -64,16 +66,23 @@ void BPF_PROG(dctcp_init, struct sock *sk)
 
 	if (!(tp->ecn_flags & TCP_ECN_OK) && fallback[0]) {
 		/* Switch to fallback */
-		bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
-			       (void *)fallback, sizeof(fallback));
-		/* Switch back to myself which the bpf trampoline
-		 * stopped calling dctcp_init recursively.
+		if (bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+				   (void *)fallback, sizeof(fallback)) == -EBUSY)
+			ebusy_cnt++;
+
+		/* Switch back to myself and the recurred dctcp_init()
+		 * will get -EBUSY for all bpf_setsockopt(TCP_CONGESTION),
+		 * except the last "cdg" one.
 		 */
-		bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
-			       (void *)bpf_dctcp, sizeof(bpf_dctcp));
+		if (bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+				   (void *)bpf_dctcp, sizeof(bpf_dctcp)) == -EBUSY)
+			ebusy_cnt++;
+
 		/* Switch back to fallback */
-		bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
-			       (void *)fallback, sizeof(fallback));
+		if (bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
+				   (void *)fallback, sizeof(fallback)) == -EBUSY)
+			ebusy_cnt++;
+
 		/* Expecting -ENOTSUPP for tcp_cdg_res */
 		tcp_cdg_res = bpf_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
 					     (void *)tcp_cdg, sizeof(tcp_cdg));
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23 22:45 ` [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
@ 2022-09-27  3:34   ` Alexei Starovoitov
  2022-09-29  1:12     ` Alexei Starovoitov
  2022-09-29  2:04   ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2022-09-27  3:34 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Network Development, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	Kernel Team, Paolo Abeni, Martin KaFai Lau

On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> From: Martin KaFai Lau <martin.lau@kernel.org>
>
> When a bad bpf prog '.init' calls
> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>
> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> ... => .init => bpf_setsockopt(tcp_cc).
>
> It was prevented by the prog->active counter before but the prog->active
> detection cannot be used in struct_ops as explained in the earlier
> patch of the set.
>
> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> in order to break the loop.  This is done by using a bit of
> an existing 1 byte hole in tcp_sock to check if there is
> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
>
> Note that this essentially limits only the first '.init' can
> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> does not support ECN) and the second '.init' cannot fallback to
> another cc.  This applies even the second
> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>
> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> ---
>  include/linux/tcp.h |  6 ++++++
>  net/core/filter.c   | 28 +++++++++++++++++++++++++++-
>  2 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index a9fbe22732c3..3bdf687e2fb3 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -388,6 +388,12 @@ struct tcp_sock {
>         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>                                          * values defined in uapi/linux/tcp.h
>                                          */
> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> +                                         * bpf_setsockopt(TCP_CONGESTION),
> +                                         * it is to avoid the bpf_tcp_cc->init()
> +                                         * to recur itself by calling
> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
> +                                         */
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>  #else
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 96f2f7a65e65..ac4c45c02da5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
>  static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>                                       int *optlen, bool getopt)
>  {
> +       struct tcp_sock *tp;
> +       int ret;
> +
>         if (*optlen < 2)
>                 return -EINVAL;
>
> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>         if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
>                 return -ENOTSUPP;
>
> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> +       /* It stops this looping
> +        *
> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
> +        * bpf_setsockopt(tcp_cc)" => .init => ....
> +        *
> +        * The second bpf_setsockopt(tcp_cc) is not allowed
> +        * in order to break the loop when both .init
> +        * are the same bpf prog.
> +        *
> +        * This applies even the second bpf_setsockopt(tcp_cc)
> +        * does not cause a loop.  This limits only the first
> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> +        * pick a fallback cc (eg. peer does not support ECN)
> +        * and the second '.init' cannot fallback to
> +        * another.
> +        */
> +       tp = tcp_sk(sk);
> +       if (tp->bpf_chg_cc_inprogress)
> +               return -EBUSY;
> +
> +       tp->bpf_chg_cc_inprogress = 1;
> +       ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>                                 KERNEL_SOCKPTR(optval), *optlen);
> +       tp->bpf_chg_cc_inprogress = 0;
> +       return ret;

Eric,

Could you please ack this patch?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-27  3:34   ` Alexei Starovoitov
@ 2022-09-29  1:12     ` Alexei Starovoitov
  0 siblings, 0 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2022-09-29  1:12 UTC (permalink / raw)
  To: Martin KaFai Lau, Eric Dumazet
  Cc: bpf, Network Development, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, David Miller, Eric Dumazet, Jakub Kicinski,
	Kernel Team, Paolo Abeni, Martin KaFai Lau

Eric,

Ping! This is an important fix for anyone using bpf-based tcp-cc.

On Mon, Sep 26, 2022 at 8:34 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > From: Martin KaFai Lau <martin.lau@kernel.org>
> >
> > When a bad bpf prog '.init' calls
> > bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
> >
> > .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> > ... => .init => bpf_setsockopt(tcp_cc).
> >
> > It was prevented by the prog->active counter before but the prog->active
> > detection cannot be used in struct_ops as explained in the earlier
> > patch of the set.
> >
> > In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> > in order to break the loop.  This is done by using a bit of
> > an existing 1 byte hole in tcp_sock to check if there is
> > on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
> >
> > Note that this essentially limits only the first '.init' can
> > call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> > does not support ECN) and the second '.init' cannot fallback to
> > another cc.  This applies even the second
> > bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
> >
> > Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> > ---
> >  include/linux/tcp.h |  6 ++++++
> >  net/core/filter.c   | 28 +++++++++++++++++++++++++++-
> >  2 files changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index a9fbe22732c3..3bdf687e2fb3 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -388,6 +388,12 @@ struct tcp_sock {
> >         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
> >                                          * values defined in uapi/linux/tcp.h
> >                                          */
> > +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> > +                                         * bpf_setsockopt(TCP_CONGESTION),
> > +                                         * it is to avoid the bpf_tcp_cc->init()
> > +                                         * to recur itself by calling
> > +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
> > +                                         */
> >  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
> >  #else
> >  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 96f2f7a65e65..ac4c45c02da5 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
> >  static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
> >                                       int *optlen, bool getopt)
> >  {
> > +       struct tcp_sock *tp;
> > +       int ret;
> > +
> >         if (*optlen < 2)
> >                 return -EINVAL;
> >
> > @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
> >         if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
> >                 return -ENOTSUPP;
> >
> > -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> > +       /* It stops this looping
> > +        *
> > +        * .init => bpf_setsockopt(tcp_cc) => .init =>
> > +        * bpf_setsockopt(tcp_cc)" => .init => ....
> > +        *
> > +        * The second bpf_setsockopt(tcp_cc) is not allowed
> > +        * in order to break the loop when both .init
> > +        * are the same bpf prog.
> > +        *
> > +        * This applies even the second bpf_setsockopt(tcp_cc)
> > +        * does not cause a loop.  This limits only the first
> > +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> > +        * pick a fallback cc (eg. peer does not support ECN)
> > +        * and the second '.init' cannot fallback to
> > +        * another.
> > +        */
> > +       tp = tcp_sk(sk);
> > +       if (tp->bpf_chg_cc_inprogress)
> > +               return -EBUSY;
> > +
> > +       tp->bpf_chg_cc_inprogress = 1;
> > +       ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> >                                 KERNEL_SOCKPTR(optval), *optlen);
> > +       tp->bpf_chg_cc_inprogress = 0;
> > +       return ret;
>
> Eric,
>
> Could you please ack this patch?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-23 22:45 ` [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
  2022-09-27  3:34   ` Alexei Starovoitov
@ 2022-09-29  2:04   ` Eric Dumazet
  2022-09-29  5:31     ` Martin KaFai Lau
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2022-09-29  2:04 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Jakub Kicinski, kernel-team, Paolo Abeni,
	Martin KaFai Lau

On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> From: Martin KaFai Lau <martin.lau@kernel.org>
>
> When a bad bpf prog '.init' calls
> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>
> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> ... => .init => bpf_setsockopt(tcp_cc).
>
> It was prevented by the prog->active counter before but the prog->active
> detection cannot be used in struct_ops as explained in the earlier
> patch of the set.
>
> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> in order to break the loop.  This is done by using a bit of
> an existing 1 byte hole in tcp_sock to check if there is
> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
>
> Note that this essentially limits only the first '.init' can
> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> does not support ECN) and the second '.init' cannot fallback to
> another cc.  This applies even the second
> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>
> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> ---
>  include/linux/tcp.h |  6 ++++++
>  net/core/filter.c   | 28 +++++++++++++++++++++++++++-
>  2 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index a9fbe22732c3..3bdf687e2fb3 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -388,6 +388,12 @@ struct tcp_sock {
>         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>                                          * values defined in uapi/linux/tcp.h
>                                          */
> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> +                                         * bpf_setsockopt(TCP_CONGESTION),
> +                                         * it is to avoid the bpf_tcp_cc->init()
> +                                         * to recur itself by calling
> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
> +                                         */
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>  #else
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 96f2f7a65e65..ac4c45c02da5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
>  static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>                                       int *optlen, bool getopt)
>  {
> +       struct tcp_sock *tp;
> +       int ret;
> +
>         if (*optlen < 2)
>                 return -EINVAL;
>
> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>         if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
>                 return -ENOTSUPP;
>
> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> +       /* It stops this looping
> +        *
> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
> +        * bpf_setsockopt(tcp_cc)" => .init => ....
> +        *
> +        * The second bpf_setsockopt(tcp_cc) is not allowed
> +        * in order to break the loop when both .init
> +        * are the same bpf prog.
> +        *
> +        * This applies even the second bpf_setsockopt(tcp_cc)
> +        * does not cause a loop.  This limits only the first
> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> +        * pick a fallback cc (eg. peer does not support ECN)
> +        * and the second '.init' cannot fallback to
> +        * another.
> +        */
> +       tp = tcp_sk(sk);
> +       if (tp->bpf_chg_cc_inprogress)
> +               return -EBUSY;
> +

Is the socket locked (and owned by current thread) at this point ?
If not, changing bpf_chg_cc_inprogress would be racy.


> +       tp->bpf_chg_cc_inprogress = 1;
> +       ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>                                 KERNEL_SOCKPTR(optval), *optlen);
> +       tp->bpf_chg_cc_inprogress = 0;
> +       return ret;
>  }
>
>  static int sol_tcp_sockopt(struct sock *sk, int optname,
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-29  2:04   ` Eric Dumazet
@ 2022-09-29  5:31     ` Martin KaFai Lau
  2022-09-29  5:37       ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Martin KaFai Lau @ 2022-09-29  5:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Jakub Kicinski, kernel-team, Paolo Abeni,
	Martin KaFai Lau

On 9/28/22 7:04 PM, Eric Dumazet wrote:
> On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>
>> When a bad bpf prog '.init' calls
>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>>
>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
>> ... => .init => bpf_setsockopt(tcp_cc).
>>
>> It was prevented by the prog->active counter before but the prog->active
>> detection cannot be used in struct_ops as explained in the earlier
>> patch of the set.
>>
>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
>> in order to break the loop.  This is done by using a bit of
>> an existing 1 byte hole in tcp_sock to check if there is
>> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
>>
>> Note that this essentially limits only the first '.init' can
>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
>> does not support ECN) and the second '.init' cannot fallback to
>> another cc.  This applies even the second
>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>>
>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>> ---
>>   include/linux/tcp.h |  6 ++++++
>>   net/core/filter.c   | 28 +++++++++++++++++++++++++++-
>>   2 files changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index a9fbe22732c3..3bdf687e2fb3 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -388,6 +388,12 @@ struct tcp_sock {
>>          u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>>                                           * values defined in uapi/linux/tcp.h
>>                                           */
>> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
>> +                                         * bpf_setsockopt(TCP_CONGESTION),
>> +                                         * it is to avoid the bpf_tcp_cc->init()
>> +                                         * to recur itself by calling
>> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
>> +                                         */
>>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>>   #else
>>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 96f2f7a65e65..ac4c45c02da5 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
>>   static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>>                                        int *optlen, bool getopt)
>>   {
>> +       struct tcp_sock *tp;
>> +       int ret;
>> +
>>          if (*optlen < 2)
>>                  return -EINVAL;
>>
>> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>>          if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
>>                  return -ENOTSUPP;
>>
>> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>> +       /* It stops this looping
>> +        *
>> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
>> +        * bpf_setsockopt(tcp_cc)" => .init => ....
>> +        *
>> +        * The second bpf_setsockopt(tcp_cc) is not allowed
>> +        * in order to break the loop when both .init
>> +        * are the same bpf prog.
>> +        *
>> +        * This applies even the second bpf_setsockopt(tcp_cc)
>> +        * does not cause a loop.  This limits only the first
>> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
>> +        * pick a fallback cc (eg. peer does not support ECN)
>> +        * and the second '.init' cannot fallback to
>> +        * another.
>> +        */
>> +       tp = tcp_sk(sk);
>> +       if (tp->bpf_chg_cc_inprogress)
>> +               return -EBUSY;
>> +
> 
> Is the socket locked (and owned by current thread) at this point ?
> If not, changing bpf_chg_cc_inprogress would be racy.

Yes, the socket is locked and owned.  There is a sock_owned_by_me check earlier 
in _bpf_setsockopt().

> 
> 
>> +       tp->bpf_chg_cc_inprogress = 1;
>> +       ret = do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>>                                  KERNEL_SOCKPTR(optval), *optlen);
>> +       tp->bpf_chg_cc_inprogress = 0;
>> +       return ret;
>>   }
>>
>>   static int sol_tcp_sockopt(struct sock *sk, int optname,
>> --
>> 2.30.2
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-29  5:31     ` Martin KaFai Lau
@ 2022-09-29  5:37       ` Eric Dumazet
  2022-09-29  6:17         ` Martin KaFai Lau
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2022-09-29  5:37 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Jakub Kicinski, kernel-team, Paolo Abeni,
	Martin KaFai Lau

On Wed, Sep 28, 2022 at 10:31 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/28/22 7:04 PM, Eric Dumazet wrote:
> > On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >>
> >> From: Martin KaFai Lau <martin.lau@kernel.org>
> >>
> >> When a bad bpf prog '.init' calls
> >> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
> >>
> >> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> >> ... => .init => bpf_setsockopt(tcp_cc).
> >>
> >> It was prevented by the prog->active counter before but the prog->active
> >> detection cannot be used in struct_ops as explained in the earlier
> >> patch of the set.
> >>
> >> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> >> in order to break the loop.  This is done by using a bit of
> >> an existing 1 byte hole in tcp_sock to check if there is
> >> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
> >>
> >> Note that this essentially limits only the first '.init' can
> >> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> >> does not support ECN) and the second '.init' cannot fallback to
> >> another cc.  This applies even the second
> >> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
> >>
> >> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> >> ---
> >>   include/linux/tcp.h |  6 ++++++
> >>   net/core/filter.c   | 28 +++++++++++++++++++++++++++-
> >>   2 files changed, 33 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> >> index a9fbe22732c3..3bdf687e2fb3 100644
> >> --- a/include/linux/tcp.h
> >> +++ b/include/linux/tcp.h
> >> @@ -388,6 +388,12 @@ struct tcp_sock {
> >>          u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
> >>                                           * values defined in uapi/linux/tcp.h
> >>                                           */
> >> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> >> +                                         * bpf_setsockopt(TCP_CONGESTION),
> >> +                                         * it is to avoid the bpf_tcp_cc->init()
> >> +                                         * to recur itself by calling
> >> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
> >> +                                         */
> >>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
> >>   #else
> >>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index 96f2f7a65e65..ac4c45c02da5 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
> >>   static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
> >>                                        int *optlen, bool getopt)
> >>   {
> >> +       struct tcp_sock *tp;
> >> +       int ret;
> >> +
> >>          if (*optlen < 2)
> >>                  return -EINVAL;
> >>
> >> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
> >>          if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
> >>                  return -ENOTSUPP;
> >>
> >> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
> >> +       /* It stops this looping
> >> +        *
> >> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
> >> +        * bpf_setsockopt(tcp_cc)" => .init => ....
> >> +        *
> >> +        * The second bpf_setsockopt(tcp_cc) is not allowed
> >> +        * in order to break the loop when both .init
> >> +        * are the same bpf prog.
> >> +        *
> >> +        * This applies even the second bpf_setsockopt(tcp_cc)
> >> +        * does not cause a loop.  This limits only the first
> >> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> >> +        * pick a fallback cc (eg. peer does not support ECN)
> >> +        * and the second '.init' cannot fallback to
> >> +        * another.
> >> +        */
> >> +       tp = tcp_sk(sk);
> >> +       if (tp->bpf_chg_cc_inprogress)
> >> +               return -EBUSY;
> >> +
> >
> > Is the socket locked (and owned by current thread) at this point ?
> > If not, changing bpf_chg_cc_inprogress would be racy.
>
> Yes, the socket is locked and owned.  There is a sock_owned_by_me check earlier
> in _bpf_setsockopt().

Good to know. Note a listener can be cloned without socket lock being held.

In order to avoid surprises, I would clear bpf_chg_cc_inprogress in
tcp_create_openreq_child()

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  2022-09-29  5:37       ` Eric Dumazet
@ 2022-09-29  6:17         ` Martin KaFai Lau
  0 siblings, 0 replies; 12+ messages in thread
From: Martin KaFai Lau @ 2022-09-29  6:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, netdev, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	David Miller, Jakub Kicinski, kernel-team, Paolo Abeni

On 9/28/22 10:37 PM, Eric Dumazet wrote:
> On Wed, Sep 28, 2022 at 10:31 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 9/28/22 7:04 PM, Eric Dumazet wrote:
>>> On Fri, Sep 23, 2022 at 3:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>>>
>>>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>>>
>>>> When a bad bpf prog '.init' calls
>>>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
>>>>
>>>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
>>>> ... => .init => bpf_setsockopt(tcp_cc).
>>>>
>>>> It was prevented by the prog->active counter before but the prog->active
>>>> detection cannot be used in struct_ops as explained in the earlier
>>>> patch of the set.
>>>>
>>>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
>>>> in order to break the loop.  This is done by using a bit of
>>>> an existing 1 byte hole in tcp_sock to check if there is
>>>> on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock.
>>>>
>>>> Note that this essentially limits only the first '.init' can
>>>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
>>>> does not support ECN) and the second '.init' cannot fallback to
>>>> another cc.  This applies even the second
>>>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
>>>>
>>>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>>>> ---
>>>>    include/linux/tcp.h |  6 ++++++
>>>>    net/core/filter.c   | 28 +++++++++++++++++++++++++++-
>>>>    2 files changed, 33 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>>>> index a9fbe22732c3..3bdf687e2fb3 100644
>>>> --- a/include/linux/tcp.h
>>>> +++ b/include/linux/tcp.h
>>>> @@ -388,6 +388,12 @@ struct tcp_sock {
>>>>           u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>>>>                                            * values defined in uapi/linux/tcp.h
>>>>                                            */
>>>> +       u8      bpf_chg_cc_inprogress:1; /* In the middle of
>>>> +                                         * bpf_setsockopt(TCP_CONGESTION),
>>>> +                                         * it is to avoid the bpf_tcp_cc->init()
>>>> +                                         * to recur itself by calling
>>>> +                                         * bpf_setsockopt(TCP_CONGESTION, "itself").
>>>> +                                         */
>>>>    #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>>>>    #else
>>>>    #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index 96f2f7a65e65..ac4c45c02da5 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -5105,6 +5105,9 @@ static int bpf_sol_tcp_setsockopt(struct sock *sk, int optname,
>>>>    static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>>>>                                         int *optlen, bool getopt)
>>>>    {
>>>> +       struct tcp_sock *tp;
>>>> +       int ret;
>>>> +
>>>>           if (*optlen < 2)
>>>>                   return -EINVAL;
>>>>
>>>> @@ -5125,8 +5128,31 @@ static int sol_tcp_sockopt_congestion(struct sock *sk, char *optval,
>>>>           if (*optlen >= sizeof("cdg") - 1 && !strncmp("cdg", optval, *optlen))
>>>>                   return -ENOTSUPP;
>>>>
>>>> -       return do_tcp_setsockopt(sk, SOL_TCP, TCP_CONGESTION,
>>>> +       /* It stops this looping
>>>> +        *
>>>> +        * .init => bpf_setsockopt(tcp_cc) => .init =>
>>>> +        * bpf_setsockopt(tcp_cc)" => .init => ....
>>>> +        *
>>>> +        * The second bpf_setsockopt(tcp_cc) is not allowed
>>>> +        * in order to break the loop when both .init
>>>> +        * are the same bpf prog.
>>>> +        *
>>>> +        * This applies even the second bpf_setsockopt(tcp_cc)
>>>> +        * does not cause a loop.  This limits only the first
>>>> +        * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
>>>> +        * pick a fallback cc (eg. peer does not support ECN)
>>>> +        * and the second '.init' cannot fallback to
>>>> +        * another.
>>>> +        */
>>>> +       tp = tcp_sk(sk);
>>>> +       if (tp->bpf_chg_cc_inprogress)
>>>> +               return -EBUSY;
>>>> +
>>>
>>> Is the socket locked (and owned by current thread) at this point ?
>>> If not, changing bpf_chg_cc_inprogress would be racy.
>>
>> Yes, the socket is locked and owned.  There is a sock_owned_by_me check earlier
>> in _bpf_setsockopt().
> 
> Good to know. Note a listener can be cloned without socket lock being held.
> 
> In order to avoid surprises, I would clear bpf_chg_cc_inprogress in
> tcp_create_openreq_child()

Ah, make sense.  I will re-spin.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-09-29  6:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-23 22:44 [PATCH v2 bpf-next 0/5] bpf: Remove recursion check for struct_ops prog Martin KaFai Lau
2022-09-23 22:44 ` [PATCH v2 bpf-next 1/5] bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline Martin KaFai Lau
2022-09-23 22:45 ` [PATCH v2 bpf-next 2/5] bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() Martin KaFai Lau
2022-09-23 22:45 ` [PATCH v2 bpf-next 3/5] bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function Martin KaFai Lau
2022-09-23 22:45 ` [PATCH v2 bpf-next 4/5] bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself Martin KaFai Lau
2022-09-27  3:34   ` Alexei Starovoitov
2022-09-29  1:12     ` Alexei Starovoitov
2022-09-29  2:04   ` Eric Dumazet
2022-09-29  5:31     ` Martin KaFai Lau
2022-09-29  5:37       ` Eric Dumazet
2022-09-29  6:17         ` Martin KaFai Lau
2022-09-23 22:45 ` [PATCH v2 bpf-next 5/5] selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox