public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
@ 2026-04-14 10:57 Jiayuan Chen
  2026-04-14 14:33 ` Alexei Starovoitov
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jiayuan Chen @ 2026-04-14 10:57 UTC (permalink / raw)
  To: bpf
  Cc: Jiayuan Chen, Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu,
	Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Shuah Khan, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, netdev, linux-doc, linux-kernel

A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
to inject custom TCP header options. When the kernel builds a TCP packet,
it calls tcp_established_options() to calculate the header size, which
invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
callback.

If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
__tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
tcp_current_mss(), which calls tcp_established_options() again,
re-triggering the same BPF callback. This creates an infinite recursion
that exhausts the kernel stack and causes a panic.

BPF_SOCK_OPS_HDR_OPT_LEN_CB
  -> bpf_setsockopt(TCP_NODELAY)
	-> tcp_push_pending_frames()
	  -> tcp_current_mss()
		-> tcp_established_options()
		  -> bpf_skops_hdr_opt_len()
                           /* infinite recursion */
			-> BPF_SOCK_OPS_HDR_OPT_LEN_CB

A similar reentrancy issue exists for TCP congestion control, which is
guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
bpf_setsockopt(TCP_NODELAY) calls that would trigger
tcp_push_pending_frames() and cause the recursion.

Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 Documentation/networking/net_cachelines/tcp_sock.rst |  1 +
 include/linux/tcp.h                                  | 11 ++++++++++-
 net/core/filter.c                                    |  4 ++++
 net/ipv4/tcp_minisocks.c                             |  1 +
 net/ipv4/tcp_output.c                                |  3 +++
 5 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
index 563daea10d6c..07d3226d90cc 100644
--- a/Documentation/networking/net_cachelines/tcp_sock.rst
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -152,6 +152,7 @@ unsigned_int                  keepalive_intvl
 int                           linger2
 u8                            bpf_sock_ops_cb_flags
 u8:1                          bpf_chg_cc_inprogress
+u8:1                          bpf_hdr_opt_len_cb_inprogress
 u16                           timeout_rehash
 u32                           rcv_ooopack
 u32                           rcv_rtt_last_tsecr
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index f72eef31fa23..2bfb73cf922e 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -475,12 +475,21 @@ struct tcp_sock {
 	u8	bpf_sock_ops_cb_flags;  /* Control calling BPF programs
 					 * values defined in uapi/linux/tcp.h
 					 */
-	u8	bpf_chg_cc_inprogress:1; /* In the middle of
+	u8	bpf_chg_cc_inprogress:1, /* In the middle of
 					  * bpf_setsockopt(TCP_CONGESTION),
 					  * it is to avoid the bpf_tcp_cc->init()
 					  * to recur itself by calling
 					  * bpf_setsockopt(TCP_CONGESTION, "itself").
 					  */
+		bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
+						  * callback so that a nested
+						  * bpf_setsockopt(TCP_NODELAY) or
+						  * bpf_setsockopt(TCP_CORK) cannot
+						  * trigger tcp_push_pending_frames(),
+						  * which would call tcp_current_mss()
+						  * -> bpf_skops_hdr_opt_len(), causing
+						  * infinite recursion.
+						  */
 #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
 #else
 #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
diff --git a/net/core/filter.c b/net/core/filter.c
index 78b548158fb0..518699429a7a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5483,6 +5483,10 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
 	if (sk->sk_protocol != IPPROTO_TCP)
 		return -EINVAL;
 
+	if ((optname == TCP_NODELAY || optname == TCP_CORK) &&
+	    tcp_sk(sk)->bpf_hdr_opt_len_cb_inprogress)
+		return -EBUSY;
+
 	switch (optname) {
 	case TCP_NODELAY:
 	case TCP_MAXSEG:
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index dafb63b923d0..fb06c464ac16 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -663,6 +663,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
 
 	newtp->bpf_chg_cc_inprogress = 0;
+	newtp->bpf_hdr_opt_len_cb_inprogress = 0;
 	tcp_bpf_clone(sk, newsk);
 
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 326b58ff1118..c9654e690e1a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -475,6 +475,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
 				  unsigned int *remaining)
 {
 	struct bpf_sock_ops_kern sock_ops;
+	struct tcp_sock *tp = tcp_sk(sk);
 	int err;
 
 	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
@@ -519,7 +520,9 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
 	if (skb)
 		bpf_skops_init_skb(&sock_ops, skb, 0);
 
+	tp->bpf_hdr_opt_len_cb_inprogress = 1;
 	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
+	tp->bpf_hdr_opt_len_cb_inprogress = 0;
 
 	if (err || sock_ops.remaining_opt_len == *remaining)
 		return;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
  2026-04-14 10:57 [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB Jiayuan Chen
@ 2026-04-14 14:33 ` Alexei Starovoitov
  2026-04-14 15:37 ` mkf
  2026-04-15 18:55 ` Martin KaFai Lau
  2 siblings, 0 replies; 7+ messages in thread
From: Alexei Starovoitov @ 2026-04-14 14:33 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, Network Development, open list:DOCUMENTATION, LKML

On Tue, Apr 14, 2026 at 3:57 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
>
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
>
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>   -> bpf_setsockopt(TCP_NODELAY)
>         -> tcp_push_pending_frames()
>           -> tcp_current_mss()
>                 -> tcp_established_options()
>                   -> bpf_skops_hdr_opt_len()
>                            /* infinite recursion */
>                         -> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
>
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  Documentation/networking/net_cachelines/tcp_sock.rst |  1 +
>  include/linux/tcp.h                                  | 11 ++++++++++-
>  net/core/filter.c                                    |  4 ++++
>  net/ipv4/tcp_minisocks.c                             |  1 +
>  net/ipv4/tcp_output.c                                |  3 +++
>  5 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c..07d3226d90cc 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -152,6 +152,7 @@ unsigned_int                  keepalive_intvl
>  int                           linger2
>  u8                            bpf_sock_ops_cb_flags
>  u8:1                          bpf_chg_cc_inprogress
> +u8:1                          bpf_hdr_opt_len_cb_inprogress
>  u16                           timeout_rehash
>  u32                           rcv_ooopack
>  u32                           rcv_rtt_last_tsecr
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23..2bfb73cf922e 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -475,12 +475,21 @@ struct tcp_sock {
>         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>                                          * values defined in uapi/linux/tcp.h
>                                          */
> -       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> +       u8      bpf_chg_cc_inprogress:1, /* In the middle of
>                                           * bpf_setsockopt(TCP_CONGESTION),
>                                           * it is to avoid the bpf_tcp_cc->init()
>                                           * to recur itself by calling
>                                           * bpf_setsockopt(TCP_CONGESTION, "itself").
>                                           */
> +               bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
> +                                                 * callback so that a nested
> +                                                 * bpf_setsockopt(TCP_NODELAY) or
> +                                                 * bpf_setsockopt(TCP_CORK) cannot
> +                                                 * trigger tcp_push_pending_frames(),
> +                                                 * which would call tcp_current_mss()
> +                                                 * -> bpf_skops_hdr_opt_len(), causing
> +                                                 * infinite recursion.

Let's not add new bits.
Reuse existing and test/check all in one place,
like commit 061ff040710e9 did.

pw-bot: cr

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
  2026-04-14 10:57 [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB Jiayuan Chen
  2026-04-14 14:33 ` Alexei Starovoitov
@ 2026-04-14 15:37 ` mkf
  2026-04-15  1:47   ` Jiayuan Chen
  2026-04-15 18:55 ` Martin KaFai Lau
  2 siblings, 1 reply; 7+ messages in thread
From: mkf @ 2026-04-14 15:37 UTC (permalink / raw)
  To: Jiayuan Chen, bpf
  Cc: Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, netdev, linux-doc, linux-kernel

On Tue, 2026-04-14 at 18:57 +0800, Jiayuan Chen wrote:
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
> 
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
> 
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>   -> bpf_setsockopt(TCP_NODELAY)
> 	-> tcp_push_pending_frames()
> 	  -> tcp_current_mss()
> 		-> tcp_established_options()
> 		  -> bpf_skops_hdr_opt_len()
>                            /* infinite recursion */
> 			-> BPF_SOCK_OPS_HDR_OPT_LEN_CB
> 
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
> 
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  Documentation/networking/net_cachelines/tcp_sock.rst |  1 +
>  include/linux/tcp.h                                  | 11 ++++++++++-
>  net/core/filter.c                                    |  4 ++++
>  net/ipv4/tcp_minisocks.c                             |  1 +
>  net/ipv4/tcp_output.c                                |  3 +++
>  5 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst
> b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c..07d3226d90cc 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -152,6 +152,7 @@ unsigned_int                  keepalive_intvl
>  int                           linger2
>  u8                            bpf_sock_ops_cb_flags
>  u8:1                          bpf_chg_cc_inprogress
> +u8:1                          bpf_hdr_opt_len_cb_inprogress
>  u16                           timeout_rehash
>  u32                           rcv_ooopack
>  u32                           rcv_rtt_last_tsecr
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23..2bfb73cf922e 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -475,12 +475,21 @@ struct tcp_sock {
>  	u8	bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>  					 * values defined in uapi/linux/tcp.h
>  					 */
> -	u8	bpf_chg_cc_inprogress:1; /* In the middle of
> +	u8	bpf_chg_cc_inprogress:1, /* In the middle of
>  					  * bpf_setsockopt(TCP_CONGESTION),
>  					  * it is to avoid the bpf_tcp_cc->init()
>  					  * to recur itself by calling
>  					  * bpf_setsockopt(TCP_CONGESTION, "itself").
>  					  */
> +		bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
> +						  * callback so that a nested
> +						  * bpf_setsockopt(TCP_NODELAY) or
> +						  * bpf_setsockopt(TCP_CORK) cannot
> +						  * trigger tcp_push_pending_frames(),
> +						  * which would call tcp_current_mss()
> +						  * -> bpf_skops_hdr_opt_len(), causing
> +						  * infinite recursion.
> +						  */
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>  #else
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 78b548158fb0..518699429a7a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5483,6 +5483,10 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
>  	if (sk->sk_protocol != IPPROTO_TCP)
>  		return -EINVAL;
>  
> +	if ((optname == TCP_NODELAY || optname == TCP_CORK) &&
> +	    tcp_sk(sk)->bpf_hdr_opt_len_cb_inprogress)
> +		return -EBUSY;
> +
TCP_CORK is not support in sol_tcp_sockopt(), return -EINVAL by default. and put the check here
could also prevent us from calling getsockopt(TCP_NODELAY) below.

>  	switch (optname) {
>  	case TCP_NODELAY:
>  	case TCP_MAXSEG:
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index dafb63b923d0..fb06c464ac16 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -663,6 +663,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>  	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
>  
>  	newtp->bpf_chg_cc_inprogress = 0;
> +	newtp->bpf_hdr_opt_len_cb_inprogress = 0;
>  	tcp_bpf_clone(sk, newsk);
>  
>  	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 326b58ff1118..c9654e690e1a 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -475,6 +475,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  				  unsigned int *remaining)
>  {
>  	struct bpf_sock_ops_kern sock_ops;
> +	struct tcp_sock *tp = tcp_sk(sk);
>  	int err;
>  
>  	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> @@ -519,7 +520,9 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  	if (skb)
>  		bpf_skops_init_skb(&sock_ops, skb, 0);
>  
> +	tp->bpf_hdr_opt_len_cb_inprogress = 1;
we check the BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG before calling BPF_CGROUP_RUN_PROG_SOCK_OPS_SK,
could this flag use for the same purpose? so we don't need to add an extra field.

	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
	    !*remaining)
		return;
>  	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
> +	tp->bpf_hdr_opt_len_cb_inprogress = 0;
>  
>  	if (err || sock_ops.remaining_opt_len == *remaining)
>  		return;

-- 
Thanks,
KaFai


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
  2026-04-14 15:37 ` mkf
@ 2026-04-15  1:47   ` Jiayuan Chen
  2026-04-15 12:52     ` KaFai Wan
  0 siblings, 1 reply; 7+ messages in thread
From: Jiayuan Chen @ 2026-04-15  1:47 UTC (permalink / raw)
  To: mkf, bpf
  Cc: Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, netdev, linux-doc, linux-kernel


On 4/14/26 11:37 PM, mkf wrote:
> On Tue, 2026-04-14 at 18:57 +0800, Jiayuan Chen wrote:


[...]

> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -475,12 +475,21 @@ struct tcp_sock {
>   	u8	bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>   					 * values defined in uapi/linux/tcp.h
>   					 */
> -	u8	bpf_chg_cc_inprogress:1; /* In the middle of
> +	u8	bpf_chg_cc_inprogress:1, /* In the middle of
>   					  * bpf_setsockopt(TCP_CONGESTION),
>   					  * it is to avoid the bpf_tcp_cc->init()
>   					  * to recur itself by calling
>   					  * bpf_setsockopt(TCP_CONGESTION, "itself").
>   					  */
> +		bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
> +						  * callback so that a nested
> +						  * bpf_setsockopt(TCP_NODELAY) or
> +						  * bpf_setsockopt(TCP_CORK) cannot
> +						  * trigger tcp_push_pending_frames(),
> +						  * which would call tcp_current_mss()
> +						  * -> bpf_skops_hdr_opt_len(), causing
> +						  * infinite recursion.
> +						  */
>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>   #else
>   #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 78b548158fb0..518699429a7a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5483,6 +5483,10 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
>   	if (sk->sk_protocol != IPPROTO_TCP)
>   		return -EINVAL;
>   
> +	if ((optname == TCP_NODELAY || optname == TCP_CORK) &&
> +	    tcp_sk(sk)->bpf_hdr_opt_len_cb_inprogress)
> +		return -EBUSY;
> +
> TCP_CORK is not support in sol_tcp_sockopt(), return -EINVAL by default. and put the check here
> could also prevent us from calling getsockopt(TCP_NODELAY) below.
>
>>   	switch (optname) {
>>   	case TCP_NODELAY:
>>   	case TCP_MAXSEG:
>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
>> index dafb63b923d0..fb06c464ac16 100644
>> --- a/net/ipv4/tcp_minisocks.c
>> +++ b/net/ipv4/tcp_minisocks.c
>> @@ -663,6 +663,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>>   	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
>>   
>>   	newtp->bpf_chg_cc_inprogress = 0;
>> +	newtp->bpf_hdr_opt_len_cb_inprogress = 0;
>>   	tcp_bpf_clone(sk, newsk);
>>   
>>   	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 326b58ff1118..c9654e690e1a 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -475,6 +475,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>>   				  unsigned int *remaining)
>>   {
>>   	struct bpf_sock_ops_kern sock_ops;
>> +	struct tcp_sock *tp = tcp_sk(sk);
>>   	int err;
>>   
>>   	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
>> @@ -519,7 +520,9 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>>   	if (skb)
>>   		bpf_skops_init_skb(&sock_ops, skb, 0);
>>   
>> +	tp->bpf_hdr_opt_len_cb_inprogress = 1;
> we check the BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG before calling BPF_CGROUP_RUN_PROG_SOCK_OPS_SK,
> could this flag use for the same purpose? so we don't need to add an extra field.
>
> 	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> 					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
> 	    !*remaining)
> 		return;


Hi Martin, I saw your patch. Your solution is better, please ignore mine :)




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
  2026-04-15  1:47   ` Jiayuan Chen
@ 2026-04-15 12:52     ` KaFai Wan
  0 siblings, 0 replies; 7+ messages in thread
From: KaFai Wan @ 2026-04-15 12:52 UTC (permalink / raw)
  To: Jiayuan Chen, bpf
  Cc: Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, netdev, linux-doc, linux-kernel

On Wed, 2026-04-15 at 09:47 +0800, Jiayuan Chen wrote:
> 
> On 4/14/26 11:37 PM, mkf wrote:
> > On Tue, 2026-04-14 at 18:57 +0800, Jiayuan Chen wrote:
> 
> Hi Martin, I saw your patch. Your solution is better, please ignore mine :)
> 
I'm not Martin, just same first name :). Ok, I'll continue.
> 
> 

-- 
Thanks,
KaFai

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
  2026-04-14 10:57 [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB Jiayuan Chen
  2026-04-14 14:33 ` Alexei Starovoitov
  2026-04-14 15:37 ` mkf
@ 2026-04-15 18:55 ` Martin KaFai Lau
  2026-04-15 20:47   ` KaFai Wan
  2 siblings, 1 reply; 7+ messages in thread
From: Martin KaFai Lau @ 2026-04-15 18:55 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David Ahern,
	netdev, linux-doc, linux-kernel

On Tue, Apr 14, 2026 at 06:57:00PM +0800, Jiayuan Chen wrote:
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
> 
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
> 
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>   -> bpf_setsockopt(TCP_NODELAY)
> 	-> tcp_push_pending_frames()
> 	  -> tcp_current_mss()
> 		-> tcp_established_options()
> 		  -> bpf_skops_hdr_opt_len()
>                            /* infinite recursion */
> 			-> BPF_SOCK_OPS_HDR_OPT_LEN_CB
> 
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
> 
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/

Thanks for the report and fixes suggested across different threads.

Using has_current_bpf_ctx() to avoid tcp_push_pending_frames() should
work but it may change the expectation for bpf_setsockopt(TCP_NODELAY).
e.g. A bpf_tcp_iter does bpf_setsockopt(TCP_NODELAY).

Adding another bit in the tcp_sock is not ideal either. I agree with
Alexei that it is better to reuse the existing bit if we go down this path.
We also need to audit more closely if there are cases that two different
type of bpf progs may call bpf_setsockopt(). e.g.
bpf_tcp_iter does bpf_setsockopt(TCP_CONGESTION) to switch
to a bpf_tcp_cc and the new bpf_tcp_cc->init() will also do
bpf_setsockopt(xxx) which then will be rejected.

Another fix could be, the bpf_setsockopt(TCP_NODELAY) is always broken
for BPF_SOCK_OPS_HDR_OPT_LEN_CB and BPF_SOCK_OPS_WRITE_HDR_OPT_CB unless
the bpf prog is doing some maneuver to avoid the recursion. Thus,
this use case is basically broken as is and I don't see a use case
for bpf_setsockopt(TCP_NODELAY) when writing header also.
How about checking the bpf_sock->op, level, and optname in
bpf_sock_ops_setsockopt() and return -EOPNOTSUPP?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
  2026-04-15 18:55 ` Martin KaFai Lau
@ 2026-04-15 20:47   ` KaFai Wan
  0 siblings, 0 replies; 7+ messages in thread
From: KaFai Wan @ 2026-04-15 20:47 UTC (permalink / raw)
  To: Martin KaFai Lau, Jiayuan Chen
  Cc: bpf, Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David Ahern,
	netdev, linux-doc, linux-kernel

On Wed, 2026-04-15 at 11:55 -0700, Martin KaFai Lau wrote:
> On Tue, Apr 14, 2026 at 06:57:00PM +0800, Jiayuan Chen wrote:
> > A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> > to inject custom TCP header options. When the kernel builds a TCP packet,
> > it calls tcp_established_options() to calculate the header size, which
> > invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> > callback.
> > 
> > If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> > __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> > tcp_current_mss(), which calls tcp_established_options() again,
> > re-triggering the same BPF callback. This creates an infinite recursion
> > that exhausts the kernel stack and causes a panic.
> > 
> > BPF_SOCK_OPS_HDR_OPT_LEN_CB
> >   -> bpf_setsockopt(TCP_NODELAY)
> > 	-> tcp_push_pending_frames()
> > 	  -> tcp_current_mss()
> > 		-> tcp_established_options()
> > 		  -> bpf_skops_hdr_opt_len()
> >                            /* infinite recursion */
> > 			-> BPF_SOCK_OPS_HDR_OPT_LEN_CB
> > 
> > A similar reentrancy issue exists for TCP congestion control, which is
> > guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> > tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> > bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> > bpf_setsockopt(TCP_NODELAY) calls that would trigger
> > tcp_push_pending_frames() and cause the recursion.
> > 
> > Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> > Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> > Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> > Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> > Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> 
> Thanks for the report and fixes suggested across different threads.
> 
> Using has_current_bpf_ctx() to avoid tcp_push_pending_frames() should
> work but it may change the expectation for bpf_setsockopt(TCP_NODELAY).
> e.g. A bpf_tcp_iter does bpf_setsockopt(TCP_NODELAY).
> 
> Adding another bit in the tcp_sock is not ideal either. I agree with
> Alexei that it is better to reuse the existing bit if we go down this path.
> We also need to audit more closely if there are cases that two different
> type of bpf progs may call bpf_setsockopt(). e.g.
> bpf_tcp_iter does bpf_setsockopt(TCP_CONGESTION) to switch
> to a bpf_tcp_cc and the new bpf_tcp_cc->init() will also do
> bpf_setsockopt(xxx) which then will be rejected.
> 
> Another fix could be, the bpf_setsockopt(TCP_NODELAY) is always broken
> for BPF_SOCK_OPS_HDR_OPT_LEN_CB and BPF_SOCK_OPS_WRITE_HDR_OPT_CB unless
> the bpf prog is doing some maneuver to avoid the recursion. Thus,
> this use case is basically broken as is and I don't see a use case
> for bpf_setsockopt(TCP_NODELAY) when writing header also.
> How about checking the bpf_sock->op, level, and optname in
> bpf_sock_ops_setsockopt() and return -EOPNOTSUPP?

Hi Martin, thanks for the review.

I'm working whit return -EOPNOTSUPP. I've completed whit the code of fix and test, and will send the
patch later.

The fix is:

diff --git a/net/core/filter.c b/net/core/filter.c
index fcfcb72663ca..911ff04bca5a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5833,6 +5833,11 @@ BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	if (!is_locked_tcp_sock_ops(bpf_sock))
 		return -EOPNOTSUPP;
 
+	if ((bpf_sock->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB ||
+	     bpf_sock->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB) &&
+	    IS_ENABLED(CONFIG_INET) && level == SOL_TCP && optname == TCP_NODELAY)
+		return -EOPNOTSUPP;
+
 	return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
 }

-- 
Thanks,
KaFai

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-15 20:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-14 10:57 [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB Jiayuan Chen
2026-04-14 14:33 ` Alexei Starovoitov
2026-04-14 15:37 ` mkf
2026-04-15  1:47   ` Jiayuan Chen
2026-04-15 12:52     ` KaFai Wan
2026-04-15 18:55 ` Martin KaFai Lau
2026-04-15 20:47   ` KaFai Wan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox