Re: [PATCH bpf v3 1/2] bpf: Reject TCP_NODELAY in TCP header option callbacks

Sashiko discussions
 help / color / mirror / Atom feed

* Re: [PATCH bpf v3 1/2] bpf: Reject TCP_NODELAY in TCP header option callbacks
       [not found] ` <20260418092235.98444C19424@smtp.kernel.org>
@ 2026-04-20 13:41   ` KaFai Wan
  2026-04-20 18:12     ` Martin KaFai Lau
  0 siblings, 1 reply; 3+ messages in thread
From: KaFai Wan @ 2026-04-20 13:41 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Sat, 2026-04-18 at 09:22 +0000, sashiko-bot@kernel.org wrote:
> Sashiko AI review found 1 potential issue(s):
> - [High] The patch prevents an infinite recursion causing a kernel stack overflow by rejecting
> TCP_NODELAY in specific BPF_SOCK_OPS callbacks. However, it fails to address the exact same
> recursion vulnerability achievable via bpf_sk_setsockopt in a BPF TCP congestion control algorithm
> (BPF_PROG_TYPE_STRUCT_OPS).
> --
> 
> commit 4ff4d98f6b4e3c3d90f9268761de84573f49fd86
> Author: KaFai Wan <kafai.wan@linux.dev>
> 
> bpf: Reject TCP_NODELAY in TCP header option callbacks
> 
> This patch rejects bpf_setsockopt(TCP_NODELAY) when called from
> BPF_SOCK_OPS_HDR_OPT_LEN_CB or BPF_SOCK_OPS_WRITE_HDR_OPT_CB to prevent
> unbounded recursion and kernel stack overflow.
> 
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index fcfcb72663ca3..eae2f2c542716 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5833,6 +5833,12 @@ BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> >  	if (!is_locked_tcp_sock_ops(bpf_sock))
> >  		return -EOPNOTSUPP;
> >  
> > +	/* TCP_NODELAY triggers tcp_push_pending_frames() and re-enters these callbacks. */
> > +	if ((bpf_sock->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB ||
> > +	     bpf_sock->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB) &&
> > +	    level == SOL_TCP && optname == TCP_NODELAY)
> > +		return -EOPNOTSUPP;
> 
> Does this same recursion vulnerability exist for BPF TCP congestion control
> algorithms using BPF_PROG_TYPE_STRUCT_OPS?
> 
> If a BPF congestion control algorithm invokes bpf_setsockopt(TCP_NODELAY)
> from its cwnd_event callback when handling CA_EVENT_TX_START, could it
> trigger the same unbounded recursion?
> 
> When the kernel transmits the first packet of a data train via
> tcp_transmit_skb(), it invokes tcp_event_data_sent(). Because
> tp->packets_out is not incremented until later, tcp_packets_in_flight(tp)
> evaluates to 0, triggering tcp_ca_event(sk, CA_EVENT_TX_START).
> 
> If the BPF program then calls bpf_setsockopt(TCP_NODELAY), it would result
> in this call chain:
> 
> tcp_transmit_skb()
>   tcp_event_data_sent() -> invokes CA_EVENT_TX_START
>     cwnd_event()
>       bpf_setsockopt(TCP_NODELAY)
>         tcp_push_pending_frames()
>           tcp_write_xmit()
> 
> Since the outer tcp_transmit_skb() hasn't finished, the send head hasn't
> advanced. Wouldn't tcp_write_xmit() see the same SKB, attempt to transmit
> it again, and re-enter tcp_transmit_skb() causing an infinite recursion?
> 
You are right. I can reproduce this. 
 
> Should the restriction on TCP_NODELAY be enforced at a broader level, such
> as inside _bpf_setsockopt(), to protect contexts holding the socket lock
> during TX paths?
> 
We can check in sol_tcp_sockopt().

> > +
> >  	return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
> >  }
> > 
> 

-- 
Thanks,
KaFai

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH bpf v3 1/2] bpf: Reject TCP_NODELAY in TCP header option callbacks
  2026-04-20 13:41   ` [PATCH bpf v3 1/2] bpf: Reject TCP_NODELAY in TCP header option callbacks KaFai Wan
@ 2026-04-20 18:12     ` Martin KaFai Lau
  2026-04-21 15:50       ` KaFai Wan
  0 siblings, 1 reply; 3+ messages in thread
From: Martin KaFai Lau @ 2026-04-20 18:12 UTC (permalink / raw)
  To: KaFai Wan; +Cc: sashiko, bpf

On Mon, Apr 20, 2026 at 09:41:06PM +0800, KaFai Wan wrote:
> > Does this same recursion vulnerability exist for BPF TCP congestion control
> > algorithms using BPF_PROG_TYPE_STRUCT_OPS?
> > 
> > If a BPF congestion control algorithm invokes bpf_setsockopt(TCP_NODELAY)
> > from its cwnd_event callback when handling CA_EVENT_TX_START, could it
> > trigger the same unbounded recursion?
> > 
> > When the kernel transmits the first packet of a data train via
> > tcp_transmit_skb(), it invokes tcp_event_data_sent(). Because
> > tp->packets_out is not incremented until later, tcp_packets_in_flight(tp)
> > evaluates to 0, triggering tcp_ca_event(sk, CA_EVENT_TX_START).
> > 
> > If the BPF program then calls bpf_setsockopt(TCP_NODELAY), it would result
> > in this call chain:
> > 
> > tcp_transmit_skb()
> >   tcp_event_data_sent() -> invokes CA_EVENT_TX_START
> >     cwnd_event()
> >       bpf_setsockopt(TCP_NODELAY)
> >         tcp_push_pending_frames()
> >           tcp_write_xmit()
> > 
> > Since the outer tcp_transmit_skb() hasn't finished, the send head hasn't
> > advanced. Wouldn't tcp_write_xmit() see the same SKB, attempt to transmit
> > it again, and re-enter tcp_transmit_skb() causing an infinite recursion?
> > 
> You are right. I can reproduce this. 
>  
> > Should the restriction on TCP_NODELAY be enforced at a broader level, such
> > as inside _bpf_setsockopt(), to protect contexts holding the socket lock
> > during TX paths?
> > 
> We can check in sol_tcp_sockopt().

I don't know how it can use the socket lock to single out this case.
All bpf programs that are allowed to call bpf_setsockopt should
have the sock lock held. Maybe I am missing something obvious.

In bpf_tcp_ca_get_func_proto, it checks what ops can do bpf_sk_setsockopt_proto.
Right now, it rejects the "release" ops. One option is to create a new
func_proto, bpf_sk_setsockopt_nodelay_proto, to reject TCP_NODELAY.
Instead of checking cwnd_event[_tx_start] in bpf_tcp_ca_get_func_proto,
I would return bpf_sk_setsockopt_nodelay_proto for all ops. We can revisit
and be more selective in the future if the hammer turns out to be too big.
"release" ops will remain disallowed from calling bpf_setsockopt.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH bpf v3 1/2] bpf: Reject TCP_NODELAY in TCP header option callbacks
  2026-04-20 18:12     ` Martin KaFai Lau
@ 2026-04-21 15:50       ` KaFai Wan
  0 siblings, 0 replies; 3+ messages in thread
From: KaFai Wan @ 2026-04-21 15:50 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: sashiko, bpf

On Mon, 2026-04-20 at 11:12 -0700, Martin KaFai Lau wrote:
> On Mon, Apr 20, 2026 at 09:41:06PM +0800, KaFai Wan wrote:
> > > Does this same recursion vulnerability exist for BPF TCP congestion control
> > > algorithms using BPF_PROG_TYPE_STRUCT_OPS?
> > > 
> > > If a BPF congestion control algorithm invokes bpf_setsockopt(TCP_NODELAY)
> > > from its cwnd_event callback when handling CA_EVENT_TX_START, could it
> > > trigger the same unbounded recursion?
> > > 
> > > When the kernel transmits the first packet of a data train via
> > > tcp_transmit_skb(), it invokes tcp_event_data_sent(). Because
> > > tp->packets_out is not incremented until later, tcp_packets_in_flight(tp)
> > > evaluates to 0, triggering tcp_ca_event(sk, CA_EVENT_TX_START).
> > > 
> > > If the BPF program then calls bpf_setsockopt(TCP_NODELAY), it would result
> > > in this call chain:
> > > 
> > > tcp_transmit_skb()
> > >   tcp_event_data_sent() -> invokes CA_EVENT_TX_START
> > >     cwnd_event()
> > >       bpf_setsockopt(TCP_NODELAY)
> > >         tcp_push_pending_frames()
> > >           tcp_write_xmit()
> > > 
> > > Since the outer tcp_transmit_skb() hasn't finished, the send head hasn't
> > > advanced. Wouldn't tcp_write_xmit() see the same SKB, attempt to transmit
> > > it again, and re-enter tcp_transmit_skb() causing an infinite recursion?
> > > 
> > You are right. I can reproduce this. 
> >  
> > > Should the restriction on TCP_NODELAY be enforced at a broader level, such
> > > as inside _bpf_setsockopt(), to protect contexts holding the socket lock
> > > during TX paths?
> > > 
> > We can check in sol_tcp_sockopt().
> 
> I don't know how it can use the socket lock to single out this case.
> All bpf programs that are allowed to call bpf_setsockopt should
> have the sock lock held. Maybe I am missing something obvious.

I tried to find a way to determine if the sk is in tx state in tcp_transmit_skb(),
but didn't succeed.

> 
> In bpf_tcp_ca_get_func_proto, it checks what ops can do bpf_sk_setsockopt_proto.
> Right now, it rejects the "release" ops. One option is to create a new
> func_proto, bpf_sk_setsockopt_nodelay_proto, to reject TCP_NODELAY.
> Instead of checking cwnd_event[_tx_start] in bpf_tcp_ca_get_func_proto,
> I would return bpf_sk_setsockopt_nodelay_proto for all ops. We can revisit
> and be more selective in the future if the hammer turns out to be too big.
> "release" ops will remain disallowed from calling bpf_setsockopt.

Great, I'll try this one. bpf_getsockopt(TCP_NODELAY) will not trigger 
infinite recursion, I will keep it as is.

-- 
Thanks,
KaFai

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-21 15:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260417092035.2299913-2-kafai.wan@linux.dev>
     [not found] ` <20260418092235.98444C19424@smtp.kernel.org>
2026-04-20 13:41   ` [PATCH bpf v3 1/2] bpf: Reject TCP_NODELAY in TCP header option callbacks KaFai Wan
2026-04-20 18:12     ` Martin KaFai Lau
2026-04-21 15:50       ` KaFai Wan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox