[PATCH net-next 0/2] tcp: a couple of fixes

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/2] tcp: a couple of fixes
@ 2025-07-18 17:25 Paolo Abeni
  2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni
  2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni
  0 siblings, 2 replies; 18+ messages in thread
From: Paolo Abeni @ 2025-07-18 17:25 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

This series includes a couple of follow-up for the recent tcp receiver
changes, addressing issues outlined by the nipa CI and the mptcp
self-tests.

Note that despite the affected self-tests where MPTCP ones, the issues
are really in the TCP code, see patch 1 for the details.

Paolo Abeni (2):
  tcp: do not set a zero size receive buffer
  tcp: do not increment BeyondWindow MIB for old seq

 net/ipv4/tcp_input.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-18 17:25 [PATCH net-next 0/2] tcp: a couple of fixes Paolo Abeni
@ 2025-07-18 17:25 ` Paolo Abeni
  2025-07-18 22:15   ` Matthieu Baerts
  2025-07-21  8:04   ` Eric Dumazet
  2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni
  1 sibling, 2 replies; 18+ messages in thread
From: Paolo Abeni @ 2025-07-18 17:25 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

The nipa CI is reporting frequent failures in the mptcp_connect
self-tests.

In the failing scenarios (TCP -> MPTCP) the involved sockets are
actually plain TCP ones, as fallback for passive socket at 2whs
time cause the MPTCP listener to actually create a TCP socket.

The transfer is stuck due to the receiver buffer being zero.
With the stronger check in place, tcp_clamp_window() can be invoked
while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
will be zeroed, too.

Pass to tcp_clamp_window() even the current skb truesize, so that
such helper could compute and use the actual limit enforced by
the stack.

Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/tcp_input.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 672cbfbdcec1..c98de02a3c57 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
 }
 
 /* 4. Recalculate window clamp after socket hit its memory bounds. */
-static void tcp_clamp_window(struct sock *sk)
+static void tcp_clamp_window(struct sock *sk, int truesize)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct net *net = sock_net(sk);
-	int rmem2;
+	int rmem2, needed;
 
 	icsk->icsk_ack.quick = 0;
 	rmem2 = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
+	needed = atomic_read(&sk->sk_rmem_alloc) + truesize;
 
 	if (sk->sk_rcvbuf < rmem2 &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
 	    !tcp_under_memory_pressure(sk) &&
 	    sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)) {
-		WRITE_ONCE(sk->sk_rcvbuf,
-			   min(atomic_read(&sk->sk_rmem_alloc), rmem2));
+		WRITE_ONCE(sk->sk_rcvbuf, min(needed, rmem2));
 	}
-	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
+	if (needed > sk->sk_rcvbuf)
 		tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
 }
 
@@ -5552,7 +5552,7 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 	NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED);
 
 	if (!tcp_can_ingest(sk, in_skb))
-		tcp_clamp_window(sk);
+		tcp_clamp_window(sk, in_skb->truesize);
 	else if (tcp_under_memory_pressure(sk))
 		tcp_adjust_rcv_ssthresh(sk);
 
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq
  2025-07-18 17:25 [PATCH net-next 0/2] tcp: a couple of fixes Paolo Abeni
  2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni
@ 2025-07-18 17:25 ` Paolo Abeni
  2025-07-18 22:16   ` Matthieu Baerts
  2025-07-21  8:28   ` Eric Dumazet
  1 sibling, 2 replies; 18+ messages in thread
From: Paolo Abeni @ 2025-07-18 17:25 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

The mentioned MIB is currently incremented even when a packet
with an old sequence number (i.e. a zero window probe) is received,
which is IMHO misleading.

Explicitly restrict such MIB increment at the relevant events.

Fixes: 6c758062c64d ("tcp: add LINUX_MIB_BEYOND_WINDOW")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/tcp_input.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c98de02a3c57..b708287df9b7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5911,7 +5911,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		if (!th->rst) {
 			if (th->syn)
 				goto syn_challenge;
-			NET_INC_STATS(sock_net(sk), LINUX_MIB_BEYOND_WINDOW);
+
+			if (reason == SKB_DROP_REASON_TCP_INVALID_SEQUENCE ||
+			    reason == SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE)
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_BEYOND_WINDOW);
 			if (!tcp_oow_rate_limited(sock_net(sk), skb,
 						  LINUX_MIB_TCPACKSKIPPEDSEQ,
 						  &tp->last_oow_ack_time))
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni
@ 2025-07-18 22:15   ` Matthieu Baerts
  2025-07-21  8:04   ` Eric Dumazet
  1 sibling, 0 replies; 18+ messages in thread
From: Matthieu Baerts @ 2025-07-18 22:15 UTC (permalink / raw)
  To: Paolo Abeni, Jakub Kicinski
  Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Simon Horman, netdev

Hi Paolo,

On 18/07/2025 19:25, Paolo Abeni wrote:
> The nipa CI is reporting frequent failures in the mptcp_connect
> self-tests.
> 
> In the failing scenarios (TCP -> MPTCP) the involved sockets are
> actually plain TCP ones, as fallback for passive socket at 2whs
> time cause the MPTCP listener to actually create a TCP socket.
> 
> The transfer is stuck due to the receiver buffer being zero.
> With the stronger check in place, tcp_clamp_window() can be invoked
> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
> will be zeroed, too.
> 
> Pass to tcp_clamp_window() even the current skb truesize, so that
> such helper could compute and use the actual limit enforced by
> the stack.

Thank you for this fix!

Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

I guess we can drop "mptcp-connect-sh" from the ignored tests then :)

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq
  2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni
@ 2025-07-18 22:16   ` Matthieu Baerts
  2025-07-21  8:28   ` Eric Dumazet
  1 sibling, 0 replies; 18+ messages in thread
From: Matthieu Baerts @ 2025-07-18 22:16 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman

Hi Paolo,

On 18/07/2025 19:25, Paolo Abeni wrote:
> The mentioned MIB is currently incremented even when a packet
> with an old sequence number (i.e. a zero window probe) is received,
> which is IMHO misleading.
> 
> Explicitly restrict such MIB increment at the relevant events.

Good idea, it is less confusing!

Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni
  2025-07-18 22:15   ` Matthieu Baerts
@ 2025-07-21  8:04   ` Eric Dumazet
  2025-07-21 10:50     ` Paolo Abeni
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2025-07-21  8:04 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> The nipa CI is reporting frequent failures in the mptcp_connect
> self-tests.
>
> In the failing scenarios (TCP -> MPTCP) the involved sockets are
> actually plain TCP ones, as fallback for passive socket at 2whs
> time cause the MPTCP listener to actually create a TCP socket.
>
> The transfer is stuck due to the receiver buffer being zero.
> With the stronger check in place, tcp_clamp_window() can be invoked
> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
> will be zeroed, too.
>
> Pass to tcp_clamp_window() even the current skb truesize, so that
> such helper could compute and use the actual limit enforced by
> the stack.
>
> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  net/ipv4/tcp_input.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 672cbfbdcec1..c98de02a3c57 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
>  }
>
>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
> -static void tcp_clamp_window(struct sock *sk)
> +static void tcp_clamp_window(struct sock *sk, int truesize)


I am unsure about this one. truesize can be 1MB here, do we want that
in general ?

I am unsure why MPTCP ends up with this path.

 LINUX_MIB_PRUNECALLED being called in normal MPTCP operations seems
strange to me.


>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct net *net = sock_net(sk);
> -       int rmem2;
> +       int rmem2, needed;
>
>         icsk->icsk_ack.quick = 0;
>         rmem2 = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
> +       needed = atomic_read(&sk->sk_rmem_alloc) + truesize;
>
>         if (sk->sk_rcvbuf < rmem2 &&
>             !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
>             !tcp_under_memory_pressure(sk) &&
>             sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)) {
> -               WRITE_ONCE(sk->sk_rcvbuf,
> -                          min(atomic_read(&sk->sk_rmem_alloc), rmem2));
> +               WRITE_ONCE(sk->sk_rcvbuf, min(needed, rmem2));
>         }
> -       if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> +       if (needed > sk->sk_rcvbuf)
>                 tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
>  }
>
> @@ -5552,7 +5552,7 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
>         NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED);
>
>         if (!tcp_can_ingest(sk, in_skb))
> -               tcp_clamp_window(sk);
> +               tcp_clamp_window(sk, in_skb->truesize);
>         else if (tcp_under_memory_pressure(sk))
>                 tcp_adjust_rcv_ssthresh(sk);
>
> --
> 2.50.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq
  2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni
  2025-07-18 22:16   ` Matthieu Baerts
@ 2025-07-21  8:28   ` Eric Dumazet
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2025-07-21  8:28 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> The mentioned MIB is currently incremented even when a packet
> with an old sequence number (i.e. a zero window probe) is received,
> which is IMHO misleading.
>
> Explicitly restrict such MIB increment at the relevant events.
>
> Fixes: 6c758062c64d ("tcp: add LINUX_MIB_BEYOND_WINDOW")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21  8:04   ` Eric Dumazet
@ 2025-07-21 10:50     ` Paolo Abeni
  2025-07-21 12:30       ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2025-07-21 10:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On 7/21/25 10:04 AM, Eric Dumazet wrote:
> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> The nipa CI is reporting frequent failures in the mptcp_connect
>> self-tests.
>>
>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
>> actually plain TCP ones, as fallback for passive socket at 2whs
>> time cause the MPTCP listener to actually create a TCP socket.
>>
>> The transfer is stuck due to the receiver buffer being zero.
>> With the stronger check in place, tcp_clamp_window() can be invoked
>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
>> will be zeroed, too.
>>
>> Pass to tcp_clamp_window() even the current skb truesize, so that
>> such helper could compute and use the actual limit enforced by
>> the stack.
>>
>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>>  net/ipv4/tcp_input.c | 12 ++++++------
>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 672cbfbdcec1..c98de02a3c57 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
>>  }
>>
>>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
>> -static void tcp_clamp_window(struct sock *sk)
>> +static void tcp_clamp_window(struct sock *sk, int truesize)
> 
> 
> I am unsure about this one. truesize can be 1MB here, do we want that
> in general ?

I'm unsure either. But I can't think of a different approach?!? If the
incoming truesize is 1M the socket should allow for at least 1M rcvbuf
size to accept it, right?

> I am unsure why MPTCP ends up with this path.
> 
>  LINUX_MIB_PRUNECALLED being called in normal MPTCP operations seems
> strange to me.

The code path hit:

	!tcp_can_ingest(sk, skb)

in tcp_try_rmem_schedule(). Perhaps the scaling_ratio is a bit
optimistic due to unfortunate packet layouts?

Let me check if I can grab more data.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 10:50     ` Paolo Abeni
@ 2025-07-21 12:30       ` Eric Dumazet
  2025-07-21 12:41         ` Eric Dumazet
  2025-07-21 13:32         ` Paolo Abeni
  0 siblings, 2 replies; 18+ messages in thread
From: Eric Dumazet @ 2025-07-21 12:30 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 7/21/25 10:04 AM, Eric Dumazet wrote:
> > On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>
> >> The nipa CI is reporting frequent failures in the mptcp_connect
> >> self-tests.
> >>
> >> In the failing scenarios (TCP -> MPTCP) the involved sockets are
> >> actually plain TCP ones, as fallback for passive socket at 2whs
> >> time cause the MPTCP listener to actually create a TCP socket.
> >>
> >> The transfer is stuck due to the receiver buffer being zero.
> >> With the stronger check in place, tcp_clamp_window() can be invoked
> >> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
> >> will be zeroed, too.
> >>
> >> Pass to tcp_clamp_window() even the current skb truesize, so that
> >> such helper could compute and use the actual limit enforced by
> >> the stack.
> >>
> >> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> >> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> >> ---
> >>  net/ipv4/tcp_input.c | 12 ++++++------
> >>  1 file changed, 6 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >> index 672cbfbdcec1..c98de02a3c57 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> >> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
> >>  }
> >>
> >>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
> >> -static void tcp_clamp_window(struct sock *sk)
> >> +static void tcp_clamp_window(struct sock *sk, int truesize)
> >
> >
> > I am unsure about this one. truesize can be 1MB here, do we want that
> > in general ?
>
> I'm unsure either. But I can't think of a different approach?!? If the
> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
> size to accept it, right?

What I meant was :

This is the generic point, accepting skb->truesize as additional input
here would make us more vulnerable, or we could risk other
regressions.

The question is : why does MPTCP end up here in the first place.
Perhaps an older issue with an incorrectly sized sk_rcvbuf ?

Or maybe the test about the receive queue being empty (currently done
in tcp_data_queue()) should be moved to a more strategic place.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 12:30       ` Eric Dumazet
@ 2025-07-21 12:41         ` Eric Dumazet
  2025-07-21 13:32         ` Paolo Abeni
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2025-07-21 12:41 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On Mon, Jul 21, 2025 at 5:30 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 7/21/25 10:04 AM, Eric Dumazet wrote:
> > > On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >>
> > >> The nipa CI is reporting frequent failures in the mptcp_connect
> > >> self-tests.
> > >>
> > >> In the failing scenarios (TCP -> MPTCP) the involved sockets are
> > >> actually plain TCP ones, as fallback for passive socket at 2whs
> > >> time cause the MPTCP listener to actually create a TCP socket.
> > >>
> > >> The transfer is stuck due to the receiver buffer being zero.
> > >> With the stronger check in place, tcp_clamp_window() can be invoked
> > >> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
> > >> will be zeroed, too.
> > >>
> > >> Pass to tcp_clamp_window() even the current skb truesize, so that
> > >> such helper could compute and use the actual limit enforced by
> > >> the stack.
> > >>
> > >> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> > >> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > >> ---
> > >>  net/ipv4/tcp_input.c | 12 ++++++------
> > >>  1 file changed, 6 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > >> index 672cbfbdcec1..c98de02a3c57 100644
> > >> --- a/net/ipv4/tcp_input.c
> > >> +++ b/net/ipv4/tcp_input.c
> > >> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
> > >>  }
> > >>
> > >>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
> > >> -static void tcp_clamp_window(struct sock *sk)
> > >> +static void tcp_clamp_window(struct sock *sk, int truesize)
> > >
> > >
> > > I am unsure about this one. truesize can be 1MB here, do we want that
> > > in general ?
> >
> > I'm unsure either. But I can't think of a different approach?!? If the
> > incoming truesize is 1M the socket should allow for at least 1M rcvbuf
> > size to accept it, right?
>
> What I meant was :
>
> This is the generic point, accepting skb->truesize as additional input
> here would make us more vulnerable, or we could risk other
> regressions.
>
> The question is : why does MPTCP end up here in the first place.
> Perhaps an older issue with an incorrectly sized sk_rcvbuf ?
>
> Or maybe the test about the receive queue being empty (currently done
> in tcp_data_queue()) should be moved to a more strategic place.

If the MPTCP part can not be easily resolved, perhaps :

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 672cbfbdcec1de22a5b1494d365863303271d222..81b6d37708120632d16a50892442ea04779cc3a4
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5549,6 +5549,10 @@ static int tcp_prune_queue(struct sock *sk,
const struct sk_buff *in_skb)
 {
        struct tcp_sock *tp = tcp_sk(sk);

+       /* Do nothing if our queues are empty. */
+       if (!atomic_read(&sk->sk_rmem_alloc))
+               return -1;
+
        NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED);

        if (!tcp_can_ingest(sk, in_skb))

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 12:30       ` Eric Dumazet
  2025-07-21 12:41         ` Eric Dumazet
@ 2025-07-21 13:32         ` Paolo Abeni
  2025-07-21 13:52           ` Eric Dumazet
  1 sibling, 1 reply; 18+ messages in thread
From: Paolo Abeni @ 2025-07-21 13:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On 7/21/25 2:30 PM, Eric Dumazet wrote:
> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 7/21/25 10:04 AM, Eric Dumazet wrote:
>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>>
>>>> The nipa CI is reporting frequent failures in the mptcp_connect
>>>> self-tests.
>>>>
>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
>>>> actually plain TCP ones, as fallback for passive socket at 2whs
>>>> time cause the MPTCP listener to actually create a TCP socket.
>>>>
>>>> The transfer is stuck due to the receiver buffer being zero.
>>>> With the stronger check in place, tcp_clamp_window() can be invoked
>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
>>>> will be zeroed, too.
>>>>
>>>> Pass to tcp_clamp_window() even the current skb truesize, so that
>>>> such helper could compute and use the actual limit enforced by
>>>> the stack.
>>>>
>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>>> ---
>>>>  net/ipv4/tcp_input.c | 12 ++++++------
>>>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>> index 672cbfbdcec1..c98de02a3c57 100644
>>>> --- a/net/ipv4/tcp_input.c
>>>> +++ b/net/ipv4/tcp_input.c
>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
>>>>  }
>>>>
>>>>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
>>>> -static void tcp_clamp_window(struct sock *sk)
>>>> +static void tcp_clamp_window(struct sock *sk, int truesize)
>>>
>>>
>>> I am unsure about this one. truesize can be 1MB here, do we want that
>>> in general ?
>>
>> I'm unsure either. But I can't think of a different approach?!? If the
>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
>> size to accept it, right?
> 
> What I meant was :
> 
> This is the generic point, accepting skb->truesize as additional input
> here would make us more vulnerable, or we could risk other
> regressions.

Understood, thanks for the clarification.

> The question is : why does MPTCP end up here in the first place.
> Perhaps an older issue with an incorrectly sized sk_rcvbuf ?

I collected a few more data. The issue happens even with plain TCP
sockets[1].

The relevant transfer is on top of the loopback device. The scaling_rate
rapidly grows to 254 - that is `truesize` and `len` are very near.

The stall happens when the received get in a packet with a slightly less
'efficient' layout (in the experiment I have handy len is 71424,
truesize 72320) (almost) filling the receiver window.

On such input, tcp_clamp_window() shrinks the receiver buffer to the
current rmem usage. The same happens on retransmissions until rcvbuf
becomes 0.

I *think* that catching only the !sk_rmem_alloc case would avoid the
stall, but I think it's a bit 'late'. I'm unsure if we could
preventing/forbidding 'too high' values of scaling_rate? (also I'm
unsure where to draw the line exactly.

Cheers,

Paolo


[1] You can run the relevant test by adding '-t' on the mptcp_connect.sh
command line, but it will take a lot of time to run the 10-20 iterations
I need to observe the issue. To make it faster I manually trimmed the
not relevant test-cases.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 13:32         ` Paolo Abeni
@ 2025-07-21 13:52           ` Eric Dumazet
  2025-07-21 14:56             ` Paolo Abeni
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2025-07-21 13:52 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 7/21/25 2:30 PM, Eric Dumazet wrote:
> > On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >> On 7/21/25 10:04 AM, Eric Dumazet wrote:
> >>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>>>
> >>>> The nipa CI is reporting frequent failures in the mptcp_connect
> >>>> self-tests.
> >>>>
> >>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
> >>>> actually plain TCP ones, as fallback for passive socket at 2whs
> >>>> time cause the MPTCP listener to actually create a TCP socket.
> >>>>
> >>>> The transfer is stuck due to the receiver buffer being zero.
> >>>> With the stronger check in place, tcp_clamp_window() can be invoked
> >>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
> >>>> will be zeroed, too.
> >>>>
> >>>> Pass to tcp_clamp_window() even the current skb truesize, so that
> >>>> such helper could compute and use the actual limit enforced by
> >>>> the stack.
> >>>>
> >>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> >>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> >>>> ---
> >>>>  net/ipv4/tcp_input.c | 12 ++++++------
> >>>>  1 file changed, 6 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>>> index 672cbfbdcec1..c98de02a3c57 100644
> >>>> --- a/net/ipv4/tcp_input.c
> >>>> +++ b/net/ipv4/tcp_input.c
> >>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
> >>>>  }
> >>>>
> >>>>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
> >>>> -static void tcp_clamp_window(struct sock *sk)
> >>>> +static void tcp_clamp_window(struct sock *sk, int truesize)
> >>>
> >>>
> >>> I am unsure about this one. truesize can be 1MB here, do we want that
> >>> in general ?
> >>
> >> I'm unsure either. But I can't think of a different approach?!? If the
> >> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
> >> size to accept it, right?
> >
> > What I meant was :
> >
> > This is the generic point, accepting skb->truesize as additional input
> > here would make us more vulnerable, or we could risk other
> > regressions.
>
> Understood, thanks for the clarification.
>
> > The question is : why does MPTCP end up here in the first place.
> > Perhaps an older issue with an incorrectly sized sk_rcvbuf ?
>
> I collected a few more data. The issue happens even with plain TCP
> sockets[1].
>
> The relevant transfer is on top of the loopback device. The scaling_rate
> rapidly grows to 254 - that is `truesize` and `len` are very near.
>
> The stall happens when the received get in a packet with a slightly less
> 'efficient' layout (in the experiment I have handy len is 71424,
> truesize 72320) (almost) filling the receiver window.
>
> On such input, tcp_clamp_window() shrinks the receiver buffer to the
> current rmem usage. The same happens on retransmissions until rcvbuf
> becomes 0.
>
> I *think* that catching only the !sk_rmem_alloc case would avoid the
> stall, but I think it's a bit 'late'.

A packetdrill test here would help understanding your concern.

> I'm unsure if we could
> preventing/forbidding 'too high' values of scaling_rate? (also I'm
> unsure where to draw the line exactly.

Indeed we need to account for a possible variation (ie reduction) of
skb->len/skb->truesize ratio in future packets.

Note that whatever conservative change we make, it will always be
possible to feed packets until RWIN 0 is sent back,
eventually after a bump caused by a prune operation.

Imagine wscale is 8, and a prior RWIN 1 was sent.

Normally we should be able to receive a packet with 256 bytes of
payload, but typical skb->truesize
will be 256+512+4096 (for a NIC driver using 4K pages).

This is one of the reasons we force wscale to 12 in Google DC :)

>
> Cheers,
>
> Paolo
>
>
> [1] You can run the relevant test by adding '-t' on the mptcp_connect.sh
> command line, but it will take a lot of time to run the 10-20 iterations
> I need to observe the issue. To make it faster I manually trimmed the
> not relevant test-cases.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 13:52           ` Eric Dumazet
@ 2025-07-21 14:56             ` Paolo Abeni
  2025-07-21 15:21               ` Eric Dumazet
  2025-07-21 15:27               ` Jakub Kicinski
  0 siblings, 2 replies; 18+ messages in thread
From: Paolo Abeni @ 2025-07-21 14:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On 7/21/25 3:52 PM, Eric Dumazet wrote:
> On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 7/21/25 2:30 PM, Eric Dumazet wrote:
>>> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>> On 7/21/25 10:04 AM, Eric Dumazet wrote:
>>>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>>>>
>>>>>> The nipa CI is reporting frequent failures in the mptcp_connect
>>>>>> self-tests.
>>>>>>
>>>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
>>>>>> actually plain TCP ones, as fallback for passive socket at 2whs
>>>>>> time cause the MPTCP listener to actually create a TCP socket.
>>>>>>
>>>>>> The transfer is stuck due to the receiver buffer being zero.
>>>>>> With the stronger check in place, tcp_clamp_window() can be invoked
>>>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
>>>>>> will be zeroed, too.
>>>>>>
>>>>>> Pass to tcp_clamp_window() even the current skb truesize, so that
>>>>>> such helper could compute and use the actual limit enforced by
>>>>>> the stack.
>>>>>>
>>>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
>>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>>>>> ---
>>>>>>  net/ipv4/tcp_input.c | 12 ++++++------
>>>>>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>>> index 672cbfbdcec1..c98de02a3c57 100644
>>>>>> --- a/net/ipv4/tcp_input.c
>>>>>> +++ b/net/ipv4/tcp_input.c
>>>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
>>>>>>  }
>>>>>>
>>>>>>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
>>>>>> -static void tcp_clamp_window(struct sock *sk)
>>>>>> +static void tcp_clamp_window(struct sock *sk, int truesize)
>>>>>
>>>>>
>>>>> I am unsure about this one. truesize can be 1MB here, do we want that
>>>>> in general ?
>>>>
>>>> I'm unsure either. But I can't think of a different approach?!? If the
>>>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
>>>> size to accept it, right?
>>>
>>> What I meant was :
>>>
>>> This is the generic point, accepting skb->truesize as additional input
>>> here would make us more vulnerable, or we could risk other
>>> regressions.
>>
>> Understood, thanks for the clarification.
>>
>>> The question is : why does MPTCP end up here in the first place.
>>> Perhaps an older issue with an incorrectly sized sk_rcvbuf ?
>>
>> I collected a few more data. The issue happens even with plain TCP
>> sockets[1].
>>
>> The relevant transfer is on top of the loopback device. The scaling_rate
>> rapidly grows to 254 - that is `truesize` and `len` are very near.
>>
>> The stall happens when the received get in a packet with a slightly less
>> 'efficient' layout (in the experiment I have handy len is 71424,
>> truesize 72320) (almost) filling the receiver window.
>>
>> On such input, tcp_clamp_window() shrinks the receiver buffer to the
>> current rmem usage. The same happens on retransmissions until rcvbuf
>> becomes 0.
>>
>> I *think* that catching only the !sk_rmem_alloc case would avoid the
>> stall, but I think it's a bit 'late'.
> 
> A packetdrill test here would help understanding your concern.

I fear like a complete working script would take a lot of time, let me
try to sketch just the relevant part:

# receiver state is:
# rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232
# no OoO data, no memory pressure,

# the incoming packet is in sequence
+0 > P. 109297:172528(63232) ack 1

With just the 0 rmem check in tcp_prune_queue(), such function will
still invoke tcp_clamp_window() that will shrink the receive buffer to
110592.
tcp_collapse() can't make enough room and the incoming packet will be
dropped. I think we should instead accept such packet.

Side note: the above data are taken from an actual reproduction of the issue

Please LMK if the above clarifies a bit my doubt or if a full pktdrill
is needed.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 14:56             ` Paolo Abeni
@ 2025-07-21 15:21               ` Eric Dumazet
  2025-07-21 16:17                 ` Paolo Abeni
  2025-07-21 15:27               ` Jakub Kicinski
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2025-07-21 15:21 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On Mon, Jul 21, 2025 at 7:56 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 7/21/25 3:52 PM, Eric Dumazet wrote:
> > On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >> On 7/21/25 2:30 PM, Eric Dumazet wrote:
> >>> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>>> On 7/21/25 10:04 AM, Eric Dumazet wrote:
> >>>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>>>>>
> >>>>>> The nipa CI is reporting frequent failures in the mptcp_connect
> >>>>>> self-tests.
> >>>>>>
> >>>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
> >>>>>> actually plain TCP ones, as fallback for passive socket at 2whs
> >>>>>> time cause the MPTCP listener to actually create a TCP socket.
> >>>>>>
> >>>>>> The transfer is stuck due to the receiver buffer being zero.
> >>>>>> With the stronger check in place, tcp_clamp_window() can be invoked
> >>>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
> >>>>>> will be zeroed, too.
> >>>>>>
> >>>>>> Pass to tcp_clamp_window() even the current skb truesize, so that
> >>>>>> such helper could compute and use the actual limit enforced by
> >>>>>> the stack.
> >>>>>>
> >>>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> >>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> >>>>>> ---
> >>>>>>  net/ipv4/tcp_input.c | 12 ++++++------
> >>>>>>  1 file changed, 6 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>>>>> index 672cbfbdcec1..c98de02a3c57 100644
> >>>>>> --- a/net/ipv4/tcp_input.c
> >>>>>> +++ b/net/ipv4/tcp_input.c
> >>>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
> >>>>>>  }
> >>>>>>
> >>>>>>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
> >>>>>> -static void tcp_clamp_window(struct sock *sk)
> >>>>>> +static void tcp_clamp_window(struct sock *sk, int truesize)
> >>>>>
> >>>>>
> >>>>> I am unsure about this one. truesize can be 1MB here, do we want that
> >>>>> in general ?
> >>>>
> >>>> I'm unsure either. But I can't think of a different approach?!? If the
> >>>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
> >>>> size to accept it, right?
> >>>
> >>> What I meant was :
> >>>
> >>> This is the generic point, accepting skb->truesize as additional input
> >>> here would make us more vulnerable, or we could risk other
> >>> regressions.
> >>
> >> Understood, thanks for the clarification.
> >>
> >>> The question is : why does MPTCP end up here in the first place.
> >>> Perhaps an older issue with an incorrectly sized sk_rcvbuf ?
> >>
> >> I collected a few more data. The issue happens even with plain TCP
> >> sockets[1].
> >>
> >> The relevant transfer is on top of the loopback device. The scaling_rate
> >> rapidly grows to 254 - that is `truesize` and `len` are very near.
> >>
> >> The stall happens when the received get in a packet with a slightly less
> >> 'efficient' layout (in the experiment I have handy len is 71424,
> >> truesize 72320) (almost) filling the receiver window.
> >>
> >> On such input, tcp_clamp_window() shrinks the receiver buffer to the
> >> current rmem usage. The same happens on retransmissions until rcvbuf
> >> becomes 0.
> >>
> >> I *think* that catching only the !sk_rmem_alloc case would avoid the
> >> stall, but I think it's a bit 'late'.
> >
> > A packetdrill test here would help understanding your concern.
>
> I fear like a complete working script would take a lot of time, let me
> try to sketch just the relevant part:
>
> # receiver state is:
> # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232
> # no OoO data, no memory pressure,
>
> # the incoming packet is in sequence
> +0 > P. 109297:172528(63232) ack 1
>
> With just the 0 rmem check in tcp_prune_queue(), such function will
> still invoke tcp_clamp_window() that will shrink the receive buffer to
> 110592.

As long as an ACK is sent back with a smaller RWIN, I think this would
be reasonable in this case.

> tcp_collapse() can't make enough room and the incoming packet will be
> dropped. I think we should instead accept such packet.

Only if not completely off-the-limits...

packetdrill test :

cat eric.pkt
// Test the calculation of receive window values by a bulk data receiver.

--mss=1000

// Set up config.
// Need to set tcp_rmem[1] to what tcp_fixup_rcvbuf() would have set to
// make sure the same kernel behavior after removing tcp_fixup_rcvbuf()
`../common/defaults.sh
../common/set_sysctls.py /proc/sys/net/ipv4/tcp_rmem="4096 131072 15728640"
`

// Create a socket.
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

// Verify that the receive buffer is the tcp_rmem default.
   +0 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0

   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

// Establish a connection.
 +.01 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
   +0 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 8>
 +.01 < . 1:1(0) ack 1 win 457
   +0 accept(3, ..., ...) = 4

// Verify that the receive buffer is the tcp_rmem default.
   +0 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0

// Check first outgoing window after SYN
 +.01 write(4, ..., 1000) = 1000
   +0 > P. 1:1001(1000) ack 1 win 251

// Phase 1: Data arrives but app doesn't read from the socket buffer.

 +.01 < . 1:60001(60000) ack 1001 win 457
   +0 > . 1001:1001(0) ack 60001 win 263


 +.01 < . 60001:120001(60000) ack 1001 win 457
+0~+.04 > . 1001:1001(0) ack 120001 win 29

// Incoming packet  has a too big skb->truesize, lets send a lower RWIN
 +.01 < . 120001:127425(7424) ack 1001 win 457
+0~+.04 > . 1001:1001(0) ack 120001 win 0

// Reset to sysctls defaults
`/tmp/sysctl_restore_${PPID}.sh`

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 14:56             ` Paolo Abeni
  2025-07-21 15:21               ` Eric Dumazet
@ 2025-07-21 15:27               ` Jakub Kicinski
  2025-07-21 15:30                 ` Jakub Kicinski
  2025-07-21 16:50                 ` Eric Dumazet
  1 sibling, 2 replies; 18+ messages in thread
From: Jakub Kicinski @ 2025-07-21 15:27 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, netdev, Neal Cardwell, Kuniyuki Iwashima,
	David S. Miller, David Ahern, Simon Horman, Matthieu Baerts

On Mon, 21 Jul 2025 16:56:06 +0200 Paolo Abeni wrote:
> >> I *think* that catching only the !sk_rmem_alloc case would avoid the
> >> stall, but I think it's a bit 'late'.  
> > 
> > A packetdrill test here would help understanding your concern.  
> 
> I fear like a complete working script would take a lot of time, let me
> try to sketch just the relevant part:
> 
> # receiver state is:
> # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232
> # no OoO data, no memory pressure,
> 
> # the incoming packet is in sequence
> +0 > P. 109297:172528(63232) ack 1
> 
> With just the 0 rmem check in tcp_prune_queue(), such function will
> still invoke tcp_clamp_window() that will shrink the receive buffer to
> 110592.
> tcp_collapse() can't make enough room and the incoming packet will be
> dropped. I think we should instead accept such packet.
> 
> Side note: the above data are taken from an actual reproduction of the issue
> 
> Please LMK if the above clarifies a bit my doubt or if a full pktdrill
> is needed.

Not the first time we stumble on packetdrills for scaling ratio.
Solving it is probably outside the scope of this discussion but 
I wonder what would be the best way to do it. My go to is to
integrate packetdrill with netdevsim and have an option for netdevsim
to inflate truesize on demand. But perhaps there's a clever way we can
force something like tap to give us the ratio we desire. Other ideas?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 15:27               ` Jakub Kicinski
@ 2025-07-21 15:30                 ` Jakub Kicinski
  2025-07-21 16:50                 ` Eric Dumazet
  1 sibling, 0 replies; 18+ messages in thread
From: Jakub Kicinski @ 2025-07-21 15:30 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Eric Dumazet, netdev, Neal Cardwell, Kuniyuki Iwashima,
	David S. Miller, David Ahern, Simon Horman, Matthieu Baerts

On Mon, 21 Jul 2025 08:27:28 -0700 Jakub Kicinski wrote:
> > With just the 0 rmem check in tcp_prune_queue(), such function will
> > still invoke tcp_clamp_window() that will shrink the receive buffer to
> > 110592.
> > tcp_collapse() can't make enough room and the incoming packet will be
> > dropped. I think we should instead accept such packet.
> > 
> > Side note: the above data are taken from an actual reproduction of the issue
> > 
> > Please LMK if the above clarifies a bit my doubt or if a full pktdrill
> > is needed.  
> 
> Not the first time we stumble on packetdrills for scaling ratio.
> Solving it is probably outside the scope of this discussion but 
> I wonder what would be the best way to do it. My go to is to
> integrate packetdrill with netdevsim and have an option for netdevsim
> to inflate truesize on demand. But perhaps there's a clever way we can
> force something like tap to give us the ratio we desire. Other ideas?

FWIW didn't see Eric's reply before hitting send..

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 15:21               ` Eric Dumazet
@ 2025-07-21 16:17                 ` Paolo Abeni
  0 siblings, 0 replies; 18+ messages in thread
From: Paolo Abeni @ 2025-07-21 16:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts

On 7/21/25 5:21 PM, Eric Dumazet wrote:
> On Mon, Jul 21, 2025 at 7:56 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 7/21/25 3:52 PM, Eric Dumazet wrote:
>>> On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>> On 7/21/25 2:30 PM, Eric Dumazet wrote:
>>>>> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>>>> On 7/21/25 10:04 AM, Eric Dumazet wrote:
>>>>>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>>>>>>
>>>>>>>> The nipa CI is reporting frequent failures in the mptcp_connect
>>>>>>>> self-tests.
>>>>>>>>
>>>>>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
>>>>>>>> actually plain TCP ones, as fallback for passive socket at 2whs
>>>>>>>> time cause the MPTCP listener to actually create a TCP socket.
>>>>>>>>
>>>>>>>> The transfer is stuck due to the receiver buffer being zero.
>>>>>>>> With the stronger check in place, tcp_clamp_window() can be invoked
>>>>>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
>>>>>>>> will be zeroed, too.
>>>>>>>>
>>>>>>>> Pass to tcp_clamp_window() even the current skb truesize, so that
>>>>>>>> such helper could compute and use the actual limit enforced by
>>>>>>>> the stack.
>>>>>>>>
>>>>>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
>>>>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>>>>>>> ---
>>>>>>>>  net/ipv4/tcp_input.c | 12 ++++++------
>>>>>>>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>>>>> index 672cbfbdcec1..c98de02a3c57 100644
>>>>>>>> --- a/net/ipv4/tcp_input.c
>>>>>>>> +++ b/net/ipv4/tcp_input.c
>>>>>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
>>>>>>>> -static void tcp_clamp_window(struct sock *sk)
>>>>>>>> +static void tcp_clamp_window(struct sock *sk, int truesize)
>>>>>>>
>>>>>>>
>>>>>>> I am unsure about this one. truesize can be 1MB here, do we want that
>>>>>>> in general ?
>>>>>>
>>>>>> I'm unsure either. But I can't think of a different approach?!? If the
>>>>>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
>>>>>> size to accept it, right?
>>>>>
>>>>> What I meant was :
>>>>>
>>>>> This is the generic point, accepting skb->truesize as additional input
>>>>> here would make us more vulnerable, or we could risk other
>>>>> regressions.
>>>>
>>>> Understood, thanks for the clarification.
>>>>
>>>>> The question is : why does MPTCP end up here in the first place.
>>>>> Perhaps an older issue with an incorrectly sized sk_rcvbuf ?
>>>>
>>>> I collected a few more data. The issue happens even with plain TCP
>>>> sockets[1].
>>>>
>>>> The relevant transfer is on top of the loopback device. The scaling_rate
>>>> rapidly grows to 254 - that is `truesize` and `len` are very near.
>>>>
>>>> The stall happens when the received get in a packet with a slightly less
>>>> 'efficient' layout (in the experiment I have handy len is 71424,
>>>> truesize 72320) (almost) filling the receiver window.
>>>>
>>>> On such input, tcp_clamp_window() shrinks the receiver buffer to the
>>>> current rmem usage. The same happens on retransmissions until rcvbuf
>>>> becomes 0.
>>>>
>>>> I *think* that catching only the !sk_rmem_alloc case would avoid the
>>>> stall, but I think it's a bit 'late'.
>>>
>>> A packetdrill test here would help understanding your concern.
>>
>> I fear like a complete working script would take a lot of time, let me
>> try to sketch just the relevant part:
>>
>> # receiver state is:
>> # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232
>> # no OoO data, no memory pressure,
>>
>> # the incoming packet is in sequence
>> +0 > P. 109297:172528(63232) ack 1
>>
>> With just the 0 rmem check in tcp_prune_queue(), such function will
>> still invoke tcp_clamp_window() that will shrink the receive buffer to
>> 110592.
> 
> As long as an ACK is sent back with a smaller RWIN, I think this would
> be reasonable in this case.

I fear some possible regression, as the sender will see some unexpected
drops, even on loopback and while not misbehaving.

But I tested your proposed code here and AFAICS solves the issue and I
could not spot anything suspicious so far.

So I'll send a v2 using that.

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
  2025-07-21 15:27               ` Jakub Kicinski
  2025-07-21 15:30                 ` Jakub Kicinski
@ 2025-07-21 16:50                 ` Eric Dumazet
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2025-07-21 16:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paolo Abeni, netdev, Neal Cardwell, Kuniyuki Iwashima,
	David S. Miller, David Ahern, Simon Horman, Matthieu Baerts

On Mon, Jul 21, 2025 at 8:27 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 21 Jul 2025 16:56:06 +0200 Paolo Abeni wrote:
> > >> I *think* that catching only the !sk_rmem_alloc case would avoid the
> > >> stall, but I think it's a bit 'late'.
> > >
> > > A packetdrill test here would help understanding your concern.
> >
> > I fear like a complete working script would take a lot of time, let me
> > try to sketch just the relevant part:
> >
> > # receiver state is:
> > # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232
> > # no OoO data, no memory pressure,
> >
> > # the incoming packet is in sequence
> > +0 > P. 109297:172528(63232) ack 1
> >
> > With just the 0 rmem check in tcp_prune_queue(), such function will
> > still invoke tcp_clamp_window() that will shrink the receive buffer to
> > 110592.
> > tcp_collapse() can't make enough room and the incoming packet will be
> > dropped. I think we should instead accept such packet.
> >
> > Side note: the above data are taken from an actual reproduction of the issue
> >
> > Please LMK if the above clarifies a bit my doubt or if a full pktdrill
> > is needed.
>
> Not the first time we stumble on packetdrills for scaling ratio.
> Solving it is probably outside the scope of this discussion but
> I wonder what would be the best way to do it. My go to is to
> integrate packetdrill with netdevsim and have an option for netdevsim
> to inflate truesize on demand. But perhaps there's a clever way we can
> force something like tap to give us the ratio we desire. Other ideas?

Adding a TUN option (ioctl) to be able to increase skb->truesize by a
given amount / percentage is doable I think.

Then in packetdrill we could add the ability to use this option for a
given scenario or packet.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-07-21 16:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-18 17:25 [PATCH net-next 0/2] tcp: a couple of fixes Paolo Abeni
2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni
2025-07-18 22:15   ` Matthieu Baerts
2025-07-21  8:04   ` Eric Dumazet
2025-07-21 10:50     ` Paolo Abeni
2025-07-21 12:30       ` Eric Dumazet
2025-07-21 12:41         ` Eric Dumazet
2025-07-21 13:32         ` Paolo Abeni
2025-07-21 13:52           ` Eric Dumazet
2025-07-21 14:56             ` Paolo Abeni
2025-07-21 15:21               ` Eric Dumazet
2025-07-21 16:17                 ` Paolo Abeni
2025-07-21 15:27               ` Jakub Kicinski
2025-07-21 15:30                 ` Jakub Kicinski
2025-07-21 16:50                 ` Eric Dumazet
2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni
2025-07-18 22:16   ` Matthieu Baerts
2025-07-21  8:28   ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).