* [PATCH net-next 0/2] tcp: a couple of fixes @ 2025-07-18 17:25 Paolo Abeni 2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni 2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni 0 siblings, 2 replies; 18+ messages in thread From: Paolo Abeni @ 2025-07-18 17:25 UTC (permalink / raw) To: netdev Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts This series includes a couple of follow-up for the recent tcp receiver changes, addressing issues outlined by the nipa CI and the mptcp self-tests. Note that despite the affected self-tests where MPTCP ones, the issues are really in the TCP code, see patch 1 for the details. Paolo Abeni (2): tcp: do not set a zero size receive buffer tcp: do not increment BeyondWindow MIB for old seq net/ipv4/tcp_input.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) -- 2.50.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-18 17:25 [PATCH net-next 0/2] tcp: a couple of fixes Paolo Abeni @ 2025-07-18 17:25 ` Paolo Abeni 2025-07-18 22:15 ` Matthieu Baerts 2025-07-21 8:04 ` Eric Dumazet 2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni 1 sibling, 2 replies; 18+ messages in thread From: Paolo Abeni @ 2025-07-18 17:25 UTC (permalink / raw) To: netdev Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts The nipa CI is reporting frequent failures in the mptcp_connect self-tests. In the failing scenarios (TCP -> MPTCP) the involved sockets are actually plain TCP ones, as fallback for passive socket at 2whs time cause the MPTCP listener to actually create a TCP socket. The transfer is stuck due to the receiver buffer being zero. With the stronger check in place, tcp_clamp_window() can be invoked while the TCP socket has sk_rmem_alloc == 0, and the receive buffer will be zeroed, too. Pass to tcp_clamp_window() even the current skb truesize, so that such helper could compute and use the actual limit enforced by the stack. Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- net/ipv4/tcp_input.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 672cbfbdcec1..c98de02a3c57 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) } /* 4. Recalculate window clamp after socket hit its memory bounds. */ -static void tcp_clamp_window(struct sock *sk) +static void tcp_clamp_window(struct sock *sk, int truesize) { struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); struct net *net = sock_net(sk); - int rmem2; + int rmem2, needed; icsk->icsk_ack.quick = 0; rmem2 = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]); + needed = atomic_read(&sk->sk_rmem_alloc) + truesize; if (sk->sk_rcvbuf < rmem2 && !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) && !tcp_under_memory_pressure(sk) && sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)) { - WRITE_ONCE(sk->sk_rcvbuf, - min(atomic_read(&sk->sk_rmem_alloc), rmem2)); + WRITE_ONCE(sk->sk_rcvbuf, min(needed, rmem2)); } - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + if (needed > sk->sk_rcvbuf) tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss); } @@ -5552,7 +5552,7 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb) NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED); if (!tcp_can_ingest(sk, in_skb)) - tcp_clamp_window(sk); + tcp_clamp_window(sk, in_skb->truesize); else if (tcp_under_memory_pressure(sk)) tcp_adjust_rcv_ssthresh(sk); -- 2.50.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni @ 2025-07-18 22:15 ` Matthieu Baerts 2025-07-21 8:04 ` Eric Dumazet 1 sibling, 0 replies; 18+ messages in thread From: Matthieu Baerts @ 2025-07-18 22:15 UTC (permalink / raw) To: Paolo Abeni, Jakub Kicinski Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Simon Horman, netdev Hi Paolo, On 18/07/2025 19:25, Paolo Abeni wrote: > The nipa CI is reporting frequent failures in the mptcp_connect > self-tests. > > In the failing scenarios (TCP -> MPTCP) the involved sockets are > actually plain TCP ones, as fallback for passive socket at 2whs > time cause the MPTCP listener to actually create a TCP socket. > > The transfer is stuck due to the receiver buffer being zero. > With the stronger check in place, tcp_clamp_window() can be invoked > while the TCP socket has sk_rmem_alloc == 0, and the receive buffer > will be zeroed, too. > > Pass to tcp_clamp_window() even the current skb truesize, so that > such helper could compute and use the actual limit enforced by > the stack. Thank you for this fix! Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> I guess we can drop "mptcp-connect-sh" from the ignored tests then :) Cheers, Matt -- Sponsored by the NGI0 Core fund. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni 2025-07-18 22:15 ` Matthieu Baerts @ 2025-07-21 8:04 ` Eric Dumazet 2025-07-21 10:50 ` Paolo Abeni 1 sibling, 1 reply; 18+ messages in thread From: Eric Dumazet @ 2025-07-21 8:04 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: > > The nipa CI is reporting frequent failures in the mptcp_connect > self-tests. > > In the failing scenarios (TCP -> MPTCP) the involved sockets are > actually plain TCP ones, as fallback for passive socket at 2whs > time cause the MPTCP listener to actually create a TCP socket. > > The transfer is stuck due to the receiver buffer being zero. > With the stronger check in place, tcp_clamp_window() can be invoked > while the TCP socket has sk_rmem_alloc == 0, and the receive buffer > will be zeroed, too. > > Pass to tcp_clamp_window() even the current skb truesize, so that > such helper could compute and use the actual limit enforced by > the stack. > > Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- > net/ipv4/tcp_input.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 672cbfbdcec1..c98de02a3c57 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) > } > > /* 4. Recalculate window clamp after socket hit its memory bounds. */ > -static void tcp_clamp_window(struct sock *sk) > +static void tcp_clamp_window(struct sock *sk, int truesize) I am unsure about this one. truesize can be 1MB here, do we want that in general ? I am unsure why MPTCP ends up with this path. LINUX_MIB_PRUNECALLED being called in normal MPTCP operations seems strange to me. > { > struct tcp_sock *tp = tcp_sk(sk); > struct inet_connection_sock *icsk = inet_csk(sk); > struct net *net = sock_net(sk); > - int rmem2; > + int rmem2, needed; > > icsk->icsk_ack.quick = 0; > rmem2 = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]); > + needed = atomic_read(&sk->sk_rmem_alloc) + truesize; > > if (sk->sk_rcvbuf < rmem2 && > !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) && > !tcp_under_memory_pressure(sk) && > sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)) { > - WRITE_ONCE(sk->sk_rcvbuf, > - min(atomic_read(&sk->sk_rmem_alloc), rmem2)); > + WRITE_ONCE(sk->sk_rcvbuf, min(needed, rmem2)); > } > - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > + if (needed > sk->sk_rcvbuf) > tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss); > } > > @@ -5552,7 +5552,7 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb) > NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED); > > if (!tcp_can_ingest(sk, in_skb)) > - tcp_clamp_window(sk); > + tcp_clamp_window(sk, in_skb->truesize); > else if (tcp_under_memory_pressure(sk)) > tcp_adjust_rcv_ssthresh(sk); > > -- > 2.50.0 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 8:04 ` Eric Dumazet @ 2025-07-21 10:50 ` Paolo Abeni 2025-07-21 12:30 ` Eric Dumazet 0 siblings, 1 reply; 18+ messages in thread From: Paolo Abeni @ 2025-07-21 10:50 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On 7/21/25 10:04 AM, Eric Dumazet wrote: > On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: >> >> The nipa CI is reporting frequent failures in the mptcp_connect >> self-tests. >> >> In the failing scenarios (TCP -> MPTCP) the involved sockets are >> actually plain TCP ones, as fallback for passive socket at 2whs >> time cause the MPTCP listener to actually create a TCP socket. >> >> The transfer is stuck due to the receiver buffer being zero. >> With the stronger check in place, tcp_clamp_window() can be invoked >> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer >> will be zeroed, too. >> >> Pass to tcp_clamp_window() even the current skb truesize, so that >> such helper could compute and use the actual limit enforced by >> the stack. >> >> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") >> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >> --- >> net/ipv4/tcp_input.c | 12 ++++++------ >> 1 file changed, 6 insertions(+), 6 deletions(-) >> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >> index 672cbfbdcec1..c98de02a3c57 100644 >> --- a/net/ipv4/tcp_input.c >> +++ b/net/ipv4/tcp_input.c >> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) >> } >> >> /* 4. Recalculate window clamp after socket hit its memory bounds. */ >> -static void tcp_clamp_window(struct sock *sk) >> +static void tcp_clamp_window(struct sock *sk, int truesize) > > > I am unsure about this one. truesize can be 1MB here, do we want that > in general ? I'm unsure either. But I can't think of a different approach?!? If the incoming truesize is 1M the socket should allow for at least 1M rcvbuf size to accept it, right? > I am unsure why MPTCP ends up with this path. > > LINUX_MIB_PRUNECALLED being called in normal MPTCP operations seems > strange to me. The code path hit: !tcp_can_ingest(sk, skb) in tcp_try_rmem_schedule(). Perhaps the scaling_ratio is a bit optimistic due to unfortunate packet layouts? Let me check if I can grab more data. Thanks, Paolo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 10:50 ` Paolo Abeni @ 2025-07-21 12:30 ` Eric Dumazet 2025-07-21 12:41 ` Eric Dumazet 2025-07-21 13:32 ` Paolo Abeni 0 siblings, 2 replies; 18+ messages in thread From: Eric Dumazet @ 2025-07-21 12:30 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On 7/21/25 10:04 AM, Eric Dumazet wrote: > > On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: > >> > >> The nipa CI is reporting frequent failures in the mptcp_connect > >> self-tests. > >> > >> In the failing scenarios (TCP -> MPTCP) the involved sockets are > >> actually plain TCP ones, as fallback for passive socket at 2whs > >> time cause the MPTCP listener to actually create a TCP socket. > >> > >> The transfer is stuck due to the receiver buffer being zero. > >> With the stronger check in place, tcp_clamp_window() can be invoked > >> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer > >> will be zeroed, too. > >> > >> Pass to tcp_clamp_window() even the current skb truesize, so that > >> such helper could compute and use the actual limit enforced by > >> the stack. > >> > >> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") > >> Signed-off-by: Paolo Abeni <pabeni@redhat.com> > >> --- > >> net/ipv4/tcp_input.c | 12 ++++++------ > >> 1 file changed, 6 insertions(+), 6 deletions(-) > >> > >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > >> index 672cbfbdcec1..c98de02a3c57 100644 > >> --- a/net/ipv4/tcp_input.c > >> +++ b/net/ipv4/tcp_input.c > >> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) > >> } > >> > >> /* 4. Recalculate window clamp after socket hit its memory bounds. */ > >> -static void tcp_clamp_window(struct sock *sk) > >> +static void tcp_clamp_window(struct sock *sk, int truesize) > > > > > > I am unsure about this one. truesize can be 1MB here, do we want that > > in general ? > > I'm unsure either. But I can't think of a different approach?!? If the > incoming truesize is 1M the socket should allow for at least 1M rcvbuf > size to accept it, right? What I meant was : This is the generic point, accepting skb->truesize as additional input here would make us more vulnerable, or we could risk other regressions. The question is : why does MPTCP end up here in the first place. Perhaps an older issue with an incorrectly sized sk_rcvbuf ? Or maybe the test about the receive queue being empty (currently done in tcp_data_queue()) should be moved to a more strategic place. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 12:30 ` Eric Dumazet @ 2025-07-21 12:41 ` Eric Dumazet 2025-07-21 13:32 ` Paolo Abeni 1 sibling, 0 replies; 18+ messages in thread From: Eric Dumazet @ 2025-07-21 12:41 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On Mon, Jul 21, 2025 at 5:30 AM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On 7/21/25 10:04 AM, Eric Dumazet wrote: > > > On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: > > >> > > >> The nipa CI is reporting frequent failures in the mptcp_connect > > >> self-tests. > > >> > > >> In the failing scenarios (TCP -> MPTCP) the involved sockets are > > >> actually plain TCP ones, as fallback for passive socket at 2whs > > >> time cause the MPTCP listener to actually create a TCP socket. > > >> > > >> The transfer is stuck due to the receiver buffer being zero. > > >> With the stronger check in place, tcp_clamp_window() can be invoked > > >> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer > > >> will be zeroed, too. > > >> > > >> Pass to tcp_clamp_window() even the current skb truesize, so that > > >> such helper could compute and use the actual limit enforced by > > >> the stack. > > >> > > >> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") > > >> Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > >> --- > > >> net/ipv4/tcp_input.c | 12 ++++++------ > > >> 1 file changed, 6 insertions(+), 6 deletions(-) > > >> > > >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > >> index 672cbfbdcec1..c98de02a3c57 100644 > > >> --- a/net/ipv4/tcp_input.c > > >> +++ b/net/ipv4/tcp_input.c > > >> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) > > >> } > > >> > > >> /* 4. Recalculate window clamp after socket hit its memory bounds. */ > > >> -static void tcp_clamp_window(struct sock *sk) > > >> +static void tcp_clamp_window(struct sock *sk, int truesize) > > > > > > > > > I am unsure about this one. truesize can be 1MB here, do we want that > > > in general ? > > > > I'm unsure either. But I can't think of a different approach?!? If the > > incoming truesize is 1M the socket should allow for at least 1M rcvbuf > > size to accept it, right? > > What I meant was : > > This is the generic point, accepting skb->truesize as additional input > here would make us more vulnerable, or we could risk other > regressions. > > The question is : why does MPTCP end up here in the first place. > Perhaps an older issue with an incorrectly sized sk_rcvbuf ? > > Or maybe the test about the receive queue being empty (currently done > in tcp_data_queue()) should be moved to a more strategic place. If the MPTCP part can not be easily resolved, perhaps : diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 672cbfbdcec1de22a5b1494d365863303271d222..81b6d37708120632d16a50892442ea04779cc3a4 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5549,6 +5549,10 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb) { struct tcp_sock *tp = tcp_sk(sk); + /* Do nothing if our queues are empty. */ + if (!atomic_read(&sk->sk_rmem_alloc)) + return -1; + NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED); if (!tcp_can_ingest(sk, in_skb)) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 12:30 ` Eric Dumazet 2025-07-21 12:41 ` Eric Dumazet @ 2025-07-21 13:32 ` Paolo Abeni 2025-07-21 13:52 ` Eric Dumazet 1 sibling, 1 reply; 18+ messages in thread From: Paolo Abeni @ 2025-07-21 13:32 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On 7/21/25 2:30 PM, Eric Dumazet wrote: > On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote: >> On 7/21/25 10:04 AM, Eric Dumazet wrote: >>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: >>>> >>>> The nipa CI is reporting frequent failures in the mptcp_connect >>>> self-tests. >>>> >>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are >>>> actually plain TCP ones, as fallback for passive socket at 2whs >>>> time cause the MPTCP listener to actually create a TCP socket. >>>> >>>> The transfer is stuck due to the receiver buffer being zero. >>>> With the stronger check in place, tcp_clamp_window() can be invoked >>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer >>>> will be zeroed, too. >>>> >>>> Pass to tcp_clamp_window() even the current skb truesize, so that >>>> such helper could compute and use the actual limit enforced by >>>> the stack. >>>> >>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") >>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >>>> --- >>>> net/ipv4/tcp_input.c | 12 ++++++------ >>>> 1 file changed, 6 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >>>> index 672cbfbdcec1..c98de02a3c57 100644 >>>> --- a/net/ipv4/tcp_input.c >>>> +++ b/net/ipv4/tcp_input.c >>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) >>>> } >>>> >>>> /* 4. Recalculate window clamp after socket hit its memory bounds. */ >>>> -static void tcp_clamp_window(struct sock *sk) >>>> +static void tcp_clamp_window(struct sock *sk, int truesize) >>> >>> >>> I am unsure about this one. truesize can be 1MB here, do we want that >>> in general ? >> >> I'm unsure either. But I can't think of a different approach?!? If the >> incoming truesize is 1M the socket should allow for at least 1M rcvbuf >> size to accept it, right? > > What I meant was : > > This is the generic point, accepting skb->truesize as additional input > here would make us more vulnerable, or we could risk other > regressions. Understood, thanks for the clarification. > The question is : why does MPTCP end up here in the first place. > Perhaps an older issue with an incorrectly sized sk_rcvbuf ? I collected a few more data. The issue happens even with plain TCP sockets[1]. The relevant transfer is on top of the loopback device. The scaling_rate rapidly grows to 254 - that is `truesize` and `len` are very near. The stall happens when the received get in a packet with a slightly less 'efficient' layout (in the experiment I have handy len is 71424, truesize 72320) (almost) filling the receiver window. On such input, tcp_clamp_window() shrinks the receiver buffer to the current rmem usage. The same happens on retransmissions until rcvbuf becomes 0. I *think* that catching only the !sk_rmem_alloc case would avoid the stall, but I think it's a bit 'late'. I'm unsure if we could preventing/forbidding 'too high' values of scaling_rate? (also I'm unsure where to draw the line exactly. Cheers, Paolo [1] You can run the relevant test by adding '-t' on the mptcp_connect.sh command line, but it will take a lot of time to run the 10-20 iterations I need to observe the issue. To make it faster I manually trimmed the not relevant test-cases. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 13:32 ` Paolo Abeni @ 2025-07-21 13:52 ` Eric Dumazet 2025-07-21 14:56 ` Paolo Abeni 0 siblings, 1 reply; 18+ messages in thread From: Eric Dumazet @ 2025-07-21 13:52 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On 7/21/25 2:30 PM, Eric Dumazet wrote: > > On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote: > >> On 7/21/25 10:04 AM, Eric Dumazet wrote: > >>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: > >>>> > >>>> The nipa CI is reporting frequent failures in the mptcp_connect > >>>> self-tests. > >>>> > >>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are > >>>> actually plain TCP ones, as fallback for passive socket at 2whs > >>>> time cause the MPTCP listener to actually create a TCP socket. > >>>> > >>>> The transfer is stuck due to the receiver buffer being zero. > >>>> With the stronger check in place, tcp_clamp_window() can be invoked > >>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer > >>>> will be zeroed, too. > >>>> > >>>> Pass to tcp_clamp_window() even the current skb truesize, so that > >>>> such helper could compute and use the actual limit enforced by > >>>> the stack. > >>>> > >>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") > >>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> > >>>> --- > >>>> net/ipv4/tcp_input.c | 12 ++++++------ > >>>> 1 file changed, 6 insertions(+), 6 deletions(-) > >>>> > >>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > >>>> index 672cbfbdcec1..c98de02a3c57 100644 > >>>> --- a/net/ipv4/tcp_input.c > >>>> +++ b/net/ipv4/tcp_input.c > >>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) > >>>> } > >>>> > >>>> /* 4. Recalculate window clamp after socket hit its memory bounds. */ > >>>> -static void tcp_clamp_window(struct sock *sk) > >>>> +static void tcp_clamp_window(struct sock *sk, int truesize) > >>> > >>> > >>> I am unsure about this one. truesize can be 1MB here, do we want that > >>> in general ? > >> > >> I'm unsure either. But I can't think of a different approach?!? If the > >> incoming truesize is 1M the socket should allow for at least 1M rcvbuf > >> size to accept it, right? > > > > What I meant was : > > > > This is the generic point, accepting skb->truesize as additional input > > here would make us more vulnerable, or we could risk other > > regressions. > > Understood, thanks for the clarification. > > > The question is : why does MPTCP end up here in the first place. > > Perhaps an older issue with an incorrectly sized sk_rcvbuf ? > > I collected a few more data. The issue happens even with plain TCP > sockets[1]. > > The relevant transfer is on top of the loopback device. The scaling_rate > rapidly grows to 254 - that is `truesize` and `len` are very near. > > The stall happens when the received get in a packet with a slightly less > 'efficient' layout (in the experiment I have handy len is 71424, > truesize 72320) (almost) filling the receiver window. > > On such input, tcp_clamp_window() shrinks the receiver buffer to the > current rmem usage. The same happens on retransmissions until rcvbuf > becomes 0. > > I *think* that catching only the !sk_rmem_alloc case would avoid the > stall, but I think it's a bit 'late'. A packetdrill test here would help understanding your concern. > I'm unsure if we could > preventing/forbidding 'too high' values of scaling_rate? (also I'm > unsure where to draw the line exactly. Indeed we need to account for a possible variation (ie reduction) of skb->len/skb->truesize ratio in future packets. Note that whatever conservative change we make, it will always be possible to feed packets until RWIN 0 is sent back, eventually after a bump caused by a prune operation. Imagine wscale is 8, and a prior RWIN 1 was sent. Normally we should be able to receive a packet with 256 bytes of payload, but typical skb->truesize will be 256+512+4096 (for a NIC driver using 4K pages). This is one of the reasons we force wscale to 12 in Google DC :) > > Cheers, > > Paolo > > > [1] You can run the relevant test by adding '-t' on the mptcp_connect.sh > command line, but it will take a lot of time to run the 10-20 iterations > I need to observe the issue. To make it faster I manually trimmed the > not relevant test-cases. > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 13:52 ` Eric Dumazet @ 2025-07-21 14:56 ` Paolo Abeni 2025-07-21 15:21 ` Eric Dumazet 2025-07-21 15:27 ` Jakub Kicinski 0 siblings, 2 replies; 18+ messages in thread From: Paolo Abeni @ 2025-07-21 14:56 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On 7/21/25 3:52 PM, Eric Dumazet wrote: > On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@redhat.com> wrote: >> On 7/21/25 2:30 PM, Eric Dumazet wrote: >>> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote: >>>> On 7/21/25 10:04 AM, Eric Dumazet wrote: >>>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: >>>>>> >>>>>> The nipa CI is reporting frequent failures in the mptcp_connect >>>>>> self-tests. >>>>>> >>>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are >>>>>> actually plain TCP ones, as fallback for passive socket at 2whs >>>>>> time cause the MPTCP listener to actually create a TCP socket. >>>>>> >>>>>> The transfer is stuck due to the receiver buffer being zero. >>>>>> With the stronger check in place, tcp_clamp_window() can be invoked >>>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer >>>>>> will be zeroed, too. >>>>>> >>>>>> Pass to tcp_clamp_window() even the current skb truesize, so that >>>>>> such helper could compute and use the actual limit enforced by >>>>>> the stack. >>>>>> >>>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") >>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >>>>>> --- >>>>>> net/ipv4/tcp_input.c | 12 ++++++------ >>>>>> 1 file changed, 6 insertions(+), 6 deletions(-) >>>>>> >>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >>>>>> index 672cbfbdcec1..c98de02a3c57 100644 >>>>>> --- a/net/ipv4/tcp_input.c >>>>>> +++ b/net/ipv4/tcp_input.c >>>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) >>>>>> } >>>>>> >>>>>> /* 4. Recalculate window clamp after socket hit its memory bounds. */ >>>>>> -static void tcp_clamp_window(struct sock *sk) >>>>>> +static void tcp_clamp_window(struct sock *sk, int truesize) >>>>> >>>>> >>>>> I am unsure about this one. truesize can be 1MB here, do we want that >>>>> in general ? >>>> >>>> I'm unsure either. But I can't think of a different approach?!? If the >>>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf >>>> size to accept it, right? >>> >>> What I meant was : >>> >>> This is the generic point, accepting skb->truesize as additional input >>> here would make us more vulnerable, or we could risk other >>> regressions. >> >> Understood, thanks for the clarification. >> >>> The question is : why does MPTCP end up here in the first place. >>> Perhaps an older issue with an incorrectly sized sk_rcvbuf ? >> >> I collected a few more data. The issue happens even with plain TCP >> sockets[1]. >> >> The relevant transfer is on top of the loopback device. The scaling_rate >> rapidly grows to 254 - that is `truesize` and `len` are very near. >> >> The stall happens when the received get in a packet with a slightly less >> 'efficient' layout (in the experiment I have handy len is 71424, >> truesize 72320) (almost) filling the receiver window. >> >> On such input, tcp_clamp_window() shrinks the receiver buffer to the >> current rmem usage. The same happens on retransmissions until rcvbuf >> becomes 0. >> >> I *think* that catching only the !sk_rmem_alloc case would avoid the >> stall, but I think it's a bit 'late'. > > A packetdrill test here would help understanding your concern. I fear like a complete working script would take a lot of time, let me try to sketch just the relevant part: # receiver state is: # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232 # no OoO data, no memory pressure, # the incoming packet is in sequence +0 > P. 109297:172528(63232) ack 1 With just the 0 rmem check in tcp_prune_queue(), such function will still invoke tcp_clamp_window() that will shrink the receive buffer to 110592. tcp_collapse() can't make enough room and the incoming packet will be dropped. I think we should instead accept such packet. Side note: the above data are taken from an actual reproduction of the issue Please LMK if the above clarifies a bit my doubt or if a full pktdrill is needed. Thanks, Paolo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 14:56 ` Paolo Abeni @ 2025-07-21 15:21 ` Eric Dumazet 2025-07-21 16:17 ` Paolo Abeni 2025-07-21 15:27 ` Jakub Kicinski 1 sibling, 1 reply; 18+ messages in thread From: Eric Dumazet @ 2025-07-21 15:21 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On Mon, Jul 21, 2025 at 7:56 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On 7/21/25 3:52 PM, Eric Dumazet wrote: > > On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@redhat.com> wrote: > >> On 7/21/25 2:30 PM, Eric Dumazet wrote: > >>> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote: > >>>> On 7/21/25 10:04 AM, Eric Dumazet wrote: > >>>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: > >>>>>> > >>>>>> The nipa CI is reporting frequent failures in the mptcp_connect > >>>>>> self-tests. > >>>>>> > >>>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are > >>>>>> actually plain TCP ones, as fallback for passive socket at 2whs > >>>>>> time cause the MPTCP listener to actually create a TCP socket. > >>>>>> > >>>>>> The transfer is stuck due to the receiver buffer being zero. > >>>>>> With the stronger check in place, tcp_clamp_window() can be invoked > >>>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer > >>>>>> will be zeroed, too. > >>>>>> > >>>>>> Pass to tcp_clamp_window() even the current skb truesize, so that > >>>>>> such helper could compute and use the actual limit enforced by > >>>>>> the stack. > >>>>>> > >>>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") > >>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> > >>>>>> --- > >>>>>> net/ipv4/tcp_input.c | 12 ++++++------ > >>>>>> 1 file changed, 6 insertions(+), 6 deletions(-) > >>>>>> > >>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > >>>>>> index 672cbfbdcec1..c98de02a3c57 100644 > >>>>>> --- a/net/ipv4/tcp_input.c > >>>>>> +++ b/net/ipv4/tcp_input.c > >>>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) > >>>>>> } > >>>>>> > >>>>>> /* 4. Recalculate window clamp after socket hit its memory bounds. */ > >>>>>> -static void tcp_clamp_window(struct sock *sk) > >>>>>> +static void tcp_clamp_window(struct sock *sk, int truesize) > >>>>> > >>>>> > >>>>> I am unsure about this one. truesize can be 1MB here, do we want that > >>>>> in general ? > >>>> > >>>> I'm unsure either. But I can't think of a different approach?!? If the > >>>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf > >>>> size to accept it, right? > >>> > >>> What I meant was : > >>> > >>> This is the generic point, accepting skb->truesize as additional input > >>> here would make us more vulnerable, or we could risk other > >>> regressions. > >> > >> Understood, thanks for the clarification. > >> > >>> The question is : why does MPTCP end up here in the first place. > >>> Perhaps an older issue with an incorrectly sized sk_rcvbuf ? > >> > >> I collected a few more data. The issue happens even with plain TCP > >> sockets[1]. > >> > >> The relevant transfer is on top of the loopback device. The scaling_rate > >> rapidly grows to 254 - that is `truesize` and `len` are very near. > >> > >> The stall happens when the received get in a packet with a slightly less > >> 'efficient' layout (in the experiment I have handy len is 71424, > >> truesize 72320) (almost) filling the receiver window. > >> > >> On such input, tcp_clamp_window() shrinks the receiver buffer to the > >> current rmem usage. The same happens on retransmissions until rcvbuf > >> becomes 0. > >> > >> I *think* that catching only the !sk_rmem_alloc case would avoid the > >> stall, but I think it's a bit 'late'. > > > > A packetdrill test here would help understanding your concern. > > I fear like a complete working script would take a lot of time, let me > try to sketch just the relevant part: > > # receiver state is: > # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232 > # no OoO data, no memory pressure, > > # the incoming packet is in sequence > +0 > P. 109297:172528(63232) ack 1 > > With just the 0 rmem check in tcp_prune_queue(), such function will > still invoke tcp_clamp_window() that will shrink the receive buffer to > 110592. As long as an ACK is sent back with a smaller RWIN, I think this would be reasonable in this case. > tcp_collapse() can't make enough room and the incoming packet will be > dropped. I think we should instead accept such packet. Only if not completely off-the-limits... packetdrill test : cat eric.pkt // Test the calculation of receive window values by a bulk data receiver. --mss=1000 // Set up config. // Need to set tcp_rmem[1] to what tcp_fixup_rcvbuf() would have set to // make sure the same kernel behavior after removing tcp_fixup_rcvbuf() `../common/defaults.sh ../common/set_sysctls.py /proc/sys/net/ipv4/tcp_rmem="4096 131072 15728640" ` // Create a socket. 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 // Verify that the receive buffer is the tcp_rmem default. +0 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // Establish a connection. +.01 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6> +0 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.01 < . 1:1(0) ack 1 win 457 +0 accept(3, ..., ...) = 4 // Verify that the receive buffer is the tcp_rmem default. +0 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0 // Check first outgoing window after SYN +.01 write(4, ..., 1000) = 1000 +0 > P. 1:1001(1000) ack 1 win 251 // Phase 1: Data arrives but app doesn't read from the socket buffer. +.01 < . 1:60001(60000) ack 1001 win 457 +0 > . 1001:1001(0) ack 60001 win 263 +.01 < . 60001:120001(60000) ack 1001 win 457 +0~+.04 > . 1001:1001(0) ack 120001 win 29 // Incoming packet has a too big skb->truesize, lets send a lower RWIN +.01 < . 120001:127425(7424) ack 1001 win 457 +0~+.04 > . 1001:1001(0) ack 120001 win 0 // Reset to sysctls defaults `/tmp/sysctl_restore_${PPID}.sh` ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 15:21 ` Eric Dumazet @ 2025-07-21 16:17 ` Paolo Abeni 0 siblings, 0 replies; 18+ messages in thread From: Paolo Abeni @ 2025-07-21 16:17 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On 7/21/25 5:21 PM, Eric Dumazet wrote: > On Mon, Jul 21, 2025 at 7:56 AM Paolo Abeni <pabeni@redhat.com> wrote: >> On 7/21/25 3:52 PM, Eric Dumazet wrote: >>> On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@redhat.com> wrote: >>>> On 7/21/25 2:30 PM, Eric Dumazet wrote: >>>>> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@redhat.com> wrote: >>>>>> On 7/21/25 10:04 AM, Eric Dumazet wrote: >>>>>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: >>>>>>>> >>>>>>>> The nipa CI is reporting frequent failures in the mptcp_connect >>>>>>>> self-tests. >>>>>>>> >>>>>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are >>>>>>>> actually plain TCP ones, as fallback for passive socket at 2whs >>>>>>>> time cause the MPTCP listener to actually create a TCP socket. >>>>>>>> >>>>>>>> The transfer is stuck due to the receiver buffer being zero. >>>>>>>> With the stronger check in place, tcp_clamp_window() can be invoked >>>>>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer >>>>>>>> will be zeroed, too. >>>>>>>> >>>>>>>> Pass to tcp_clamp_window() even the current skb truesize, so that >>>>>>>> such helper could compute and use the actual limit enforced by >>>>>>>> the stack. >>>>>>>> >>>>>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") >>>>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com> >>>>>>>> --- >>>>>>>> net/ipv4/tcp_input.c | 12 ++++++------ >>>>>>>> 1 file changed, 6 insertions(+), 6 deletions(-) >>>>>>>> >>>>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >>>>>>>> index 672cbfbdcec1..c98de02a3c57 100644 >>>>>>>> --- a/net/ipv4/tcp_input.c >>>>>>>> +++ b/net/ipv4/tcp_input.c >>>>>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk) >>>>>>>> } >>>>>>>> >>>>>>>> /* 4. Recalculate window clamp after socket hit its memory bounds. */ >>>>>>>> -static void tcp_clamp_window(struct sock *sk) >>>>>>>> +static void tcp_clamp_window(struct sock *sk, int truesize) >>>>>>> >>>>>>> >>>>>>> I am unsure about this one. truesize can be 1MB here, do we want that >>>>>>> in general ? >>>>>> >>>>>> I'm unsure either. But I can't think of a different approach?!? If the >>>>>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf >>>>>> size to accept it, right? >>>>> >>>>> What I meant was : >>>>> >>>>> This is the generic point, accepting skb->truesize as additional input >>>>> here would make us more vulnerable, or we could risk other >>>>> regressions. >>>> >>>> Understood, thanks for the clarification. >>>> >>>>> The question is : why does MPTCP end up here in the first place. >>>>> Perhaps an older issue with an incorrectly sized sk_rcvbuf ? >>>> >>>> I collected a few more data. The issue happens even with plain TCP >>>> sockets[1]. >>>> >>>> The relevant transfer is on top of the loopback device. The scaling_rate >>>> rapidly grows to 254 - that is `truesize` and `len` are very near. >>>> >>>> The stall happens when the received get in a packet with a slightly less >>>> 'efficient' layout (in the experiment I have handy len is 71424, >>>> truesize 72320) (almost) filling the receiver window. >>>> >>>> On such input, tcp_clamp_window() shrinks the receiver buffer to the >>>> current rmem usage. The same happens on retransmissions until rcvbuf >>>> becomes 0. >>>> >>>> I *think* that catching only the !sk_rmem_alloc case would avoid the >>>> stall, but I think it's a bit 'late'. >>> >>> A packetdrill test here would help understanding your concern. >> >> I fear like a complete working script would take a lot of time, let me >> try to sketch just the relevant part: >> >> # receiver state is: >> # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232 >> # no OoO data, no memory pressure, >> >> # the incoming packet is in sequence >> +0 > P. 109297:172528(63232) ack 1 >> >> With just the 0 rmem check in tcp_prune_queue(), such function will >> still invoke tcp_clamp_window() that will shrink the receive buffer to >> 110592. > > As long as an ACK is sent back with a smaller RWIN, I think this would > be reasonable in this case. I fear some possible regression, as the sender will see some unexpected drops, even on loopback and while not misbehaving. But I tested your proposed code here and AFAICS solves the issue and I could not spot anything suspicious so far. So I'll send a v2 using that. Thanks! Paolo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 14:56 ` Paolo Abeni 2025-07-21 15:21 ` Eric Dumazet @ 2025-07-21 15:27 ` Jakub Kicinski 2025-07-21 15:30 ` Jakub Kicinski 2025-07-21 16:50 ` Eric Dumazet 1 sibling, 2 replies; 18+ messages in thread From: Jakub Kicinski @ 2025-07-21 15:27 UTC (permalink / raw) To: Paolo Abeni Cc: Eric Dumazet, netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Simon Horman, Matthieu Baerts On Mon, 21 Jul 2025 16:56:06 +0200 Paolo Abeni wrote: > >> I *think* that catching only the !sk_rmem_alloc case would avoid the > >> stall, but I think it's a bit 'late'. > > > > A packetdrill test here would help understanding your concern. > > I fear like a complete working script would take a lot of time, let me > try to sketch just the relevant part: > > # receiver state is: > # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232 > # no OoO data, no memory pressure, > > # the incoming packet is in sequence > +0 > P. 109297:172528(63232) ack 1 > > With just the 0 rmem check in tcp_prune_queue(), such function will > still invoke tcp_clamp_window() that will shrink the receive buffer to > 110592. > tcp_collapse() can't make enough room and the incoming packet will be > dropped. I think we should instead accept such packet. > > Side note: the above data are taken from an actual reproduction of the issue > > Please LMK if the above clarifies a bit my doubt or if a full pktdrill > is needed. Not the first time we stumble on packetdrills for scaling ratio. Solving it is probably outside the scope of this discussion but I wonder what would be the best way to do it. My go to is to integrate packetdrill with netdevsim and have an option for netdevsim to inflate truesize on demand. But perhaps there's a clever way we can force something like tap to give us the ratio we desire. Other ideas? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 15:27 ` Jakub Kicinski @ 2025-07-21 15:30 ` Jakub Kicinski 2025-07-21 16:50 ` Eric Dumazet 1 sibling, 0 replies; 18+ messages in thread From: Jakub Kicinski @ 2025-07-21 15:30 UTC (permalink / raw) To: Paolo Abeni Cc: Eric Dumazet, netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Simon Horman, Matthieu Baerts On Mon, 21 Jul 2025 08:27:28 -0700 Jakub Kicinski wrote: > > With just the 0 rmem check in tcp_prune_queue(), such function will > > still invoke tcp_clamp_window() that will shrink the receive buffer to > > 110592. > > tcp_collapse() can't make enough room and the incoming packet will be > > dropped. I think we should instead accept such packet. > > > > Side note: the above data are taken from an actual reproduction of the issue > > > > Please LMK if the above clarifies a bit my doubt or if a full pktdrill > > is needed. > > Not the first time we stumble on packetdrills for scaling ratio. > Solving it is probably outside the scope of this discussion but > I wonder what would be the best way to do it. My go to is to > integrate packetdrill with netdevsim and have an option for netdevsim > to inflate truesize on demand. But perhaps there's a clever way we can > force something like tap to give us the ratio we desire. Other ideas? FWIW didn't see Eric's reply before hitting send.. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer 2025-07-21 15:27 ` Jakub Kicinski 2025-07-21 15:30 ` Jakub Kicinski @ 2025-07-21 16:50 ` Eric Dumazet 1 sibling, 0 replies; 18+ messages in thread From: Eric Dumazet @ 2025-07-21 16:50 UTC (permalink / raw) To: Jakub Kicinski Cc: Paolo Abeni, netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Simon Horman, Matthieu Baerts On Mon, Jul 21, 2025 at 8:27 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 21 Jul 2025 16:56:06 +0200 Paolo Abeni wrote: > > >> I *think* that catching only the !sk_rmem_alloc case would avoid the > > >> stall, but I think it's a bit 'late'. > > > > > > A packetdrill test here would help understanding your concern. > > > > I fear like a complete working script would take a lot of time, let me > > try to sketch just the relevant part: > > > > # receiver state is: > > # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232 > > # no OoO data, no memory pressure, > > > > # the incoming packet is in sequence > > +0 > P. 109297:172528(63232) ack 1 > > > > With just the 0 rmem check in tcp_prune_queue(), such function will > > still invoke tcp_clamp_window() that will shrink the receive buffer to > > 110592. > > tcp_collapse() can't make enough room and the incoming packet will be > > dropped. I think we should instead accept such packet. > > > > Side note: the above data are taken from an actual reproduction of the issue > > > > Please LMK if the above clarifies a bit my doubt or if a full pktdrill > > is needed. > > Not the first time we stumble on packetdrills for scaling ratio. > Solving it is probably outside the scope of this discussion but > I wonder what would be the best way to do it. My go to is to > integrate packetdrill with netdevsim and have an option for netdevsim > to inflate truesize on demand. But perhaps there's a clever way we can > force something like tap to give us the ratio we desire. Other ideas? Adding a TUN option (ioctl) to be able to increase skb->truesize by a given amount / percentage is doable I think. Then in packetdrill we could add the ability to use this option for a given scenario or packet. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq 2025-07-18 17:25 [PATCH net-next 0/2] tcp: a couple of fixes Paolo Abeni 2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni @ 2025-07-18 17:25 ` Paolo Abeni 2025-07-18 22:16 ` Matthieu Baerts 2025-07-21 8:28 ` Eric Dumazet 1 sibling, 2 replies; 18+ messages in thread From: Paolo Abeni @ 2025-07-18 17:25 UTC (permalink / raw) To: netdev Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts The mentioned MIB is currently incremented even when a packet with an old sequence number (i.e. a zero window probe) is received, which is IMHO misleading. Explicitly restrict such MIB increment at the relevant events. Fixes: 6c758062c64d ("tcp: add LINUX_MIB_BEYOND_WINDOW") Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- net/ipv4/tcp_input.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c98de02a3c57..b708287df9b7 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5911,7 +5911,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, if (!th->rst) { if (th->syn) goto syn_challenge; - NET_INC_STATS(sock_net(sk), LINUX_MIB_BEYOND_WINDOW); + + if (reason == SKB_DROP_REASON_TCP_INVALID_SEQUENCE || + reason == SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE) + NET_INC_STATS(sock_net(sk), + LINUX_MIB_BEYOND_WINDOW); if (!tcp_oow_rate_limited(sock_net(sk), skb, LINUX_MIB_TCPACKSKIPPEDSEQ, &tp->last_oow_ack_time)) -- 2.50.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq 2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni @ 2025-07-18 22:16 ` Matthieu Baerts 2025-07-21 8:28 ` Eric Dumazet 1 sibling, 0 replies; 18+ messages in thread From: Matthieu Baerts @ 2025-07-18 22:16 UTC (permalink / raw) To: Paolo Abeni, netdev Cc: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman Hi Paolo, On 18/07/2025 19:25, Paolo Abeni wrote: > The mentioned MIB is currently incremented even when a packet > with an old sequence number (i.e. a zero window probe) is received, > which is IMHO misleading. > > Explicitly restrict such MIB increment at the relevant events. Good idea, it is less confusing! Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Cheers, Matt -- Sponsored by the NGI0 Core fund. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq 2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni 2025-07-18 22:16 ` Matthieu Baerts @ 2025-07-21 8:28 ` Eric Dumazet 1 sibling, 0 replies; 18+ messages in thread From: Eric Dumazet @ 2025-07-21 8:28 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern, Jakub Kicinski, Simon Horman, Matthieu Baerts On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@redhat.com> wrote: > > The mentioned MIB is currently incremented even when a packet > with an old sequence number (i.e. a zero window probe) is received, > which is IMHO misleading. > > Explicitly restrict such MIB increment at the relevant events. > > Fixes: 6c758062c64d ("tcp: add LINUX_MIB_BEYOND_WINDOW") > Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Thanks ! ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-07-21 16:50 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-18 17:25 [PATCH net-next 0/2] tcp: a couple of fixes Paolo Abeni 2025-07-18 17:25 ` [PATCH net-next 1/2] tcp: do not set a zero size receive buffer Paolo Abeni 2025-07-18 22:15 ` Matthieu Baerts 2025-07-21 8:04 ` Eric Dumazet 2025-07-21 10:50 ` Paolo Abeni 2025-07-21 12:30 ` Eric Dumazet 2025-07-21 12:41 ` Eric Dumazet 2025-07-21 13:32 ` Paolo Abeni 2025-07-21 13:52 ` Eric Dumazet 2025-07-21 14:56 ` Paolo Abeni 2025-07-21 15:21 ` Eric Dumazet 2025-07-21 16:17 ` Paolo Abeni 2025-07-21 15:27 ` Jakub Kicinski 2025-07-21 15:30 ` Jakub Kicinski 2025-07-21 16:50 ` Eric Dumazet 2025-07-18 17:25 ` [PATCH net-next 2/2] tcp: do not increment BeyondWindow MIB for old seq Paolo Abeni 2025-07-18 22:16 ` Matthieu Baerts 2025-07-21 8:28 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).