public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] tcp: do not shrink window clamp when SO_RCVBUF is locked
@ 2026-04-27 15:27 Ankit Jain
  2026-04-27 15:38 ` Eric Dumazet
  0 siblings, 1 reply; 3+ messages in thread
From: Ankit Jain @ 2026-04-27 15:27 UTC (permalink / raw)
  To: netdev, davem, dsahern, edumazet, ncardwell, kuniyu, kuba, pabeni,
	horms, quic_stranche, quic_subashab
  Cc: linux-kernel, karen.badiryan, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu, Ankit Jain,
	stable

When an application explicitly sets SO_RCVBUF, the window clamp should
not be dynamically recalculated based on the memory scaling_ratio.

Currently, tcp_measure_rcv_mss() aggressively crushes the window clamp
down when it sees a poor skb->len to skb->truesize ratio. If the
application explicitly locked the buffer via SO_RCVBUF, this
recalculation causes the advertised window to drop severely.

If the window drops below the interface MSS, it triggers Silly Window
Syndrome (SWS) avoidance on the sender. The sender defers transmission
and drops the connection into a perpetual 200ms PROBE0 timer loop,
drastically reducing throughput.

This is highly reproducible on loopback interfaces (MTU 65536) using
Java-based workloads (like Tomcat/GemFire) where the JVM sets SO_RCVBUF
to 32K or 64K. The bloated loopback truesize forces the scaling ratio
to drop, crushing the window clamp to ~26K, instantly triggering SWS
stalls and causing gigabyte transfers to take minutes instead of
milliseconds.

Since the application locked the buffer, the kernel should respect the
clamp boundary and not dynamically crush it based on runtime ratios.

Fixes: a2cbb1603943 ("tcp: Update window clamping condition")
Cc: stable@vger.kernel.org
Reported-by: Karen Badiryan <karen.badiryan@broadcom.com>
Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>
---
Note to reviewers:

Testing Context:
- The SWS deadlock was successfully reproduced on the latest netdev/net 
  tree (v7.1-rc1) using the actual enterprise Java workload.
- Applying this patch completely resolves the 504 Timeouts and restores 
  loopback throughput.
- Baseline iperf3 auto-tuning remains unaffected by this patch.

For context, here is the exact sequence of events that triggers the 
recalculation flaw, illustrated in a packetdrill-style flow. 
Unpatched kernels aggressively crush the window at step 3, triggering SWS.

// 1. Tomcat creates socket and hardcodes the buffer to 32K
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [32768]) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

// 2. GemFire connects over loopback (simulating Jumbo MSS of 65496)
+0 < S 0:0(0) win 65535 <mss 65496, sackOK, TS val 100 ecr 0, nop, wscale 8>
+0 > S. 0:0(0) ack 1 <...>
+0 < . 1:1(0) ack 1 win 65535 <TS val 200 ecr 1>
+0 accept(3, ..., ...) = 4

// 3. GemFire sends a 20KB packet, dropping the scaling_ratio.
// Without the patch, tcp_measure_rcv_mss() crushes the window_clamp here.
+0.1 < . 1:20001(20000) ack 1 win 65535 <TS val 300 ecr 1>
+0.1 read(4, ..., 20000) = 20000

// 4. Assert window did not crush
// WITH the patch, the kernel respects the SOCK_RCVBUF_LOCK.
+0 > . 1:1(0) ack 20001 win 65535
---
 net/ipv4/tcp_input.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d5c9e65d9..c1cb9d3ed 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -248,7 +248,8 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 			do_div(val, skb->truesize);
 			tcp_sk(sk)->scaling_ratio = val ? val : 1;
 
-			if (old_ratio != tcp_sk(sk)->scaling_ratio) {
+			if (old_ratio != tcp_sk(sk)->scaling_ratio &&
+			    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
 				struct tcp_sock *tp = tcp_sk(sk);
 
 				val = tcp_win_from_space(sk, sk->sk_rcvbuf);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net] tcp: do not shrink window clamp when SO_RCVBUF is locked
  2026-04-27 15:27 [PATCH net] tcp: do not shrink window clamp when SO_RCVBUF is locked Ankit Jain
@ 2026-04-27 15:38 ` Eric Dumazet
  2026-04-27 20:11   ` Jakub Kicinski
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Dumazet @ 2026-04-27 15:38 UTC (permalink / raw)
  To: Ankit Jain
  Cc: netdev, davem, dsahern, ncardwell, kuniyu, kuba, pabeni, horms,
	quic_stranche, quic_subashab, linux-kernel, karen.badiryan,
	ajay.kaher, alexey.makhalov, vamsi-krishna.brahmajosyula,
	yin.ding, tapas.kundu, stable

On Mon, Apr 27, 2026 at 8:32 AM Ankit Jain <ankit-aj.jain@broadcom.com> wrote:
>
> When an application explicitly sets SO_RCVBUF, the window clamp should
> not be dynamically recalculated based on the memory scaling_ratio.
>
> Currently, tcp_measure_rcv_mss() aggressively crushes the window clamp
> down when it sees a poor skb->len to skb->truesize ratio. If the
> application explicitly locked the buffer via SO_RCVBUF, this
> recalculation causes the advertised window to drop severely.
>
> If the window drops below the interface MSS, it triggers Silly Window
> Syndrome (SWS) avoidance on the sender. The sender defers transmission
> and drops the connection into a perpetual 200ms PROBE0 timer loop,
> drastically reducing throughput.
>
> This is highly reproducible on loopback interfaces (MTU 65536) using
> Java-based workloads (like Tomcat/GemFire) where the JVM sets SO_RCVBUF
> to 32K or 64K. The bloated loopback truesize forces the scaling ratio
> to drop, crushing the window clamp to ~26K, instantly triggering SWS
> stalls and causing gigabyte transfers to take minutes instead of
> milliseconds.
>
> Since the application locked the buffer, the kernel should respect the
> clamp boundary and not dynamically crush it based on runtime ratios.
>
> Fixes: a2cbb1603943 ("tcp: Update window clamping condition")
> Cc: stable@vger.kernel.org
> Reported-by: Karen Badiryan <karen.badiryan@broadcom.com>
> Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>

Make sure to add a selftests (in ./tools/testing/selftests/net/packetdrill/ )

Thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net] tcp: do not shrink window clamp when SO_RCVBUF is locked
  2026-04-27 15:38 ` Eric Dumazet
@ 2026-04-27 20:11   ` Jakub Kicinski
  0 siblings, 0 replies; 3+ messages in thread
From: Jakub Kicinski @ 2026-04-27 20:11 UTC (permalink / raw)
  To: Ankit Jain
  Cc: Eric Dumazet, netdev, davem, dsahern, ncardwell, kuniyu, pabeni,
	horms, quic_stranche, quic_subashab, linux-kernel, karen.badiryan,
	ajay.kaher, alexey.makhalov, vamsi-krishna.brahmajosyula,
	yin.ding, tapas.kundu, stable

On Mon, 27 Apr 2026 08:38:44 -0700 Eric Dumazet wrote:
> On Mon, Apr 27, 2026 at 8:32 AM Ankit Jain <ankit-aj.jain@broadcom.com> wrote:
> >
> > When an application explicitly sets SO_RCVBUF, the window clamp should
> > not be dynamically recalculated based on the memory scaling_ratio.
> >
> > Currently, tcp_measure_rcv_mss() aggressively crushes the window clamp
> > down when it sees a poor skb->len to skb->truesize ratio. If the
> > application explicitly locked the buffer via SO_RCVBUF, this
> > recalculation causes the advertised window to drop severely.
> >
> > If the window drops below the interface MSS, it triggers Silly Window
> > Syndrome (SWS) avoidance on the sender. The sender defers transmission
> > and drops the connection into a perpetual 200ms PROBE0 timer loop,
> > drastically reducing throughput.
> >
> > This is highly reproducible on loopback interfaces (MTU 65536) using
> > Java-based workloads (like Tomcat/GemFire) where the JVM sets SO_RCVBUF
> > to 32K or 64K. The bloated loopback truesize forces the scaling ratio
> > to drop, crushing the window clamp to ~26K, instantly triggering SWS
> > stalls and causing gigabyte transfers to take minutes instead of
> > milliseconds.
> >
> > Since the application locked the buffer, the kernel should respect the
> > clamp boundary and not dynamically crush it based on runtime ratios.
> >
> > Fixes: a2cbb1603943 ("tcp: Update window clamping condition")
> > Cc: stable@vger.kernel.org
> > Reported-by: Karen Badiryan <karen.badiryan@broadcom.com>
> > Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>  
> 
> Make sure to add a selftests (in ./tools/testing/selftests/net/packetdrill/ )

And I think it makes tcp_rcv_neg_window.pkt fail

reminder - please wait 24h before posting v2 on netdev, and when posting
v2 start a new thread.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-27 20:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 15:27 [PATCH net] tcp: do not shrink window clamp when SO_RCVBUF is locked Ankit Jain
2026-04-27 15:38 ` Eric Dumazet
2026-04-27 20:11   ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox