public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net v2 0/2] tcp: protect locked SO_RCVBUF from Silly Window Syndrome
@ 2026-05-04 14:49 Ankit Jain
  2026-05-04 14:49 ` [PATCH net v2 1/2] " Ankit Jain
  2026-05-04 14:49 ` [PATCH net v2 2/2] selftests/net: add packetdrill test for locked SO_RCVBUF SWS Ankit Jain
  0 siblings, 2 replies; 7+ messages in thread
From: Ankit Jain @ 2026-05-04 14:49 UTC (permalink / raw)
  To: edumazet, kuba, netdev
  Cc: davem, pabeni, ncardwell, kuniyu, horms, shuah, quic_subashab,
	quic_stranche, linux-kselftest, linux-kernel, karen.badiryan,
	ajay.kaher, alexey.makhalov, vamsi-krishna.brahmajosyula,
	yin.ding, tapas.kundu, Ankit Jain

This series fixes a regression where locked SO_RCVBUF sockets suffer from
Silly Window Syndrome (SWS).

Recent memory fragmentation optimizations apply truesize penalties to the
scaling_ratio. For applications locking SO_RCVBUF (like Java/Tomcat) and
processing small packets, this drops the scaling_ratio to 1. This
collapses the advertised window and causes 504 Gateway Timeouts.

Patch 1 bypasses this penalty for locked buffers unless skb->len > advmss.
Patch 2 adds a packetdrill test.

Link to v1:
https://lore.kernel.org/all/20260427152756.1205-1-ankit-aj.jain@broadcom.com/

v1 -> v2:
 - Shifted protection from window_clamp to scaling_ratio based on Jakub
   Kicinski's feedback.
 - Added skb->len > advmss check to ensure large aggregate payloads (GRO)
   are still penalized. This allows tcp_rcv_neg_window.pkt to pass.
 - Added a new packetdrill test (Patch 2/2).

Testing:
 - Verified fix in a live Java/Tomcat environment (504 timeouts resolved).
 - Verified deadlock prevention via the new tcp_locked_rcvbuf_sws.pkt.
 - Passed upstream regression tests: tcp_rcv_neg_window.pkt,
   tcp_rcv_wnd_shrink_allowed.pkt, tcp_rcv_wnd_shrink_nomem.pkt,
   tcp_rcv_zero_wnd_fin.pkt, and tcp_rcv_big_endseq.pkt.

Ankit Jain (2):
  tcp: protect locked SO_RCVBUF from Silly Window Syndrome
  selftests/net: add packetdrill test for locked SO_RCVBUF SWS

 net/ipv4/tcp_input.c                          |  8 ++++-
 .../net/packetdrill/tcp_locked_rcvbuf_sws.pkt | 34 +++++++++++++++++++
 2 files changed, 41 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt

--
2.53.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net v2 1/2] tcp: protect locked SO_RCVBUF from Silly Window Syndrome
  2026-05-04 14:49 [PATCH net v2 0/2] tcp: protect locked SO_RCVBUF from Silly Window Syndrome Ankit Jain
@ 2026-05-04 14:49 ` Ankit Jain
  2026-05-04 16:09   ` Eric Dumazet
  2026-05-04 14:49 ` [PATCH net v2 2/2] selftests/net: add packetdrill test for locked SO_RCVBUF SWS Ankit Jain
  1 sibling, 1 reply; 7+ messages in thread
From: Ankit Jain @ 2026-05-04 14:49 UTC (permalink / raw)
  To: edumazet, kuba, netdev
  Cc: davem, pabeni, ncardwell, kuniyu, horms, shuah, quic_subashab,
	quic_stranche, linux-kselftest, linux-kernel, karen.badiryan,
	ajay.kaher, alexey.makhalov, vamsi-krishna.brahmajosyula,
	yin.ding, tapas.kundu, Ankit Jain

When an application locks SO_RCVBUF, it expects strict memory bounds and
disables TCP window auto-tuning. However, recent TCP memory fragmentation
optimizations still apply dynamic truesize penalties to the `scaling_ratio`
of these locked sockets.

For workloads processing small, fragmented packets (like Java's Tomcat),
this penalty drops the scaling_ratio to 1. This shrinks the dynamically
calculated advertised window, leading to Silly Window Syndrome (SWS)
deadlocks and 504 Gateway Timeouts.

This patch fixes the issue by bypassing the truesize penalty for sockets
with `SOCK_RCVBUF_LOCK` set. To ensure the kernel still defends against
memory exhaustion from large aggregate payloads (e.g., GRO), the penalty
is still applied if `skb->len` exceeds the advertised MSS.

Fixes: a2cbb1603943 ("tcp: Update window clamping condition")
Reported-by: Karen Badiryan <karen.badiryan@broadcom.com>
Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>
---
 net/ipv4/tcp_input.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d5c9e65d9760..569299dafa88 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -240,8 +240,14 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 		/* Note: divides are still a bit expensive.
 		 * For the moment, only adjust scaling_ratio
 		 * when we update icsk_ack.rcv_mss.
+		 *
+		 * Protect locked SO_RCVBUF from Silly Window Syndrome
+		 * due to truesize penalties on small packets. Allow
+		 * penalty if aggregate payload (e.g., GRO) exceeds MSS.
 		 */
-		if (unlikely(len != icsk->icsk_ack.rcv_mss)) {
+		if (unlikely(len != icsk->icsk_ack.rcv_mss &&
+			     (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
+			      skb->len > tcp_sk(sk)->advmss))) {
 			u64 val = (u64)skb->len << TCP_RMEM_TO_WIN_SCALE;
 			u8 old_ratio = tcp_sk(sk)->scaling_ratio;

--
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net v2 2/2] selftests/net: add packetdrill test for locked SO_RCVBUF SWS
  2026-05-04 14:49 [PATCH net v2 0/2] tcp: protect locked SO_RCVBUF from Silly Window Syndrome Ankit Jain
  2026-05-04 14:49 ` [PATCH net v2 1/2] " Ankit Jain
@ 2026-05-04 14:49 ` Ankit Jain
  2026-05-04 16:13   ` Eric Dumazet
  1 sibling, 1 reply; 7+ messages in thread
From: Ankit Jain @ 2026-05-04 14:49 UTC (permalink / raw)
  To: edumazet, kuba, netdev
  Cc: davem, pabeni, ncardwell, kuniyu, horms, shuah, quic_subashab,
	quic_stranche, linux-kselftest, linux-kernel, karen.badiryan,
	ajay.kaher, alexey.makhalov, vamsi-krishna.brahmajosyula,
	yin.ding, tapas.kundu, Ankit Jain

Add a packetdrill test to verify that TCP does not aggressively penalize
the `scaling_ratio` when an application explicitly locks the receive
buffer via SO_RCVBUF.

This test establishes a connection with a very small MSS, then injects
small payloads. On buggy kernels, the payload-to-truesize memory penalty
drops the `scaling_ratio` to 1, which shrinks the dynamically calculated
window and causes Silly Window Syndrome (SWS).

The test asserts that `tcpi_rcv_ssthresh` (the exposed window clamp)
remains protected and stays at a healthy level.

Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>
---
 .../net/packetdrill/tcp_locked_rcvbuf_sws.pkt | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt

diff --git a/tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt b/tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt
new file mode 100644
index 000000000000..96d3a0813548
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+// Test that TCP does not apply aggressive truesize penalties
+// to the scaling_ratio when the application has explicitly locked
+// the receive buffer, preventing Silly Window Syndrome.
+
+// 1. Create a socket and lock SO_RCVBUF to 32K
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
++0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
++0 bind(3, ..., ...) = 0
++0 listen(3, 1) = 0
+
+// 2. Establish connection with a TINY MSS (48 bytes)
+// This forces the kernel's rcv_mss expectation to initialize at rock bottom.
++0 < S 0:0(0) win 65535 <mss 48,nop,wscale 8>
++0 > S. 0:0(0) ack 1 <...>
++0 < . 1:1(0) ack 1 win 65535
++0 accept(3, ..., ...) = 4
+
+// 3. Inject 100 bytes.
+// 100 is > 48, so this forces tcp_measure_rcv_mss() to recalculate the scaling ratio.
+// Because the payload (100B) is small compared to the sk_buff truesize,
+// kernels without the fix will incorrectly drop the scaling_ratio.
++0.1 < P. 1:101(100) ack 1 win 65535
++0 > . 1:1(0) ack 101 <...>
+
+// 4. Inject 110 bytes to bypass the length jitter optimization and force
+// a second recalculation, driving the scaling_ratio to 1 and crushing the clamp.
++0.1 < P. 101:211(110) ack 1 win 65535
++0 > . 1:1(0) ack 211 <...>
+
+// 5. Assert the window clamp (tcpi_rcv_ssthresh) is protected from the penalty.
++0.1 %{
+assert tcpi_rcv_ssthresh > 15000, f"BUG DETECTED: scaling_ratio crushed! rcv_ssthresh={tcpi_rcv_ssthresh}"
+}%
--
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net v2 1/2] tcp: protect locked SO_RCVBUF from Silly Window Syndrome
  2026-05-04 14:49 ` [PATCH net v2 1/2] " Ankit Jain
@ 2026-05-04 16:09   ` Eric Dumazet
  2026-05-05 18:19     ` Ankit Jain
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2026-05-04 16:09 UTC (permalink / raw)
  To: Ankit Jain
  Cc: kuba, netdev, davem, pabeni, ncardwell, kuniyu, horms, shuah,
	quic_subashab, quic_stranche, linux-kselftest, linux-kernel,
	karen.badiryan, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu

On Mon, May 4, 2026 at 7:53 AM Ankit Jain <ankit-aj.jain@broadcom.com> wrote:
>
> When an application locks SO_RCVBUF, it expects strict memory bounds and
> disables TCP window auto-tuning. However, recent TCP memory fragmentation
> optimizations still apply dynamic truesize penalties to the `scaling_ratio`
> of these locked sockets.
>
> For workloads processing small, fragmented packets (like Java's Tomcat),
> this penalty drops the scaling_ratio to 1. This shrinks the dynamically
> calculated advertised window, leading to Silly Window Syndrome (SWS)
> deadlocks and 504 Gateway Timeouts.
>
> This patch fixes the issue by bypassing the truesize penalty for sockets
> with `SOCK_RCVBUF_LOCK` set. To ensure the kernel still defends against
> memory exhaustion from large aggregate payloads (e.g., GRO), the penalty
> is still applied if `skb->len` exceeds the advertised MSS.
>
> Fixes: a2cbb1603943 ("tcp: Update window clamping condition")
> Reported-by: Karen Badiryan <karen.badiryan@broadcom.com>
> Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>
> ---
>  net/ipv4/tcp_input.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d5c9e65d9760..569299dafa88 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -240,8 +240,14 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
>                 /* Note: divides are still a bit expensive.
>                  * For the moment, only adjust scaling_ratio
>                  * when we update icsk_ack.rcv_mss.
> +                *
> +                * Protect locked SO_RCVBUF from Silly Window Syndrome
> +                * due to truesize penalties on small packets. Allow
> +                * penalty if aggregate payload (e.g., GRO) exceeds MSS.
>                  */
> -               if (unlikely(len != icsk->icsk_ack.rcv_mss)) {
> +               if (unlikely(len != icsk->icsk_ack.rcv_mss &&
> +                            (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK) ||
> +                             skb->len > tcp_sk(sk)->advmss))) {

Testing tp->advmss is not doing what you want I think.

A remote peer can send GRO packets with tiny segments, regardless of tp->advmss

If GRO is what you are looking for, why not testing (skb->len > len) ?

>                         u64 val = (u64)skb->len << TCP_RMEM_TO_WIN_SCALE;
>                         u8 old_ratio = tcp_sk(sk)->scaling_ratio;
>
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net v2 2/2] selftests/net: add packetdrill test for locked SO_RCVBUF SWS
  2026-05-04 14:49 ` [PATCH net v2 2/2] selftests/net: add packetdrill test for locked SO_RCVBUF SWS Ankit Jain
@ 2026-05-04 16:13   ` Eric Dumazet
  2026-05-05 18:23     ` Ankit Jain
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2026-05-04 16:13 UTC (permalink / raw)
  To: Ankit Jain
  Cc: kuba, netdev, davem, pabeni, ncardwell, kuniyu, horms, shuah,
	quic_subashab, quic_stranche, linux-kselftest, linux-kernel,
	karen.badiryan, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu

On Mon, May 4, 2026 at 7:53 AM Ankit Jain <ankit-aj.jain@broadcom.com> wrote:
>
> Add a packetdrill test to verify that TCP does not aggressively penalize
> the `scaling_ratio` when an application explicitly locks the receive
> buffer via SO_RCVBUF.
>
> This test establishes a connection with a very small MSS, then injects
> small payloads. On buggy kernels, the payload-to-truesize memory penalty
> drops the `scaling_ratio` to 1, which shrinks the dynamically calculated
> window and causes Silly Window Syndrome (SWS).
>
> The test asserts that `tcpi_rcv_ssthresh` (the exposed window clamp)
> remains protected and stays at a healthy level.
>
> Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>
> ---
>  .../net/packetdrill/tcp_locked_rcvbuf_sws.pkt | 34 +++++++++++++++++++
>  1 file changed, 34 insertions(+)
>  create mode 100644 tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt
>
> diff --git a/tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt b/tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt
> new file mode 100644
> index 000000000000..96d3a0813548
> --- /dev/null
> +++ b/tools/testing/selftests/net/packetdrill/tcp_locked_rcvbuf_sws.pkt
> @@ -0,0 +1,34 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Test that TCP does not apply aggressive truesize penalties
> +// to the scaling_ratio when the application has explicitly locked
> +// the receive buffer, preventing Silly Window Syndrome.
> +
> +// 1. Create a socket and lock SO_RCVBUF to 32K
> +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> ++0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
> ++0 bind(3, ..., ...) = 0
> ++0 listen(3, 1) = 0
> +
> +// 2. Establish connection with a TINY MSS (48 bytes)
> +// This forces the kernel's rcv_mss expectation to initialize at rock bottom.
> ++0 < S 0:0(0) win 65535 <mss 48,nop,wscale 8>
> ++0 > S. 0:0(0) ack 1 <...>
> ++0 < . 1:1(0) ack 1 win 65535
> ++0 accept(3, ..., ...) = 4
> +
> +// 3. Inject 100 bytes.
> +// 100 is > 48, so this forces tcp_measure_rcv_mss() to recalculate the scaling ratio.
> +// Because the payload (100B) is small compared to the sk_buff truesize,
> +// kernels without the fix will incorrectly drop the scaling_ratio.
> ++0.1 < P. 1:101(100) ack 1 win 65535
> ++0 > . 1:1(0) ack 101 <...>
> +
> +// 4. Inject 110 bytes to bypass the length jitter optimization and force
> +// a second recalculation, driving the scaling_ratio to 1 and crushing the clamp.
> ++0.1 < P. 101:211(110) ack 1 win 65535
> ++0 > . 1:1(0) ack 211 <...>
> +
> +// 5. Assert the window clamp (tcpi_rcv_ssthresh) is protected from the penalty.
> ++0.1 %{
> +assert tcpi_rcv_ssthresh > 15000, f"BUG DETECTED: scaling_ratio crushed! rcv_ssthresh={tcpi_rcv_ssthresh}"
> +}%


I do not see the SWS effect you want to avoid in the first place.

This test is an ad-hoc test about your change, but I still do not see
why recomputing tp->ratio every time the rcvmss is increased is an
issue on loopback interface.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net v2 1/2] tcp: protect locked SO_RCVBUF from Silly Window Syndrome
  2026-05-04 16:09   ` Eric Dumazet
@ 2026-05-05 18:19     ` Ankit Jain
  0 siblings, 0 replies; 7+ messages in thread
From: Ankit Jain @ 2026-05-05 18:19 UTC (permalink / raw)
  To: edumazet
  Cc: kuba, netdev, davem, pabeni, ncardwell, kuniyu, horms, shuah,
	quic_subashab, quic_stranche, linux-kselftest, linux-kernel,
	karen.badiryan, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu

Hi Eric,

Thanks for the review and suggestion.

> Testing tp->advmss is not doing what you want I think.
> 
> A remote peer can send GRO packets with tiny segments, regardless of
> tp->advmss
> 
> If GRO is what you are looking for, why not testing (skb->len > len) ?

I tested your suggested `skb->len > len` logic on our reproduction
setup. It works perfectly and the 504 timeouts are completely resolved.

Thanks,
Ankit

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net v2 2/2] selftests/net: add packetdrill test for locked SO_RCVBUF SWS
  2026-05-04 16:13   ` Eric Dumazet
@ 2026-05-05 18:23     ` Ankit Jain
  0 siblings, 0 replies; 7+ messages in thread
From: Ankit Jain @ 2026-05-05 18:23 UTC (permalink / raw)
  To: edumazet
  Cc: kuba, netdev, davem, pabeni, ncardwell, kuniyu, horms, shuah,
	quic_subashab, quic_stranche, linux-kselftest, linux-kernel,
	karen.badiryan, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu

Thanks for the review.

> I do not see the SWS effect you want to avoid in the first place.
> 
> This test is an ad-hoc test about your change, but I still do not see
> why recomputing tp->ratio every time the rcvmss is increased is an
> issue on loopback interface.

For the packetdrill script, it is taking more time. To actually show
the window dropping to 0, I have to write a long script with many
packets and application reads. This is because TCP does not shrink the
right edge of an already open window.

Since the C-code fix in Patch 1 is tested and working fine, should I
send v3 with just the code fix first? I can work on the packetdrill
script and send it later in a separate patch. Or should I wait and
send both together?

Kindly suggest how I should proceed.

Thanks,
Ankit

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-05 18:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-04 14:49 [PATCH net v2 0/2] tcp: protect locked SO_RCVBUF from Silly Window Syndrome Ankit Jain
2026-05-04 14:49 ` [PATCH net v2 1/2] " Ankit Jain
2026-05-04 16:09   ` Eric Dumazet
2026-05-05 18:19     ` Ankit Jain
2026-05-04 14:49 ` [PATCH net v2 2/2] selftests/net: add packetdrill test for locked SO_RCVBUF SWS Ankit Jain
2026-05-04 16:13   ` Eric Dumazet
2026-05-05 18:23     ` Ankit Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox