public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* TCP OOM drops with the stricter rcvbuf checking
@ 2026-02-25 20:23 Jakub Kicinski
  2026-02-25 21:44 ` Jakub Kicinski
  0 siblings, 1 reply; 3+ messages in thread
From: Jakub Kicinski @ 2026-02-25 20:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Hi Eric!

Even with commit f017c1f768b6 ("tcp: use skb->len instead of
skb->truesize in tcp_can_ingest()") we see a huge increase in
rcvq drops. Some uwsgi trigger a ton of drops over loopback,
prior kernel had 0.000003958 drops per second, with the
changes up to f017c1f768b6 it's 0.826685681 drops / sec
(for the most impacted workload)

After much digging I see that the worst workload hits the drops
with sockets in the following state:

 ifindex: 1 
 rcvbuf: 131072
 window_clamp: 129024
 scaling_ratio: 252
 rx_bytes: 2673351
 inq: 59392 (rcvq: skb_cnt:1 [truesize:64384,eaten:4096:frags:2|no-fraglist])
 sk_rmem_alloc: 64384
 incoming skb: len:67584
 deficit: -896

(I wasted quite a bit of time mislead by deficit being the skb
overhead :|)

I _think_ what happens is simpler, because we round up the window
we advertise:

  window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));

so we effectively grant extra window space to the sender which we
then don't honor. This matters less for real NICs which have lower
scaling_ratio as the lie hides in the skb->len vs skb->truesize
relaxation that f017c1f768b6 made. But over loopback with scaling
ratio >250 we can't hide 800B of overshoot, even on a 64kB skb.

I'm not entirely sure how to fix this. Of course we can give:

  1 << tcp_sk(sk)->rx_opt.rcv_wscale;

of slack in tcp_can_ingest() (or maybe just a fixed value like 16kB?)

But aligning the window down instead of up feels much cleaner to me.
IDK if this can regress anything:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 326b58ff1118..9f7ed76a97aa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3383,7 +3383,8 @@ u32 __tcp_select_window(struct sock *sk)
                 * Import case: prevent zero window announcement if
                 * 1<<rcv_wscale > mss.
                 */
-               window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+               if (window < (1 << tp->rx_opt.rcv_wscale))
+                       window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
        } else {
                window = tp->rcv_wnd;
                /* Get the largest window that is a nice multiple of mss.

(possibly we could avoid the branch with some ALU magic)

Does this make sense?

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-02-26  1:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-25 20:23 TCP OOM drops with the stricter rcvbuf checking Jakub Kicinski
2026-02-25 21:44 ` Jakub Kicinski
2026-02-26  1:58   ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox