All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Subject: TCP OOM drops with the stricter rcvbuf checking
Date: Wed, 25 Feb 2026 12:23:55 -0800	[thread overview]
Message-ID: <20260225122355.585fd57b@kernel.org> (raw)

Hi Eric!

Even with commit f017c1f768b6 ("tcp: use skb->len instead of
skb->truesize in tcp_can_ingest()") we see a huge increase in
rcvq drops. Some uwsgi trigger a ton of drops over loopback,
prior kernel had 0.000003958 drops per second, with the
changes up to f017c1f768b6 it's 0.826685681 drops / sec
(for the most impacted workload)

After much digging I see that the worst workload hits the drops
with sockets in the following state:

 ifindex: 1 
 rcvbuf: 131072
 window_clamp: 129024
 scaling_ratio: 252
 rx_bytes: 2673351
 inq: 59392 (rcvq: skb_cnt:1 [truesize:64384,eaten:4096:frags:2|no-fraglist])
 sk_rmem_alloc: 64384
 incoming skb: len:67584
 deficit: -896

(I wasted quite a bit of time mislead by deficit being the skb
overhead :|)

I _think_ what happens is simpler, because we round up the window
we advertise:

  window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));

so we effectively grant extra window space to the sender which we
then don't honor. This matters less for real NICs which have lower
scaling_ratio as the lie hides in the skb->len vs skb->truesize
relaxation that f017c1f768b6 made. But over loopback with scaling
ratio >250 we can't hide 800B of overshoot, even on a 64kB skb.

I'm not entirely sure how to fix this. Of course we can give:

  1 << tcp_sk(sk)->rx_opt.rcv_wscale;

of slack in tcp_can_ingest() (or maybe just a fixed value like 16kB?)

But aligning the window down instead of up feels much cleaner to me.
IDK if this can regress anything:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 326b58ff1118..9f7ed76a97aa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3383,7 +3383,8 @@ u32 __tcp_select_window(struct sock *sk)
                 * Import case: prevent zero window announcement if
                 * 1<<rcv_wscale > mss.
                 */
-               window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+               if (window < (1 << tp->rx_opt.rcv_wscale))
+                       window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
        } else {
                window = tp->rcv_wnd;
                /* Get the largest window that is a nice multiple of mss.

(possibly we could avoid the branch with some ALU magic)

Does this make sense?

             reply	other threads:[~2026-02-25 20:23 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-25 20:23 Jakub Kicinski [this message]
2026-02-25 21:44 ` TCP OOM drops with the stricter rcvbuf checking Jakub Kicinski
2026-02-26  1:58   ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260225122355.585fd57b@kernel.org \
    --to=kuba@kernel.org \
    --cc=edumazet@google.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.