From: Jakub Kicinski <kuba@kernel.org>
To: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Subject: TCP OOM drops with the stricter rcvbuf checking
Date: Wed, 25 Feb 2026 12:23:55 -0800 [thread overview]
Message-ID: <20260225122355.585fd57b@kernel.org> (raw)
Hi Eric!
Even with commit f017c1f768b6 ("tcp: use skb->len instead of
skb->truesize in tcp_can_ingest()") we see a huge increase in
rcvq drops. Some uwsgi trigger a ton of drops over loopback,
prior kernel had 0.000003958 drops per second, with the
changes up to f017c1f768b6 it's 0.826685681 drops / sec
(for the most impacted workload)
After much digging I see that the worst workload hits the drops
with sockets in the following state:
ifindex: 1
rcvbuf: 131072
window_clamp: 129024
scaling_ratio: 252
rx_bytes: 2673351
inq: 59392 (rcvq: skb_cnt:1 [truesize:64384,eaten:4096:frags:2|no-fraglist])
sk_rmem_alloc: 64384
incoming skb: len:67584
deficit: -896
(I wasted quite a bit of time mislead by deficit being the skb
overhead :|)
I _think_ what happens is simpler, because we round up the window
we advertise:
window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
so we effectively grant extra window space to the sender which we
then don't honor. This matters less for real NICs which have lower
scaling_ratio as the lie hides in the skb->len vs skb->truesize
relaxation that f017c1f768b6 made. But over loopback with scaling
ratio >250 we can't hide 800B of overshoot, even on a 64kB skb.
I'm not entirely sure how to fix this. Of course we can give:
1 << tcp_sk(sk)->rx_opt.rcv_wscale;
of slack in tcp_can_ingest() (or maybe just a fixed value like 16kB?)
But aligning the window down instead of up feels much cleaner to me.
IDK if this can regress anything:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 326b58ff1118..9f7ed76a97aa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3383,7 +3383,8 @@ u32 __tcp_select_window(struct sock *sk)
* Import case: prevent zero window announcement if
* 1<<rcv_wscale > mss.
*/
- window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+ if (window < (1 << tp->rx_opt.rcv_wscale))
+ window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
} else {
window = tp->rcv_wnd;
/* Get the largest window that is a nice multiple of mss.
(possibly we could avoid the branch with some ALU magic)
Does this make sense?
next reply other threads:[~2026-02-25 20:23 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-25 20:23 Jakub Kicinski [this message]
2026-02-25 21:44 ` TCP OOM drops with the stricter rcvbuf checking Jakub Kicinski
2026-02-26 1:58 ` Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260225122355.585fd57b@kernel.org \
--to=kuba@kernel.org \
--cc=edumazet@google.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox