* TCP OOM drops with the stricter rcvbuf checking
@ 2026-02-25 20:23 Jakub Kicinski
2026-02-25 21:44 ` Jakub Kicinski
0 siblings, 1 reply; 3+ messages in thread
From: Jakub Kicinski @ 2026-02-25 20:23 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
Hi Eric!
Even with commit f017c1f768b6 ("tcp: use skb->len instead of
skb->truesize in tcp_can_ingest()") we see a huge increase in
rcvq drops. Some uwsgi trigger a ton of drops over loopback,
prior kernel had 0.000003958 drops per second, with the
changes up to f017c1f768b6 it's 0.826685681 drops / sec
(for the most impacted workload)
After much digging I see that the worst workload hits the drops
with sockets in the following state:
ifindex: 1
rcvbuf: 131072
window_clamp: 129024
scaling_ratio: 252
rx_bytes: 2673351
inq: 59392 (rcvq: skb_cnt:1 [truesize:64384,eaten:4096:frags:2|no-fraglist])
sk_rmem_alloc: 64384
incoming skb: len:67584
deficit: -896
(I wasted quite a bit of time mislead by deficit being the skb
overhead :|)
I _think_ what happens is simpler, because we round up the window
we advertise:
window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
so we effectively grant extra window space to the sender which we
then don't honor. This matters less for real NICs which have lower
scaling_ratio as the lie hides in the skb->len vs skb->truesize
relaxation that f017c1f768b6 made. But over loopback with scaling
ratio >250 we can't hide 800B of overshoot, even on a 64kB skb.
I'm not entirely sure how to fix this. Of course we can give:
1 << tcp_sk(sk)->rx_opt.rcv_wscale;
of slack in tcp_can_ingest() (or maybe just a fixed value like 16kB?)
But aligning the window down instead of up feels much cleaner to me.
IDK if this can regress anything:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 326b58ff1118..9f7ed76a97aa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3383,7 +3383,8 @@ u32 __tcp_select_window(struct sock *sk)
* Import case: prevent zero window announcement if
* 1<<rcv_wscale > mss.
*/
- window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+ if (window < (1 << tp->rx_opt.rcv_wscale))
+ window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
} else {
window = tp->rcv_wnd;
/* Get the largest window that is a nice multiple of mss.
(possibly we could avoid the branch with some ALU magic)
Does this make sense?
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: TCP OOM drops with the stricter rcvbuf checking
2026-02-25 20:23 TCP OOM drops with the stricter rcvbuf checking Jakub Kicinski
@ 2026-02-25 21:44 ` Jakub Kicinski
2026-02-26 1:58 ` Eric Dumazet
0 siblings, 1 reply; 3+ messages in thread
From: Jakub Kicinski @ 2026-02-25 21:44 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
On Wed, 25 Feb 2026 12:23:55 -0800 Jakub Kicinski wrote:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 326b58ff1118..9f7ed76a97aa 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3383,7 +3383,8 @@ u32 __tcp_select_window(struct sock *sk)
> * Import case: prevent zero window announcement if
> * 1<<rcv_wscale > mss.
> */
> - window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
> + if (window < (1 << tp->rx_opt.rcv_wscale))
> + window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
> } else {
> window = tp->rcv_wnd;
> /* Get the largest window that is a nice multiple of mss.
Hm, reading thru __tcp_select_window() more carefully I guess there's
already an attempt to solve this:
if (free_space < (full_space >> 1)) {
...
/* free_space might become our new window, make sure we don't
* increase it due to wscale.
*/
free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
the problem is that over loopback we can receive rather large skbs,
so we don't hit this round_down(). The drops I see have < 64k inq,
and the incoming skb is > 64k.
Perhaps this condition should check if free_space < gro_ipv*_max_size
as well (modulo gro_*_max_size vs tso_max_size on loopback)?
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: TCP OOM drops with the stricter rcvbuf checking
2026-02-25 21:44 ` Jakub Kicinski
@ 2026-02-26 1:58 ` Eric Dumazet
0 siblings, 0 replies; 3+ messages in thread
From: Eric Dumazet @ 2026-02-26 1:58 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev
On Wed, Feb 25, 2026 at 10:44 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 25 Feb 2026 12:23:55 -0800 Jakub Kicinski wrote:
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 326b58ff1118..9f7ed76a97aa 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -3383,7 +3383,8 @@ u32 __tcp_select_window(struct sock *sk)
> > * Import case: prevent zero window announcement if
> > * 1<<rcv_wscale > mss.
> > */
> > - window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
> > + if (window < (1 << tp->rx_opt.rcv_wscale))
> > + window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
> > } else {
> > window = tp->rcv_wnd;
> > /* Get the largest window that is a nice multiple of mss.
>
> Hm, reading thru __tcp_select_window() more carefully I guess there's
> already an attempt to solve this:
>
> if (free_space < (full_space >> 1)) {
> ...
> /* free_space might become our new window, make sure we don't
> * increase it due to wscale.
> */
> free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
>
> the problem is that over loopback we can receive rather large skbs,
> so we don't hit this round_down(). The drops I see have < 64k inq,
> and the incoming skb is > 64k.
>
> Perhaps this condition should check if free_space < gro_ipv*_max_size
> as well (modulo gro_*_max_size vs tso_max_size on loopback)?
Given the amount of issues, maybe we can relax tcp_can_ingest()
until better days.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e7b41abb82aad33d8cab4fcfa989cc4771149b41..de6ad26537232b46bee2e1f168144e10faa4bf87
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5365,25 +5365,11 @@ static void tcp_ofo_queue(struct sock *sk)
static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb);
static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb);
-/* Check if this incoming skb can be added to socket receive queues
- * while satisfying sk->sk_rcvbuf limit.
- *
- * In theory we should use skb->truesize, but this can cause problems
- * when applications use too small SO_RCVBUF values.
- * When LRO / hw gro is used, the socket might have a high tp->scaling_ratio,
- * allowing RWIN to be close to available space.
- * Whenever the receive queue gets full, we can receive a small packet
- * filling RWIN, but with a high skb->truesize, because most NIC use 4K page
- * plus sk_buff metadata even when receiving less than 1500 bytes of payload.
- *
- * Note that we use skb->len to decide to accept or drop this packet,
- * but sk->sk_rmem_alloc is the sum of all skb->truesize.
- */
static bool tcp_can_ingest(const struct sock *sk, const struct sk_buff *skb)
{
unsigned int rmem = atomic_read(&sk->sk_rmem_alloc);
- return rmem + skb->len <= sk->sk_rcvbuf;
+ return rmem <= sk->sk_rcvbuf;
}
static int tcp_try_rmem_schedule(struct sock *sk, const struct sk_buff *skb,
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-02-26 1:58 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-25 20:23 TCP OOM drops with the stricter rcvbuf checking Jakub Kicinski
2026-02-25 21:44 ` Jakub Kicinski
2026-02-26 1:58 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox