[PATCH net-next 00/11] tcp: receive side improvements

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 00/11] tcp: receive side improvements
@ 2025-05-13 19:39 Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint Eric Dumazet
                   ` (13 more replies)
  0 siblings, 14 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

We have set tcp_rmem[2] to 15 MB for about 8 years at Google,
but had some issues for high speed flows on very small RTT.

TCP rx autotuning has a tendency to overestimate the RTT,
thus tp->rcvq_space.space and sk->sk_rcvbuf.

This makes TCP receive queues much bigger than necessary,
to a point cpu caches are evicted before application can
copy the data, on cpus using DDIO.

This series aims to fix this.

- First patch adds tcp_rcvbuf_grow() tracepoint, which was very
  convenient to study the various issues fixed in this series.

- Seven patches fix receiver autotune issues.

- Two patches fix sender side issues.

- Final patch increases tcp_rmem[2] so that TCP speed over WAN
  can meet modern needs.

Tested on a 200Gbit NIC, average max throughput of a single flow:

Before:
 73593 Mbit.

After:
 122514 Mbit.

Eric Dumazet (11):
  tcp: add tcp_rcvbuf_grow() tracepoint
  tcp: fix sk_rcvbuf overshoot
  tcp: adjust rcvbuf in presence of reorders
  tcp: add receive queue awareness in tcp_rcv_space_adjust()
  tcp: remove zero TCP TS samples for autotuning
  tcp: fix initial tp->rcvq_space.space value for passive TS enabled
    flows
  tcp: always seek for minimal rtt in tcp_rcv_rtt_update()
  tcp: skip big rtt sample if receive queue is not empty
  tcp: increase tcp_limit_output_bytes default value to 4MB
  tcp: always use tcp_limit_output_bytes limitation
  tcp: increase tcp_rmem[2] to 32 MB

 Documentation/networking/ip-sysctl.rst |   4 +-
 include/linux/tcp.h                    |   2 +-
 include/trace/events/tcp.h             |  73 ++++++++++++++++
 net/ipv4/tcp.c                         |   2 +-
 net/ipv4/tcp_input.c                   | 110 ++++++++++++-------------
 net/ipv4/tcp_ipv4.c                    |   4 +-
 net/ipv4/tcp_output.c                  |   5 +-
 7 files changed, 134 insertions(+), 66 deletions(-)

-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-14 15:30   ` David Ahern
  2025-05-13 19:39 ` [PATCH net-next 02/11] tcp: fix sk_rcvbuf overshoot Eric Dumazet
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

Provide a new tracepoint to better understand
tcp_rcv_space_adjust() (currently broken) behavior.

Call it only when tcp_rcv_space_adjust() has a chance
to make a change.

I chose to leave trace_tcp_rcv_space_adjust() as is,
because commit 6163849d289b ("net: introduce a new tracepoint
for tcp_rcv_space_adjust") intent was to get it called after
each data delivery to user space.

Tested:

Pair of hosts in the same rack. Ideally, sk->sk_rcvbuf should be kept small.

echo "4096 131072 33554432" >/proc/sys/net/ipv4/tcp_rmem
./netserver
perf record -C10 -e tcp:tcp_rcvbuf_grow sleep 30

<launch from client : netperf -H server -T,10>

Trace for a TS enabled TCP flow (with standard ms granularity)

perf script // We can see that sk_rcvbuf is growing very fast to tcp_mem[2]
  260.500397: tcp:tcp_rcvbuf_grow: time=291 rtt_us=274 copied=110592 inq=0 space=41080 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
  260.501333: tcp:tcp_rcvbuf_grow: time=555 rtt_us=364 copied=333824 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
  260.501664: tcp:tcp_rcvbuf_grow: time=331 rtt_us=330 copied=798720 inq=0 space=333824 ooo=0 scaling_ratio=230 rcvbuf=4110551 ...
  260.502003: tcp:tcp_rcvbuf_grow: time=340 rtt_us=330 copied=1040384 inq=49152 space=798720 ooo=0 scaling_ratio=230 rcvbuf=7006410 ...
  260.502483: tcp:tcp_rcvbuf_grow: time=479 rtt_us=330 copied=2658304 inq=49152 space=1040384 ooo=0 scaling_ratio=230 rcvbuf=7006410 ...
  260.502899: tcp:tcp_rcvbuf_grow: time=416 rtt_us=413 copied=4026368 inq=147456 space=2658304 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.504233: tcp:tcp_rcvbuf_grow: time=493 rtt_us=487 copied=4800512 inq=196608 space=4026368 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.504792: tcp:tcp_rcvbuf_grow: time=559 rtt_us=551 copied=5672960 inq=49152 space=4800512 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.506614: tcp:tcp_rcvbuf_grow: time=610 rtt_us=607 copied=6688768 inq=180224 space=5672960 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.507280: tcp:tcp_rcvbuf_grow: time=666 rtt_us=656 copied=6868992 inq=49152 space=6688768 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.507979: tcp:tcp_rcvbuf_grow: time=699 rtt_us=699 copied=7000064 inq=0 space=6868992 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.508681: tcp:tcp_rcvbuf_grow: time=703 rtt_us=699 copied=7208960 inq=0 space=7000064 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.509426: tcp:tcp_rcvbuf_grow: time=744 rtt_us=737 copied=7569408 inq=0 space=7208960 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.510213: tcp:tcp_rcvbuf_grow: time=787 rtt_us=770 copied=7880704 inq=49152 space=7569408 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.511013: tcp:tcp_rcvbuf_grow: time=801 rtt_us=798 copied=8339456 inq=0 space=7880704 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.511860: tcp:tcp_rcvbuf_grow: time=847 rtt_us=824 copied=8601600 inq=49152 space=8339456 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.512710: tcp:tcp_rcvbuf_grow: time=850 rtt_us=846 copied=8814592 inq=65536 space=8601600 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.514428: tcp:tcp_rcvbuf_grow: time=871 rtt_us=865 copied=8855552 inq=49152 space=8814592 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.515333: tcp:tcp_rcvbuf_grow: time=905 rtt_us=882 copied=9228288 inq=49152 space=8855552 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.516237: tcp:tcp_rcvbuf_grow: time=905 rtt_us=896 copied=9371648 inq=49152 space=9228288 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.517149: tcp:tcp_rcvbuf_grow: time=911 rtt_us=909 copied=9543680 inq=49152 space=9371648 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.518070: tcp:tcp_rcvbuf_grow: time=921 rtt_us=921 copied=9793536 inq=0 space=9543680 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.520895: tcp:tcp_rcvbuf_grow: time=948 rtt_us=947 copied=10203136 inq=114688 space=9793536 ooo=0 scaling_ratio=230 rcvbuf=24622616 ...
  260.521853: tcp:tcp_rcvbuf_grow: time=959 rtt_us=954 copied=10293248 inq=57344 space=10203136 ooo=0 scaling_ratio=230 rcvbuf=24691992 ...
  260.522818: tcp:tcp_rcvbuf_grow: time=964 rtt_us=959 copied=10330112 inq=0 space=10293248 ooo=0 scaling_ratio=230 rcvbuf=24691992 ...
  260.524760: tcp:tcp_rcvbuf_grow: time=979 rtt_us=969 copied=10633216 inq=49152 space=10330112 ooo=0 scaling_ratio=230 rcvbuf=24691992 ...
  260.526709: tcp:tcp_rcvbuf_grow: time=975 rtt_us=973 copied=12013568 inq=163840 space=10633216 ooo=0 scaling_ratio=230 rcvbuf=25136755 ...
  260.527694: tcp:tcp_rcvbuf_grow: time=985 rtt_us=976 copied=12025856 inq=32768 space=12013568 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.530655: tcp:tcp_rcvbuf_grow: time=991 rtt_us=986 copied=12050432 inq=98304 space=12025856 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.533626: tcp:tcp_rcvbuf_grow: time=993 rtt_us=989 copied=12124160 inq=0 space=12050432 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.538606: tcp:tcp_rcvbuf_grow: time=1000 rtt_us=994 copied=12222464 inq=49152 space=12124160 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.545605: tcp:tcp_rcvbuf_grow: time=1005 rtt_us=998 copied=12263424 inq=81920 space=12222464 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.553626: tcp:tcp_rcvbuf_grow: time=1005 rtt_us=999 copied=12320768 inq=12288 space=12263424 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.589749: tcp:tcp_rcvbuf_grow: time=1001 rtt_us=1000 copied=12398592 inq=16384 space=12320768 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  260.806577: tcp:tcp_rcvbuf_grow: time=1010 rtt_us=1000 copied=12402688 inq=32768 space=12398592 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  261.002386: tcp:tcp_rcvbuf_grow: time=1002 rtt_us=1000 copied=12419072 inq=98304 space=12402688 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  261.803432: tcp:tcp_rcvbuf_grow: time=1013 rtt_us=1000 copied=12468224 inq=49152 space=12419072 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  261.829533: tcp:tcp_rcvbuf_grow: time=1004 rtt_us=1000 copied=12615680 inq=0 space=12468224 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...
  265.505435: tcp:tcp_rcvbuf_grow: time=1007 rtt_us=1000 copied=12632064 inq=32768 space=12615680 ooo=0 scaling_ratio=230 rcvbuf=33554432 ...

We also see rtt_us going gradually to 1000 usec, causing massive overshoot.

Trace for a usec TS enabled TCP flow (us granularity)

perf script // We can see that sk_rcvbuf is growing to a smaller value,
               thanks to tight rtt_us values.
 1509.273955: tcp:tcp_rcvbuf_grow: time=396 rtt_us=377 copied=110592 inq=0 space=41080 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
 1509.274366: tcp:tcp_rcvbuf_grow: time=412 rtt_us=365 copied=129024 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
 1509.274738: tcp:tcp_rcvbuf_grow: time=372 rtt_us=355 copied=194560 inq=0 space=129024 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
 1509.275020: tcp:tcp_rcvbuf_grow: time=282 rtt_us=257 copied=401408 inq=0 space=194560 ooo=0 scaling_ratio=230 rcvbuf=1399144 ...
 1509.275190: tcp:tcp_rcvbuf_grow: time=170 rtt_us=144 copied=741376 inq=229376 space=401408 ooo=0 scaling_ratio=230 rcvbuf=3021625 ...
 1509.275300: tcp:tcp_rcvbuf_grow: time=110 rtt_us=110 copied=1146880 inq=65536 space=741376 ooo=0 scaling_ratio=230 rcvbuf=4642390 ...
 1509.275449: tcp:tcp_rcvbuf_grow: time=149 rtt_us=106 copied=1310720 inq=737280 space=1146880 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.275560: tcp:tcp_rcvbuf_grow: time=111 rtt_us=107 copied=1388544 inq=430080 space=1310720 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.275674: tcp:tcp_rcvbuf_grow: time=114 rtt_us=113 copied=1495040 inq=421888 space=1388544 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.275800: tcp:tcp_rcvbuf_grow: time=126 rtt_us=126 copied=1572864 inq=77824 space=1495040 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.275968: tcp:tcp_rcvbuf_grow: time=168 rtt_us=161 copied=1863680 inq=172032 space=1572864 ooo=0 scaling_ratio=230 rcvbuf=5498637 ...
 1509.276129: tcp:tcp_rcvbuf_grow: time=161 rtt_us=161 copied=1941504 inq=204800 space=1863680 ooo=0 scaling_ratio=230 rcvbuf=5782790 ...
 1509.276288: tcp:tcp_rcvbuf_grow: time=159 rtt_us=158 copied=1990656 inq=131072 space=1941504 ooo=0 scaling_ratio=230 rcvbuf=5782790 ...
 1509.276900: tcp:tcp_rcvbuf_grow: time=228 rtt_us=226 copied=2883584 inq=266240 space=1990656 ooo=0 scaling_ratio=230 rcvbuf=5782790 ...
 1509.277819: tcp:tcp_rcvbuf_grow: time=242 rtt_us=236 copied=3022848 inq=0 space=2883584 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.278072: tcp:tcp_rcvbuf_grow: time=253 rtt_us=247 copied=3055616 inq=49152 space=3022848 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.279560: tcp:tcp_rcvbuf_grow: time=268 rtt_us=264 copied=3133440 inq=180224 space=3055616 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.279833: tcp:tcp_rcvbuf_grow: time=274 rtt_us=270 copied=3424256 inq=0 space=3133440 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.282187: tcp:tcp_rcvbuf_grow: time=277 rtt_us=273 copied=3465216 inq=180224 space=3424256 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.284685: tcp:tcp_rcvbuf_grow: time=292 rtt_us=292 copied=3481600 inq=147456 space=3465216 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.284983: tcp:tcp_rcvbuf_grow: time=297 rtt_us=295 copied=3702784 inq=45056 space=3481600 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.285596: tcp:tcp_rcvbuf_grow: time=311 rtt_us=310 copied=3723264 inq=40960 space=3702784 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.285909: tcp:tcp_rcvbuf_grow: time=313 rtt_us=304 copied=3846144 inq=196608 space=3723264 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.291654: tcp:tcp_rcvbuf_grow: time=322 rtt_us=311 copied=3960832 inq=49152 space=3846144 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.291986: tcp:tcp_rcvbuf_grow: time=333 rtt_us=330 copied=4075520 inq=360448 space=3960832 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.292319: tcp:tcp_rcvbuf_grow: time=332 rtt_us=332 copied=4079616 inq=65536 space=4075520 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.292666: tcp:tcp_rcvbuf_grow: time=348 rtt_us=347 copied=4177920 inq=212992 space=4079616 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.293015: tcp:tcp_rcvbuf_grow: time=349 rtt_us=345 copied=4276224 inq=262144 space=4177920 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.293371: tcp:tcp_rcvbuf_grow: time=356 rtt_us=346 copied=4415488 inq=49152 space=4276224 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...
 1509.515798: tcp:tcp_rcvbuf_grow: time=424 rtt_us=411 copied=4833280 inq=81920 space=4415488 ooo=0 scaling_ratio=230 rcvbuf=12316197 ...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
---
 include/trace/events/tcp.h | 73 ++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c       |  2 ++
 2 files changed, 75 insertions(+)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 53e878fa14d14ee1f6d072bfdd8179cd8b995d6f..006c2116c8f611c2caa401528af8cd11e6a7b703 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -213,6 +213,79 @@ DEFINE_EVENT(tcp_event_sk, tcp_rcv_space_adjust,
 	TP_ARGS(sk)
 );
 
+TRACE_EVENT(tcp_rcvbuf_grow,
+
+	TP_PROTO(struct sock *sk, int time),
+
+	TP_ARGS(sk, time),
+
+	TP_STRUCT__entry(
+		__field(int, time)
+		__field(__u32, rtt_us)
+		__field(__u32, copied)
+		__field(__u32, inq)
+		__field(__u32, space)
+		__field(__u32, ooo_space)
+		__field(__u32, rcvbuf)
+		__field(__u8, scaling_ratio)
+		__field(__u16, sport)
+		__field(__u16, dport)
+		__field(__u16, family)
+		__array(__u8, saddr, 4)
+		__array(__u8, daddr, 4)
+		__array(__u8, saddr_v6, 16)
+		__array(__u8, daddr_v6, 16)
+		__field(const void *, skaddr)
+		__field(__u64, sock_cookie)
+	),
+
+	TP_fast_assign(
+		struct inet_sock *inet = inet_sk(sk);
+		struct tcp_sock *tp = tcp_sk(sk);
+		__be32 *p32;
+
+		__entry->time = time;
+		__entry->rtt_us = tp->rcv_rtt_est.rtt_us >> 3;
+		__entry->copied = tp->copied_seq - tp->rcvq_space.seq;
+		__entry->inq = tp->rcv_nxt - tp->copied_seq;
+		__entry->space = tp->rcvq_space.space;
+		__entry->ooo_space = RB_EMPTY_ROOT(&tp->out_of_order_queue) ? 0 :
+				     TCP_SKB_CB(tp->ooo_last_skb)->end_seq -
+				     tp->rcv_nxt;
+
+		__entry->rcvbuf = sk->sk_rcvbuf;
+		__entry->scaling_ratio = tp->scaling_ratio;
+		__entry->sport = ntohs(inet->inet_sport);
+		__entry->dport = ntohs(inet->inet_dport);
+		__entry->family = sk->sk_family;
+
+		p32 = (__be32 *) __entry->saddr;
+		*p32 = inet->inet_saddr;
+
+		p32 = (__be32 *) __entry->daddr;
+		*p32 = inet->inet_daddr;
+
+		TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
+			       sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
+
+		__entry->skaddr = sk;
+		__entry->sock_cookie = sock_gen_cookie(sk);
+	),
+
+	TP_printk("time=%u rtt_us=%u copied=%u inq=%u space=%u ooo=%u scaling_ratio=%u rcvbuf=%u "
+		  "family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 "
+		  "saddrv6=%pI6c daddrv6=%pI6c skaddr=%p sock_cookie=%llx",
+		  __entry->time, __entry->rtt_us, __entry->copied,
+		  __entry->inq, __entry->space, __entry->ooo_space,
+		  __entry->scaling_ratio, __entry->rcvbuf,
+		  show_family_name(__entry->family),
+		  __entry->sport, __entry->dport,
+		  __entry->saddr, __entry->daddr,
+		  __entry->saddr_v6, __entry->daddr_v6,
+		  __entry->skaddr,
+		  __entry->sock_cookie)
+);
+
 TRACE_EVENT(tcp_retransmit_synack,
 
 	TP_PROTO(const struct sock *sk, const struct request_sock *req),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a35018e2d0ba27b14d0b59d3728f7181b1a51161..88beb6d0f7b5981e65937a6727a1111fd341335b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -769,6 +769,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
 	if (copied <= tp->rcvq_space.space)
 		goto new_measure;
 
+	trace_tcp_rcvbuf_grow(sk, time);
+
 	/* A bit of theory :
 	 * copied = bytes received in previous RTT, our base window
 	 * To cope with packet losses, we need a 2x factor
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 02/11] tcp: fix sk_rcvbuf overshoot
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 03/11] tcp: adjust rcvbuf in presence of reorders Eric Dumazet
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

Current autosizing in tcp_rcv_space_adjust() is too aggressive.

Instead of betting on possible losses and over estimate BDP,
it is better to only account for slow start.

The following patch is then adding a more precise tuning
in the events of packet losses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 59 +++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 34 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 88beb6d0f7b5981e65937a6727a1111fd341335b..89e886bb0fa11666ca4b51b032d536f233078dca 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -747,6 +747,29 @@ static inline void tcp_rcv_rtt_measure_ts(struct sock *sk,
 	}
 }
 
+static void tcp_rcvbuf_grow(struct sock *sk)
+{
+	const struct net *net = sock_net(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	int rcvwin, rcvbuf, cap;
+
+	if (!READ_ONCE(net->ipv4.sysctl_tcp_moderate_rcvbuf) ||
+	    (sk->sk_userlocks & SOCK_RCVBUF_LOCK))
+		return;
+
+	/* slow start: allow the sender to double its rate. */
+	rcvwin = tp->rcvq_space.space << 1;
+
+	cap = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
+
+	rcvbuf = min_t(u32, tcp_space_from_win(sk, rcvwin), cap);
+	if (rcvbuf > sk->sk_rcvbuf) {
+		WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);
+		/* Make the window clamp follow along.  */
+		WRITE_ONCE(tp->window_clamp,
+			   tcp_win_from_space(sk, rcvbuf));
+	}
+}
 /*
  * This function should be called every time data is copied to user space.
  * It calculates the appropriate TCP receive buffer space.
@@ -771,42 +794,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
 
 	trace_tcp_rcvbuf_grow(sk, time);
 
-	/* A bit of theory :
-	 * copied = bytes received in previous RTT, our base window
-	 * To cope with packet losses, we need a 2x factor
-	 * To cope with slow start, and sender growing its cwin by 100 %
-	 * every RTT, we need a 4x factor, because the ACK we are sending
-	 * now is for the next RTT, not the current one :
-	 * <prev RTT . ><current RTT .. ><next RTT .... >
-	 */
-
-	if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) &&
-	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
-		u64 rcvwin, grow;
-		int rcvbuf;
-
-		/* minimal window to cope with packet losses, assuming
-		 * steady state. Add some cushion because of small variations.
-		 */
-		rcvwin = ((u64)copied << 1) + 16 * tp->advmss;
-
-		/* Accommodate for sender rate increase (eg. slow start) */
-		grow = rcvwin * (copied - tp->rcvq_space.space);
-		do_div(grow, tp->rcvq_space.space);
-		rcvwin += (grow << 1);
-
-		rcvbuf = min_t(u64, tcp_space_from_win(sk, rcvwin),
-			       READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2]));
-		if (rcvbuf > sk->sk_rcvbuf) {
-			WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);
-
-			/* Make the window clamp follow along.  */
-			WRITE_ONCE(tp->window_clamp,
-				   tcp_win_from_space(sk, rcvbuf));
-		}
-	}
 	tp->rcvq_space.space = copied;
 
+	tcp_rcvbuf_grow(sk);
+
 new_measure:
 	tp->rcvq_space.seq = tp->copied_seq;
 	tp->rcvq_space.time = tp->tcp_mstamp;
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 03/11] tcp: adjust rcvbuf in presence of reorders
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 02/11] tcp: fix sk_rcvbuf overshoot Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 04/11] tcp: add receive queue awareness in tcp_rcv_space_adjust() Eric Dumazet
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

This patch takes care of the needed provisioning
when incoming packets are stored in the out of order queue.

This part was not implemented in the correct way, we need
to decouple it from tcp_rcv_space_adjust() logic.

Without it, stalls in the pipe could happen.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 89e886bb0fa11666ca4b51b032d536f233078dca..f799200db26492730fbd042a68c8d206d85455d4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -760,6 +760,9 @@ static void tcp_rcvbuf_grow(struct sock *sk)
 	/* slow start: allow the sender to double its rate. */
 	rcvwin = tp->rcvq_space.space << 1;
 
+	if (!RB_EMPTY_ROOT(&tp->out_of_order_queue))
+		rcvwin += TCP_SKB_CB(tp->ooo_last_skb)->end_seq - tp->rcv_nxt;
+
 	cap = READ_ONCE(net->ipv4.sysctl_tcp_rmem[2]);
 
 	rcvbuf = min_t(u32, tcp_space_from_win(sk, rcvwin), cap);
@@ -5166,6 +5169,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		skb_condense(skb);
 		skb_set_owner_r(skb, sk);
 	}
+	tcp_rcvbuf_grow(sk);
 }
 
 static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 04/11] tcp: add receive queue awareness in tcp_rcv_space_adjust()
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-05-13 19:39 ` [PATCH net-next 03/11] tcp: adjust rcvbuf in presence of reorders Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 05/11] tcp: remove zero TCP TS samples for autotuning Eric Dumazet
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

If the application can not drain fast enough a TCP socket queue,
tcp_rcv_space_adjust() can overestimate tp->rcvq_space.space.

Then sk->sk_rcvbuf can grow and hit tcp_rmem[2] for no good reason.

Fix this by taking into acount the number of available bytes.

Keeping sk->sk_rcvbuf at the right size allows better cache efficiency.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
---
 include/linux/tcp.h  | 2 +-
 net/ipv4/tcp_input.c | 6 ++++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index a8af71623ba7ca16f211cb9884f431fc9462ce9e..29f59d50dc73f8c433865e6bc116cb1bac4eafb7 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -340,7 +340,7 @@ struct tcp_sock {
 	} rcv_rtt_est;
 /* Receiver queue space */
 	struct {
-		u32	space;
+		int	space;
 		u32	seq;
 		u64	time;
 	} rcvq_space;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f799200db26492730fbd042a68c8d206d85455d4..5d64a6ecfc8f78de3665afdea112d62c417cee27 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -780,8 +780,7 @@ static void tcp_rcvbuf_grow(struct sock *sk)
 void tcp_rcv_space_adjust(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 copied;
-	int time;
+	int time, inq, copied;
 
 	trace_tcp_rcv_space_adjust(sk);
 
@@ -792,6 +791,9 @@ void tcp_rcv_space_adjust(struct sock *sk)
 
 	/* Number of bytes copied to user in last RTT */
 	copied = tp->copied_seq - tp->rcvq_space.seq;
+	/* Number of bytes in receive queue. */
+	inq = tp->rcv_nxt - tp->copied_seq;
+	copied -= inq;
 	if (copied <= tp->rcvq_space.space)
 		goto new_measure;
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 05/11] tcp: remove zero TCP TS samples for autotuning
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (3 preceding siblings ...)
  2025-05-13 19:39 ` [PATCH net-next 04/11] tcp: add receive queue awareness in tcp_rcv_space_adjust() Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 06/11] tcp: fix initial tp->rcvq_space.space value for passive TS enabled flows Eric Dumazet
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

For TCP flows using ms RFC 7323 timestamp granularity
tcp_rcv_rtt_update() can be fed with 1 ms samples, breaking
TCP autotuning for data center flows with sub ms RTT.

Instead, rely on the window based samples, fed by tcp_rcv_rtt_measure()

tcp_rcvbuf_grow() for a 10 second TCP_STREAM sesssion now looks saner.
We can see rcvbuf is kept at a reasonable value.

  222.234976: tcp:tcp_rcvbuf_grow: time=348 rtt_us=330 copied=110592 inq=0 space=40960 ooo=0 scaling_ratio=230 rcvbuf=131072 ...
  222.235276: tcp:tcp_rcvbuf_grow: time=300 rtt_us=288 copied=126976 inq=0 space=110592 ooo=0 scaling_ratio=230 rcvbuf=246187 ...
  222.235569: tcp:tcp_rcvbuf_grow: time=294 rtt_us=288 copied=184320 inq=0 space=126976 ooo=0 scaling_ratio=230 rcvbuf=282659 ...
  222.235833: tcp:tcp_rcvbuf_grow: time=264 rtt_us=244 copied=373760 inq=0 space=184320 ooo=0 scaling_ratio=230 rcvbuf=410312 ...
  222.236142: tcp:tcp_rcvbuf_grow: time=308 rtt_us=219 copied=424960 inq=20480 space=373760 ooo=0 scaling_ratio=230 rcvbuf=832022 ...
  222.236378: tcp:tcp_rcvbuf_grow: time=236 rtt_us=219 copied=692224 inq=49152 space=404480 ooo=0 scaling_ratio=230 rcvbuf=900407 ...
  222.236602: tcp:tcp_rcvbuf_grow: time=225 rtt_us=219 copied=730112 inq=49152 space=643072 ooo=0 scaling_ratio=230 rcvbuf=1431534 ...
  222.237050: tcp:tcp_rcvbuf_grow: time=229 rtt_us=219 copied=1160192 inq=49152 space=680960 ooo=0 scaling_ratio=230 rcvbuf=1515876 ...
  222.237618: tcp:tcp_rcvbuf_grow: time=305 rtt_us=218 copied=2228224 inq=49152 space=1111040 ooo=0 scaling_ratio=230 rcvbuf=2473271 ...
  222.238591: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3063808 inq=360448 space=2179072 ooo=0 scaling_ratio=230 rcvbuf=4850803 ...
  222.240647: tcp:tcp_rcvbuf_grow: time=260 rtt_us=218 copied=2752512 inq=0 space=2703360 ooo=0 scaling_ratio=230 rcvbuf=6017914 ...
  222.243535: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2834432 inq=49152 space=2752512 ooo=0 scaling_ratio=230 rcvbuf=6127331 ...
  222.245108: tcp:tcp_rcvbuf_grow: time=240 rtt_us=218 copied=2883584 inq=49152 space=2785280 ooo=0 scaling_ratio=230 rcvbuf=6200275 ...
  222.245333: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=2859008 inq=0 space=2834432 ooo=0 scaling_ratio=230 rcvbuf=6309692 ...
  222.301021: tcp:tcp_rcvbuf_grow: time=222 rtt_us=218 copied=2883584 inq=0 space=2859008 ooo=0 scaling_ratio=230 rcvbuf=6364400 ...
  222.989242: tcp:tcp_rcvbuf_grow: time=225 rtt_us=218 copied=2899968 inq=0 space=2883584 ooo=0 scaling_ratio=230 rcvbuf=6419108 ...
  224.139553: tcp:tcp_rcvbuf_grow: time=224 rtt_us=218 copied=3014656 inq=65536 space=2899968 ooo=0 scaling_ratio=230 rcvbuf=6455580 ...
  224.584608: tcp:tcp_rcvbuf_grow: time=232 rtt_us=218 copied=3014656 inq=49152 space=2949120 ooo=0 scaling_ratio=230 rcvbuf=6564997 ...
  230.145560: tcp:tcp_rcvbuf_grow: time=223 rtt_us=218 copied=2981888 inq=0 space=2965504 ooo=0 scaling_ratio=230 rcvbuf=6601469 ...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
---
 net/ipv4/tcp_input.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5d64a6ecfc8f78de3665afdea112d62c417cee27..f3eae8f5ad2b6c5602542a1083328f71ec8cbded 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -712,7 +712,7 @@ static inline void tcp_rcv_rtt_measure(struct tcp_sock *tp)
 	tp->rcv_rtt_est.time = tp->tcp_mstamp;
 }
 
-static s32 tcp_rtt_tsopt_us(const struct tcp_sock *tp)
+static s32 tcp_rtt_tsopt_us(const struct tcp_sock *tp, u32 min_delta)
 {
 	u32 delta, delta_us;
 
@@ -722,7 +722,7 @@ static s32 tcp_rtt_tsopt_us(const struct tcp_sock *tp)
 
 	if (likely(delta < INT_MAX / (USEC_PER_SEC / TCP_TS_HZ))) {
 		if (!delta)
-			delta = 1;
+			delta = min_delta;
 		delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ);
 		return delta_us;
 	}
@@ -740,9 +740,9 @@ static inline void tcp_rcv_rtt_measure_ts(struct sock *sk,
 
 	if (TCP_SKB_CB(skb)->end_seq -
 	    TCP_SKB_CB(skb)->seq >= inet_csk(sk)->icsk_ack.rcv_mss) {
-		s32 delta = tcp_rtt_tsopt_us(tp);
+		s32 delta = tcp_rtt_tsopt_us(tp, 0);
 
-		if (delta >= 0)
+		if (delta > 0)
 			tcp_rcv_rtt_update(tp, delta, 0);
 	}
 }
@@ -3224,7 +3224,7 @@ static bool tcp_ack_update_rtt(struct sock *sk, const int flag,
 	 */
 	if (seq_rtt_us < 0 && tp->rx_opt.saw_tstamp &&
 	    tp->rx_opt.rcv_tsecr && flag & FLAG_ACKED)
-		seq_rtt_us = ca_rtt_us = tcp_rtt_tsopt_us(tp);
+		seq_rtt_us = ca_rtt_us = tcp_rtt_tsopt_us(tp, 1);
 
 	rs->rtt_us = ca_rtt_us; /* RTT of last (S)ACKed packet (or -1) */
 	if (seq_rtt_us < 0)
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 06/11] tcp: fix initial tp->rcvq_space.space value for passive TS enabled flows
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (4 preceding siblings ...)
  2025-05-13 19:39 ` [PATCH net-next 05/11] tcp: remove zero TCP TS samples for autotuning Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 07/11] tcp: always seek for minimal rtt in tcp_rcv_rtt_update() Eric Dumazet
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

tcp_rcv_state_process() must tweak tp->advmss for TS enabled flows
before the call to tcp_init_transfer() / tcp_init_buffer_space().

Otherwise tp->rcvq_space.space is off by 120 bytes
(TCP_INIT_CWND * TCPOLEN_TSTAMP_ALIGNED).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
---
 net/ipv4/tcp_input.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f3eae8f5ad2b6c5602542a1083328f71ec8cbded..32b8b332c7d82e8c6a0716b26f2e048d68667864 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6872,6 +6872,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		if (!tp->srtt_us)
 			tcp_synack_rtt_meas(sk, req);
 
+		if (tp->rx_opt.tstamp_ok)
+			tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
+
 		if (req) {
 			tcp_rcv_synrecv_state_fastopen(sk);
 		} else {
@@ -6897,9 +6900,6 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
 
-		if (tp->rx_opt.tstamp_ok)
-			tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
-
 		if (!inet_csk(sk)->icsk_ca_ops->cong_control)
 			tcp_update_pacing_rate(sk);
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 07/11] tcp: always seek for minimal rtt in tcp_rcv_rtt_update()
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (5 preceding siblings ...)
  2025-05-13 19:39 ` [PATCH net-next 06/11] tcp: fix initial tp->rcvq_space.space value for passive TS enabled flows Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 08/11] tcp: skip big rtt sample if receive queue is not empty Eric Dumazet
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

tcp_rcv_rtt_update() goal is to maintain an estimation of the RTT
in tp->rcv_rtt_est.rtt_us, used by tcp_rcv_space_adjust()

When TCP TS are enabled, tcp_rcv_rtt_update() is using
EWMA to smooth the samples.

Change this to immediately latch the incoming value if it
is lower than tp->rcv_rtt_est.rtt_us, so that tcp_rcv_space_adjust()
does not overshoot tp->rcvq_space.space and sk->sk_rcvbuf.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 32b8b332c7d82e8c6a0716b26f2e048d68667864..4723d696492517143a2f3c035bfda6b05198a824 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -664,10 +664,12 @@ EXPORT_IPV6_MOD(tcp_initialize_rcv_mss);
  */
 static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 sample, int win_dep)
 {
-	u32 new_sample = tp->rcv_rtt_est.rtt_us;
-	long m = sample;
+	u32 new_sample, old_sample = tp->rcv_rtt_est.rtt_us;
+	long m = sample << 3;
 
-	if (new_sample != 0) {
+	if (old_sample == 0 || m < old_sample) {
+		new_sample = m;
+	} else {
 		/* If we sample in larger samples in the non-timestamp
 		 * case, we could grossly overestimate the RTT especially
 		 * with chatty applications or bulk transfer apps which
@@ -678,17 +680,9 @@ static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 sample, int win_dep)
 		 * else with timestamps disabled convergence takes too
 		 * long.
 		 */
-		if (!win_dep) {
-			m -= (new_sample >> 3);
-			new_sample += m;
-		} else {
-			m <<= 3;
-			if (m < new_sample)
-				new_sample = m;
-		}
-	} else {
-		/* No previous measure. */
-		new_sample = m << 3;
+		if (win_dep)
+			return;
+		new_sample = old_sample - (old_sample >> 3) + sample;
 	}
 
 	tp->rcv_rtt_est.rtt_us = new_sample;
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 08/11] tcp: skip big rtt sample if receive queue is not empty
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (6 preceding siblings ...)
  2025-05-13 19:39 ` [PATCH net-next 07/11] tcp: always seek for minimal rtt in tcp_rcv_rtt_update() Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 09/11] tcp: increase tcp_limit_output_bytes default value to 4MB Eric Dumazet
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

tcp_rcv_rtt_update() role is to keep an estimation
of RTT (tp->rcv_rtt_est.rtt_us) for receivers.

If an application is too slow to drain the TCP receive
queue, it is better to leave the RTT estimation small,
so that tcp_rcv_space_adjust() does not inflate
tp->rcvq_space.space and sk->sk_rcvbuf.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4723d696492517143a2f3c035bfda6b05198a824..8ec92dec321a909abe00203d0097c8bf4df1a240 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -682,6 +682,9 @@ static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 sample, int win_dep)
 		 */
 		if (win_dep)
 			return;
+		/* Do not use this sample if receive queue is not empty. */
+		if (tp->rcv_nxt != tp->copied_seq)
+			return;
 		new_sample = old_sample - (old_sample >> 3) + sample;
 	}
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 09/11] tcp: increase tcp_limit_output_bytes default value to 4MB
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (7 preceding siblings ...)
  2025-05-13 19:39 ` [PATCH net-next 08/11] tcp: skip big rtt sample if receive queue is not empty Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 10/11] tcp: always use tcp_limit_output_bytes limitation Eric Dumazet
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

Last change happened in 2018 with commit c73e5807e4f6
("tcp: tsq: no longer use limit_output_bytes for paced flows")

Modern NIC speeds got a 4x increase since then.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/networking/ip-sysctl.rst | 2 +-
 net/ipv4/tcp_ipv4.c                    | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index b43222ee57cf9e54e38cb78f752f050c2f43a5cf..91b7d0a1c7fd884ee964d5be0d4dbd10ce040f76 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1099,7 +1099,7 @@ tcp_limit_output_bytes - INTEGER
 	limits the number of bytes on qdisc or device to reduce artificial
 	RTT/cwnd and reduce bufferbloat.
 
-	Default: 1048576 (16 * 65536)
+	Default: 4194304 (4 MB)
 
 tcp_challenge_ack_limit - INTEGER
 	Limits number of Challenge ACK sent per second, as recommended
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d5b5c32115d2ef84b0c91d43f584e571f342d9fb..6a14f9e6fef645511be5738e0ead22e168fb20b2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3495,8 +3495,8 @@ static int __net_init tcp_sk_init(struct net *net)
 	 * which are too large can cause TCP streams to be bursty.
 	 */
 	net->ipv4.sysctl_tcp_tso_win_divisor = 3;
-	/* Default TSQ limit of 16 TSO segments */
-	net->ipv4.sysctl_tcp_limit_output_bytes = 16 * 65536;
+	/* Default TSQ limit of 4 MB */
+	net->ipv4.sysctl_tcp_limit_output_bytes = 4 << 20;
 
 	/* rfc5961 challenge ack rate limiting, per net-ns, disabled by default. */
 	net->ipv4.sysctl_tcp_challenge_ack_limit = INT_MAX;
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 10/11] tcp: always use tcp_limit_output_bytes limitation
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (8 preceding siblings ...)
  2025-05-13 19:39 ` [PATCH net-next 09/11] tcp: increase tcp_limit_output_bytes default value to 4MB Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-13 19:39 ` [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB Eric Dumazet
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

This partially reverts commit c73e5807e4f6 ("tcp: tsq: no longer use
limit_output_bytes for paced flows")

Overriding the tcp_limit_output_bytes sysctl value
for FQ enabled flows has the following problem:

It allows TCP to queue around 2 ms worth of data per flow,
defeating tcp_rcv_rtt_update() accuracy on the receiver,
forcing it to increase sk->sk_rcvbuf even if the real
RTT is around 100 us.

After this change, we keep enough packets in flight to fill
the pipe, and let receive queues small enough to get
good cache behavior (cpu caches and/or NIC driver page pools).

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 13295a59d22e65305d8c4094313e4aa37306cbff..3ac8d2d17e1ff42aaeb9adf0a9e0c99c13d141a8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2619,9 +2619,8 @@ static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
 	limit = max_t(unsigned long,
 		      2 * skb->truesize,
 		      READ_ONCE(sk->sk_pacing_rate) >> READ_ONCE(sk->sk_pacing_shift));
-	if (sk->sk_pacing_status == SK_PACING_NONE)
-		limit = min_t(unsigned long, limit,
-			      READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes));
+	limit = min_t(unsigned long, limit,
+		      READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes));
 	limit <<= factor;
 
 	if (static_branch_unlikely(&tcp_tx_delay_enabled) &&
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (9 preceding siblings ...)
  2025-05-13 19:39 ` [PATCH net-next 10/11] tcp: always use tcp_limit_output_bytes limitation Eric Dumazet
@ 2025-05-13 19:39 ` Eric Dumazet
  2025-05-14 20:24   ` Jakub Kicinski
  2025-05-14 20:26 ` [PATCH net-next 00/11] tcp: receive side improvements Jakub Kicinski
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2025-05-13 19:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet, Eric Dumazet

Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e32
("tcp: change tcp_adv_win_scale and tcp_rmem[2]")

TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers.

After this series improvements, it is time to increase the default.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/networking/ip-sysctl.rst | 2 +-
 net/ipv4/tcp.c                         | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 91b7d0a1c7fd884ee964d5be0d4dbd10ce040f76..0f1251cce31491930c3e446ae746e538d22fc5c7 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -735,7 +735,7 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
 	net.core.rmem_max.  Calling setsockopt() with SO_RCVBUF disables
 	automatic tuning of that socket's receive buffer size, in which
 	case this value is ignored.
-	Default: between 131072 and 6MB, depending on RAM size.
+	Default: between 131072 and 32MB, depending on RAM size.
 
 tcp_sack - BOOLEAN
 	Enable select acknowledgments (SACKS).
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0ae265d39184ed1a40a724a1ad6bb8f2f22d4fff..b7b6ab41b496f98bf82e099fab1da454dce1fe67 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -5231,7 +5231,7 @@ void __init tcp_init(void)
 	/* Set per-socket limits to no more than 1/128 the pressure threshold */
 	limit = nr_free_buffer_pages() << (PAGE_SHIFT - 7);
 	max_wshare = min(4UL*1024*1024, limit);
-	max_rshare = min(6UL*1024*1024, limit);
+	max_rshare = min(32UL*1024*1024, limit);
 
 	init_net.ipv4.sysctl_tcp_wmem[0] = PAGE_SIZE;
 	init_net.ipv4.sysctl_tcp_wmem[1] = 16*1024;
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint
  2025-05-13 19:39 ` [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint Eric Dumazet
@ 2025-05-14 15:30   ` David Ahern
  2025-05-14 15:38     ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: David Ahern @ 2025-05-14 15:30 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet

On 5/13/25 1:39 PM, Eric Dumazet wrote:
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a35018e2d0ba27b14d0b59d3728f7181b1a51161..88beb6d0f7b5981e65937a6727a1111fd341335b 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -769,6 +769,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
>  	if (copied <= tp->rcvq_space.space)
>  		goto new_measure;
>  
> +	trace_tcp_rcvbuf_grow(sk, time);

tracepoints typically take on the name of the function. Patch 2 moves a
lot of logic from tcp_rcv_space_adjust to tcp_rcvbuf_grow but does not
move this tracepoint into it. For sake of consistency, why not do that -
and add this patch after the code move?

> +
>  	/* A bit of theory :
>  	 * copied = bytes received in previous RTT, our base window
>  	 * To cope with packet losses, we need a 2x factor


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint
  2025-05-14 15:30   ` David Ahern
@ 2025-05-14 15:38     ` Eric Dumazet
  2025-05-14 15:46       ` David Ahern
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2025-05-14 15:38 UTC (permalink / raw)
  To: David Ahern
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet

On Wed, May 14, 2025 at 8:30 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 5/13/25 1:39 PM, Eric Dumazet wrote:
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index a35018e2d0ba27b14d0b59d3728f7181b1a51161..88beb6d0f7b5981e65937a6727a1111fd341335b 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -769,6 +769,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
> >       if (copied <= tp->rcvq_space.space)
> >               goto new_measure;
> >
> > +     trace_tcp_rcvbuf_grow(sk, time);
>
> tracepoints typically take on the name of the function. Patch 2 moves a
> lot of logic from tcp_rcv_space_adjust to tcp_rcvbuf_grow but does not
> move this tracepoint into it. For sake of consistency, why not do that -
> and add this patch after the code move?

Prior value is needed in the tracepoint, but in patch 2, I call
tcp_rcvbuf_grow() after it is overwritten.

I was planning to add a call to this tracepoint from
tcp_data_queue_ofo(), with 'time==0', in the third patch.

But I found this quite noisy and not useful, so I removed it from the OFO case.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint
  2025-05-14 15:38     ` Eric Dumazet
@ 2025-05-14 15:46       ` David Ahern
  2025-05-14 16:33         ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: David Ahern @ 2025-05-14 15:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet

On 5/14/25 9:38 AM, Eric Dumazet wrote:
> On Wed, May 14, 2025 at 8:30 AM David Ahern <dsahern@kernel.org> wrote:
>>
>> On 5/13/25 1:39 PM, Eric Dumazet wrote:
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index a35018e2d0ba27b14d0b59d3728f7181b1a51161..88beb6d0f7b5981e65937a6727a1111fd341335b 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -769,6 +769,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
>>>       if (copied <= tp->rcvq_space.space)
>>>               goto new_measure;
>>>
>>> +     trace_tcp_rcvbuf_grow(sk, time);
>>
>> tracepoints typically take on the name of the function. Patch 2 moves a
>> lot of logic from tcp_rcv_space_adjust to tcp_rcvbuf_grow but does not
>> move this tracepoint into it. For sake of consistency, why not do that -
>> and add this patch after the code move?
> 
> Prior value is needed in the tracepoint, but in patch 2, I call
> tcp_rcvbuf_grow() after it is overwritten.
> 

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8ec92dec321a..6bfbe9005fdb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -744,12 +744,16 @@ static inline void tcp_rcv_rtt_measure_ts(struct
sock *sk,
        }
 }

-static void tcp_rcvbuf_grow(struct sock *sk)
+static void tcp_rcvbuf_grow(struct sock *sk, int time, int copied)
 {
        const struct net *net = sock_net(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        int rcvwin, rcvbuf, cap;

+       trace_tcp_rcvbuf_grow(sk, time);
+
+       tp->rcvq_space.space = copied;
+
        if (!READ_ONCE(net->ipv4.sysctl_tcp_moderate_rcvbuf) ||
            (sk->sk_userlocks & SOCK_RCVBUF_LOCK))
                return;
@@ -794,11 +798,7 @@ void tcp_rcv_space_adjust(struct sock *sk)
        if (copied <= tp->rcvq_space.space)
                goto new_measure;

-       trace_tcp_rcvbuf_grow(sk, time);
-
-       tp->rcvq_space.space = copied;
-
-       tcp_rcvbuf_grow(sk);
+       tcp_rcvbuf_grow(sk, time, copied);

 new_measure:
        tp->rcvq_space.seq = tp->copied_seq;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint
  2025-05-14 15:46       ` David Ahern
@ 2025-05-14 16:33         ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-14 16:33 UTC (permalink / raw)
  To: David Ahern
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet

On Wed, May 14, 2025 at 8:46 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 5/14/25 9:38 AM, Eric Dumazet wrote:
> > On Wed, May 14, 2025 at 8:30 AM David Ahern <dsahern@kernel.org> wrote:
> >>
> >> On 5/13/25 1:39 PM, Eric Dumazet wrote:
> >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>> index a35018e2d0ba27b14d0b59d3728f7181b1a51161..88beb6d0f7b5981e65937a6727a1111fd341335b 100644
> >>> --- a/net/ipv4/tcp_input.c
> >>> +++ b/net/ipv4/tcp_input.c
> >>> @@ -769,6 +769,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
> >>>       if (copied <= tp->rcvq_space.space)
> >>>               goto new_measure;
> >>>
> >>> +     trace_tcp_rcvbuf_grow(sk, time);
> >>
> >> tracepoints typically take on the name of the function. Patch 2 moves a
> >> lot of logic from tcp_rcv_space_adjust to tcp_rcvbuf_grow but does not
> >> move this tracepoint into it. For sake of consistency, why not do that -
> >> and add this patch after the code move?
> >
> > Prior value is needed in the tracepoint, but in patch 2, I call
> > tcp_rcvbuf_grow() after it is overwritten.
> >
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 8ec92dec321a..6bfbe9005fdb 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -744,12 +744,16 @@ static inline void tcp_rcv_rtt_measure_ts(struct
> sock *sk,
>         }
>  }
>
> -static void tcp_rcvbuf_grow(struct sock *sk)
> +static void tcp_rcvbuf_grow(struct sock *sk, int time, int copied)
>  {
>         const struct net *net = sock_net(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         int rcvwin, rcvbuf, cap;
>
> +       trace_tcp_rcvbuf_grow(sk, time);
> +
> +       tp->rcvq_space.space = copied;
> +
>         if (!READ_ONCE(net->ipv4.sysctl_tcp_moderate_rcvbuf) ||
>             (sk->sk_userlocks & SOCK_RCVBUF_LOCK))
>                 return;
> @@ -794,11 +798,7 @@ void tcp_rcv_space_adjust(struct sock *sk)
>         if (copied <= tp->rcvq_space.space)
>                 goto new_measure;
>
> -       trace_tcp_rcvbuf_grow(sk, time);
> -
> -       tp->rcvq_space.space = copied;

I think I prefer leaving this write here, instead of having to go to
tcp_rcvbuf_grow(()

> -
> -       tcp_rcvbuf_grow(sk);
> +       tcp_rcvbuf_grow(sk, time, copied);
>
>  new_measure:
>         tp->rcvq_space.seq = tp->copied_seq;
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB
  2025-05-13 19:39 ` [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB Eric Dumazet
@ 2025-05-14 20:24   ` Jakub Kicinski
  2025-05-14 20:53     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2025-05-14 20:24 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Eric Dumazet, David S . Miller, Paolo Abeni, Neal Cardwell,
	Simon Horman, Rick Jones, Wei Wang, netdev, eric.dumazet, bpf

On Tue, 13 May 2025 19:39:19 +0000 Eric Dumazet wrote:
> Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e32
> ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
> 
> TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers.
> 
> After this series improvements, it is time to increase the default.

I think this breaks the BPF syncookie test, Kuniyuki any idea why?

https://github.com/kernel-patches/bpf/actions/runs/15016644781/job/42196471693

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 00/11] tcp: receive side improvements
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (10 preceding siblings ...)
  2025-05-13 19:39 ` [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB Eric Dumazet
@ 2025-05-14 20:26 ` Jakub Kicinski
  2025-05-15 18:50 ` patchwork-bot+netdevbpf
  2025-05-22 14:03 ` Daniel Borkmann
  13 siblings, 0 replies; 25+ messages in thread
From: Jakub Kicinski @ 2025-05-14 20:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, Neal Cardwell, Simon Horman,
	Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev, eric.dumazet

On Tue, 13 May 2025 19:39:08 +0000 Eric Dumazet wrote:
> Before:
>  73593 Mbit.
> 
> After:
>  122514 Mbit.

Very exciting, obviously :)

I hid it from patchwork temporarily until we figure out 
the BPF selftest issue.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB
  2025-05-14 20:24   ` Jakub Kicinski
@ 2025-05-14 20:53     ` Kuniyuki Iwashima
  2025-05-14 21:20       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 25+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-14 20:53 UTC (permalink / raw)
  To: kuba
  Cc: bpf, davem, edumazet, eric.dumazet, horms, jonesrick, kuniyu,
	ncardwell, netdev, pabeni, weiwan

From: Jakub Kicinski <kuba@kernel.org>
Date: Wed, 14 May 2025 13:24:22 -0700
> On Tue, 13 May 2025 19:39:19 +0000 Eric Dumazet wrote:
> > Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e32
> > ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
> > 
> > TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers.
> > 
> > After this series improvements, it is time to increase the default.
> 
> I think this breaks the BPF syncookie test, Kuniyuki any idea why?
> 
> https://github.com/kernel-patches/bpf/actions/runs/15016644781/job/42196471693

It seems ACK was not handled by BPF at tc hook on lo.

ACK was not sent or tcp_load_headers() failed to parse it ?
both sounds unlikely though.

Will try to reproduce it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB
  2025-05-14 20:53     ` Kuniyuki Iwashima
@ 2025-05-14 21:20       ` Kuniyuki Iwashima
  2025-05-14 21:26         ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-14 21:20 UTC (permalink / raw)
  To: kuniyu
  Cc: bpf, davem, edumazet, eric.dumazet, horms, jonesrick, kuba,
	ncardwell, netdev, pabeni, weiwan

From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Wed, 14 May 2025 13:53:39 -0700
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Wed, 14 May 2025 13:24:22 -0700
> > On Tue, 13 May 2025 19:39:19 +0000 Eric Dumazet wrote:
> > > Last change to tcp_rmem[2] happened in 2012, in commit b49960a05e32
> > > ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
> > > 
> > > TCP performance on WAN is mostly limited by tcp_rmem[2] for receivers.
> > > 
> > > After this series improvements, it is time to increase the default.
> > 
> > I think this breaks the BPF syncookie test, Kuniyuki any idea why?
> > 
> > https://github.com/kernel-patches/bpf/actions/runs/15016644781/job/42196471693
> 
> It seems ACK was not handled by BPF at tc hook on lo.
> 
> ACK was not sent or tcp_load_headers() failed to parse it ?
> both sounds unlikely though.
> 
> Will try to reproduce it.

I hard-coded the expected TCPOPT_WINDOW to be 7, and this
series bumps it to 10, so SYN was dropped as invalid.

This fixes the failure, and I think it's not a blocker.

---8<---
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
index eb5cca1fce16..7d5293de1952 100644
--- a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
+++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
@@ -294,7 +294,9 @@ static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
 	    (ctx->ipv6 && ctx->attrs.mss != MSS_LOCAL_IPV6))
 		goto err;
 
-	if (!ctx->attrs.wscale_ok || ctx->attrs.snd_wscale != 7)
+	if (!ctx->attrs.wscale_ok ||
+	    !ctx->attrs.snd_wscale ||
+	    ctx->attrs.snd_wscale >= BPF_SYNCOOKIE_WSCALE_MASK)
 		goto err;
 
 	if (!ctx->attrs.tstamp_ok)
---8<---

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB
  2025-05-14 21:20       ` Kuniyuki Iwashima
@ 2025-05-14 21:26         ` Jakub Kicinski
  2025-05-14 21:28           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2025-05-14 21:26 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: bpf, davem, edumazet, eric.dumazet, horms, jonesrick, ncardwell,
	netdev, pabeni, weiwan

On Wed, 14 May 2025 14:20:05 -0700 Kuniyuki Iwashima wrote:
> > It seems ACK was not handled by BPF at tc hook on lo.
> > 
> > ACK was not sent or tcp_load_headers() failed to parse it ?
> > both sounds unlikely though.
> > 
> > Will try to reproduce it.  
> 
> I hard-coded the expected TCPOPT_WINDOW to be 7, and this
> series bumps it to 10, so SYN was dropped as invalid.
> 
> This fixes the failure, and I think it's not a blocker.
> 
> ---8<---
> diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
> index eb5cca1fce16..7d5293de1952 100644
> --- a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
> +++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
> @@ -294,7 +294,9 @@ static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
>  	    (ctx->ipv6 && ctx->attrs.mss != MSS_LOCAL_IPV6))
>  		goto err;
>  
> -	if (!ctx->attrs.wscale_ok || ctx->attrs.snd_wscale != 7)
> +	if (!ctx->attrs.wscale_ok ||
> +	    !ctx->attrs.snd_wscale ||
> +	    ctx->attrs.snd_wscale >= BPF_SYNCOOKIE_WSCALE_MASK)
>  		goto err;
>  
>  	if (!ctx->attrs.tstamp_ok)

Awesome, could you submit officially? As soon as your fix is in
patchwork I can return Eric's series into the testing branch.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB
  2025-05-14 21:26         ` Jakub Kicinski
@ 2025-05-14 21:28           ` Kuniyuki Iwashima
  0 siblings, 0 replies; 25+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-14 21:28 UTC (permalink / raw)
  To: kuba
  Cc: bpf, davem, edumazet, eric.dumazet, horms, jonesrick, kuniyu,
	ncardwell, netdev, pabeni, weiwan

From: Jakub Kicinski <kuba@kernel.org>
Date: Wed, 14 May 2025 14:26:20 -0700
> On Wed, 14 May 2025 14:20:05 -0700 Kuniyuki Iwashima wrote:
> > > It seems ACK was not handled by BPF at tc hook on lo.
> > > 
> > > ACK was not sent or tcp_load_headers() failed to parse it ?
> > > both sounds unlikely though.
> > > 
> > > Will try to reproduce it.  
> > 
> > I hard-coded the expected TCPOPT_WINDOW to be 7, and this
> > series bumps it to 10, so SYN was dropped as invalid.
> > 
> > This fixes the failure, and I think it's not a blocker.
> > 
> > ---8<---
> > diff --git a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
> > index eb5cca1fce16..7d5293de1952 100644
> > --- a/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
> > +++ b/tools/testing/selftests/bpf/progs/test_tcp_custom_syncookie.c
> > @@ -294,7 +294,9 @@ static int tcp_validate_sysctl(struct tcp_syncookie *ctx)
> >  	    (ctx->ipv6 && ctx->attrs.mss != MSS_LOCAL_IPV6))
> >  		goto err;
> >  
> > -	if (!ctx->attrs.wscale_ok || ctx->attrs.snd_wscale != 7)
> > +	if (!ctx->attrs.wscale_ok ||
> > +	    !ctx->attrs.snd_wscale ||
> > +	    ctx->attrs.snd_wscale >= BPF_SYNCOOKIE_WSCALE_MASK)
> >  		goto err;
> >  
> >  	if (!ctx->attrs.tstamp_ok)
> 
> Awesome, could you submit officially? As soon as your fix is in
> patchwork I can return Eric's series into the testing branch.

For sure, will post a patch shortly.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 00/11] tcp: receive side improvements
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (11 preceding siblings ...)
  2025-05-14 20:26 ` [PATCH net-next 00/11] tcp: receive side improvements Jakub Kicinski
@ 2025-05-15 18:50 ` patchwork-bot+netdevbpf
  2025-05-22 14:03 ` Daniel Borkmann
  13 siblings, 0 replies; 25+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-05-15 18:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, ncardwell, horms, kuniyu, jonesrick, weiwan,
	netdev, eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 13 May 2025 19:39:08 +0000 you wrote:
> We have set tcp_rmem[2] to 15 MB for about 8 years at Google,
> but had some issues for high speed flows on very small RTT.
> 
> TCP rx autotuning has a tendency to overestimate the RTT,
> thus tp->rcvq_space.space and sk->sk_rcvbuf.
> 
> This makes TCP receive queues much bigger than necessary,
> to a point cpu caches are evicted before application can
> copy the data, on cpus using DDIO.
> 
> [...]

Here is the summary with links:
  - [net-next,01/11] tcp: add tcp_rcvbuf_grow() tracepoint
    https://git.kernel.org/netdev/net-next/c/c1269d3d12b8
  - [net-next,02/11] tcp: fix sk_rcvbuf overshoot
    https://git.kernel.org/netdev/net-next/c/65c5287892e9
  - [net-next,03/11] tcp: adjust rcvbuf in presence of reorders
    https://git.kernel.org/netdev/net-next/c/63ad7dfedfae
  - [net-next,04/11] tcp: add receive queue awareness in tcp_rcv_space_adjust()
    https://git.kernel.org/netdev/net-next/c/ea33537d8292
  - [net-next,05/11] tcp: remove zero TCP TS samples for autotuning
    https://git.kernel.org/netdev/net-next/c/d59fc95be9d0
  - [net-next,06/11] tcp: fix initial tp->rcvq_space.space value for passive TS enabled flows
    https://git.kernel.org/netdev/net-next/c/cd171461b90a
  - [net-next,07/11] tcp: always seek for minimal rtt in tcp_rcv_rtt_update()
    https://git.kernel.org/netdev/net-next/c/b879dcb1aeec
  - [net-next,08/11] tcp: skip big rtt sample if receive queue is not empty
    https://git.kernel.org/netdev/net-next/c/a00f135cd986
  - [net-next,09/11] tcp: increase tcp_limit_output_bytes default value to 4MB
    https://git.kernel.org/netdev/net-next/c/9ea3bfa61b09
  - [net-next,10/11] tcp: always use tcp_limit_output_bytes limitation
    https://git.kernel.org/netdev/net-next/c/c4221a8cc3a7
  - [net-next,11/11] tcp: increase tcp_rmem[2] to 32 MB
    https://git.kernel.org/netdev/net-next/c/572be9bf9d0d

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 00/11] tcp: receive side improvements
  2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
                   ` (12 preceding siblings ...)
  2025-05-15 18:50 ` patchwork-bot+netdevbpf
@ 2025-05-22 14:03 ` Daniel Borkmann
  2025-05-22 14:11   ` Eric Dumazet
  13 siblings, 1 reply; 25+ messages in thread
From: Daniel Borkmann @ 2025-05-22 14:03 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet

Hi Eric,

On 5/13/25 9:39 PM, Eric Dumazet wrote:
> We have set tcp_rmem[2] to 15 MB for about 8 years at Google,
> but had some issues for high speed flows on very small RTT.

Are there plans to bump/modernize the rmem_default and rmem_max defaults,
too? Looks like last time it was done back in commit eaa72dc4748 ("neigh:
increase queue_len_bytes to match wmem_default"). Fwiw, we've experienced
deployments where vxlan/geneve is being used for E/W to be affected to hit
the distro default limits leading to UDP drops for TCP traffic. Would it
make sense to move these e.g. to 4MB as well or do you have used another
heuristic which worked well over the years?

> TCP rx autotuning has a tendency to overestimate the RTT,
> thus tp->rcvq_space.space and sk->sk_rcvbuf.
> 
> This makes TCP receive queues much bigger than necessary,
> to a point cpu caches are evicted before application can
> copy the data, on cpus using DDIO.
> 
> This series aims to fix this.
> 
> - First patch adds tcp_rcvbuf_grow() tracepoint, which was very
>    convenient to study the various issues fixed in this series.
> 
> - Seven patches fix receiver autotune issues.
> 
> - Two patches fix sender side issues.
> 
> - Final patch increases tcp_rmem[2] so that TCP speed over WAN
>    can meet modern needs.
> 
> Tested on a 200Gbit NIC, average max throughput of a single flow:
> 
> Before:
>   73593 Mbit.
> 
> After:
>   122514 Mbit.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next 00/11] tcp: receive side improvements
  2025-05-22 14:03 ` Daniel Borkmann
@ 2025-05-22 14:11   ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2025-05-22 14:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Rick Jones, Wei Wang, netdev,
	eric.dumazet

On Thu, May 22, 2025 at 7:03 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Hi Eric,
>
> On 5/13/25 9:39 PM, Eric Dumazet wrote:
> > We have set tcp_rmem[2] to 15 MB for about 8 years at Google,
> > but had some issues for high speed flows on very small RTT.
>
> Are there plans to bump/modernize the rmem_default and rmem_max defaults,
> too? Looks like last time it was done back in commit eaa72dc4748 ("neigh:
> increase queue_len_bytes to match wmem_default"). Fwiw, we've experienced
> deployments where vxlan/geneve is being used for E/W to be affected to hit
> the distro default limits leading to UDP drops for TCP traffic. Would it
> make sense to move these e.g. to 4MB as well or do you have used another
> heuristic which worked well over the years?

Yes, I have a similar increase for send size, with tcp_notsent_lowat
set to avoid eating too much kernel memory for bulk senders.

Extract from Google server :

cat /proc/sys/net/ipv4/tcp_wmem
4096 262144 67108864
cat /proc/sys/net/ipv4/tcp_notsent_lowat
2097152

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-05-22 14:24 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-13 19:39 [PATCH net-next 00/11] tcp: receive side improvements Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 01/11] tcp: add tcp_rcvbuf_grow() tracepoint Eric Dumazet
2025-05-14 15:30   ` David Ahern
2025-05-14 15:38     ` Eric Dumazet
2025-05-14 15:46       ` David Ahern
2025-05-14 16:33         ` Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 02/11] tcp: fix sk_rcvbuf overshoot Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 03/11] tcp: adjust rcvbuf in presence of reorders Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 04/11] tcp: add receive queue awareness in tcp_rcv_space_adjust() Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 05/11] tcp: remove zero TCP TS samples for autotuning Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 06/11] tcp: fix initial tp->rcvq_space.space value for passive TS enabled flows Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 07/11] tcp: always seek for minimal rtt in tcp_rcv_rtt_update() Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 08/11] tcp: skip big rtt sample if receive queue is not empty Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 09/11] tcp: increase tcp_limit_output_bytes default value to 4MB Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 10/11] tcp: always use tcp_limit_output_bytes limitation Eric Dumazet
2025-05-13 19:39 ` [PATCH net-next 11/11] tcp: increase tcp_rmem[2] to 32 MB Eric Dumazet
2025-05-14 20:24   ` Jakub Kicinski
2025-05-14 20:53     ` Kuniyuki Iwashima
2025-05-14 21:20       ` Kuniyuki Iwashima
2025-05-14 21:26         ` Jakub Kicinski
2025-05-14 21:28           ` Kuniyuki Iwashima
2025-05-14 20:26 ` [PATCH net-next 00/11] tcp: receive side improvements Jakub Kicinski
2025-05-15 18:50 ` patchwork-bot+netdevbpf
2025-05-22 14:03 ` Daniel Borkmann
2025-05-22 14:11   ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).