linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options
@ 2025-06-20 12:56 Petr Tesarik
  2025-06-20 12:56 ` [PATCH net v2 1/2] tcp_metrics: set congestion window clamp from the dst entry Petr Tesarik
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Petr Tesarik @ 2025-06-20 12:56 UTC (permalink / raw)
  To: Paolo Abeni, David S. Miller, Eric Dumazet, Neal Cardwell,
	Kuniyuki Iwashima, open list:NETWORKING [TCP]
  Cc: David Ahern, Jakub Kicinski, open list, Petr Tesarik

I ran into a couple of issues while trying to tweak TCP congestion
avoidance to analyze a potential performance regression. It turns out
that overriding the parameters with ip-route(8) does not work as
expected and appears to be buggy.

Changes in v2:
- more background information in the cwnd commit message
- fix handling of initial TCP Slow Start if cwnd is clamped

Petr Tesarik (2):
  tcp_metrics: set congestion window clamp from the dst entry
  tcp_metrics: use ssthresh value from dst if there is no metrics

 net/ipv4/tcp_metrics.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH net v2 1/2] tcp_metrics: set congestion window clamp from the dst entry
  2025-06-20 12:56 [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options Petr Tesarik
@ 2025-06-20 12:56 ` Petr Tesarik
  2025-06-20 12:56 ` [PATCH net v2 2/2] tcp_metrics: use ssthresh value from dst if there is no metrics Petr Tesarik
  2025-06-20 13:24 ` [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options Eric Dumazet
  2 siblings, 0 replies; 6+ messages in thread
From: Petr Tesarik @ 2025-06-20 12:56 UTC (permalink / raw)
  To: Paolo Abeni, David S. Miller, Eric Dumazet, Neal Cardwell,
	Kuniyuki Iwashima, open list:NETWORKING [TCP]
  Cc: David Ahern, Jakub Kicinski, open list, Petr Tesarik

If RTAX_CWND is locked, always initialize tp->snd_cwnd_clamp from the
corresponding dst entry.

Note that an unlocked RTAX_CWND does not have any effect on the kernel.
This behavior is even documented in the manual page of ip-route(8):

    cwnd NUMBER (Linux 2.3.15+ only)
	   the clamp for congestion window. It is ignored if the lock
	   flag is not used.

An unlocked RTAX_CWND was updated by tcp_update_metrics() until v3.6. Since
then, only the newly introduced TCP metric (TCP_METRIC_CWND) has been
updated, rendering unlocked RTAX_CWND useless.

TCP metrics are updated after a TCP connection finishes. If there are no
metrics for a given destination when a new connection is created, default
values are used instead.

This means there are two issues with setting tp->snd_cwnd_clamp from the
TCP metric:

1. If the cwnd option is changed in the routing table, the new value is not
   used for new connections as long as there is a cached TCP metric for the
   destination. An existing cached metric is not updated from the routing
   table unless it has seen no update for longer than TCP_METRICS_TIMEOUT
   (1 hour).

2. After evicting the corresponding cached metric, the new value from the
   routing table is still not used for new connections until one connection
   finishes, and a new cached entry is created.

As a result, the following shenanigan is required to set a new locked cwnd
clamp:

- update the route (``ip route replace ... cwnd lock $value``)
- flush any existing TCP metric entry (``ip tcp_metrics flush $dest``)
- create and finish a dummy connection to the destination to create a TCP
  metric entry with the new value
- *next* connection to this destination will use the new value

The above does not seem to be intentional.

NB there is also an initcwnd route parameter (RTAX_INITCWND) to set the
initial size of the congestion window; this patch does not change anything
about the handling of that parameter.

Fixes: 51c5d0c4b169 ("tcp: Maintain dynamic metrics in local cache.")
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
---
 net/ipv4/tcp_metrics.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 4251670e328c..dd8f3457bd72 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -477,6 +477,9 @@ void tcp_init_metrics(struct sock *sk)
 	if (!dst)
 		goto reset;
 
+	if (dst_metric_locked(dst, RTAX_CWND))
+		tp->snd_cwnd_clamp = dst_metric(dst, RTAX_CWND);
+
 	rcu_read_lock();
 	tm = tcp_get_metrics(sk, dst, false);
 	if (!tm) {
@@ -484,9 +487,6 @@ void tcp_init_metrics(struct sock *sk)
 		goto reset;
 	}
 
-	if (tcp_metric_locked(tm, TCP_METRIC_CWND))
-		tp->snd_cwnd_clamp = tcp_metric_get(tm, TCP_METRIC_CWND);
-
 	val = READ_ONCE(net->ipv4.sysctl_tcp_no_ssthresh_metrics_save) ?
 	      0 : tcp_metric_get(tm, TCP_METRIC_SSTHRESH);
 	if (val) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net v2 2/2] tcp_metrics: use ssthresh value from dst if there is no metrics
  2025-06-20 12:56 [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options Petr Tesarik
  2025-06-20 12:56 ` [PATCH net v2 1/2] tcp_metrics: set congestion window clamp from the dst entry Petr Tesarik
@ 2025-06-20 12:56 ` Petr Tesarik
  2025-06-20 13:24 ` [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options Eric Dumazet
  2 siblings, 0 replies; 6+ messages in thread
From: Petr Tesarik @ 2025-06-20 12:56 UTC (permalink / raw)
  To: Paolo Abeni, David S. Miller, Eric Dumazet, Neal Cardwell,
	Kuniyuki Iwashima, open list:NETWORKING [TCP]
  Cc: David Ahern, Jakub Kicinski, open list, Petr Tesarik

If there is no cached TCP metrics entry for a connection, initialize
tp->snd_ssthresh from the corresponding dst entry. Of course, this value
may have to be clamped to tp->snd_cwnd_clamp, but let's not copy and paste
the existing code. Instead, move the check to the common path.

When ssthresh value is zero, the connection should enter initial slow
start, indicated by setting tp->snd_ssthresh to infinity (ignoring the
value of tp->snd_cwnd_clamp). Move this check against zero to the common
path, too.

Fixes: 51c5d0c4b169 ("tcp: Maintain dynamic metrics in local cache.")
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
---
 net/ipv4/tcp_metrics.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index dd8f3457bd72..6f7172ea8bc8 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -473,12 +473,13 @@ void tcp_init_metrics(struct sock *sk)
 	/* ssthresh may have been reduced unnecessarily during.
 	 * 3WHS. Restore it back to its initial default.
 	 */
-	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
+	tp->snd_ssthresh = 0;
 	if (!dst)
 		goto reset;
 
 	if (dst_metric_locked(dst, RTAX_CWND))
 		tp->snd_cwnd_clamp = dst_metric(dst, RTAX_CWND);
+	tp->snd_ssthresh = dst_metric(dst, RTAX_SSTHRESH);
 
 	rcu_read_lock();
 	tm = tcp_get_metrics(sk, dst, false);
@@ -487,13 +488,8 @@ void tcp_init_metrics(struct sock *sk)
 		goto reset;
 	}
 
-	val = READ_ONCE(net->ipv4.sysctl_tcp_no_ssthresh_metrics_save) ?
-	      0 : tcp_metric_get(tm, TCP_METRIC_SSTHRESH);
-	if (val) {
-		tp->snd_ssthresh = val;
-		if (tp->snd_ssthresh > tp->snd_cwnd_clamp)
-			tp->snd_ssthresh = tp->snd_cwnd_clamp;
-	}
+	if (!READ_ONCE(net->ipv4.sysctl_tcp_no_ssthresh_metrics_save))
+		tp->snd_ssthresh = tcp_metric_get(tm, TCP_METRIC_SSTHRESH);
 	val = tcp_metric_get(tm, TCP_METRIC_REORDERING);
 	if (val && tp->reordering != val)
 		tp->reordering = val;
@@ -537,6 +533,11 @@ void tcp_init_metrics(struct sock *sk)
 
 		inet_csk(sk)->icsk_rto = TCP_TIMEOUT_FALLBACK;
 	}
+
+	if (!tp->snd_ssthresh)
+		tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
+	else if (tp->snd_ssthresh > tp->snd_cwnd_clamp)
+		tp->snd_ssthresh = tp->snd_cwnd_clamp;
 }
 
 bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options
  2025-06-20 12:56 [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options Petr Tesarik
  2025-06-20 12:56 ` [PATCH net v2 1/2] tcp_metrics: set congestion window clamp from the dst entry Petr Tesarik
  2025-06-20 12:56 ` [PATCH net v2 2/2] tcp_metrics: use ssthresh value from dst if there is no metrics Petr Tesarik
@ 2025-06-20 13:24 ` Eric Dumazet
  2025-06-23  7:36   ` Petr Tesarik
  2 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2025-06-20 13:24 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Paolo Abeni, David S. Miller, Neal Cardwell, Kuniyuki Iwashima,
	open list:NETWORKING [TCP], David Ahern, Jakub Kicinski,
	open list

On Fri, Jun 20, 2025 at 5:57 AM Petr Tesarik <ptesarik@suse.com> wrote:
>
> I ran into a couple of issues while trying to tweak TCP congestion
> avoidance to analyze a potential performance regression. It turns out
> that overriding the parameters with ip-route(8) does not work as
> expected and appears to be buggy.

Hi Petr

Could you add packetdrill tests as well ?

Given this could accidentally break user setups, maybe net-next would be safer.

Some of us disable tcp_metrics, because metrics used one minute (or
few seconds) in the past are not very helpful, and source of
confusion.

(/proc/sys/net/ipv4/tcp_no_metrics_save set to 1)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options
  2025-06-20 13:24 ` [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options Eric Dumazet
@ 2025-06-23  7:36   ` Petr Tesarik
  2025-06-23 11:44     ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Petr Tesarik @ 2025-06-23  7:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paolo Abeni, David S. Miller, Neal Cardwell, Kuniyuki Iwashima,
	open list:NETWORKING [TCP], David Ahern, Jakub Kicinski,
	open list

On Fri, 20 Jun 2025 06:24:23 -0700
Eric Dumazet <edumazet@google.com> wrote:

> On Fri, Jun 20, 2025 at 5:57 AM Petr Tesarik <ptesarik@suse.com> wrote:
> >
> > I ran into a couple of issues while trying to tweak TCP congestion
> > avoidance to analyze a potential performance regression. It turns out
> > that overriding the parameters with ip-route(8) does not work as
> > expected and appears to be buggy.  
> 
> Hi Petr
> 
> Could you add packetdrill tests as well ?

Glad to do that. But it will be my first time. ;-) Is there a tutorial?
I looked under Documentation/ and didn't see anything.

> Given this could accidentally break user setups, maybe net-next would be safer.

Yeah, you're right. Technically, it is a bugfix, but if it's been
broken for more than a decade without anyone complaining, it can't be
super-urgent.

> Some of us disable tcp_metrics, because metrics used one minute (or
> few seconds) in the past are not very helpful, and source of
> confusion.
> 
> (/proc/sys/net/ipv4/tcp_no_metrics_save set to 1)

Yes, I know about that one. FWIW it didn't help at all in my case,
because then the value from the routing table was ALWAYS ignored...

Petr T

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options
  2025-06-23  7:36   ` Petr Tesarik
@ 2025-06-23 11:44     ` Eric Dumazet
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2025-06-23 11:44 UTC (permalink / raw)
  To: Petr Tesarik, Willem de Bruijn
  Cc: Paolo Abeni, David S. Miller, Neal Cardwell, Kuniyuki Iwashima,
	open list:NETWORKING [TCP], David Ahern, Jakub Kicinski,
	open list

On Mon, Jun 23, 2025 at 12:36 AM Petr Tesarik <ptesarik@suse.com> wrote:
>
> On Fri, 20 Jun 2025 06:24:23 -0700
> Eric Dumazet <edumazet@google.com> wrote:
>
> > On Fri, Jun 20, 2025 at 5:57 AM Petr Tesarik <ptesarik@suse.com> wrote:
> > >
> > > I ran into a couple of issues while trying to tweak TCP congestion
> > > avoidance to analyze a potential performance regression. It turns out
> > > that overriding the parameters with ip-route(8) does not work as
> > > expected and appears to be buggy.
> >
> > Hi Petr
> >
> > Could you add packetdrill tests as well ?
>
> Glad to do that. But it will be my first time. ;-) Is there a tutorial?
> I looked under Documentation/ and didn't see anything.

I came up with the following test (currently working for IPv4 only).
Neal, Willem, any idea on how to have this test done for upstream tree ?

tools/testing/selftests/net/packetdrill/ksft_runner.sh does not have a
way to make a test family dependent.


diff --git a/tools/testing/selftests/net/packetdrill/tcp_cwnd5.pkt
b/tools/testing/selftests/net/packetdrill/tcp_cwnd5.pkt
new file mode 100644
index 0000000000000000000000000000000000000000..e28b63b696d200ca447f613c30003571c1ff1ae8
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_cwnd5.pkt
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+// Test of RTAX_CWND routing attribute
+
+// Set up config.
+`./defaults.sh`
+
++0 `ip ro change 192.0.2.1 via 192.168.0.1 dev tun0 cwnd lock 6`
+
+   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+  +.1 < . 1:1(0) ack 1 win 257
+   +0 accept(3, ..., ...) = 4
+
+   +0 %{ assert tcpi_snd_cwnd == 6, tcpi_snd_cwnd }%
+
+   +0 write(4, ..., 20000) = 20000
+   +0 > P. 1:6001(6000) ack 1
+
+   +.1 < . 1:1(0) ack 6001 win 257
+
+   +0 %{ assert tcpi_snd_cwnd == 6, tcpi_snd_cwnd }%



>
> > Given this could accidentally break user setups, maybe net-next would be safer.
>
> Yeah, you're right. Technically, it is a bugfix, but if it's been
> broken for more than a decade without anyone complaining, it can't be
> super-urgent.
>
> > Some of us disable tcp_metrics, because metrics used one minute (or
> > few seconds) in the past are not very helpful, and source of
> > confusion.
> >
> > (/proc/sys/net/ipv4/tcp_no_metrics_save set to 1)
>
> Yes, I know about that one. FWIW it didn't help at all in my case,
> because then the value from the routing table was ALWAYS ignored...
>
> Petr T

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-06-23 11:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 12:56 [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options Petr Tesarik
2025-06-20 12:56 ` [PATCH net v2 1/2] tcp_metrics: set congestion window clamp from the dst entry Petr Tesarik
2025-06-20 12:56 ` [PATCH net v2 2/2] tcp_metrics: use ssthresh value from dst if there is no metrics Petr Tesarik
2025-06-20 13:24 ` [PATCH net v2 0/2] tcp_metrics: fix hanlding of route options Eric Dumazet
2025-06-23  7:36   ` Petr Tesarik
2025-06-23 11:44     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).