[PATCH RFC net-next 0/2] tcp: support initcwnd adjustment

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC net-next 0/2] tcp: support initcwnd adjustment
@ 2025-03-28 15:16 Jason Xing
  2025-03-28 15:16 ` [PATCH RFC net-next 1/2] tcp: add TCP_IW for socksetopt Jason Xing
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Jason Xing @ 2025-03-28 15:16 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, horms, kuniyu, ncardwell
  Cc: netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

Patch 1 introduces a normal set/getsockopt for initcwnd.

Patch 2 introduces a dynamic adjustment for initcwnd to contribute to
small data transfer in data center.

Jason Xing (2):
  tcp: add TCP_IW for socksetopt
  tcp: introduce dynamic initcwnd adjustment

 include/linux/tcp.h      |  4 +++-
 include/uapi/linux/tcp.h |  2 ++
 net/ipv4/tcp.c           | 16 ++++++++++++++++
 net/ipv4/tcp_input.c     | 13 ++++++++++---
 4 files changed, 31 insertions(+), 4 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC net-next 1/2] tcp: add TCP_IW for socksetopt
  2025-03-28 15:16 [PATCH RFC net-next 0/2] tcp: support initcwnd adjustment Jason Xing
@ 2025-03-28 15:16 ` Jason Xing
  2025-03-28 15:16 ` [PATCH RFC net-next 2/2] tcp: introduce dynamic initcwnd adjustment Jason Xing
  2025-04-07  1:11 ` [PATCH RFC net-next 0/2] tcp: support " Jason Xing
  2 siblings, 0 replies; 4+ messages in thread
From: Jason Xing @ 2025-03-28 15:16 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, horms, kuniyu, ncardwell
  Cc: netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

ip route command adjusts the initcwnd for the certain flows. And it
takes effect in the slow start and slow start from idle cases.

Now this patch introduces a socket-level option for applications to
have the same ability. After this, I think TCP_BPF_IW can be adjusted
accordingly for slow start from idle case.

Introduce a new field to store the initial cwnd to help socket remember
what the value is when it begins to slow start after idle.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/linux/tcp.h      | 1 +
 include/uapi/linux/tcp.h | 1 +
 net/ipv4/tcp.c           | 8 ++++++++
 net/ipv4/tcp_input.c     | 2 +-
 4 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 1669d95bb0f9..aba0a1fe0e36 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -403,6 +403,7 @@ struct tcp_sock {
 	u32	snd_cwnd_used;
 	u32	snd_cwnd_stamp;
 	u32	prior_cwnd;	/* cwnd right before starting loss recovery */
+	u32	init_cwnd;	/* init cwnd controlled by setsockopt */
 	u32	prr_delivered;	/* Number of newly delivered packets to
 				 * receiver in Recovery. */
 	u32	last_oow_ack_time;  /* timestamp of last out-of-window ACK */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index dc8fdc80e16b..acf77114efed 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -142,6 +142,7 @@ enum {
 #define TCP_RTO_MAX_MS		44	/* max rto time in ms */
 #define TCP_RTO_MIN_US		45	/* min rto time in us */
 #define TCP_DELACK_MAX_US	46	/* max delayed ack time in us */
+#define TCP_IW			47	/* initial congestion window */
 
 #define TCP_REPAIR_ON		1
 #define TCP_REPAIR_OFF		0
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ea8de00f669d..9da7ece57b20 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3863,6 +3863,11 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 		WRITE_ONCE(inet_csk(sk)->icsk_delack_max, delack_max);
 		return 0;
 	}
+	case TCP_IW:
+		if (val <= 0 || tp->data_segs_out > tp->syn_data)
+			return -EINVAL;
+		tp->init_cwnd = val;
+		return 0;
 	}
 
 	sockopt_lock_sock(sk);
@@ -4708,6 +4713,9 @@ int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_DELACK_MAX_US:
 		val = jiffies_to_usecs(READ_ONCE(inet_csk(sk)->icsk_delack_max));
 		break;
+	case TCP_IW:
+		val = tp->init_cwnd;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e1f952fbac48..00cbe8970a1b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1019,7 +1019,7 @@ static void tcp_set_rto(struct sock *sk)
 
 __u32 tcp_init_cwnd(const struct tcp_sock *tp, const struct dst_entry *dst)
 {
-	__u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0);
+	__u32 cwnd = tp->init_cwnd ? : (dst ? dst_metric(dst, RTAX_INITCWND) : 0);
 
 	if (!cwnd)
 		cwnd = TCP_INIT_CWND;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC net-next 2/2] tcp: introduce dynamic initcwnd adjustment
  2025-03-28 15:16 [PATCH RFC net-next 0/2] tcp: support initcwnd adjustment Jason Xing
  2025-03-28 15:16 ` [PATCH RFC net-next 1/2] tcp: add TCP_IW for socksetopt Jason Xing
@ 2025-03-28 15:16 ` Jason Xing
  2025-04-07  1:11 ` [PATCH RFC net-next 0/2] tcp: support " Jason Xing
  2 siblings, 0 replies; 4+ messages in thread
From: Jason Xing @ 2025-03-28 15:16 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, horms, kuniyu, ncardwell
  Cc: netdev, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

More than one decade ago, Google published an important paper[1] that
describes how different initcwnd values have different impacts. Three
years later, initcwnd is set to 10 by default[2] for common use. But
nowadays, more and more small features have been developed for certain
particular cases instead of all the cases.

As we may notice some CDN teams try to increase it even to more than 100
for uncontrollable global network to speed up transmitting data in the slow
start phase. In data center, we also need such a similar change to ramp up
slow start especially for the case where application sometime tries to send
a small amount of data, say, 50K at one time in the persistent connection.
Asking users to tune by 'ip route' might not be that practical because 1)
it may affect those unwanted flows, 2) too big global-wide value may cause
burst for all kinds of flows.

This patch adds a dynamic adjustment feature for initcwnd in the slow start
or slow start from idle phase so that it only accelerates the in first
round trip time and doesn't affect too much for the massive data transfer
case.

Use 65535 as an upper bound to calculate the proper initcwnd. This number
is derived from the case where an skb carries the 65535 window when sending
syn ack at __tcp_transmit_skb(). Without it, the passive open side
sending data is able to see a very big value from the last ack in 3-WHS,
say, 2699776 which means it possibly generates a 1912 initcwnd that is
too big.

This patch can help the small data transfer case accelerate the speed. I
tested transmitting 50k at one time and managed to see the time consumed
decreased from 1400us to 80us. A 1750% delta!

The idea behind this is I often see the small data transfer consumes
more than 2 or 3 rtt because of limited snd_cwnd. In data center, we can
afford the bandwidth if we choose to accelerate transmission.

Why I chose the tp->max_window/tp->mss_cache? It's because cwnd is
increased by per mss packet and max_window is the signal that the other
side tries to tell us the max capacity it can bear. As we can see at
tcp_set_skb_tso_segs(), tcp_gso_size is equal to mss.

[1]: https://developers.google.com/speed/protocols/tcp_initcwnd_techreport.pdf
[2]: https://datatracker.ietf.org/doc/html/rfc6928

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
I'm not sure what the upper bound of this window should be. 65535 used
as max window generates a 46 initcwnd with the 1412 mss in my vm.
---
 include/linux/tcp.h      |  3 ++-
 include/uapi/linux/tcp.h |  1 +
 net/ipv4/tcp.c           |  8 ++++++++
 net/ipv4/tcp_input.c     | 11 +++++++++--
 4 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index aba0a1fe0e36..445db706f3cd 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -385,7 +385,8 @@ struct tcp_sock {
 		syn_fastopen:1,	/* SYN includes Fast Open option */
 		syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
 		syn_fastopen_ch:1, /* Active TFO re-enabling probe */
-		syn_data_acked:1;/* data in SYN is acked by SYN-ACK */
+		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
+		dynamic_initcwnd:1;  /* dynamic adjustment for initcwnd */

 	u8	keepalive_probes; /* num of allowed keep alive probes	*/
 	u32	tcp_tx_delay;	/* delay (in usec) added to TX packets */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index acf77114efed..7c63d0d0b5e1 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -143,6 +143,7 @@ enum {
 #define TCP_RTO_MIN_US		45	/* min rto time in us */
 #define TCP_DELACK_MAX_US	46	/* max delayed ack time in us */
 #define TCP_IW			47	/* initial congestion window */
+#define TCP_IW_DYNAMIC         48      /* dynamic adjustment for initcwnd */

 #define TCP_REPAIR_ON		1
 #define TCP_REPAIR_OFF		0
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9da7ece57b20..3d419a714f2d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3868,6 +3868,11 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 			return -EINVAL;
 		tp->init_cwnd = val;
 		return 0;
+	case TCP_IW_DYNAMIC:
+		if (val < 0 || val > 1)
+			return -EINVAL;
+		tp->dynamic_initcwnd = val;
+		return 0;
 	}

 	sockopt_lock_sock(sk);
@@ -4716,6 +4721,9 @@ int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_IW:
 		val = tp->init_cwnd;
 		break;
+	case TCP_IW_DYNAMIC:
+		val = tp->dynamic_initcwnd;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 00cbe8970a1b..05dbec734aa5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6341,10 +6341,17 @@ void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb)
 	 * initRTO, we only reset cwnd when more than 1 SYN/SYN-ACK
 	 * retransmission has occurred.
 	 */
-	if (tp->total_retrans > 1 && tp->undo_marker)
+	if (tp->total_retrans > 1 && tp->undo_marker) {
 		tcp_snd_cwnd_set(tp, 1);
-	else
+	} else {
+		if (tp->dynamic_initcwnd) {
+			u32 win = min(tp->max_window, 65535);
+
+			tp->init_cwnd = max(win / tp->mss_cache, TCP_INIT_CWND);
+		}
+
 		tcp_snd_cwnd_set(tp, tcp_init_cwnd(tp, __sk_dst_get(sk)));
+	}
 	tp->snd_cwnd_stamp = tcp_jiffies32;

 	bpf_skops_established(sk, bpf_op, skb);
-- 
2.43.5

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC net-next 0/2] tcp: support initcwnd adjustment
  2025-03-28 15:16 [PATCH RFC net-next 0/2] tcp: support initcwnd adjustment Jason Xing
  2025-03-28 15:16 ` [PATCH RFC net-next 1/2] tcp: add TCP_IW for socksetopt Jason Xing
  2025-03-28 15:16 ` [PATCH RFC net-next 2/2] tcp: introduce dynamic initcwnd adjustment Jason Xing
@ 2025-04-07  1:11 ` Jason Xing
  2 siblings, 0 replies; 4+ messages in thread
From: Jason Xing @ 2025-04-07  1:11 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, dsahern, horms, kuniyu, ncardwell
  Cc: netdev, Jason Xing

On Fri, Mar 28, 2025 at 11:16 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> Patch 1 introduces a normal set/getsockopt for initcwnd.
>
> Patch 2 introduces a dynamic adjustment for initcwnd to contribute to
> small data transfer in data center.

After having a few rounds of discussion in IETF, I think I will
postpone resending this series probably a few months later, because I
will keep collecting enough numbers in production and then consider
this small data acceleration feature more comprehensively :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-04-07  1:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-28 15:16 [PATCH RFC net-next 0/2] tcp: support initcwnd adjustment Jason Xing
2025-03-28 15:16 ` [PATCH RFC net-next 1/2] tcp: add TCP_IW for socksetopt Jason Xing
2025-03-28 15:16 ` [PATCH RFC net-next 2/2] tcp: introduce dynamic initcwnd adjustment Jason Xing
2025-04-07  1:11 ` [PATCH RFC net-next 0/2] tcp: support " Jason Xing

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).