[PATCH net-next 0/3] tcp: avoid sending too small packets

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] tcp: avoid sending too small packets
@ 2024-04-18 21:45 Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test() Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Eric Dumazet @ 2024-04-18 21:45 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Kevin Yang, eric.dumazet, Eric Dumazet

tcp_sendmsg() cooks 'large' skbs, that are later split
if needed from tcp_write_xmit().

After a split, the leftover skb size is smaller than the optimal
size, and this causes a performance drop.

In this series, tcp_grow_skb() helper is added to shift
payload from the second skb in the write queue to the first
skb to always send optimal sized skbs.

This increases TSO efficiency, and decreases number of ACK
packets.

Eric Dumazet (3):
  tcp: remove dubious FIN exception from tcp_cwnd_test()
  tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()
  tcp: try to send bigger TSO packets

 net/ipv4/tcp_output.c | 78 +++++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 25 deletions(-)

-- 
2.44.0.769.g3c40516874-goog

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test()
  2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
@ 2024-04-18 21:45 ` Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit() Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2024-04-18 21:45 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Kevin Yang, eric.dumazet, Eric Dumazet

tcp_cwnd_test() has a special handing for the last packet in
the write queue if it is smaller than one MSS and has the FIN flag.

This is in violation of TCP RFC, and seems quite dubious.

This packet can be sent only if the current CWND is bigger
than the number of packets in flight.

Making tcp_cwnd_test() result independent of the first skb
in the write queue is needed for the last patch of the series.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 61119d42b0fd27a3736e136b1e81f6fc2d4cb44b..acbc76ca3e640354880c62c2423cfe4ba99f0be3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2073,16 +2073,10 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
 /* Can at least one segment of SKB be sent right now, according to the
  * congestion window rules?  If so, return how many segments are allowed.
  */
-static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-					 const struct sk_buff *skb)
+static u32 tcp_cwnd_test(const struct tcp_sock *tp)
 {
 	u32 in_flight, cwnd, halfcwnd;
 
-	/* Don't be strict about the congestion window for the final FIN.  */
-	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
-	    tcp_skb_pcount(skb) == 1)
-		return 1;
-
 	in_flight = tcp_packets_in_flight(tp);
 	cwnd = tcp_snd_cwnd(tp);
 	if (in_flight >= cwnd)
@@ -2706,10 +2700,9 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	unsigned int tso_segs, sent_pkts;
-	int cwnd_quota;
+	u32 cwnd_quota, max_segs;
 	int result;
 	bool is_cwnd_limited = false, is_rwnd_limited = false;
-	u32 max_segs;
 
 	sent_pkts = 0;
 
@@ -2743,7 +2736,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		tso_segs = tcp_init_tso_segs(skb, mss_now);
 		BUG_ON(!tso_segs);
 
-		cwnd_quota = tcp_cwnd_test(tp, skb);
+		cwnd_quota = tcp_cwnd_test(tp);
 		if (!cwnd_quota) {
 			if (push_one == 2)
 				/* Force out a loss probe pkt. */
@@ -2772,9 +2765,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		limit = mss_now;
 		if (tso_segs > 1 && !tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    min_t(unsigned int,
-							  cwnd_quota,
-							  max_segs),
+						    min(cwnd_quota,
+							max_segs),
 						    nonagle);
 
 		if (skb->len > limit &&
-- 
2.44.0.769.g3c40516874-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next 2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()
  2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test() Eric Dumazet
@ 2024-04-18 21:45 ` Eric Dumazet
  2024-04-18 21:46 ` [PATCH net-next 3/3] tcp: try to send bigger TSO packets Eric Dumazet
  2024-04-22 21:30 ` [PATCH net-next 0/3] tcp: avoid sending too small packets patchwork-bot+netdevbpf
  3 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2024-04-18 21:45 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Kevin Yang, eric.dumazet, Eric Dumazet

tcp_write_xmit() calls tcp_init_tso_segs()
to set gso_size and gso_segs on the packet.

tcp_init_tso_segs() requires the stack to maintain
an up to date tcp_skb_pcount(), and this makes sense
for packets in rtx queue. Not so much for packets
still in the write queue.

In the following patch, we don't want to deal with
tcp_skb_pcount() when moving payload from 2nd
skb to 1st skb in the write queue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index acbc76ca3e640354880c62c2423cfe4ba99f0be3..5e8665241f9345f38ce56afffe473948aef66786 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1502,18 +1502,22 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
 }

 /* Initialize TSO segments for a packet. */
-static void tcp_set_skb_tso_segs(struct sk_buff *skb, unsigned int mss_now)
+static int tcp_set_skb_tso_segs(struct sk_buff *skb, unsigned int mss_now)
 {
+	int tso_segs;
+
 	if (skb->len <= mss_now) {
 		/* Avoid the costly divide in the normal
 		 * non-TSO case.
 		 */
-		tcp_skb_pcount_set(skb, 1);
 		TCP_SKB_CB(skb)->tcp_gso_size = 0;
-	} else {
-		tcp_skb_pcount_set(skb, DIV_ROUND_UP(skb->len, mss_now));
-		TCP_SKB_CB(skb)->tcp_gso_size = mss_now;
+		tcp_skb_pcount_set(skb, 1);
+		return 1;
 	}
+	TCP_SKB_CB(skb)->tcp_gso_size = mss_now;
+	tso_segs = DIV_ROUND_UP(skb->len, mss_now);
+	tcp_skb_pcount_set(skb, tso_segs);
+	return tso_segs;
 }

 /* Pcount in the middle of the write queue got changed, we need to do various
@@ -2097,10 +2101,9 @@ static int tcp_init_tso_segs(struct sk_buff *skb, unsigned int mss_now)
 {
 	int tso_segs = tcp_skb_pcount(skb);

-	if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now)) {
-		tcp_set_skb_tso_segs(skb, mss_now);
-		tso_segs = tcp_skb_pcount(skb);
-	}
+	if (!tso_segs || (tso_segs > 1 && tcp_skb_mss(skb) != mss_now))
+		return tcp_set_skb_tso_segs(skb, mss_now);
+
 	return tso_segs;
 }

@@ -2733,9 +2736,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		if (tcp_pacing_check(sk))
 			break;

-		tso_segs = tcp_init_tso_segs(skb, mss_now);
-		BUG_ON(!tso_segs);
-
 		cwnd_quota = tcp_cwnd_test(tp);
 		if (!cwnd_quota) {
 			if (push_one == 2)
@@ -2745,6 +2745,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}

+		tso_segs = tcp_set_skb_tso_segs(skb, mss_now);
+
 		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
 			is_rwnd_limited = true;
 			break;
-- 
2.44.0.769.g3c40516874-goog

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next 3/3] tcp: try to send bigger TSO packets
  2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test() Eric Dumazet
  2024-04-18 21:45 ` [PATCH net-next 2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit() Eric Dumazet
@ 2024-04-18 21:46 ` Eric Dumazet
  2025-02-13 14:45   ` Shahar Shitrit
  2024-04-22 21:30 ` [PATCH net-next 0/3] tcp: avoid sending too small packets patchwork-bot+netdevbpf
  3 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2024-04-18 21:46 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Kevin Yang, eric.dumazet, Eric Dumazet

While investigating TCP performance, I found that TCP would
sometimes send big skbs followed by a single MSS skb,
in a 'locked' pattern.

For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
of payload per segment. gso_max_size is set to 181000.

This means that an optimal TCP packet size should contain
44 * 4096 = 180224 bytes of payload,

However, I was seeing packets sizes interleaved in this pattern:

172032, 8192, 172032, 8192, 172032, 8192, <repeat>

tcp_tso_should_defer() heuristic is defeated, because after a split of
a packet in write queue for whatever reason (this might be a too small
CWND or a small enough pacing_rate),
the leftover packet in the queue is smaller than the optimal size.

It is time to try to make 'leftover packets' bigger so that
tcp_tso_should_defer() can give its full potential.

After this patch, we can see the following output:

14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980

With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
of TSO packets (and ACK packets) of about 30 %.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5e8665241f9345f38ce56afffe473948aef66786..99a1d88f7f47b9ef0334efe62f8fd34c0d693ced 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2683,6 +2683,36 @@ void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type)
 		tcp_chrono_set(tp, TCP_CHRONO_BUSY);
 }
 
+/* First skb in the write queue is smaller than ideal packet size.
+ * Check if we can move payload from the second skb in the queue.
+ */
+static void tcp_grow_skb(struct sock *sk, struct sk_buff *skb, int amount)
+{
+	struct sk_buff *next_skb = skb->next;
+	unsigned int nlen;
+
+	if (tcp_skb_is_last(sk, skb))
+		return;
+
+	if (!tcp_skb_can_collapse(skb, next_skb))
+		return;
+
+	nlen = min_t(u32, amount, next_skb->len);
+	if (!nlen || !skb_shift(skb, next_skb, nlen))
+		return;
+
+	TCP_SKB_CB(skb)->end_seq += nlen;
+	TCP_SKB_CB(next_skb)->seq += nlen;
+
+	if (!next_skb->len) {
+		TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
+		TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
+		TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
+		tcp_unlink_write_queue(next_skb, sk);
+		tcp_wmem_free_skb(sk, next_skb);
+	}
+}
+
 /* This routine writes packets to the network.  It advances the
  * send_head.  This happens as incoming acks open up the remote
  * window for us.
@@ -2723,6 +2753,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	max_segs = tcp_tso_segs(sk, mss_now);
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
+		int missing_bytes;
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
@@ -2744,6 +2775,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			else
 				break;
 		}
+		cwnd_quota = min(cwnd_quota, max_segs);
+		missing_bytes = cwnd_quota * mss_now - skb->len;
+		if (missing_bytes > 0)
+			tcp_grow_skb(sk, skb, missing_bytes);
 
 		tso_segs = tcp_set_skb_tso_segs(skb, mss_now);
 
@@ -2767,8 +2802,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		limit = mss_now;
 		if (tso_segs > 1 && !tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    min(cwnd_quota,
-							max_segs),
+						    cwnd_quota,
 						    nonagle);
 
 		if (skb->len > limit &&
-- 
2.44.0.769.g3c40516874-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 3/3] tcp: try to send bigger TSO packets
  2024-04-18 21:46 ` [PATCH net-next 3/3] tcp: try to send bigger TSO packets Eric Dumazet
@ 2025-02-13 14:45   ` Shahar Shitrit
  2025-02-13 14:51     ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Shahar Shitrit @ 2025-02-13 14:45 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Neal Cardwell, Kevin Yang, eric.dumazet, Gal Pressman,
	Tariq Toukan

Hello,

I'm troubleshooting an issue and would appreciate your input.

The problem occurs when the SYNPROXY extension is configured with
iptables on the server side, and the rmem_max value is set to 512MB on
the same server. The combination of these two settings results in a
significant performance drop - specifically, it reduces the iperf3
bitrate from approximately 30 Gbps to a few Gbps (around 5).

Here are some key points from my investigation:
• When either of these configurations is applied independently, there is
no noticeable impact on performance. The issue only arises when they are
used together.
• The issue persists even when TSO, GSO, and GRO are disabled on both sides.
• The issue persists also with different congestion control algorithms.
• In the pcap, I observe that the server's window size remains small (it
only increases up to 9728 bytes, compared to around 64KB in normal traffic).
• In the tcp_select_window() function, I noticed that increasing the
rmem_max value causes tp->rx_opt.rcv_wscale to become larger (14 instead
of the default value of 7). This, in turn, reduces the window size
returned from the function because it gets shifted by
tp->rx_opt.rcv_wscale. Additionally, sk->sk_rcvbuf stays stuck at its
initial value (tcp_rmem[1]), whereas with normal traffic, it grows
throughout the test. Similarly, sk->sk_backlog.len and sk->sk_rmem_alloc
do not increase and remain at 0 for most of the traffic.
• It appears that there may be an issue with the server’s ability to
receive the skbs, which could explain why sk->sk_rmem_alloc doesn’t grow.
• Based on the iptables counters, there doesn’t seem to be an issue with
the SYNPROXY processing more packets than expected.

Additionally, with a kernel version containing the commit below, the
traffic performance worsens even further, dropping to 95 Kbps. As
observed in the pcap, the server's window size remains at 512 bytes
until it sends a RST. Moreover, from a certain point there's a 4-ms
delay in the server ACK that persists until the RST. No retransmission
is observed.
One indicator of the issue is that the TSO counters don't increment and
remain at 0, which is how we initially identified the problem.
I'm still not sure what might be the connection between the described
issue to this commit.


I would appreciate any insights you might have on this issue, as well as
suggestions for further investigation.

Steps to reproduce:

# server:
ifconfig eth2 1.1.1.1

sysctl -w net.netfilter.nf_conntrack_tcp_loose=0
iptables -t raw -I PREROUTING -i eth2 -w 2 -p tcp -m tcp --syn -j CT
--notrack
iptables -A INPUT -i eth2 -w 2 -p tcp -m tcp -m state --state
INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7 --mss 1460

echo '536870912' > /proc/sys/net/core/rmem_max

iperf3 -B 1.1.1.1 -s

# client:
ifconfig eth2 1.1.1.2

iperf3 -B 1.1.1.2 -c 1.1.1.1


If needed, I will send the pcaps.

Thank you,
Shahar Shitrit

On 19/04/2024 0:46, Eric Dumazet wrote:
> While investigating TCP performance, I found that TCP would
> sometimes send big skbs followed by a single MSS skb,
> in a 'locked' pattern.
> 
> For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
> of payload per segment. gso_max_size is set to 181000.
> 
> This means that an optimal TCP packet size should contain
> 44 * 4096 = 180224 bytes of payload,
> 
> However, I was seeing packets sizes interleaved in this pattern:
> 
> 172032, 8192, 172032, 8192, 172032, 8192, <repeat>
> 
> tcp_tso_should_defer() heuristic is defeated, because after a split of
> a packet in write queue for whatever reason (this might be a too small
> CWND or a small enough pacing_rate),
> the leftover packet in the queue is smaller than the optimal size.
> 
> It is time to try to make 'leftover packets' bigger so that
> tcp_tso_should_defer() can give its full potential.
> 
> After this patch, we can see the following output:
> 
> 14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
> 14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
> 14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
> 14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
> 14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
> 14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
> 14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
> 14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
> 14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
> 14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
> 14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980
> 
> With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
> of TSO packets (and ACK packets) of about 30 %.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/tcp_output.c | 38 ++++++++++++++++++++++++++++++++++++--
>  1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 5e8665241f9345f38ce56afffe473948aef66786..99a1d88f7f47b9ef0334efe62f8fd34c0d693ced 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2683,6 +2683,36 @@ void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type)
>  		tcp_chrono_set(tp, TCP_CHRONO_BUSY);
>  }
>  
> +/* First skb in the write queue is smaller than ideal packet size.
> + * Check if we can move payload from the second skb in the queue.
> + */
> +static void tcp_grow_skb(struct sock *sk, struct sk_buff *skb, int amount)
> +{
> +	struct sk_buff *next_skb = skb->next;
> +	unsigned int nlen;
> +
> +	if (tcp_skb_is_last(sk, skb))
> +		return;
> +
> +	if (!tcp_skb_can_collapse(skb, next_skb))
> +		return;
> +
> +	nlen = min_t(u32, amount, next_skb->len);
> +	if (!nlen || !skb_shift(skb, next_skb, nlen))
> +		return;
> +
> +	TCP_SKB_CB(skb)->end_seq += nlen;
> +	TCP_SKB_CB(next_skb)->seq += nlen;
> +
> +	if (!next_skb->len) {
> +		TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
> +		TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
> +		TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
> +		tcp_unlink_write_queue(next_skb, sk);
> +		tcp_wmem_free_skb(sk, next_skb);
> +	}
> +}
> +
>  /* This routine writes packets to the network.  It advances the
>   * send_head.  This happens as incoming acks open up the remote
>   * window for us.
> @@ -2723,6 +2753,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  	max_segs = tcp_tso_segs(sk, mss_now);
>  	while ((skb = tcp_send_head(sk))) {
>  		unsigned int limit;
> +		int missing_bytes;
>  
>  		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
>  			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
> @@ -2744,6 +2775,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  			else
>  				break;
>  		}
> +		cwnd_quota = min(cwnd_quota, max_segs);
> +		missing_bytes = cwnd_quota * mss_now - skb->len;
> +		if (missing_bytes > 0)
> +			tcp_grow_skb(sk, skb, missing_bytes);
>  
>  		tso_segs = tcp_set_skb_tso_segs(skb, mss_now);
>  
> @@ -2767,8 +2802,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  		limit = mss_now;
>  		if (tso_segs > 1 && !tcp_urg_mode(tp))
>  			limit = tcp_mss_split_point(sk, skb, mss_now,
> -						    min(cwnd_quota,
> -							max_segs),
> +						    cwnd_quota,
>  						    nonagle);
>  
>  		if (skb->len > limit &&


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 3/3] tcp: try to send bigger TSO packets
  2025-02-13 14:45   ` Shahar Shitrit
@ 2025-02-13 14:51     ` Eric Dumazet
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2025-02-13 14:51 UTC (permalink / raw)
  To: Shahar Shitrit
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, netdev,
	Neal Cardwell, Kevin Yang, eric.dumazet, Gal Pressman,
	Tariq Toukan

Please do not top post on netdev mailing list.

On Thu, Feb 13, 2025 at 3:45 PM Shahar Shitrit <shshitrit@nvidia.com> wrote:
>
> Hello,
>
> I'm troubleshooting an issue and would appreciate your input.
>
> The problem occurs when the SYNPROXY extension is configured with
> iptables on the server side, and the rmem_max value is set to 512MB on
> the same server. The combination of these two settings results in a
> significant performance drop - specifically, it reduces the iperf3
> bitrate from approximately 30 Gbps to a few Gbps (around 5).
>
> Here are some key points from my investigation:
> • When either of these configurations is applied independently, there is
> no noticeable impact on performance. The issue only arises when they are
> used together.
> • The issue persists even when TSO, GSO, and GRO are disabled on both sides.
> • The issue persists also with different congestion control algorithms.
> • In the pcap, I observe that the server's window size remains small (it
> only increases up to 9728 bytes, compared to around 64KB in normal traffic).
> • In the tcp_select_window() function, I noticed that increasing the
> rmem_max value causes tp->rx_opt.rcv_wscale to become larger (14 instead
> of the default value of 7). This, in turn, reduces the window size
> returned from the function because it gets shifted by
> tp->rx_opt.rcv_wscale. Additionally, sk->sk_rcvbuf stays stuck at its
> initial value (tcp_rmem[1]), whereas with normal traffic, it grows
> throughout the test. Similarly, sk->sk_backlog.len and sk->sk_rmem_alloc
> do not increase and remain at 0 for most of the traffic.
> • It appears that there may be an issue with the server’s ability to
> receive the skbs, which could explain why sk->sk_rmem_alloc doesn’t grow.
> • Based on the iptables counters, there doesn’t seem to be an issue with
> the SYNPROXY processing more packets than expected.
>
> Additionally, with a kernel version containing the commit below, the
> traffic performance worsens even further, dropping to 95 Kbps. As
> observed in the pcap, the server's window size remains at 512 bytes
> until it sends a RST. Moreover, from a certain point there's a 4-ms
> delay in the server ACK that persists until the RST. No retransmission
> is observed.
> One indicator of the issue is that the TSO counters don't increment and
> remain at 0, which is how we initially identified the problem.
> I'm still not sure what might be the connection between the described
> issue to this commit.
>
>
> I would appreciate any insights you might have on this issue, as well as
> suggestions for further investigation.
>
> Steps to reproduce:
>
> # server:
> ifconfig eth2 1.1.1.1
>
> sysctl -w net.netfilter.nf_conntrack_tcp_loose=0
> iptables -t raw -I PREROUTING -i eth2 -w 2 -p tcp -m tcp --syn -j CT
> --notrack
> iptables -A INPUT -i eth2 -w 2 -p tcp -m tcp -m state --state
> INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7 --mss 1460
>
> echo '536870912' > /proc/sys/net/core/rmem_max

What happens if you set a more reasonable value ?

Note that TCP really uses /proc/sys/net/ipv4/tcp_rmem for its limits.

Setting a big /proc/sys/net/core/rmem_max value might have adverse effects.


>
> iperf3 -B 1.1.1.1 -s
>
> # client:
> ifconfig eth2 1.1.1.2
>
> iperf3 -B 1.1.1.2 -c 1.1.1.1
>
>
> If needed, I will send the pcaps.
>
> Thank you,
> Shahar Shitrit

Seeing that you force a small mss value, my guess is the (skb->len /
skb->truesize) ratio is small on your driver RX packets.

On receiver :

ss -temoi dst other_host

might help.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next 0/3] tcp: avoid sending too small packets
  2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
                   ` (2 preceding siblings ...)
  2024-04-18 21:46 ` [PATCH net-next 3/3] tcp: try to send bigger TSO packets Eric Dumazet
@ 2024-04-22 21:30 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 7+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-04-22 21:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuba, pabeni, netdev, ncardwell, yyd, eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 18 Apr 2024 21:45:57 +0000 you wrote:
> tcp_sendmsg() cooks 'large' skbs, that are later split
> if needed from tcp_write_xmit().
> 
> After a split, the leftover skb size is smaller than the optimal
> size, and this causes a performance drop.
> 
> In this series, tcp_grow_skb() helper is added to shift
> payload from the second skb in the write queue to the first
> skb to always send optimal sized skbs.
> 
> [...]

Here is the summary with links:
  - [net-next,1/3] tcp: remove dubious FIN exception from tcp_cwnd_test()
    https://git.kernel.org/netdev/net-next/c/22555032c513
  - [net-next,2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit()
    https://git.kernel.org/netdev/net-next/c/d5b38a71d333
  - [net-next,3/3] tcp: try to send bigger TSO packets
    https://git.kernel.org/netdev/net-next/c/8ee602c63520

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-02-13 14:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-18 21:45 [PATCH net-next 0/3] tcp: avoid sending too small packets Eric Dumazet
2024-04-18 21:45 ` [PATCH net-next 1/3] tcp: remove dubious FIN exception from tcp_cwnd_test() Eric Dumazet
2024-04-18 21:45 ` [PATCH net-next 2/3] tcp: call tcp_set_skb_tso_segs() from tcp_write_xmit() Eric Dumazet
2024-04-18 21:46 ` [PATCH net-next 3/3] tcp: try to send bigger TSO packets Eric Dumazet
2025-02-13 14:45   ` Shahar Shitrit
2025-02-13 14:51     ` Eric Dumazet
2024-04-22 21:30 ` [PATCH net-next 0/3] tcp: avoid sending too small packets patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).