[PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls

Netdev List
 help / color / mirror / Atom feed

From: Eric Dumazet <eric.dumazet@gmail.com>
To: Neal Cardwell <ncardwell@google.com>
Cc: "David Miller" <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	"Tom Herbert" <therbert@google.com>,
	"Maciej Żenczykowski" <maze@google.com>,
	"Yuchung Cheng" <ycheng@google.com>
Subject: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
Date: Wed, 18 Apr 2012 21:51:47 +0200	[thread overview]
Message-ID: <1334778707.2472.333.camel@edumazet-glaptop> (raw)
In-Reply-To: <1334776707.2472.316.camel@edumazet-glaptop>

From: Eric Dumazet <edumazet@google.com>

While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
unexpected profiling results, with pskb_expand_head() being in the top.

After further analysis, I found we hit badly page refcounts,
because when we transmit full size skb (64 KB), we can receive ACK for
the first segments of the frame while skb was not completely sent by
NIC.

It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
close peer, we can receive TCP ACK in less than 50 us rtt.

This is also true on 1Gb links but we were limited by wire speed, not
cpu.

When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
because the skb clone we did for transmit is still alive in TX ring
buffer.

pskb_expand_head() is really expensive : It has to make about 16+16
atomic operations on page refcounts, not counting the skb head
reallocation/copy. It increases chances of false sharing.

In fact, we dont really need to trim skb. This costly operation can be
delayed to the point it is really needed : Thats when a retransmit must
happen.

Most of the time, upcoming ACKS will ack the whole packet, and we can
free it with minimal cost (since clone was already freed by TX
completion)

Of course, this means we dont uncharge the acked part from socket limits
until retransmit, but this is hardly a concern with current autotuning
(around 4MB per socket)
Even with small cwnd limit, a single packet can not hold more than half
the window.

Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
About 3% less cpu used for same workload (single netperf TCP_STREAM),
bounded by x4 PCI-e slots (4660 Mbits).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
v2 : added Neal suggestions

 include/net/tcp.h     |    6 ++++--
 net/ipv4/tcp_input.c  |   22 +++++++++++-----------
 net/ipv4/tcp_output.c |   25 +++++++++++++++++--------
 3 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d5984e3..0f57706 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -477,7 +477,8 @@ extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_retransmit_timer(struct sock *sk);
 extern void tcp_xmit_retransmit_queue(struct sock *);
 extern void tcp_simple_retransmit(struct sock *);
-extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
+extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+				 unsigned int mss_now);
 extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int);
 
 extern void tcp_send_probe0(struct sock *);
@@ -640,7 +641,8 @@ struct tcp_skb_cb {
 #if IS_ENABLED(CONFIG_IPV6)
 		struct inet6_skb_parm	h6;
 #endif
-	} header;	/* For incoming frames		*/
+		unsigned int offset_ack; /* part of acked data in this skb */
+	} header;
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
 	__u32		when;		/* used to compute rtt's	*/
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99448f0..bdec2e6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3260,25 +3260,25 @@ static void tcp_rearm_rto(struct sock *sk)
 	}
 }
 
-/* If we get here, the whole TSO packet has not been acked. */
+/* If we get here, the whole packet has not been acked.
+ * We used to call tcp_trim_head() to remove acked data from skb,
+ * but its expensive with TSO if our previous clone is still in flight.
+ * We thus maintain an offset_ack, and hope no pskb_expand_head()
+ * is needed until whole packet is acked by upcoming ACKs.
+ */
 static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 packets_acked;
+	u32 oldpcount = tcp_skb_pcount(skb);
 
 	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una));
 
-	packets_acked = tcp_skb_pcount(skb);
-	if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
-		return 0;
-	packets_acked -= tcp_skb_pcount(skb);
+	TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq;
 
-	if (packets_acked) {
-		BUG_ON(tcp_skb_pcount(skb) == 0);
-		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
-	}
+	if (oldpcount > 1)
+		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
 
-	return packets_acked;
+	return oldpcount - tcp_skb_pcount(skb);
 }
 
 /* Remove acknowledged frames from the retransmission queue. If our packet
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index de8790c..f66df37 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -927,11 +927,15 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
 	sk_mem_charge(sk, skb->truesize);
 }
 
-/* Initialize TSO segments for a packet. */
-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-				 unsigned int mss_now)
+/* Initialize TSO segments for a packet.
+ * Part of skb (offset_ack) might have been acked.
+ */
+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+			  unsigned int mss_now)
 {
-	if (skb->len <= mss_now || !sk_can_gso(sk) ||
+	unsigned int len = skb->len - TCP_SKB_CB(skb)->header.offset_ack;
+
+	if (len <= mss_now || !sk_can_gso(sk) ||
 	    skb->ip_summed == CHECKSUM_NONE) {
 		/* Avoid the costly divide in the normal
 		 * non-TSO case.
@@ -940,7 +944,7 @@ static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
 		skb_shinfo(skb)->gso_size = 0;
 		skb_shinfo(skb)->gso_type = 0;
 	} else {
-		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now);
+		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len, mss_now);
 		skb_shinfo(skb)->gso_size = mss_now;
 		skb_shinfo(skb)->gso_type = sk->sk_gso_type;
 	}
@@ -1125,15 +1129,20 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
 	skb->len = skb->data_len;
 }
 
-/* Remove acked data from a packet in the transmit queue. */
-int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+/* Remove acked data from a packet in the transmit queue.
+ * Ony called before retransmit.
+ */
+static int tcp_trim_head(struct sock *sk, struct sk_buff *skb)
 {
+	u32 len = tcp_sk(sk)->snd_una - TCP_SKB_CB(skb)->seq;
+
 	if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
 
 	__pskb_trim_head(skb, len);
 
 	TCP_SKB_CB(skb)->seq += len;
+	TCP_SKB_CB(skb)->header.offset_ack = 0;
 	skb->ip_summed = CHECKSUM_PARTIAL;
 
 	skb->truesize	     -= len;
@@ -2096,7 +2105,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
-		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
+		if (tcp_trim_head(sk, skb))
 			return -ENOMEM;
 	}

next prev parent reply	other threads:[~2012-04-18 19:51 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-17  9:06 [BUG] ixgbe: something wrong with queue selection ? Eric Dumazet
2012-04-17  9:16 ` Jeff Kirsher
2012-04-17 16:01   ` Alexander Duyck
2012-04-17 16:38     ` John Fastabend
2012-04-17 17:07       ` Ben Hutchings
2012-04-17 16:46     ` Eric Dumazet
2012-04-17 21:38       ` TSO not 10G friendly if peer is close enough Eric Dumazet
2012-04-17 21:47         ` David Miller
2012-04-18  3:00           ` Eric Dumazet
2012-04-18 15:49         ` [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls Eric Dumazet
     [not found]           ` <4F8EF317.10504@hp.com>
2012-04-18 17:16             ` Eric Dumazet
2012-04-18 17:30               ` Rick Jones
2012-04-18 17:40                 ` Eric Dumazet
2012-04-18 18:40           ` Neal Cardwell
2012-04-18 19:18             ` Eric Dumazet
2012-04-18 19:51               ` Eric Dumazet [this message]
2012-04-19 11:10                 ` [PATCH v2 " Ilpo Järvinen
2012-04-19 11:30                   ` Eric Dumazet
2012-04-19 11:40                     ` Eric Dumazet
2012-04-19 11:57                       ` Ilpo Järvinen
2012-04-19 12:44                         ` Eric Dumazet
2012-04-20 12:27                           ` Ilpo Järvinen
2012-04-19 13:18                     ` Eric Dumazet
2012-04-19 13:52                       ` Eric Dumazet
2012-04-19 14:10                         ` Eric Dumazet
2012-04-19 17:20                           ` Rick Jones
2012-04-19 17:25                             ` Eric Dumazet
2012-04-19 17:48                               ` Rick Jones
2012-04-19 18:00                                 ` Eric Dumazet
2012-04-19 18:05                                   ` Rick Jones
2012-04-18 19:41           ` [PATCH " Vijay Subramanian
2012-04-18 19:49             ` Eric Dumazet

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:d5984e3 dfblob:0f57706 dfblob:99448f0 dfblob:bdec2e6
dfblob:de8790c dfblob:f66df37 )
 OR (
bs:"[PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1334778707.2472.333.camel@edumazet-glaptop \
    --to=eric.dumazet@gmail.com \
    --cc=davem@davemloft.net \
    --cc=maze@google.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=therbert@google.com \
    --cc=ycheng@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox