Re: bad TSO performance in 2.6.9-rc2-BK

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "David S. Miller" <davem@davemloft.net>
To: Herbert Xu <herbert@gondor.apana.org.au>
Cc: jheffner@psc.edu, ak@suse.de, niv@us.ibm.com,
	andy.grover@gmail.com, anton@samba.org, netdev@oss.sgi.com
Subject: Re: bad TSO performance in 2.6.9-rc2-BK
Date: Mon, 27 Sep 2004 21:59:01 -0700	[thread overview]
Message-ID: <20040927215901.6f65dc15.davem@davemloft.net> (raw)
In-Reply-To: <20040928003412.GA8755@gondor.apana.org.au>

On Tue, 28 Sep 2004 10:34:12 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> Yes this looks very good.

It's got a nasty bug though, I'm not updating
TCP_SKB_CB(skb)->seq so retransmits have corrupted
sequence numbers, doh!  And hey if I update that
then I do not need this tso_offset thingy.

> > +static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
> > +			 __u32 now, __s32 *seq_rtt)
> > +{
> > +	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
> > +	__u32 tso_seq = scb->seq + scb->tso_offset;
> > +	__u32 mss = tp->mss_cache_std;
> 
> In future we should probably record the MSS in the scb just in case
> it changes.

Fixed in my updated patch below, the above bug made me
consider this exact issue.

> > -		if (after(scb->end_seq, tp->snd_una))
> > +		if (after(scb->end_seq, tp->snd_una)) {
> > +			if (scb->tso_factor)
> 
> tso_factor > 1

Good catch, fixed below as well.

> >  		TCP_SKB_CB(skb)->tso_factor = 1;
> > +		TCP_SKB_CB(skb)->tso_offset = 1;
> 
> Is this a clever trick that I don't understand? :)

What a dumb bug, also fixed below.

This should do it for RTX queue purging and congestion
window growth.

Next I'll work on the other issues.  In particular:

1) Moving TSO mss calculations to tcp_current_mss()
   as you suggested.  Also integrating the 'large'
   usage bug fixes you sent a patch for earlier.

   I'm making this thing non-inline too, it gets
   expanded a lot.

2) The tcp_init_metrics() consistency fix you made.

3) Consider limiting 'factor' such as jheffner and myself
   have suggested over the past few days.  It might not
   be necessary, I'll do some tests.

Note the trick below wrt. adjusting the SKB data
area lazily.  I defer it to tcp_retransmit_skb()
otherwise we copy the data area around in the packet
several times, and that would undo the gain of
zerocopy + TSO wouldn't it? :-)

I mention that we might want mess with skb->truesize
and liberate space from sk->sk_wmem_alloc... but on
further reflection that might not really gain us
anything.  We'll make less work for the process by
waking up only on full TSO packet liberation.

Please scream loudly if you can find a bug in this
current patch.

===== include/net/tcp.h 1.89 vs edited =====
--- 1.89/include/net/tcp.h	2004-09-27 11:57:52 -07:00
+++ edited/include/net/tcp.h	2004-09-27 18:02:21 -07:00
@@ -1180,7 +1180,8 @@
 
 	__u16		urg_ptr;	/* Valid w/URG flags is set.	*/
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
-	__u32		tso_factor;
+	__u16		tso_factor;	/* If > 1, TSO frame		*/
+	__u16		tso_mss;	/* MSS that FACTOR's in terms of*/
 };
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
===== net/ipv4/tcp_input.c 1.75 vs edited =====
--- 1.75/net/ipv4/tcp_input.c	2004-09-27 12:00:32 -07:00
+++ edited/net/ipv4/tcp_input.c	2004-09-27 21:27:08 -07:00
@@ -2355,6 +2355,86 @@
 	}
 }
 
+/* There is one downside to this scheme.  Although we keep the
+ * ACK clock ticking, adjusting packet counters and advancing
+ * congestion window, we do not liberate socket send buffer
+ * space.
+ *
+ * Mucking with skb->truesize and sk->sk_wmem_alloc et al.
+ * then making a write space wakeup callback is a possible
+ * future enhancement.  WARNING: it is not trivial to make.
+ */
+static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
+			 __u32 now, __s32 *seq_rtt)
+{
+	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
+	__u32 mss = scb->tso_mss;
+	__u32 snd_una = tp->snd_una;
+	__u32 seq = scb->seq;
+	__u32 packets_acked = 0;
+	int acked = 0;
+
+	/* If we get here, the whole TSO packet has not been
+	 * acked.
+	 */
+	BUG_ON(!after(scb->end_seq, snd_una));
+
+	while (!after(seq + mss, snd_una)) {
+		packets_acked++;
+		seq += mss;
+	}
+
+	if (packets_acked) {
+		__u8 sacked = scb->sacked;
+
+		/* We adjust scb->seq but we do not pskb_pull() the
+		 * SKB.  We let tcp_retransmit_skb() handle this case
+		 * by checking skb->len against the data sequence span.
+		 * This way, we avoid the pskb_pull() work unless we
+		 * actually need to retransmit the SKB.
+		 */
+		scb->seq = seq;
+
+		acked |= FLAG_DATA_ACKED;
+		if (sacked) {
+			if (sacked & TCPCB_RETRANS) {
+				if (sacked & TCPCB_SACKED_RETRANS)
+					tcp_dec_pcount_explicit(&tp->retrans_out,
+								packets_acked);
+				acked |= FLAG_RETRANS_DATA_ACKED;
+				*seq_rtt = -1;
+			} else if (*seq_rtt < 0)
+				*seq_rtt = now - scb->when;
+			if (sacked & TCPCB_SACKED_ACKED)
+				tcp_dec_pcount_explicit(&tp->sacked_out,
+							packets_acked);
+			if (sacked & TCPCB_LOST)
+				tcp_dec_pcount_explicit(&tp->lost_out,
+							packets_acked);
+			if (sacked & TCPCB_URG) {
+				if (tp->urg_mode &&
+				    !before(scb->seq, tp->snd_up))
+					tp->urg_mode = 0;
+			}
+		} else if (*seq_rtt < 0)
+			*seq_rtt = now - scb->when;
+
+		if (tcp_get_pcount(&tp->fackets_out)) {
+			__u32 dval = min(tcp_get_pcount(&tp->fackets_out),
+					 packets_acked);
+			tcp_dec_pcount_explicit(&tp->fackets_out, dval);
+		}
+		tcp_dec_pcount_explicit(&tp->packets_out, packets_acked);
+		scb->tso_factor -= packets_acked;
+
+		BUG_ON(scb->tso_factor == 0);
+		BUG_ON(!before(scb->seq, scb->end_seq));
+	}
+
+	return acked;
+}
+
+
 /* Remove acknowledged frames from the retransmission queue. */
 static int tcp_clean_rtx_queue(struct sock *sk, __s32 *seq_rtt_p)
 {
@@ -2373,8 +2453,12 @@
 		 * discard it as it's confirmed to have arrived at
 		 * the other end.
 		 */
-		if (after(scb->end_seq, tp->snd_una))
+		if (after(scb->end_seq, tp->snd_una)) {
+			if (scb->tso_factor > 1)
+				acked |= tcp_tso_acked(tp, skb,
+						       now, &seq_rtt);
 			break;
+		}
 
 		/* Initial outgoing SYN's get put onto the write_queue
 		 * just like anything else we transmit.  It is not
===== net/ipv4/tcp_output.c 1.59 vs edited =====
--- 1.59/net/ipv4/tcp_output.c	2004-09-27 11:57:52 -07:00
+++ edited/net/ipv4/tcp_output.c	2004-09-27 21:27:50 -07:00
@@ -436,6 +436,7 @@
 		factor /= mss_std;
 		TCP_SKB_CB(skb)->tso_factor = factor;
 	}
+	TCP_SKB_CB(skb)->tso_mss = mss_std;
 }
 
 /* Function to create two new TCP segments.  Shrinks the given segment
@@ -552,7 +553,7 @@
 	return skb->tail;
 }
 
-static int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+static int __tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 {
 	if (skb_cloned(skb) &&
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
@@ -565,11 +566,20 @@
 			return -ENOMEM;
 	}
 
-	TCP_SKB_CB(skb)->seq += len;
 	skb->ip_summed = CHECKSUM_HW;
 	return 0;
 }
 
+static inline int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+{
+	int err = __tcp_trim_head(sk, skb, len);
+
+	if (!err)
+		TCP_SKB_CB(skb)->seq += len;
+
+	return err;
+}
+
 /* This function synchronize snd mss to current pmtu/exthdr set.
 
    tp->user_mss is mss set by user by TCP_MAXSEG. It does NOT counts
@@ -949,6 +959,7 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
  	unsigned int cur_mss = tcp_current_mss(sk, 0);
+	__u32 data_seq, data_end_seq;
 	int err;
 
 	/* Do not sent more than we queued. 1/4 is reserved for possible
@@ -958,6 +969,22 @@
 	    min(sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf))
 		return -EAGAIN;
 
+	/* What is going on here?  When TSO packets are partially ACK'd,
+	 * we adjust the TCP_SKB_CB(skb)->seq value forward but we do
+	 * not adjust the data area of the SKB.  We defer that to here
+	 * so that we can avoid the work unless we really retransmit
+	 * the packet.
+	 */
+	data_seq = TCP_SKB_CB(skb)->seq;
+	data_end_seq = TCP_SKB_CB(skb)->end_seq;
+	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
+		data_end_seq--;
+
+	if (skb->len != (data_end_seq - data_seq)) {
+		if (__tcp_trim_head(sk, skb, data_end_seq - data_seq))
+			return -ENOMEM;
+	}		
+
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
@@ -1191,6 +1218,7 @@
 		TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
 		TCP_SKB_CB(skb)->sacked = 0;
 		TCP_SKB_CB(skb)->tso_factor = 1;
+		TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 
 		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
 		TCP_SKB_CB(skb)->seq = tp->write_seq;
@@ -1223,6 +1251,7 @@
 	TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_RST);
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 
 	/* Send it off. */
 	TCP_SKB_CB(skb)->seq = tcp_acceptable_seq(sk, tp);
@@ -1304,6 +1333,7 @@
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + 1;
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
 	th->ack_seq = htonl(req->rcv_isn + 1);
 	if (req->rcv_wnd == 0) { /* ignored for retransmitted syns */
@@ -1406,6 +1436,7 @@
 	TCP_ECN_send_syn(sk, tp, buff);
 	TCP_SKB_CB(buff)->sacked = 0;
 	TCP_SKB_CB(buff)->tso_factor = 1;
+	TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
 	buff->csum = 0;
 	TCP_SKB_CB(buff)->seq = tp->write_seq++;
 	TCP_SKB_CB(buff)->end_seq = tp->write_seq;
@@ -1506,6 +1537,7 @@
 		TCP_SKB_CB(buff)->flags = TCPCB_FLAG_ACK;
 		TCP_SKB_CB(buff)->sacked = 0;
 		TCP_SKB_CB(buff)->tso_factor = 1;
+		TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
 
 		/* Send it off, this clears delayed acks for us. */
 		TCP_SKB_CB(buff)->seq = TCP_SKB_CB(buff)->end_seq = tcp_acceptable_seq(sk, tp);
@@ -1541,6 +1573,7 @@
 	TCP_SKB_CB(skb)->flags = TCPCB_FLAG_ACK;
 	TCP_SKB_CB(skb)->sacked = urgent;
 	TCP_SKB_CB(skb)->tso_factor = 1;
+	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
 
 	/* Use a previous sequence.  This should cause the other
 	 * end to send an ack.  Don't queue or clone SKB, just

next prev parent reply	other threads:[~2004-09-28  4:59 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-09-20  6:30 bad TSO performance in 2.6.9-rc2-BK Anton Blanchard
2004-09-20 15:54 ` Nivedita Singhvi
2004-09-21 15:55   ` Anton Blanchard
2004-09-20 20:30 ` Andi Kleen
2004-09-21 22:58   ` David S. Miller
2004-09-22 14:00     ` Andi Kleen
2004-09-22 18:12       ` David S. Miller
2004-09-22 19:55         ` Andi Kleen
2004-09-22 20:07           ` Nivedita Singhvi
2004-09-22 20:30             ` David S. Miller
2004-09-22 20:56               ` Nivedita Singhvi
2004-09-22 21:56               ` Andi Kleen
2004-09-22 22:04                 ` David S. Miller
2004-09-22 20:12           ` Andrew Grover
2004-09-22 20:39             ` David S. Miller
2004-09-22 22:06               ` Andi Kleen
2004-09-22 22:25                 ` David S. Miller
2004-09-22 22:47                   ` Andi Kleen
2004-09-22 22:50                     ` David S. Miller
2004-09-23 23:11                     ` David S. Miller
2004-09-23 23:41                       ` Herbert Xu
2004-09-23 23:41                         ` David S. Miller
2004-09-24  0:12                           ` Herbert Xu
2004-09-24  0:40                             ` Herbert Xu
2004-09-24  1:07                               ` Herbert Xu
2004-09-24  1:17                                 ` David S. Miller
2004-09-27  1:27                           ` Herbert Xu
2004-09-27  2:50                             ` Herbert Xu
2004-09-27  4:00                               ` David S. Miller
2004-09-27  5:45                                 ` Herbert Xu
2004-09-27 19:01                                   ` David S. Miller
2004-09-27 21:32                                     ` Herbert Xu
2004-09-28 21:10                                       ` David S. Miller
2004-09-28 21:34                                         ` Andi Kleen
2004-09-28 21:53                                           ` David S. Miller
2004-09-28 22:33                                             ` Andi Kleen
2004-09-28 22:57                                               ` David S. Miller
2004-09-28 23:27                                                 ` Andi Kleen
2004-09-28 23:35                                                   ` David S. Miller
2004-09-28 23:55                                                     ` Andi Kleen
2004-09-29  0:04                                                       ` David S. Miller
2004-09-29 20:58                                                   ` John Heffner
2004-09-29 21:10                                                     ` Nivedita Singhvi
2004-09-29 21:50                                                       ` David S. Miller
2004-09-29 21:56                                                         ` Andi Kleen
2004-09-29 23:29                                                           ` David S. Miller
2004-09-29 23:51                                                             ` John Heffner
2004-09-30  0:03                                                               ` David S. Miller
2004-09-30  0:10                                                                 ` Herbert Xu
2004-10-01  0:34                                                                   ` David S. Miller
2004-10-01  1:12                                                                     ` David S. Miller
2004-10-01  3:40                                                                       ` David S. Miller
2004-10-01 10:35                                                                         ` Andi Kleen
2004-10-01 10:23                                                                       ` Andi Kleen
2004-09-30  0:10                                                               ` John Heffner
2004-09-30 17:25                                                                 ` John Heffner
2004-09-30 20:23                                                                   ` David S. Miller
2004-09-30  0:05                                                             ` Herbert Xu
2004-09-30  4:33                                                               ` David S. Miller
2004-09-30  5:47                                                                 ` Herbert Xu
2004-09-30  7:39                                                                   ` David S. Miller
2004-09-30  8:09                                                                     ` Herbert Xu
2004-09-30  9:29                                                                 ` Andi Kleen
2004-09-30 20:20                                                                   ` David S. Miller
2004-09-29  3:27                                               ` John Heffner
2004-09-29  9:01                                                 ` Andi Kleen
2004-09-29 19:56                                                   ` David S. Miller
2004-09-29 20:56                                                     ` Andi Kleen
2004-09-29 21:17                                                       ` David S. Miller
2004-09-29 21:00                                                 ` David S. Miller
2004-09-29 21:16                                                   ` Nivedita Singhvi
2004-09-29 21:22                                                     ` David S. Miller
2004-09-29 21:43                                                       ` Andi Kleen
2004-09-29 21:51                                                         ` John Heffner
2004-09-29 21:52                                                           ` David S. Miller
2004-09-24  8:30                       ` Andi Kleen
2004-09-27 22:38                       ` John Heffner
2004-09-27 23:04                         ` David S. Miller
2004-09-27 23:25                           ` Andi Kleen
2004-09-27 23:37                             ` David S. Miller
2004-09-27 23:51                               ` Andi Kleen
2004-09-28  0:15                                 ` David S. Miller
2004-09-27 23:36                           ` Herbert Xu
2004-09-28  0:13                             ` David S. Miller
2004-09-28  0:34                               ` Herbert Xu
2004-09-28  4:59                                 ` David S. Miller [this message]
2004-09-28  5:15                                   ` Herbert Xu
2004-09-28  5:58                                     ` David S. Miller
2004-09-28  6:45                                   ` Nivedita Singhvi
2004-09-28  7:20                               ` Nivedita Singhvi
2004-09-28 20:38                                 ` David S. Miller
2004-09-28  7:23                         ` Nivedita Singhvi
2004-09-28  8:23                           ` Herbert Xu
2004-09-28 12:53                           ` John Heffner
2004-09-22 20:28           ` David S. Miller
     [not found] <Pine.NEB.4.33.0409301625560.13549-100000@dexter.psc.edu>
2004-10-02  1:32 ` John Heffner
2004-10-04 20:07   ` David S. Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040927215901.6f65dc15.davem@davemloft.net \
    --to=davem@davemloft.net \
    --cc=ak@suse.de \
    --cc=andy.grover@gmail.com \
    --cc=anton@samba.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=jheffner@psc.edu \
    --cc=netdev@oss.sgi.com \
    --cc=niv@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).