Current 2.6.x TSO state

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Current 2.6.x TSO state
@ 2004-10-01  4:32 David S. Miller
  2004-10-01 10:11 ` Andi Kleen
  2004-10-01 13:06 ` Herbert Xu
  0 siblings, 2 replies; 12+ messages in thread
From: David S. Miller @ 2004-10-01  4:32 UTC (permalink / raw)
  To: netdev; +Cc: ak, jheffner, herbert

[-- Attachment #1: Type: text/plain, Size: 1696 bytes --]


Attached are the 4 TCP TSO patches I have in my tree.
They are relative to Linus's current tree and also
available at:

	bk://kernel.bkbits.net/davem/net-2.6

The quick summary is:

diff1) Smooth out TSO ack clocking by calling
       tcp_trim_head() at tcp_tso_ack() time
       and making tcp_trim_head() liberate
       socket send buffer space.

diff2) URG fix in tcp_tso_ack()

diff3) Add tcp_tso_win_divisor sysctl knob.

diff4) Obey MSS in tso handling, shrink tcp_skb_cb

Existing known problems requiring a fix in
time for 2.6.9-final are:

1) Andi sees performance anomaly to 2.6.5 kernels.
   Hopefully fixed by diff3 above, merely awaiting
   retesting by him.

2) John Heffner sees some kind of weird transfer
   hang, down/up'ing the interface makes transfer
   finish successfully.

   He has told me he will spend some time this weekend
   trying to debug it.

Future enhancements which are not as critical as
the above:

1) Handle SACK tagging of TSO frames... somehow.
   I don't have any brilliant ideas currently.

   We could use bits, to represent sub-TSO SACK
   regions.  This would impose a hard limit of
   something like 32 for the maximum TSO factor.

   Another idea is to resegment a TSO frame when
   SACKs cover portions.  This approach is my
   least favorite because SACKs can be common in
   the presence of even minor packet reordering.
   So we'd be splitting up + copying a lot.

2) Leave TSO enabled even during loss events.
   #1 is pretty much a prerequisite for #2

   If we don't do #1 first, most SACKs get entirely
   ignored.

3) Fix up the packet counting once we have sub-TSO
   SACK tagging in place.

Ok that's enough TSO hacking for me today.


[-- Attachment #2: diff1 --]
[-- Type: application/octet-stream, Size: 8037 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/29 21:12:18-07:00 davem@nuts.davemloft.net 
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/29 21:11:53-07:00 davem@nuts.davemloft.net +15 -35
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_input.c
#   2004/09/29 21:11:53-07:00 davem@nuts.davemloft.net +9 -13
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# include/net/tcp.h
#   2004/09/29 21:11:52-07:00 davem@nuts.davemloft.net +1 -0
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-30 21:02:34 -07:00
+++ b/include/net/tcp.h	2004-09-30 21:02:34 -07:00
@@ -944,6 +944,7 @@
 extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_xmit_retransmit_queue(struct sock *);
 extern void tcp_simple_retransmit(struct sock *);
+extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
 
 extern void tcp_send_probe0(struct sock *);
 extern void tcp_send_partial(struct sock *);
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-09-30 21:02:34 -07:00
+++ b/net/ipv4/tcp_input.c	2004-09-30 21:02:34 -07:00
@@ -2364,13 +2364,14 @@
  * then making a write space wakeup callback is a possible
  * future enhancement.  WARNING: it is not trivial to make.
  */
-static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
+static int tcp_tso_acked(struct sock *sk, struct sk_buff *skb,
 			 __u32 now, __s32 *seq_rtt)
 {
+	struct tcp_opt *tp = tcp_sk(sk);
 	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
 	__u32 mss = scb->tso_mss;
 	__u32 snd_una = tp->snd_una;
-	__u32 seq = scb->seq;
+	__u32 orig_seq, seq;
 	__u32 packets_acked = 0;
 	int acked = 0;
 
@@ -2379,22 +2380,18 @@
 	 */
 	BUG_ON(!after(scb->end_seq, snd_una));
 
+	seq = orig_seq = scb->seq;
 	while (!after(seq + mss, snd_una)) {
 		packets_acked++;
 		seq += mss;
 	}
 
+	if (tcp_trim_head(sk, skb, (seq - orig_seq)))
+		return 0;
+
 	if (packets_acked) {
 		__u8 sacked = scb->sacked;
 
-		/* We adjust scb->seq but we do not pskb_pull() the
-		 * SKB.  We let tcp_retransmit_skb() handle this case
-		 * by checking skb->len against the data sequence span.
-		 * This way, we avoid the pskb_pull() work unless we
-		 * actually need to retransmit the SKB.
-		 */
-		scb->seq = seq;
-
 		acked |= FLAG_DATA_ACKED;
 		if (sacked) {
 			if (sacked & TCPCB_RETRANS) {
@@ -2413,7 +2410,7 @@
 							packets_acked);
 			if (sacked & TCPCB_URG) {
 				if (tp->urg_mode &&
-				    !before(scb->seq, tp->snd_up))
+				    !before(orig_seq, tp->snd_up))
 					tp->urg_mode = 0;
 			}
 		} else if (*seq_rtt < 0)
@@ -2425,7 +2422,6 @@
 			tcp_dec_pcount_explicit(&tp->fackets_out, dval);
 		}
 		tcp_dec_pcount_explicit(&tp->packets_out, packets_acked);
-		scb->tso_factor -= packets_acked;
 
 		BUG_ON(scb->tso_factor == 0);
 		BUG_ON(!before(scb->seq, scb->end_seq));
@@ -2455,7 +2451,7 @@
 		 */
 		if (after(scb->end_seq, tp->snd_una)) {
 			if (scb->tso_factor > 1)
-				acked |= tcp_tso_acked(tp, skb,
+				acked |= tcp_tso_acked(sk, skb,
 						       now, &seq_rtt);
 			break;
 		}
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-30 21:02:34 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-30 21:02:34 -07:00
@@ -525,7 +525,7 @@
  * eventually). The difference is that pulled data not copied, but
  * immediately discarded.
  */
-unsigned char * __pskb_trim_head(struct sk_buff *skb, int len)
+static unsigned char *__pskb_trim_head(struct sk_buff *skb, int len)
 {
 	int i, k, eat;
 
@@ -553,8 +553,10 @@
 	return skb->tail;
 }
 
-static int __tcp_trim_head(struct tcp_opt *tp, struct sk_buff *skb, u32 len)
+int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 {
+	struct tcp_opt *tp = tcp_sk(sk);
+
 	if (skb_cloned(skb) &&
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
@@ -566,8 +568,14 @@
 			return -ENOMEM;
 	}
 
+	TCP_SKB_CB(skb)->seq += len;
 	skb->ip_summed = CHECKSUM_HW;
 
+	skb->truesize	     -= len;
+	sk->sk_queue_shrunk   = 1;
+	sk->sk_wmem_queued   -= len;
+	sk->sk_forward_alloc += len;
+
 	/* Any change of skb->len requires recalculation of tso
 	 * factor and mss.
 	 */
@@ -576,16 +584,6 @@
 	return 0;
 }
 
-static inline int tcp_trim_head(struct tcp_opt *tp, struct sk_buff *skb, u32 len)
-{
-	int err = __tcp_trim_head(tp, skb, len);
-
-	if (!err)
-		TCP_SKB_CB(skb)->seq += len;
-
-	return err;
-}
-
 /* This function synchronize snd mss to current pmtu/exthdr set.
 
    tp->user_mss is mss set by user by TCP_MAXSEG. It does NOT counts
@@ -686,11 +684,12 @@
 					68U - tp->tcp_header_len);
 
 		/* Always keep large mss multiple of real mss, but
-		 * do not exceed congestion window.
+		 * do not exceed 1/4 of the congestion window so we
+		 * can keep the ACK clock ticking.
 		 */
 		factor = large_mss / mss_now;
-		if (factor > tp->snd_cwnd)
-			factor = tp->snd_cwnd;
+		if (factor > (tp->snd_cwnd >> 2))
+			factor = max(1, tp->snd_cwnd >> 2);
 
 		tp->mss_cache = mss_now * factor;
 
@@ -1003,7 +1002,6 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
  	unsigned int cur_mss = tcp_current_mss(sk, 0);
-	__u32 data_seq, data_end_seq;
 	int err;
 
 	/* Do not sent more than we queued. 1/4 is reserved for possible
@@ -1013,24 +1011,6 @@
 	    min(sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf))
 		return -EAGAIN;
 
-	/* What is going on here?  When TSO packets are partially ACK'd,
-	 * we adjust the TCP_SKB_CB(skb)->seq value forward but we do
-	 * not adjust the data area of the SKB.  We defer that to here
-	 * so that we can avoid the work unless we really retransmit
-	 * the packet.
-	 */
-	data_seq = TCP_SKB_CB(skb)->seq;
-	data_end_seq = TCP_SKB_CB(skb)->end_seq;
-	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
-		data_end_seq--;
-
-	if (skb->len > (data_end_seq - data_seq)) {
-		u32 to_trim = skb->len - (data_end_seq - data_seq);
-
-		if (__tcp_trim_head(tp, skb, to_trim))
-			return -ENOMEM;
-	}		
-
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
@@ -1041,7 +1021,7 @@
 			tp->mss_cache = tp->mss_cache_std;
 		}
 
-		if (tcp_trim_head(tp, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
+		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
 			return -ENOMEM;
 	}
 

[-- Attachment #3: diff2 --]
[-- Type: application/octet-stream, Size: 845 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/30 12:42:29-07:00 davem@nuts.davemloft.net 
#   [TCP]: Check correct sequence number for URG in tcp_tso_acked().
#   
#   Noticed by Herbert Xu.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_input.c
#   2004/09/30 12:41:57-07:00 davem@nuts.davemloft.net +1 -1
#   [TCP]: Check correct sequence number for URG in tcp_tso_acked().
# 
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-09-30 21:02:47 -07:00
+++ b/net/ipv4/tcp_input.c	2004-09-30 21:02:47 -07:00
@@ -2410,7 +2410,7 @@
 							packets_acked);
 			if (sacked & TCPCB_URG) {
 				if (tp->urg_mode &&
-				    !before(orig_seq, tp->snd_up))
+				    !before(seq, tp->snd_up))
 					tp->urg_mode = 0;
 			}
 		} else if (*seq_rtt < 0)

[-- Attachment #4: diff3 --]
[-- Type: application/octet-stream, Size: 4066 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/30 20:09:28-07:00 davem@nuts.davemloft.net 
#   [TCP]: Add tcp_tso_win_divisor sysctl.
#   
#   This allows control over what percentage of
#   the congestion window can be consumed by a
#   single TSO frame.
#   
#   The setting of this parameter is a choice
#   between burstiness and building larger TSO
#   frames.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +19 -7
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
# net/ipv4/sysctl_net_ipv4.c
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +8 -0
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
# include/net/tcp.h
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +1 -0
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
# include/linux/sysctl.h
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +1 -0
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
diff -Nru a/include/linux/sysctl.h b/include/linux/sysctl.h
--- a/include/linux/sysctl.h	2004-09-30 21:03:00 -07:00
+++ b/include/linux/sysctl.h	2004-09-30 21:03:00 -07:00
@@ -341,6 +341,7 @@
 	NET_TCP_BIC_LOW_WINDOW=104,
 	NET_TCP_DEFAULT_WIN_SCALE=105,
 	NET_TCP_MODERATE_RCVBUF=106,
+	NET_TCP_TSO_WIN_DIVISOR=107,
 };
 
 enum {
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-30 21:03:00 -07:00
+++ b/include/net/tcp.h	2004-09-30 21:03:00 -07:00
@@ -609,6 +609,7 @@
 extern int sysctl_tcp_bic_fast_convergence;
 extern int sysctl_tcp_bic_low_window;
 extern int sysctl_tcp_moderate_rcvbuf;
+extern int sysctl_tcp_tso_win_divisor;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
diff -Nru a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
--- a/net/ipv4/sysctl_net_ipv4.c	2004-09-30 21:03:01 -07:00
+++ b/net/ipv4/sysctl_net_ipv4.c	2004-09-30 21:03:01 -07:00
@@ -674,6 +674,14 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= NET_TCP_TSO_WIN_DIVISOR,
+		.procname	= "tcp_tso_win_divisor",
+		.data		= &sysctl_tcp_tso_win_divisor,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0 }
 };
 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-30 21:03:00 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-30 21:03:01 -07:00
@@ -45,6 +45,12 @@
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse = 1;
 
+/* This limits the percentage of the congestion window which we
+ * will allow a single TSO frame to consume.  Building TSO frames
+ * which are too large can cause TCP streams to be bursty.
+ */
+int sysctl_tcp_tso_win_divisor = 8;
+
 static __inline__
 void update_send_head(struct sock *sk, struct tcp_opt *tp, struct sk_buff *skb)
 {
@@ -658,7 +664,7 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
 	struct dst_entry *dst = __sk_dst_get(sk);
-	int do_large, mss_now;
+	unsigned int do_large, mss_now;
 
 	mss_now = tp->mss_cache_std;
 	if (dst) {
@@ -673,7 +679,7 @@
 		    !tp->urg_mode);
 
 	if (do_large) {
-		int large_mss, factor;
+		unsigned int large_mss, factor, limit;
 
 		large_mss = 65535 - tp->af_specific->net_header_len -
 			tp->ext_header_len - tp->ext2_header_len -
@@ -683,13 +689,19 @@
 			large_mss = max((tp->max_window>>1),
 					68U - tp->tcp_header_len);
 
+		factor = large_mss / mss_now;
+
 		/* Always keep large mss multiple of real mss, but
-		 * do not exceed 1/4 of the congestion window so we
-		 * can keep the ACK clock ticking.
+		 * do not exceed 1/tso_win_divisor of the congestion window
+		 * so we can keep the ACK clock ticking and minimize
+		 * bursting.
 		 */
-		factor = large_mss / mss_now;
-		if (factor > (tp->snd_cwnd >> 2))
-			factor = max(1, tp->snd_cwnd >> 2);
+		limit = tp->snd_cwnd;
+		if (sysctl_tcp_tso_win_divisor)
+			limit /= sysctl_tcp_tso_win_divisor;
+		limit = max(1U, limit);
+		if (factor > limit)
+			factor = limit;
 
 		tp->mss_cache = mss_now * factor;
 

[-- Attachment #5: diff4 --]
[-- Type: application/octet-stream, Size: 12442 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/30 20:58:53-07:00 davem@nuts.davemloft.net 
#   [TCP]: Kill tso_{factor,mss}.
#   
#   We can just use skb_shinfo(skb)->tso_{segs,size}
#   directly.  This also allows us to kill the
#   hack zone code in ip_output.c
#   
#   The original impetus for thus change was a problem
#   noted by John Heffner.  We do not abide by the MSS
#   of the connection for TCP segmentation, we were using
#   the path MTU instead.  This broke various local
#   network setups with TSO enabled and is fixed as a side
#   effect of these changes.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +30 -28
#   [TCP]: Kill tso_{factor,mss}.
# 
# net/ipv4/tcp_input.c
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +7 -7
#   [TCP]: Kill tso_{factor,mss}.
# 
# net/ipv4/tcp.c
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +2 -2
#   [TCP]: Kill tso_{factor,mss}.
# 
# net/ipv4/ip_output.c
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +1 -14
#   [TCP]: Kill tso_{factor,mss}.
# 
# include/net/tcp.h
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +11 -7
#   [TCP]: Kill tso_{factor,mss}.
# 
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-30 21:03:14 -07:00
+++ b/include/net/tcp.h	2004-09-30 21:03:14 -07:00
@@ -1152,8 +1152,6 @@
 
 	__u16		urg_ptr;	/* Valid w/URG flags is set.	*/
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
-	__u16		tso_factor;	/* If > 1, TSO frame		*/
-	__u16		tso_mss;	/* MSS that FACTOR's in terms of*/
 };
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
@@ -1165,7 +1163,13 @@
  */
 static inline int tcp_skb_pcount(struct sk_buff *skb)
 {
-	return TCP_SKB_CB(skb)->tso_factor;
+	return skb_shinfo(skb)->tso_segs;
+}
+
+/* This is valid iff tcp_skb_pcount() > 1. */
+static inline int tcp_skb_psize(struct sk_buff *skb)
+{
+	return skb_shinfo(skb)->tso_size;
 }
 
 static inline void tcp_inc_pcount(tcp_pcount_t *count, struct sk_buff *skb)
@@ -1440,7 +1444,7 @@
 		  tcp_minshall_check(tp))));
 }
 
-extern void tcp_set_skb_tso_factor(struct sk_buff *, unsigned int);
+extern void tcp_set_skb_tso_segs(struct sk_buff *, unsigned int);
 
 /* This checks if the data bearing packet SKB (usually sk->sk_send_head)
  * should be put on the wire right now.
@@ -1448,11 +1452,11 @@
 static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
 				   unsigned cur_mss, int nonagle)
 {
-	int pkts = TCP_SKB_CB(skb)->tso_factor;
+	int pkts = tcp_skb_pcount(skb);
 
 	if (!pkts) {
-		tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
-		pkts = TCP_SKB_CB(skb)->tso_factor;
+		tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
+		pkts = tcp_skb_pcount(skb);
 	}
 
 	/*	RFC 1122 - section 4.2.3.4
diff -Nru a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
--- a/net/ipv4/ip_output.c	2004-09-30 21:03:14 -07:00
+++ b/net/ipv4/ip_output.c	2004-09-30 21:03:14 -07:00
@@ -305,7 +305,6 @@
 	struct ip_options *opt = inet->opt;
 	struct rtable *rt;
 	struct iphdr *iph;
-	u32 mtu;
 
 	/* Skip all of this if the packet is already routed,
 	 * f.e. by something like SCTP.
@@ -366,21 +365,9 @@
 	skb->nh.iph   = iph;
 	/* Transport layer set skb->h.foo itself. */
 
-	if(opt && opt->optlen) {
+	if (opt && opt->optlen) {
 		iph->ihl += opt->optlen >> 2;
 		ip_options_build(skb, opt, inet->daddr, rt, 0);
-	}
-
-	mtu = dst_pmtu(&rt->u.dst);
-	if (skb->len > mtu && (sk->sk_route_caps & NETIF_F_TSO)) {
-		unsigned int hlen;
-
-		/* Hack zone: all this must be done by TCP. */
-		hlen = ((skb->h.raw - skb->data) + (skb->h.th->doff << 2));
-		skb_shinfo(skb)->tso_size = mtu - hlen;
-		skb_shinfo(skb)->tso_segs =
-			(skb->len - hlen + skb_shinfo(skb)->tso_size - 1)/
-				skb_shinfo(skb)->tso_size - 1;
 	}
 
 	ip_select_ident_more(iph, &rt->u.dst, sk, skb_shinfo(skb)->tso_segs);
diff -Nru a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c	2004-09-30 21:03:14 -07:00
+++ b/net/ipv4/tcp.c	2004-09-30 21:03:14 -07:00
@@ -691,7 +691,7 @@
 		skb->ip_summed = CHECKSUM_HW;
 		tp->write_seq += copy;
 		TCP_SKB_CB(skb)->end_seq += copy;
-		TCP_SKB_CB(skb)->tso_factor = 0;
+		skb_shinfo(skb)->tso_segs = 0;
 
 		if (!copied)
 			TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
@@ -938,7 +938,7 @@
 
 			tp->write_seq += copy;
 			TCP_SKB_CB(skb)->end_seq += copy;
-			TCP_SKB_CB(skb)->tso_factor = 0;
+			skb_shinfo(skb)->tso_segs = 0;
 
 			from += copy;
 			copied += copy;
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-09-30 21:03:14 -07:00
+++ b/net/ipv4/tcp_input.c	2004-09-30 21:03:14 -07:00
@@ -1035,7 +1035,7 @@
 			if(!before(TCP_SKB_CB(skb)->seq, end_seq))
 				break;
 
-			fack_count += TCP_SKB_CB(skb)->tso_factor;
+			fack_count += tcp_skb_pcount(skb);
 
 			in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
 				!before(end_seq, TCP_SKB_CB(skb)->end_seq);
@@ -1224,7 +1224,7 @@
 	tcp_set_pcount(&tp->fackets_out, 0);
 
 	sk_stream_for_retrans_queue(skb, sk) {
-		cnt += TCP_SKB_CB(skb)->tso_factor;;
+		cnt += tcp_skb_pcount(skb);
 		TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
 		if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED)) {
 
@@ -1299,7 +1299,7 @@
 		tp->undo_marker = tp->snd_una;
 
 	sk_stream_for_retrans_queue(skb, sk) {
-		cnt += TCP_SKB_CB(skb)->tso_factor;
+		cnt += tcp_skb_pcount(skb);
 		if (TCP_SKB_CB(skb)->sacked&TCPCB_RETRANS)
 			tp->undo_marker = 0;
 		TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
@@ -1550,7 +1550,7 @@
 	BUG_TRAP(cnt <= tcp_get_pcount(&tp->packets_out));
 
 	sk_stream_for_retrans_queue(skb, sk) {
-		cnt -= TCP_SKB_CB(skb)->tso_factor;
+		cnt -= tcp_skb_pcount(skb);
 		if (cnt < 0 || after(TCP_SKB_CB(skb)->end_seq, high_seq))
 			break;
 		if (!(TCP_SKB_CB(skb)->sacked&TCPCB_TAGBITS)) {
@@ -2369,7 +2369,7 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
 	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
-	__u32 mss = scb->tso_mss;
+	__u32 mss = tcp_skb_psize(skb);
 	__u32 snd_una = tp->snd_una;
 	__u32 orig_seq, seq;
 	__u32 packets_acked = 0;
@@ -2423,7 +2423,7 @@
 		}
 		tcp_dec_pcount_explicit(&tp->packets_out, packets_acked);
 
-		BUG_ON(scb->tso_factor == 0);
+		BUG_ON(tcp_skb_pcount(skb) == 0);
 		BUG_ON(!before(scb->seq, scb->end_seq));
 	}
 
@@ -2450,7 +2450,7 @@
 		 * the other end.
 		 */
 		if (after(scb->end_seq, tp->snd_una)) {
-			if (scb->tso_factor > 1)
+			if (tcp_skb_pcount(skb) > 1)
 				acked |= tcp_tso_acked(sk, skb,
 						       now, &seq_rtt);
 			break;
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-30 21:03:14 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-30 21:03:14 -07:00
@@ -274,7 +274,7 @@
 		int sysctl_flags;
 		int err;
 
-		BUG_ON(!TCP_SKB_CB(skb)->tso_factor);
+		BUG_ON(!tcp_skb_pcount(skb));
 
 #define SYSCTL_FLAG_TSTAMPS	0x1
 #define SYSCTL_FLAG_WSCALE	0x2
@@ -428,21 +428,22 @@
 	}
 }
 
-void tcp_set_skb_tso_factor(struct sk_buff *skb, unsigned int mss_std)
+void tcp_set_skb_tso_segs(struct sk_buff *skb, unsigned int mss_std)
 {
 	if (skb->len <= mss_std) {
 		/* Avoid the costly divide in the normal
 		 * non-TSO case.
 		 */
-		TCP_SKB_CB(skb)->tso_factor = 1;
+		skb_shinfo(skb)->tso_segs = 1;
+		skb_shinfo(skb)->tso_size = 0;
 	} else {
 		unsigned int factor;
 
 		factor = skb->len + (mss_std - 1);
 		factor /= mss_std;
-		TCP_SKB_CB(skb)->tso_factor = factor;
+		skb_shinfo(skb)->tso_segs = factor;
+		skb_shinfo(skb)->tso_size = mss_std;
 	}
-	TCP_SKB_CB(skb)->tso_mss = mss_std;
 }
 
 /* Function to create two new TCP segments.  Shrinks the given segment
@@ -508,8 +509,8 @@
 	}
 
 	/* Fix up tso_factor for both original and new SKB.  */
-	tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
-	tcp_set_skb_tso_factor(buff, tp->mss_cache_std);
+	tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
+	tcp_set_skb_tso_segs(buff, tp->mss_cache_std);
 
 	if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST) {
 		tcp_inc_pcount(&tp->lost_out, skb);
@@ -585,7 +586,7 @@
 	/* Any change of skb->len requires recalculation of tso
 	 * factor and mss.
 	 */
-	tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
+	tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
 
 	return 0;
 }
@@ -914,8 +915,8 @@
 		    ((skb_size + next_skb_size) > mss_now))
 			return;
 
-		BUG_ON(TCP_SKB_CB(skb)->tso_factor != 1 ||
-		       TCP_SKB_CB(next_skb)->tso_factor != 1);
+		BUG_ON(tcp_skb_pcount(skb) != 1 ||
+		       tcp_skb_pcount(next_skb) != 1);
 
 		/* Ok.  We will be able to collapse the packet. */
 		__skb_unlink(next_skb, next_skb->list);
@@ -1047,14 +1048,14 @@
 		return -EAGAIN;
 
 	if (skb->len > cur_mss) {
-		int old_factor = TCP_SKB_CB(skb)->tso_factor;
+		int old_factor = tcp_skb_pcount(skb);
 		int new_factor;
 
 		if (tcp_fragment(sk, skb, cur_mss))
 			return -ENOMEM; /* We'll try again later. */
 
 		/* New SKB created, account for it. */
-		new_factor = TCP_SKB_CB(skb)->tso_factor;
+		new_factor = tcp_skb_pcount(skb);
 		tcp_dec_pcount_explicit(&tp->packets_out,
 					old_factor - new_factor);
 		tcp_inc_pcount(&tp->packets_out, skb->next);
@@ -1081,7 +1082,8 @@
 	   tp->snd_una == (TCP_SKB_CB(skb)->end_seq - 1)) {
 		if (!pskb_trim(skb, 0)) {
 			TCP_SKB_CB(skb)->seq = TCP_SKB_CB(skb)->end_seq - 1;
-			TCP_SKB_CB(skb)->tso_factor = 1;
+			skb_shinfo(skb)->tso_segs = 1;
+			skb_shinfo(skb)->tso_size = 0;
 			skb->ip_summed = CHECKSUM_NONE;
 			skb->csum = 0;
 		}
@@ -1166,7 +1168,7 @@
 						tcp_reset_xmit_timer(sk, TCP_TIME_RETRANS, tp->rto);
 				}
 
-				packet_cnt -= TCP_SKB_CB(skb)->tso_factor;
+				packet_cnt -= tcp_skb_pcount(skb);
 				if (packet_cnt <= 0)
 					break;
 			}
@@ -1256,8 +1258,8 @@
 		skb->csum = 0;
 		TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
 		TCP_SKB_CB(skb)->sacked = 0;
-		TCP_SKB_CB(skb)->tso_factor = 1;
-		TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
+		skb_shinfo(skb)->tso_segs = 1;
+		skb_shinfo(skb)->tso_size = 0;
 
 		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
 		TCP_SKB_CB(skb)->seq = tp->write_seq;
@@ -1289,8 +1291,8 @@
 	skb->csum = 0;
 	TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_RST);
 	TCP_SKB_CB(skb)->sacked = 0;
-	TCP_SKB_CB(skb)->tso_factor = 1;
-	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
+	skb_shinfo(skb)->tso_segs = 1;
+	skb_shinfo(skb)->tso_size = 0;
 
 	/* Send it off. */
 	TCP_SKB_CB(skb)->seq = tcp_acceptable_seq(sk, tp);
@@ -1371,8 +1373,8 @@
 	TCP_SKB_CB(skb)->seq = req->snt_isn;
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + 1;
 	TCP_SKB_CB(skb)->sacked = 0;
-	TCP_SKB_CB(skb)->tso_factor = 1;
-	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
+	skb_shinfo(skb)->tso_segs = 1;
+	skb_shinfo(skb)->tso_size = 0;
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
 	th->ack_seq = htonl(req->rcv_isn + 1);
 	if (req->rcv_wnd == 0) { /* ignored for retransmitted syns */
@@ -1474,8 +1476,8 @@
 	TCP_SKB_CB(buff)->flags = TCPCB_FLAG_SYN;
 	TCP_ECN_send_syn(sk, tp, buff);
 	TCP_SKB_CB(buff)->sacked = 0;
-	TCP_SKB_CB(buff)->tso_factor = 1;
-	TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
+	skb_shinfo(buff)->tso_segs = 1;
+	skb_shinfo(buff)->tso_size = 0;
 	buff->csum = 0;
 	TCP_SKB_CB(buff)->seq = tp->write_seq++;
 	TCP_SKB_CB(buff)->end_seq = tp->write_seq;
@@ -1575,8 +1577,8 @@
 		buff->csum = 0;
 		TCP_SKB_CB(buff)->flags = TCPCB_FLAG_ACK;
 		TCP_SKB_CB(buff)->sacked = 0;
-		TCP_SKB_CB(buff)->tso_factor = 1;
-		TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
+		skb_shinfo(buff)->tso_segs = 1;
+		skb_shinfo(buff)->tso_size = 0;
 
 		/* Send it off, this clears delayed acks for us. */
 		TCP_SKB_CB(buff)->seq = TCP_SKB_CB(buff)->end_seq = tcp_acceptable_seq(sk, tp);
@@ -1611,8 +1613,8 @@
 	skb->csum = 0;
 	TCP_SKB_CB(skb)->flags = TCPCB_FLAG_ACK;
 	TCP_SKB_CB(skb)->sacked = urgent;
-	TCP_SKB_CB(skb)->tso_factor = 1;
-	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
+	skb_shinfo(skb)->tso_segs = 1;
+	skb_shinfo(skb)->tso_size = 0;
 
 	/* Use a previous sequence.  This should cause the other
 	 * end to send an ack.  Don't queue or clone SKB, just
@@ -1656,8 +1658,8 @@
 					sk->sk_route_caps &= ~NETIF_F_TSO;
 					tp->mss_cache = tp->mss_cache_std;
 				}
-			} else if (!TCP_SKB_CB(skb)->tso_factor)
-				tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
+			} else if (!tcp_skb_pcount(skb))
+				tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
 
 			TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
 			TCP_SKB_CB(skb)->when = tcp_time_stamp;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01  4:32 Current 2.6.x TSO state David S. Miller
@ 2004-10-01 10:11 ` Andi Kleen
  2004-10-01 19:47   ` David S. Miller
  2004-10-01 13:06 ` Herbert Xu
  1 sibling, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2004-10-01 10:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, jheffner, herbert

On Thu, 30 Sep 2004 21:32:21 -0700
"David S. Miller" <davem@davemloft.net> wrote:


> 
> 1) Andi sees performance anomaly to 2.6.5 kernels.
>    Hopefully fixed by diff3 above, merely awaiting
>    retesting by him.

Didn't fix the problem unfortunately, it's even a bit slower
with TSO  than the previous kernel I tested.

The stretch acks are still quite visible, see 
http://www.firstfloor.org/~andi/tso-stretch-ack1.gz for the full
log.

I will try to do the ACK instrumentation on the receiver you suggested 
later, unfortunately have some other urgent things to do first.

Do you want me to play with the new sysctl too?

tcptrace output for the last run as overview:

      complete conn: yes
        first packet:  Fri Oct  1 12:02:39.725840 2004
        last packet:   Fri Oct  1 12:02:49.729142 2004
        elapsed time:  0:00:10.003302
        total packets: 315102
        filename:      /tmp/LOG
   e->f:                              f->e:
     total packets:        298495           total packets:         16607      
     ack pkts sent:        298494           ack pkts sent:         16607      
     pure acks sent:            2           pure acks sent:        16605      
     unique bytes sent: 432215424           unique bytes sent:         0      
     actual data pkts:     298492           actual data pkts:          0      
     actual data bytes: 432215424           actual data bytes:         0      
     rexmt data pkts:           0           rexmt data pkts:           0      
     rexmt data bytes:          0           rexmt data bytes:          0      
     outoforder pkts:           0           outoforder pkts:           0      
     pushed data pkts:      16715           pushed data pkts:          0      
     SYN/FIN pkts sent:       1/1           SYN/FIN pkts sent:       1/1      
     req 1323 ws/ts:          Y/Y           req 1323 ws/ts:          Y/Y      
     adv wind scale:            2           adv wind scale:            0      
     req sack:                  Y           req sack:                  Y      
     sacks sent:                0           sacks sent:                0      
     mss requested:          1460 bytes     mss requested:          1460 bytes
     max segm size:          1448 bytes     max segm size:             0 bytes
     min segm size:           456 bytes     min segm size:             0 bytes
     avg segm size:          1447 bytes     avg segm size:             0 bytes
     max win adv:            5840 bytes     max win adv:           63712 bytes
     min win adv:            5840 bytes     min win adv:            5792 bytes
     zero win adv:              0 times     zero win adv:              0 times
     avg win adv:            5840 bytes     avg win adv:           63673 bytes
     initial window:         4344 bytes     initial window:            0 bytes
     initial window:            3 pkts      initial window:            0 pkts 
     ttl stream length: 432215424 bytes     ttl stream length:         0 bytes
     missed data:               0 bytes     missed data:               0 bytes
     truncated data:    423260664 bytes     truncated data:            0 bytes
     truncated packets:    298492 pkts      truncated packets:         0 pkts 
     data xmit time:       10.003 secs      data xmit time:        0.000 secs 
     idletime max:            8.4 ms        idletime max:            8.6 ms   
     throughput:         43207275 Bps       throughput:                0 Bps  


No SACKs etc. 

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01  4:32 Current 2.6.x TSO state David S. Miller
  2004-10-01 10:11 ` Andi Kleen
@ 2004-10-01 13:06 ` Herbert Xu
  2004-10-03 21:50   ` David S. Miller
  1 sibling, 1 reply; 12+ messages in thread
From: Herbert Xu @ 2004-10-01 13:06 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, ak, jheffner

On Thu, Sep 30, 2004 at 09:32:21PM -0700, David S. Miller wrote:
> 
> diff4) Obey MSS in tso handling, shrink tcp_skb_cb

This looks great.  But can we please rename tcp_skb_psize to
tcp_skb_mss?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01 10:11 ` Andi Kleen
@ 2004-10-01 19:47   ` David S. Miller
  2004-10-01 19:51     ` Andi Kleen
  0 siblings, 1 reply; 12+ messages in thread
From: David S. Miller @ 2004-10-01 19:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev, jheffner, herbert

On Fri, 1 Oct 2004 12:11:23 +0200
Andi Kleen <ak@suse.de> wrote:

> Do you want me to play with the new sysctl too?

Yes.

>      max win adv:            5840 bytes     max win adv:           63712 bytes
>      min win adv:            5840 bytes     min win adv:            5792 bytes

That stinks that the receiver is only using a 64K window,
that's way too small for gigabit.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01 19:47   ` David S. Miller
@ 2004-10-01 19:51     ` Andi Kleen
  2004-10-01 19:56       ` David S. Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2004-10-01 19:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, netdev, jheffner, herbert

On Fri, Oct 01, 2004 at 12:47:33PM -0700, David S. Miller wrote:
> On Fri, 1 Oct 2004 12:11:23 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > Do you want me to play with the new sysctl too?
> 
> Yes.

Already did, see my other mail (I should not read incoming mail in the wrong order @)
> 
> >      max win adv:            5840 bytes     max win adv:           63712 bytes
> >      min win adv:            5840 bytes     min win adv:            5792 bytes
> 
> That stinks that the receiver is only using a 64K window,
> that's way too small for gigabit.

Just using the default, no tuning.

I have some patches in the pipeline to do automatic window tuning
based on link speed based on dev->features. But they were actually more 
intended for 10Gbit/s (where all the defaults are completely inadequate)  
And it needs a bit more work. 

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01 19:51     ` Andi Kleen
@ 2004-10-01 19:56       ` David S. Miller
  2004-10-01 20:01         ` Andi Kleen
  2004-10-01 23:19         ` Andi Kleen
  0 siblings, 2 replies; 12+ messages in thread
From: David S. Miller @ 2004-10-01 19:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, netdev, jheffner, herbert

On Fri, 1 Oct 2004 21:51:47 +0200
Andi Kleen <ak@suse.de> wrote:

> > >      max win adv:            5840 bytes     max win adv:           63712 bytes
> > >      min win adv:            5840 bytes     min win adv:            5792 bytes
> > 
> > That stinks that the receiver is only using a 64K window,
> > that's way too small for gigabit.
> 
> Just using the default, no tuning.
> 
> I have some patches in the pipeline to do automatic window tuning
> based on link speed based on dev->features. But they were actually more 
> intended for 10Gbit/s (where all the defaults are completely inadequate)  
> And it needs a bit more work. 

As mentioned, the TCP receive buffer auto-tuning takes care
of all of this in 2.6.6 and later.  It's just 2.6.5 doesn't
have John Heffner's auto-tuning code which is why your test
case is so stuck in the mud.

Also, the stretch ACK's are quite normal.  If the receiver can't
advertize a larger window, we won't spit out an ACK until
the ack timeout.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01 19:56       ` David S. Miller
@ 2004-10-01 20:01         ` Andi Kleen
  2004-10-01 20:15           ` David S. Miller
  2004-10-01 20:33           ` John Heffner
  2004-10-01 23:19         ` Andi Kleen
  1 sibling, 2 replies; 12+ messages in thread
From: Andi Kleen @ 2004-10-01 20:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, netdev, jheffner, herbert

> As mentioned, the TCP receive buffer auto-tuning takes care
> of all of this in 2.6.6 and later.  It's just 2.6.5 doesn't
> have John Heffner's auto-tuning code which is why your test
> case is so stuck in the mud.
> 
> Also, the stretch ACK's are quite normal.  If the receiver can't
> advertize a larger window, we won't spit out an ACK until
> the ack timeout.

Ok, but why is the TSO case still slower? 

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01 20:01         ` Andi Kleen
@ 2004-10-01 20:15           ` David S. Miller
  2004-10-01 20:33           ` John Heffner
  1 sibling, 0 replies; 12+ messages in thread
From: David S. Miller @ 2004-10-01 20:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, netdev, jheffner, herbert

On Fri, 1 Oct 2004 22:01:02 +0200
Andi Kleen <ak@suse.de> wrote:

> > As mentioned, the TCP receive buffer auto-tuning takes care
> > of all of this in 2.6.6 and later.  It's just 2.6.5 doesn't
> > have John Heffner's auto-tuning code which is why your test
> > case is so stuck in the mud.
> > 
> > Also, the stretch ACK's are quite normal.  If the receiver can't
> > advertize a larger window, we won't spit out an ACK until
> > the ack timeout.
> 
> Ok, but why is the TSO case still slower? 

It isn't for me.  With the auto-tuning code present at the
receiver, at least in my case, the TSO case runs more quickly
since my sender is PCI bandwidth limited since the tg3 sits
on a 32mhz/33bit PCI bus.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01 20:01         ` Andi Kleen
  2004-10-01 20:15           ` David S. Miller
@ 2004-10-01 20:33           ` John Heffner
  1 sibling, 0 replies; 12+ messages in thread
From: John Heffner @ 2004-10-01 20:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, netdev, herbert

On Fri, 1 Oct 2004, Andi Kleen wrote:

> > As mentioned, the TCP receive buffer auto-tuning takes care
> > of all of this in 2.6.6 and later.  It's just 2.6.5 doesn't
> > have John Heffner's auto-tuning code which is why your test
> > case is so stuck in the mud.
> >
> > Also, the stretch ACK's are quite normal.  If the receiver can't
> > advertize a larger window, we won't spit out an ACK until
> > the ack timeout.
>
> Ok, but why is the TSO case still slower?

Because with TSO enabled, you get bigger back-to-back bursts during which
the receiving app can't run.

  -John

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01 19:56       ` David S. Miller
  2004-10-01 20:01         ` Andi Kleen
@ 2004-10-01 23:19         ` Andi Kleen
  2004-10-02  0:04           ` David S. Miller
  1 sibling, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2004-10-01 23:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: Andi Kleen, netdev, jheffner, herbert

> As mentioned, the TCP receive buffer auto-tuning takes care
> of all of this in 2.6.6 and later.  It's just 2.6.5 doesn't

How do you explain the 2-4MB/s less with manually increased 
receive buffers? 

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01 23:19         ` Andi Kleen
@ 2004-10-02  0:04           ` David S. Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David S. Miller @ 2004-10-02  0:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, netdev, jheffner, herbert

On Sat, 2 Oct 2004 01:19:39 +0200
Andi Kleen <ak@suse.de> wrote:

> > As mentioned, the TCP receive buffer auto-tuning takes care
> > of all of this in 2.6.6 and later.  It's just 2.6.5 doesn't
> 
> How do you explain the 2-4MB/s less with manually increased 
> receive buffers? 

Some scheduling differences, I suppose.

Frankly, I've done what I can with the TSO stuff at this point.

All I get from you are "it's slower" and no code, I've had to
write and fix and debug everything for you.  The performance
is close or on-par to non-TSO and more importantly TSO abides
by the congestion window and MSS values properly now.  That's
10 times more important than 2-4MB/s performance difference.

Or maybe I should revert all of the TSO work so that all SpecWEB
submissions done with 2.6.x kernels get invalidated?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Current 2.6.x TSO state
  2004-10-01 13:06 ` Herbert Xu
@ 2004-10-03 21:50   ` David S. Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David S. Miller @ 2004-10-03 21:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev, ak, jheffner

On Fri, 1 Oct 2004 23:06:09 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Thu, Sep 30, 2004 at 09:32:21PM -0700, David S. Miller wrote:
> > 
> > diff4) Obey MSS in tso handling, shrink tcp_skb_cb
> 
> This looks great.  But can we please rename tcp_skb_psize to
> tcp_skb_mss?

Sure, no problem, I've done that in my tree.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-10-03 21:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-01  4:32 Current 2.6.x TSO state David S. Miller
2004-10-01 10:11 ` Andi Kleen
2004-10-01 19:47   ` David S. Miller
2004-10-01 19:51     ` Andi Kleen
2004-10-01 19:56       ` David S. Miller
2004-10-01 20:01         ` Andi Kleen
2004-10-01 20:15           ` David S. Miller
2004-10-01 20:33           ` John Heffner
2004-10-01 23:19         ` Andi Kleen
2004-10-02  0:04           ` David S. Miller
2004-10-01 13:06 ` Herbert Xu
2004-10-03 21:50   ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).