Current 2.6.x TSO state - David S. Miller

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "David S. Miller" <davem@davemloft.net>
To: netdev@oss.sgi.com
Cc: ak@suse.de, jheffner@psc.edu, herbert@gondor.apana.org.au
Subject: Current 2.6.x TSO state
Date: Thu, 30 Sep 2004 21:32:21 -0700	[thread overview]
Message-ID: <20040930213221.06a3f5b3.davem@davemloft.net> (raw)

[-- Attachment #1: Type: text/plain, Size: 1696 bytes --]


Attached are the 4 TCP TSO patches I have in my tree.
They are relative to Linus's current tree and also
available at:

	bk://kernel.bkbits.net/davem/net-2.6

The quick summary is:

diff1) Smooth out TSO ack clocking by calling
       tcp_trim_head() at tcp_tso_ack() time
       and making tcp_trim_head() liberate
       socket send buffer space.

diff2) URG fix in tcp_tso_ack()

diff3) Add tcp_tso_win_divisor sysctl knob.

diff4) Obey MSS in tso handling, shrink tcp_skb_cb

Existing known problems requiring a fix in
time for 2.6.9-final are:

1) Andi sees performance anomaly to 2.6.5 kernels.
   Hopefully fixed by diff3 above, merely awaiting
   retesting by him.

2) John Heffner sees some kind of weird transfer
   hang, down/up'ing the interface makes transfer
   finish successfully.

   He has told me he will spend some time this weekend
   trying to debug it.

Future enhancements which are not as critical as
the above:

1) Handle SACK tagging of TSO frames... somehow.
   I don't have any brilliant ideas currently.

   We could use bits, to represent sub-TSO SACK
   regions.  This would impose a hard limit of
   something like 32 for the maximum TSO factor.

   Another idea is to resegment a TSO frame when
   SACKs cover portions.  This approach is my
   least favorite because SACKs can be common in
   the presence of even minor packet reordering.
   So we'd be splitting up + copying a lot.

2) Leave TSO enabled even during loss events.
   #1 is pretty much a prerequisite for #2

   If we don't do #1 first, most SACKs get entirely
   ignored.

3) Fix up the packet counting once we have sub-TSO
   SACK tagging in place.

Ok that's enough TSO hacking for me today.


[-- Attachment #2: diff1 --]
[-- Type: application/octet-stream, Size: 8037 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/29 21:12:18-07:00 davem@nuts.davemloft.net 
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/29 21:11:53-07:00 davem@nuts.davemloft.net +15 -35
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_input.c
#   2004/09/29 21:11:53-07:00 davem@nuts.davemloft.net +9 -13
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# include/net/tcp.h
#   2004/09/29 21:11:52-07:00 davem@nuts.davemloft.net +1 -0
#   [TCP]: Smooth out TSO ack clocking.
#   
#   - Export tcp_trim_head() and call it directly from
#     tcp_tso_acked().  This also fixes URG handling.
#   
#   - Make tcp_trim_head() adjust the skb->truesize of
#     the packet and liberate that space from the socket
#     send buffer.
#   
#   - In tcp_current_mss(), limit TSO factor to 1/4 of
#     snd_cwnd.  The idea is from John Heffner.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-30 21:02:34 -07:00
+++ b/include/net/tcp.h	2004-09-30 21:02:34 -07:00
@@ -944,6 +944,7 @@
 extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_xmit_retransmit_queue(struct sock *);
 extern void tcp_simple_retransmit(struct sock *);
+extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
 
 extern void tcp_send_probe0(struct sock *);
 extern void tcp_send_partial(struct sock *);
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-09-30 21:02:34 -07:00
+++ b/net/ipv4/tcp_input.c	2004-09-30 21:02:34 -07:00
@@ -2364,13 +2364,14 @@
  * then making a write space wakeup callback is a possible
  * future enhancement.  WARNING: it is not trivial to make.
  */
-static int tcp_tso_acked(struct tcp_opt *tp, struct sk_buff *skb,
+static int tcp_tso_acked(struct sock *sk, struct sk_buff *skb,
 			 __u32 now, __s32 *seq_rtt)
 {
+	struct tcp_opt *tp = tcp_sk(sk);
 	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
 	__u32 mss = scb->tso_mss;
 	__u32 snd_una = tp->snd_una;
-	__u32 seq = scb->seq;
+	__u32 orig_seq, seq;
 	__u32 packets_acked = 0;
 	int acked = 0;
 
@@ -2379,22 +2380,18 @@
 	 */
 	BUG_ON(!after(scb->end_seq, snd_una));
 
+	seq = orig_seq = scb->seq;
 	while (!after(seq + mss, snd_una)) {
 		packets_acked++;
 		seq += mss;
 	}
 
+	if (tcp_trim_head(sk, skb, (seq - orig_seq)))
+		return 0;
+
 	if (packets_acked) {
 		__u8 sacked = scb->sacked;
 
-		/* We adjust scb->seq but we do not pskb_pull() the
-		 * SKB.  We let tcp_retransmit_skb() handle this case
-		 * by checking skb->len against the data sequence span.
-		 * This way, we avoid the pskb_pull() work unless we
-		 * actually need to retransmit the SKB.
-		 */
-		scb->seq = seq;
-
 		acked |= FLAG_DATA_ACKED;
 		if (sacked) {
 			if (sacked & TCPCB_RETRANS) {
@@ -2413,7 +2410,7 @@
 							packets_acked);
 			if (sacked & TCPCB_URG) {
 				if (tp->urg_mode &&
-				    !before(scb->seq, tp->snd_up))
+				    !before(orig_seq, tp->snd_up))
 					tp->urg_mode = 0;
 			}
 		} else if (*seq_rtt < 0)
@@ -2425,7 +2422,6 @@
 			tcp_dec_pcount_explicit(&tp->fackets_out, dval);
 		}
 		tcp_dec_pcount_explicit(&tp->packets_out, packets_acked);
-		scb->tso_factor -= packets_acked;
 
 		BUG_ON(scb->tso_factor == 0);
 		BUG_ON(!before(scb->seq, scb->end_seq));
@@ -2455,7 +2451,7 @@
 		 */
 		if (after(scb->end_seq, tp->snd_una)) {
 			if (scb->tso_factor > 1)
-				acked |= tcp_tso_acked(tp, skb,
+				acked |= tcp_tso_acked(sk, skb,
 						       now, &seq_rtt);
 			break;
 		}
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-30 21:02:34 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-30 21:02:34 -07:00
@@ -525,7 +525,7 @@
  * eventually). The difference is that pulled data not copied, but
  * immediately discarded.
  */
-unsigned char * __pskb_trim_head(struct sk_buff *skb, int len)
+static unsigned char *__pskb_trim_head(struct sk_buff *skb, int len)
 {
 	int i, k, eat;
 
@@ -553,8 +553,10 @@
 	return skb->tail;
 }
 
-static int __tcp_trim_head(struct tcp_opt *tp, struct sk_buff *skb, u32 len)
+int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 {
+	struct tcp_opt *tp = tcp_sk(sk);
+
 	if (skb_cloned(skb) &&
 	    pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
 		return -ENOMEM;
@@ -566,8 +568,14 @@
 			return -ENOMEM;
 	}
 
+	TCP_SKB_CB(skb)->seq += len;
 	skb->ip_summed = CHECKSUM_HW;
 
+	skb->truesize	     -= len;
+	sk->sk_queue_shrunk   = 1;
+	sk->sk_wmem_queued   -= len;
+	sk->sk_forward_alloc += len;
+
 	/* Any change of skb->len requires recalculation of tso
 	 * factor and mss.
 	 */
@@ -576,16 +584,6 @@
 	return 0;
 }
 
-static inline int tcp_trim_head(struct tcp_opt *tp, struct sk_buff *skb, u32 len)
-{
-	int err = __tcp_trim_head(tp, skb, len);
-
-	if (!err)
-		TCP_SKB_CB(skb)->seq += len;
-
-	return err;
-}
-
 /* This function synchronize snd mss to current pmtu/exthdr set.
 
    tp->user_mss is mss set by user by TCP_MAXSEG. It does NOT counts
@@ -686,11 +684,12 @@
 					68U - tp->tcp_header_len);
 
 		/* Always keep large mss multiple of real mss, but
-		 * do not exceed congestion window.
+		 * do not exceed 1/4 of the congestion window so we
+		 * can keep the ACK clock ticking.
 		 */
 		factor = large_mss / mss_now;
-		if (factor > tp->snd_cwnd)
-			factor = tp->snd_cwnd;
+		if (factor > (tp->snd_cwnd >> 2))
+			factor = max(1, tp->snd_cwnd >> 2);
 
 		tp->mss_cache = mss_now * factor;
 
@@ -1003,7 +1002,6 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
  	unsigned int cur_mss = tcp_current_mss(sk, 0);
-	__u32 data_seq, data_end_seq;
 	int err;
 
 	/* Do not sent more than we queued. 1/4 is reserved for possible
@@ -1013,24 +1011,6 @@
 	    min(sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf))
 		return -EAGAIN;
 
-	/* What is going on here?  When TSO packets are partially ACK'd,
-	 * we adjust the TCP_SKB_CB(skb)->seq value forward but we do
-	 * not adjust the data area of the SKB.  We defer that to here
-	 * so that we can avoid the work unless we really retransmit
-	 * the packet.
-	 */
-	data_seq = TCP_SKB_CB(skb)->seq;
-	data_end_seq = TCP_SKB_CB(skb)->end_seq;
-	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
-		data_end_seq--;
-
-	if (skb->len > (data_end_seq - data_seq)) {
-		u32 to_trim = skb->len - (data_end_seq - data_seq);
-
-		if (__tcp_trim_head(tp, skb, to_trim))
-			return -ENOMEM;
-	}		
-
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
@@ -1041,7 +1021,7 @@
 			tp->mss_cache = tp->mss_cache_std;
 		}
 
-		if (tcp_trim_head(tp, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
+		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
 			return -ENOMEM;
 	}
 

[-- Attachment #3: diff2 --]
[-- Type: application/octet-stream, Size: 845 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/30 12:42:29-07:00 davem@nuts.davemloft.net 
#   [TCP]: Check correct sequence number for URG in tcp_tso_acked().
#   
#   Noticed by Herbert Xu.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_input.c
#   2004/09/30 12:41:57-07:00 davem@nuts.davemloft.net +1 -1
#   [TCP]: Check correct sequence number for URG in tcp_tso_acked().
# 
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-09-30 21:02:47 -07:00
+++ b/net/ipv4/tcp_input.c	2004-09-30 21:02:47 -07:00
@@ -2410,7 +2410,7 @@
 							packets_acked);
 			if (sacked & TCPCB_URG) {
 				if (tp->urg_mode &&
-				    !before(orig_seq, tp->snd_up))
+				    !before(seq, tp->snd_up))
 					tp->urg_mode = 0;
 			}
 		} else if (*seq_rtt < 0)

[-- Attachment #4: diff3 --]
[-- Type: application/octet-stream, Size: 4066 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/30 20:09:28-07:00 davem@nuts.davemloft.net 
#   [TCP]: Add tcp_tso_win_divisor sysctl.
#   
#   This allows control over what percentage of
#   the congestion window can be consumed by a
#   single TSO frame.
#   
#   The setting of this parameter is a choice
#   between burstiness and building larger TSO
#   frames.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +19 -7
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
# net/ipv4/sysctl_net_ipv4.c
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +8 -0
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
# include/net/tcp.h
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +1 -0
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
# include/linux/sysctl.h
#   2004/09/30 20:07:20-07:00 davem@nuts.davemloft.net +1 -0
#   [TCP]: Add tcp_tso_win_divisor sysctl.
# 
diff -Nru a/include/linux/sysctl.h b/include/linux/sysctl.h
--- a/include/linux/sysctl.h	2004-09-30 21:03:00 -07:00
+++ b/include/linux/sysctl.h	2004-09-30 21:03:00 -07:00
@@ -341,6 +341,7 @@
 	NET_TCP_BIC_LOW_WINDOW=104,
 	NET_TCP_DEFAULT_WIN_SCALE=105,
 	NET_TCP_MODERATE_RCVBUF=106,
+	NET_TCP_TSO_WIN_DIVISOR=107,
 };
 
 enum {
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-30 21:03:00 -07:00
+++ b/include/net/tcp.h	2004-09-30 21:03:00 -07:00
@@ -609,6 +609,7 @@
 extern int sysctl_tcp_bic_fast_convergence;
 extern int sysctl_tcp_bic_low_window;
 extern int sysctl_tcp_moderate_rcvbuf;
+extern int sysctl_tcp_tso_win_divisor;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
diff -Nru a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
--- a/net/ipv4/sysctl_net_ipv4.c	2004-09-30 21:03:01 -07:00
+++ b/net/ipv4/sysctl_net_ipv4.c	2004-09-30 21:03:01 -07:00
@@ -674,6 +674,14 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= NET_TCP_TSO_WIN_DIVISOR,
+		.procname	= "tcp_tso_win_divisor",
+		.data		= &sysctl_tcp_tso_win_divisor,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0 }
 };
 
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-30 21:03:00 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-30 21:03:01 -07:00
@@ -45,6 +45,12 @@
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse = 1;
 
+/* This limits the percentage of the congestion window which we
+ * will allow a single TSO frame to consume.  Building TSO frames
+ * which are too large can cause TCP streams to be bursty.
+ */
+int sysctl_tcp_tso_win_divisor = 8;
+
 static __inline__
 void update_send_head(struct sock *sk, struct tcp_opt *tp, struct sk_buff *skb)
 {
@@ -658,7 +664,7 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
 	struct dst_entry *dst = __sk_dst_get(sk);
-	int do_large, mss_now;
+	unsigned int do_large, mss_now;
 
 	mss_now = tp->mss_cache_std;
 	if (dst) {
@@ -673,7 +679,7 @@
 		    !tp->urg_mode);
 
 	if (do_large) {
-		int large_mss, factor;
+		unsigned int large_mss, factor, limit;
 
 		large_mss = 65535 - tp->af_specific->net_header_len -
 			tp->ext_header_len - tp->ext2_header_len -
@@ -683,13 +689,19 @@
 			large_mss = max((tp->max_window>>1),
 					68U - tp->tcp_header_len);
 
+		factor = large_mss / mss_now;
+
 		/* Always keep large mss multiple of real mss, but
-		 * do not exceed 1/4 of the congestion window so we
-		 * can keep the ACK clock ticking.
+		 * do not exceed 1/tso_win_divisor of the congestion window
+		 * so we can keep the ACK clock ticking and minimize
+		 * bursting.
 		 */
-		factor = large_mss / mss_now;
-		if (factor > (tp->snd_cwnd >> 2))
-			factor = max(1, tp->snd_cwnd >> 2);
+		limit = tp->snd_cwnd;
+		if (sysctl_tcp_tso_win_divisor)
+			limit /= sysctl_tcp_tso_win_divisor;
+		limit = max(1U, limit);
+		if (factor > limit)
+			factor = limit;
 
 		tp->mss_cache = mss_now * factor;
 

[-- Attachment #5: diff4 --]
[-- Type: application/octet-stream, Size: 12442 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/09/30 20:58:53-07:00 davem@nuts.davemloft.net 
#   [TCP]: Kill tso_{factor,mss}.
#   
#   We can just use skb_shinfo(skb)->tso_{segs,size}
#   directly.  This also allows us to kill the
#   hack zone code in ip_output.c
#   
#   The original impetus for thus change was a problem
#   noted by John Heffner.  We do not abide by the MSS
#   of the connection for TCP segmentation, we were using
#   the path MTU instead.  This broke various local
#   network setups with TSO enabled and is fixed as a side
#   effect of these changes.
#   
#   Signed-off-by: David S. Miller <davem@davemloft.net>
# 
# net/ipv4/tcp_output.c
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +30 -28
#   [TCP]: Kill tso_{factor,mss}.
# 
# net/ipv4/tcp_input.c
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +7 -7
#   [TCP]: Kill tso_{factor,mss}.
# 
# net/ipv4/tcp.c
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +2 -2
#   [TCP]: Kill tso_{factor,mss}.
# 
# net/ipv4/ip_output.c
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +1 -14
#   [TCP]: Kill tso_{factor,mss}.
# 
# include/net/tcp.h
#   2004/09/30 20:56:45-07:00 davem@nuts.davemloft.net +11 -7
#   [TCP]: Kill tso_{factor,mss}.
# 
diff -Nru a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2004-09-30 21:03:14 -07:00
+++ b/include/net/tcp.h	2004-09-30 21:03:14 -07:00
@@ -1152,8 +1152,6 @@
 
 	__u16		urg_ptr;	/* Valid w/URG flags is set.	*/
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
-	__u16		tso_factor;	/* If > 1, TSO frame		*/
-	__u16		tso_mss;	/* MSS that FACTOR's in terms of*/
 };
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
@@ -1165,7 +1163,13 @@
  */
 static inline int tcp_skb_pcount(struct sk_buff *skb)
 {
-	return TCP_SKB_CB(skb)->tso_factor;
+	return skb_shinfo(skb)->tso_segs;
+}
+
+/* This is valid iff tcp_skb_pcount() > 1. */
+static inline int tcp_skb_psize(struct sk_buff *skb)
+{
+	return skb_shinfo(skb)->tso_size;
 }
 
 static inline void tcp_inc_pcount(tcp_pcount_t *count, struct sk_buff *skb)
@@ -1440,7 +1444,7 @@
 		  tcp_minshall_check(tp))));
 }
 
-extern void tcp_set_skb_tso_factor(struct sk_buff *, unsigned int);
+extern void tcp_set_skb_tso_segs(struct sk_buff *, unsigned int);
 
 /* This checks if the data bearing packet SKB (usually sk->sk_send_head)
  * should be put on the wire right now.
@@ -1448,11 +1452,11 @@
 static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
 				   unsigned cur_mss, int nonagle)
 {
-	int pkts = TCP_SKB_CB(skb)->tso_factor;
+	int pkts = tcp_skb_pcount(skb);
 
 	if (!pkts) {
-		tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
-		pkts = TCP_SKB_CB(skb)->tso_factor;
+		tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
+		pkts = tcp_skb_pcount(skb);
 	}
 
 	/*	RFC 1122 - section 4.2.3.4
diff -Nru a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
--- a/net/ipv4/ip_output.c	2004-09-30 21:03:14 -07:00
+++ b/net/ipv4/ip_output.c	2004-09-30 21:03:14 -07:00
@@ -305,7 +305,6 @@
 	struct ip_options *opt = inet->opt;
 	struct rtable *rt;
 	struct iphdr *iph;
-	u32 mtu;
 
 	/* Skip all of this if the packet is already routed,
 	 * f.e. by something like SCTP.
@@ -366,21 +365,9 @@
 	skb->nh.iph   = iph;
 	/* Transport layer set skb->h.foo itself. */
 
-	if(opt && opt->optlen) {
+	if (opt && opt->optlen) {
 		iph->ihl += opt->optlen >> 2;
 		ip_options_build(skb, opt, inet->daddr, rt, 0);
-	}
-
-	mtu = dst_pmtu(&rt->u.dst);
-	if (skb->len > mtu && (sk->sk_route_caps & NETIF_F_TSO)) {
-		unsigned int hlen;
-
-		/* Hack zone: all this must be done by TCP. */
-		hlen = ((skb->h.raw - skb->data) + (skb->h.th->doff << 2));
-		skb_shinfo(skb)->tso_size = mtu - hlen;
-		skb_shinfo(skb)->tso_segs =
-			(skb->len - hlen + skb_shinfo(skb)->tso_size - 1)/
-				skb_shinfo(skb)->tso_size - 1;
 	}
 
 	ip_select_ident_more(iph, &rt->u.dst, sk, skb_shinfo(skb)->tso_segs);
diff -Nru a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c	2004-09-30 21:03:14 -07:00
+++ b/net/ipv4/tcp.c	2004-09-30 21:03:14 -07:00
@@ -691,7 +691,7 @@
 		skb->ip_summed = CHECKSUM_HW;
 		tp->write_seq += copy;
 		TCP_SKB_CB(skb)->end_seq += copy;
-		TCP_SKB_CB(skb)->tso_factor = 0;
+		skb_shinfo(skb)->tso_segs = 0;
 
 		if (!copied)
 			TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
@@ -938,7 +938,7 @@
 
 			tp->write_seq += copy;
 			TCP_SKB_CB(skb)->end_seq += copy;
-			TCP_SKB_CB(skb)->tso_factor = 0;
+			skb_shinfo(skb)->tso_segs = 0;
 
 			from += copy;
 			copied += copy;
diff -Nru a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2004-09-30 21:03:14 -07:00
+++ b/net/ipv4/tcp_input.c	2004-09-30 21:03:14 -07:00
@@ -1035,7 +1035,7 @@
 			if(!before(TCP_SKB_CB(skb)->seq, end_seq))
 				break;
 
-			fack_count += TCP_SKB_CB(skb)->tso_factor;
+			fack_count += tcp_skb_pcount(skb);
 
 			in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
 				!before(end_seq, TCP_SKB_CB(skb)->end_seq);
@@ -1224,7 +1224,7 @@
 	tcp_set_pcount(&tp->fackets_out, 0);
 
 	sk_stream_for_retrans_queue(skb, sk) {
-		cnt += TCP_SKB_CB(skb)->tso_factor;;
+		cnt += tcp_skb_pcount(skb);
 		TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
 		if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED)) {
 
@@ -1299,7 +1299,7 @@
 		tp->undo_marker = tp->snd_una;
 
 	sk_stream_for_retrans_queue(skb, sk) {
-		cnt += TCP_SKB_CB(skb)->tso_factor;
+		cnt += tcp_skb_pcount(skb);
 		if (TCP_SKB_CB(skb)->sacked&TCPCB_RETRANS)
 			tp->undo_marker = 0;
 		TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
@@ -1550,7 +1550,7 @@
 	BUG_TRAP(cnt <= tcp_get_pcount(&tp->packets_out));
 
 	sk_stream_for_retrans_queue(skb, sk) {
-		cnt -= TCP_SKB_CB(skb)->tso_factor;
+		cnt -= tcp_skb_pcount(skb);
 		if (cnt < 0 || after(TCP_SKB_CB(skb)->end_seq, high_seq))
 			break;
 		if (!(TCP_SKB_CB(skb)->sacked&TCPCB_TAGBITS)) {
@@ -2369,7 +2369,7 @@
 {
 	struct tcp_opt *tp = tcp_sk(sk);
 	struct tcp_skb_cb *scb = TCP_SKB_CB(skb); 
-	__u32 mss = scb->tso_mss;
+	__u32 mss = tcp_skb_psize(skb);
 	__u32 snd_una = tp->snd_una;
 	__u32 orig_seq, seq;
 	__u32 packets_acked = 0;
@@ -2423,7 +2423,7 @@
 		}
 		tcp_dec_pcount_explicit(&tp->packets_out, packets_acked);
 
-		BUG_ON(scb->tso_factor == 0);
+		BUG_ON(tcp_skb_pcount(skb) == 0);
 		BUG_ON(!before(scb->seq, scb->end_seq));
 	}
 
@@ -2450,7 +2450,7 @@
 		 * the other end.
 		 */
 		if (after(scb->end_seq, tp->snd_una)) {
-			if (scb->tso_factor > 1)
+			if (tcp_skb_pcount(skb) > 1)
 				acked |= tcp_tso_acked(sk, skb,
 						       now, &seq_rtt);
 			break;
diff -Nru a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c	2004-09-30 21:03:14 -07:00
+++ b/net/ipv4/tcp_output.c	2004-09-30 21:03:14 -07:00
@@ -274,7 +274,7 @@
 		int sysctl_flags;
 		int err;
 
-		BUG_ON(!TCP_SKB_CB(skb)->tso_factor);
+		BUG_ON(!tcp_skb_pcount(skb));
 
 #define SYSCTL_FLAG_TSTAMPS	0x1
 #define SYSCTL_FLAG_WSCALE	0x2
@@ -428,21 +428,22 @@
 	}
 }
 
-void tcp_set_skb_tso_factor(struct sk_buff *skb, unsigned int mss_std)
+void tcp_set_skb_tso_segs(struct sk_buff *skb, unsigned int mss_std)
 {
 	if (skb->len <= mss_std) {
 		/* Avoid the costly divide in the normal
 		 * non-TSO case.
 		 */
-		TCP_SKB_CB(skb)->tso_factor = 1;
+		skb_shinfo(skb)->tso_segs = 1;
+		skb_shinfo(skb)->tso_size = 0;
 	} else {
 		unsigned int factor;
 
 		factor = skb->len + (mss_std - 1);
 		factor /= mss_std;
-		TCP_SKB_CB(skb)->tso_factor = factor;
+		skb_shinfo(skb)->tso_segs = factor;
+		skb_shinfo(skb)->tso_size = mss_std;
 	}
-	TCP_SKB_CB(skb)->tso_mss = mss_std;
 }
 
 /* Function to create two new TCP segments.  Shrinks the given segment
@@ -508,8 +509,8 @@
 	}
 
 	/* Fix up tso_factor for both original and new SKB.  */
-	tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
-	tcp_set_skb_tso_factor(buff, tp->mss_cache_std);
+	tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
+	tcp_set_skb_tso_segs(buff, tp->mss_cache_std);
 
 	if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST) {
 		tcp_inc_pcount(&tp->lost_out, skb);
@@ -585,7 +586,7 @@
 	/* Any change of skb->len requires recalculation of tso
 	 * factor and mss.
 	 */
-	tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
+	tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
 
 	return 0;
 }
@@ -914,8 +915,8 @@
 		    ((skb_size + next_skb_size) > mss_now))
 			return;
 
-		BUG_ON(TCP_SKB_CB(skb)->tso_factor != 1 ||
-		       TCP_SKB_CB(next_skb)->tso_factor != 1);
+		BUG_ON(tcp_skb_pcount(skb) != 1 ||
+		       tcp_skb_pcount(next_skb) != 1);
 
 		/* Ok.  We will be able to collapse the packet. */
 		__skb_unlink(next_skb, next_skb->list);
@@ -1047,14 +1048,14 @@
 		return -EAGAIN;
 
 	if (skb->len > cur_mss) {
-		int old_factor = TCP_SKB_CB(skb)->tso_factor;
+		int old_factor = tcp_skb_pcount(skb);
 		int new_factor;
 
 		if (tcp_fragment(sk, skb, cur_mss))
 			return -ENOMEM; /* We'll try again later. */
 
 		/* New SKB created, account for it. */
-		new_factor = TCP_SKB_CB(skb)->tso_factor;
+		new_factor = tcp_skb_pcount(skb);
 		tcp_dec_pcount_explicit(&tp->packets_out,
 					old_factor - new_factor);
 		tcp_inc_pcount(&tp->packets_out, skb->next);
@@ -1081,7 +1082,8 @@
 	   tp->snd_una == (TCP_SKB_CB(skb)->end_seq - 1)) {
 		if (!pskb_trim(skb, 0)) {
 			TCP_SKB_CB(skb)->seq = TCP_SKB_CB(skb)->end_seq - 1;
-			TCP_SKB_CB(skb)->tso_factor = 1;
+			skb_shinfo(skb)->tso_segs = 1;
+			skb_shinfo(skb)->tso_size = 0;
 			skb->ip_summed = CHECKSUM_NONE;
 			skb->csum = 0;
 		}
@@ -1166,7 +1168,7 @@
 						tcp_reset_xmit_timer(sk, TCP_TIME_RETRANS, tp->rto);
 				}
 
-				packet_cnt -= TCP_SKB_CB(skb)->tso_factor;
+				packet_cnt -= tcp_skb_pcount(skb);
 				if (packet_cnt <= 0)
 					break;
 			}
@@ -1256,8 +1258,8 @@
 		skb->csum = 0;
 		TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
 		TCP_SKB_CB(skb)->sacked = 0;
-		TCP_SKB_CB(skb)->tso_factor = 1;
-		TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
+		skb_shinfo(skb)->tso_segs = 1;
+		skb_shinfo(skb)->tso_size = 0;
 
 		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
 		TCP_SKB_CB(skb)->seq = tp->write_seq;
@@ -1289,8 +1291,8 @@
 	skb->csum = 0;
 	TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_RST);
 	TCP_SKB_CB(skb)->sacked = 0;
-	TCP_SKB_CB(skb)->tso_factor = 1;
-	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
+	skb_shinfo(skb)->tso_segs = 1;
+	skb_shinfo(skb)->tso_size = 0;
 
 	/* Send it off. */
 	TCP_SKB_CB(skb)->seq = tcp_acceptable_seq(sk, tp);
@@ -1371,8 +1373,8 @@
 	TCP_SKB_CB(skb)->seq = req->snt_isn;
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + 1;
 	TCP_SKB_CB(skb)->sacked = 0;
-	TCP_SKB_CB(skb)->tso_factor = 1;
-	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
+	skb_shinfo(skb)->tso_segs = 1;
+	skb_shinfo(skb)->tso_size = 0;
 	th->seq = htonl(TCP_SKB_CB(skb)->seq);
 	th->ack_seq = htonl(req->rcv_isn + 1);
 	if (req->rcv_wnd == 0) { /* ignored for retransmitted syns */
@@ -1474,8 +1476,8 @@
 	TCP_SKB_CB(buff)->flags = TCPCB_FLAG_SYN;
 	TCP_ECN_send_syn(sk, tp, buff);
 	TCP_SKB_CB(buff)->sacked = 0;
-	TCP_SKB_CB(buff)->tso_factor = 1;
-	TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
+	skb_shinfo(buff)->tso_segs = 1;
+	skb_shinfo(buff)->tso_size = 0;
 	buff->csum = 0;
 	TCP_SKB_CB(buff)->seq = tp->write_seq++;
 	TCP_SKB_CB(buff)->end_seq = tp->write_seq;
@@ -1575,8 +1577,8 @@
 		buff->csum = 0;
 		TCP_SKB_CB(buff)->flags = TCPCB_FLAG_ACK;
 		TCP_SKB_CB(buff)->sacked = 0;
-		TCP_SKB_CB(buff)->tso_factor = 1;
-		TCP_SKB_CB(buff)->tso_mss = tp->mss_cache_std;
+		skb_shinfo(buff)->tso_segs = 1;
+		skb_shinfo(buff)->tso_size = 0;
 
 		/* Send it off, this clears delayed acks for us. */
 		TCP_SKB_CB(buff)->seq = TCP_SKB_CB(buff)->end_seq = tcp_acceptable_seq(sk, tp);
@@ -1611,8 +1613,8 @@
 	skb->csum = 0;
 	TCP_SKB_CB(skb)->flags = TCPCB_FLAG_ACK;
 	TCP_SKB_CB(skb)->sacked = urgent;
-	TCP_SKB_CB(skb)->tso_factor = 1;
-	TCP_SKB_CB(skb)->tso_mss = tp->mss_cache_std;
+	skb_shinfo(skb)->tso_segs = 1;
+	skb_shinfo(skb)->tso_size = 0;
 
 	/* Use a previous sequence.  This should cause the other
 	 * end to send an ack.  Don't queue or clone SKB, just
@@ -1656,8 +1658,8 @@
 					sk->sk_route_caps &= ~NETIF_F_TSO;
 					tp->mss_cache = tp->mss_cache_std;
 				}
-			} else if (!TCP_SKB_CB(skb)->tso_factor)
-				tcp_set_skb_tso_factor(skb, tp->mss_cache_std);
+			} else if (!tcp_skb_pcount(skb))
+				tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
 
 			TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
 			TCP_SKB_CB(skb)->when = tcp_time_stamp;

next             reply	other threads:[~2004-10-01  4:32 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-01  4:32 David S. Miller [this message]
2004-10-01 10:11 ` Current 2.6.x TSO state Andi Kleen
2004-10-01 19:47   ` David S. Miller
2004-10-01 19:51     ` Andi Kleen
2004-10-01 19:56       ` David S. Miller
2004-10-01 20:01         ` Andi Kleen
2004-10-01 20:15           ` David S. Miller
2004-10-01 20:33           ` John Heffner
2004-10-01 23:19         ` Andi Kleen
2004-10-02  0:04           ` David S. Miller
2004-10-01 13:06 ` Herbert Xu
2004-10-03 21:50   ` David S. Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040930213221.06a3f5b3.davem@davemloft.net \
    --to=davem@davemloft.net \
    --cc=ak@suse.de \
    --cc=herbert@gondor.apana.org.au \
    --cc=jheffner@psc.edu \
    --cc=netdev@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).