[PATCH 0/9]: TCP: The Road to Super TSO

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9]: TCP: The Road to Super TSO
@ 2005-06-07  4:08 David S. Miller
  2005-06-07  4:16 ` [PATCH 1/9]: " David S. Miller
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:08 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner

Some folks, notable the S2IO guys, get performance degradation
from the Super TSO v2 patch (they get it from the first version
as well).  It's a real pain to spot what causes such things
in such a huge patch... so I started splitting things up in
a very fine grained manner so we can catch regressions more
precisely.

There are several bugs spotted by this first set of 9 patches,
and I'd really appreciate good high-quality testing reports.
Please do not mail such reports privately to me, as some have
done, always include netdev@oss.sgi.com, thanks a lot.

Herbert, I'm CC:'ing you because one of the bugs fixed here
has to do with the TSO header COW'ing stuff you did.  You
missed one case where a skb_header_release() call was needed,
namely tcp_fragment() where it does it's __skb_append().

John, I'm CC:'ing you because there are several cwnd handling
related cures in here.  I did _not_ fix the TSO cwnd growth
bug yet in these patches, but it is at the very top of my
TODO list for my next batch of work on this stuff.  The most
notable fix here is the bogus extra cwnd validation done by
__tcp_push_pending_frames().  That validation should only
occur if we _do_ send some packets, and tcp_write_xmit() takes
care of that just fine.  The other one is that the 'nonagle'
argument to __tcp_push_pending_frames() is clobbered by it's
tcp_skb_is_last() logic, causing TCP_NAGLE_PUSH to be used for
all packets processed by tcp_write_xmit(), whoops...

Please help me review this stuff, thanks.

The patches will show up as followups to this email.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
@ 2005-06-07  4:16 ` David S. Miller
  2005-06-07  4:17 ` [PATCH 2/9]: " David S. Miller
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:16 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner


[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.

The ideal and most optimal layout for an SKB when doing
scatter-gather is to put all the headers at skb->data, and
all the user data in the page array.

This makes SKB splitting and combining extremely simple,
especially before a packet goes onto the wire the first
time.

So, when sk_stream_alloc_pskb() is given a zero size, make
sure there is no skb_tailroom().  This is achieved by applying
SKB_DATA_ALIGN() to the header length used here.

Next, make select_size() in TCP output segmentation use a
length of zero when NETIF_F_SG is true on the outgoing
interface.

Signed-off-by: David S. Miller <davem@davemloft.net>

28f78ef8dcc90a2a26499dab76678bd6813d7793 (from 3f5948fa2cbbda1261eec9a39ef3004b3caf73fb)
diff --git a/include/net/sock.h b/include/net/sock.h
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1130,13 +1130,16 @@ static inline void sk_stream_moderate_sn
 static inline struct sk_buff *sk_stream_alloc_pskb(struct sock *sk,
 						   int size, int mem, int gfp)
 {
-	struct sk_buff *skb = alloc_skb(size + sk->sk_prot->max_header, gfp);
+	struct sk_buff *skb;
+	int hdr_len;
 
+	hdr_len = SKB_DATA_ALIGN(sk->sk_prot->max_header);
+	skb = alloc_skb(size + hdr_len, gfp);
 	if (skb) {
 		skb->truesize += mem;
 		if (sk->sk_forward_alloc >= (int)skb->truesize ||
 		    sk_stream_mem_schedule(sk, skb->truesize, 0)) {
-			skb_reserve(skb, sk->sk_prot->max_header);
+			skb_reserve(skb, hdr_len);
 			return skb;
 		}
 		__kfree_skb(skb);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -775,13 +775,9 @@ static inline int select_size(struct soc
 {
 	int tmp = tp->mss_cache_std;
 
-	if (sk->sk_route_caps & NETIF_F_SG) {
-		int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
+	if (sk->sk_route_caps & NETIF_F_SG)
+		tmp = 0;
 
-		if (tmp >= pgbreak &&
-		    tmp <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
-			tmp = pgbreak;
-	}
 	return tmp;
 }
 
@@ -891,11 +887,6 @@ new_segment:
 					tcp_mark_push(tp, skb);
 					goto new_segment;
 				} else if (page) {
-					/* If page is cached, align
-					 * offset to L1 cache boundary
-					 */
-					off = (off + L1_CACHE_BYTES - 1) &
-					      ~(L1_CACHE_BYTES - 1);
 					if (off == PAGE_SIZE) {
 						put_page(page);
 						TCP_PAGE(sk) = page = NULL;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
  2005-06-07  4:16 ` [PATCH 1/9]: " David S. Miller
@ 2005-06-07  4:17 ` David S. Miller
  2005-06-07  4:17 ` [PATCH 3/9]: " David S. Miller
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:17 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner


[TCP]: Fix quick-ack decrementing with TSO.

On each packet output, we call tcp_dec_quickack_mode()
if the ACK flag is set.  It drops tp->ack.quick until
it hits zero, at which time we deflate the ATO value.

When doing TSO, we are emitting multiple packets with
ACK set, so we should decrement tp->ack.quick that many
segments.

Note that, unlike this case, tcp_enter_cwr() should not
take the tcp_skb_pcount(skb) into consideration.  That
function, one time, readjusts tp->snd_cwnd and moves
into TCP_CA_CWR state.

Signed-off-by: David S. Miller <davem@davemloft.net>

00cb08b2ec091f4b461210026392edeaccf31d9c (from 28f78ef8dcc90a2a26499dab76678bd6813d7793)
diff --git a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -817,11 +817,16 @@ static inline int tcp_ack_scheduled(stru
 	return tp->ack.pending&TCP_ACK_SCHED;
 }
 
-static __inline__ void tcp_dec_quickack_mode(struct tcp_sock *tp)
+static __inline__ void tcp_dec_quickack_mode(struct tcp_sock *tp, unsigned int pkts)
 {
-	if (tp->ack.quick && --tp->ack.quick == 0) {
-		/* Leaving quickack mode we deflate ATO. */
-		tp->ack.ato = TCP_ATO_MIN;
+	if (tp->ack.quick) {
+		if (pkts >= tp->ack.quick) {
+			tp->ack.quick = 0;
+
+			/* Leaving quickack mode we deflate ATO. */
+			tp->ack.ato = TCP_ATO_MIN;
+		} else
+			tp->ack.quick -= pkts;
 	}
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -141,11 +141,11 @@ static inline void tcp_event_data_sent(s
 		tp->ack.pingpong = 1;
 }
 
-static __inline__ void tcp_event_ack_sent(struct sock *sk)
+static __inline__ void tcp_event_ack_sent(struct sock *sk, unsigned int pkts)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	tcp_dec_quickack_mode(tp);
+	tcp_dec_quickack_mode(tp, pkts);
 	tcp_clear_xmit_timer(sk, TCP_TIME_DACK);
 }
 
@@ -361,7 +361,7 @@ static int tcp_transmit_skb(struct sock 
 		tp->af_specific->send_check(sk, th, skb->len, skb);
 
 		if (tcb->flags & TCPCB_FLAG_ACK)
-			tcp_event_ack_sent(sk);
+			tcp_event_ack_sent(sk, tcp_skb_pcount(skb));
 
 		if (skb->len != tcp_header_size)
 			tcp_event_data_sent(tp, skb, sk);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
  2005-06-07  4:16 ` [PATCH 1/9]: " David S. Miller
  2005-06-07  4:17 ` [PATCH 2/9]: " David S. Miller
@ 2005-06-07  4:17 ` David S. Miller
  2005-06-07  4:18 ` [PATCH 4/9]: " David S. Miller
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:17 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner


[TCP]: Move send test logic out of net/tcp.h

This just moves the code into tcp_output.c, no code logic changes are
made by this patch.

Using this as a baseline, we can begin to untangle the mess of
comparisons for the Nagle test et al.  We will also be able to reduce
all of the redundant computation that occurs when outputting data
packets.

Signed-off-by: David S. Miller <davem@davemloft.net>

cba5d690f46699d37df7dc087247d1f7c7155692 (from 00cb08b2ec091f4b461210026392edeaccf31d9c)
diff --git a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -945,6 +945,9 @@ extern __u32 cookie_v4_init_sequence(str
 /* tcp_output.c */
 
 extern int tcp_write_xmit(struct sock *, int nonagle);
+extern void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
+				      unsigned cur_mss, int nonagle);
+extern int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp);
 extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_xmit_retransmit_queue(struct sock *);
 extern void tcp_simple_retransmit(struct sock *);
@@ -1389,12 +1392,6 @@ static __inline__ __u32 tcp_max_burst(co
 	return 3;
 }
 
-static __inline__ int tcp_minshall_check(const struct tcp_sock *tp)
-{
-	return after(tp->snd_sml,tp->snd_una) &&
-		!after(tp->snd_sml, tp->snd_nxt);
-}
-
 static __inline__ void tcp_minshall_update(struct tcp_sock *tp, int mss, 
 					   const struct sk_buff *skb)
 {
@@ -1402,122 +1399,18 @@ static __inline__ void tcp_minshall_upda
 		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
 }
 
-/* Return 0, if packet can be sent now without violation Nagle's rules:
-   1. It is full sized.
-   2. Or it contains FIN.
-   3. Or TCP_NODELAY was set.
-   4. Or TCP_CORK is not set, and all sent packets are ACKed.
-      With Minshall's modification: all sent small packets are ACKed.
- */
-
-static __inline__ int
-tcp_nagle_check(const struct tcp_sock *tp, const struct sk_buff *skb, 
-		unsigned mss_now, int nonagle)
-{
-	return (skb->len < mss_now &&
-		!(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
-		((nonagle&TCP_NAGLE_CORK) ||
-		 (!nonagle &&
-		  tp->packets_out &&
-		  tcp_minshall_check(tp))));
-}
-
-extern void tcp_set_skb_tso_segs(struct sock *, struct sk_buff *);
-
-/* This checks if the data bearing packet SKB (usually sk->sk_send_head)
- * should be put on the wire right now.
- */
-static __inline__ int tcp_snd_test(struct sock *sk,
-				   struct sk_buff *skb,
-				   unsigned cur_mss, int nonagle)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-	int pkts = tcp_skb_pcount(skb);
-
-	if (!pkts) {
-		tcp_set_skb_tso_segs(sk, skb);
-		pkts = tcp_skb_pcount(skb);
-	}
-
-	/*	RFC 1122 - section 4.2.3.4
-	 *
-	 *	We must queue if
-	 *
-	 *	a) The right edge of this frame exceeds the window
-	 *	b) There are packets in flight and we have a small segment
-	 *	   [SWS avoidance and Nagle algorithm]
-	 *	   (part of SWS is done on packetization)
-	 *	   Minshall version sounds: there are no _small_
-	 *	   segments in flight. (tcp_nagle_check)
-	 *	c) We have too many packets 'in flight'
-	 *
-	 * 	Don't use the nagle rule for urgent data (or
-	 *	for the final FIN -DaveM).
-	 *
-	 *	Also, Nagle rule does not apply to frames, which
-	 *	sit in the middle of queue (they have no chances
-	 *	to get new data) and if room at tail of skb is
-	 *	not enough to save something seriously (<32 for now).
-	 */
-
-	/* Don't be strict about the congestion window for the
-	 * final FIN frame.  -DaveM
-	 */
-	return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode
-		 || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
-		(((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) ||
-		 (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
-		!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));
-}
-
 static __inline__ void tcp_check_probe_timer(struct sock *sk, struct tcp_sock *tp)
 {
 	if (!tp->packets_out && !tp->pending)
 		tcp_reset_xmit_timer(sk, TCP_TIME_PROBE0, tp->rto);
 }
 
-static __inline__ int tcp_skb_is_last(const struct sock *sk, 
-				      const struct sk_buff *skb)
-{
-	return skb->next == (struct sk_buff *)&sk->sk_write_queue;
-}
-
-/* Push out any pending frames which were held back due to
- * TCP_CORK or attempt at coalescing tiny packets.
- * The socket must be locked by the caller.
- */
-static __inline__ void __tcp_push_pending_frames(struct sock *sk,
-						 struct tcp_sock *tp,
-						 unsigned cur_mss,
-						 int nonagle)
-{
-	struct sk_buff *skb = sk->sk_send_head;
-
-	if (skb) {
-		if (!tcp_skb_is_last(sk, skb))
-			nonagle = TCP_NAGLE_PUSH;
-		if (!tcp_snd_test(sk, skb, cur_mss, nonagle) ||
-		    tcp_write_xmit(sk, nonagle))
-			tcp_check_probe_timer(sk, tp);
-	}
-	tcp_cwnd_validate(sk, tp);
-}
-
 static __inline__ void tcp_push_pending_frames(struct sock *sk,
 					       struct tcp_sock *tp)
 {
 	__tcp_push_pending_frames(sk, tp, tcp_current_mss(sk, 1), tp->nonagle);
 }
 
-static __inline__ int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp)
-{
-	struct sk_buff *skb = sk->sk_send_head;
-
-	return (skb &&
-		tcp_snd_test(sk, skb, tcp_current_mss(sk, 1),
-			     tcp_skb_is_last(sk, skb) ? TCP_NAGLE_PUSH : tp->nonagle));
-}
-
 static __inline__ void tcp_init_wl(struct tcp_sock *tp, u32 ack, u32 seq)
 {
 	tp->snd_wl1 = seq;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -419,6 +419,135 @@ static inline void tcp_tso_set_push(stru
 		TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
 }
 
+static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (skb->len <= tp->mss_cache_std ||
+	    !(sk->sk_route_caps & NETIF_F_TSO)) {
+		/* Avoid the costly divide in the normal
+		 * non-TSO case.
+		 */
+		skb_shinfo(skb)->tso_segs = 1;
+		skb_shinfo(skb)->tso_size = 0;
+	} else {
+		unsigned int factor;
+
+		factor = skb->len + (tp->mss_cache_std - 1);
+		factor /= tp->mss_cache_std;
+		skb_shinfo(skb)->tso_segs = factor;
+		skb_shinfo(skb)->tso_size = tp->mss_cache_std;
+	}
+}
+
+static inline int tcp_minshall_check(const struct tcp_sock *tp)
+{
+	return after(tp->snd_sml,tp->snd_una) &&
+		!after(tp->snd_sml, tp->snd_nxt);
+}
+
+/* Return 0, if packet can be sent now without violation Nagle's rules:
+ * 1. It is full sized.
+ * 2. Or it contains FIN.
+ * 3. Or TCP_NODELAY was set.
+ * 4. Or TCP_CORK is not set, and all sent packets are ACKed.
+ *    With Minshall's modification: all sent small packets are ACKed.
+ */
+
+static inline int tcp_nagle_check(const struct tcp_sock *tp,
+				  const struct sk_buff *skb, 
+				  unsigned mss_now, int nonagle)
+{
+	return (skb->len < mss_now &&
+		!(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
+		((nonagle&TCP_NAGLE_CORK) ||
+		 (!nonagle &&
+		  tp->packets_out &&
+		  tcp_minshall_check(tp))));
+}
+
+/* This checks if the data bearing packet SKB (usually sk->sk_send_head)
+ * should be put on the wire right now.
+ */
+static int tcp_snd_test(struct sock *sk, struct sk_buff *skb,
+			unsigned cur_mss, int nonagle)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int pkts = tcp_skb_pcount(skb);
+
+	if (!pkts) {
+		tcp_set_skb_tso_segs(sk, skb);
+		pkts = tcp_skb_pcount(skb);
+	}
+
+	/*	RFC 1122 - section 4.2.3.4
+	 *
+	 *	We must queue if
+	 *
+	 *	a) The right edge of this frame exceeds the window
+	 *	b) There are packets in flight and we have a small segment
+	 *	   [SWS avoidance and Nagle algorithm]
+	 *	   (part of SWS is done on packetization)
+	 *	   Minshall version sounds: there are no _small_
+	 *	   segments in flight. (tcp_nagle_check)
+	 *	c) We have too many packets 'in flight'
+	 *
+	 * 	Don't use the nagle rule for urgent data (or
+	 *	for the final FIN -DaveM).
+	 *
+	 *	Also, Nagle rule does not apply to frames, which
+	 *	sit in the middle of queue (they have no chances
+	 *	to get new data) and if room at tail of skb is
+	 *	not enough to save something seriously (<32 for now).
+	 */
+
+	/* Don't be strict about the congestion window for the
+	 * final FIN frame.  -DaveM
+	 */
+	return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode
+		 || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
+		(((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) ||
+		 (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
+		!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));
+}
+
+static inline int tcp_skb_is_last(const struct sock *sk, 
+				  const struct sk_buff *skb)
+{
+	return skb->next == (struct sk_buff *)&sk->sk_write_queue;
+}
+
+/* Push out any pending frames which were held back due to
+ * TCP_CORK or attempt at coalescing tiny packets.
+ * The socket must be locked by the caller.
+ */
+void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
+			       unsigned cur_mss, int nonagle)
+{
+	struct sk_buff *skb = sk->sk_send_head;
+
+	if (skb) {
+		if (!tcp_skb_is_last(sk, skb))
+			nonagle = TCP_NAGLE_PUSH;
+		if (!tcp_snd_test(sk, skb, cur_mss, nonagle) ||
+		    tcp_write_xmit(sk, nonagle))
+			tcp_check_probe_timer(sk, tp);
+	}
+	tcp_cwnd_validate(sk, tp);
+}
+
+int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp)
+{
+	struct sk_buff *skb = sk->sk_send_head;
+
+	return (skb &&
+		tcp_snd_test(sk, skb, tcp_current_mss(sk, 1),
+			     (tcp_skb_is_last(sk, skb) ?
+			      TCP_NAGLE_PUSH :
+			      tp->nonagle)));
+}
+
+
 /* Send _single_ skb sitting at the send head. This function requires
  * true push pending frames to setup probe timer etc.
  */
@@ -440,27 +569,6 @@ void tcp_push_one(struct sock *sk, unsig
 	}
 }
 
-void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-
-	if (skb->len <= tp->mss_cache_std ||
-	    !(sk->sk_route_caps & NETIF_F_TSO)) {
-		/* Avoid the costly divide in the normal
-		 * non-TSO case.
-		 */
-		skb_shinfo(skb)->tso_segs = 1;
-		skb_shinfo(skb)->tso_size = 0;
-	} else {
-		unsigned int factor;
-
-		factor = skb->len + (tp->mss_cache_std - 1);
-		factor /= tp->mss_cache_std;
-		skb_shinfo(skb)->tso_segs = factor;
-		skb_shinfo(skb)->tso_size = tp->mss_cache_std;
-	}
-}
-
 /* Function to create two new TCP segments.  Shrinks the given segment
  * to the specified size and appends a new segment with the rest of the
  * packet to the list.  This won't be called frequently, I hope. 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
                   ` (2 preceding siblings ...)
  2005-06-07  4:17 ` [PATCH 3/9]: " David S. Miller
@ 2005-06-07  4:18 ` David S. Miller
  2005-06-07  4:19 ` [PATCH 5/9]: " David S. Miller
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:18 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner


[TCP]: Move __tcp_data_snd_check into tcp_output.c

It reimplements portions of tcp_snd_check(), so it
we move it to tcp_output.c we can consolidate it's
logic much easier in a later change.

Signed-off-by: David S. Miller <davem@davemloft.net>

bdbf09522de5be3ada129dceaa3ad9da9be078bc (from cba5d690f46699d37df7dc087247d1f7c7155692)
diff --git a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -945,6 +945,7 @@ extern __u32 cookie_v4_init_sequence(str
 /* tcp_output.c */
 
 extern int tcp_write_xmit(struct sock *, int nonagle);
+extern void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb);
 extern void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
 				      unsigned cur_mss, int nonagle);
 extern int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3975,16 +3975,6 @@ static inline void tcp_check_space(struc
 	}
 }
 
-static void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-
-	if (after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd) ||
-	    tcp_packets_in_flight(tp) >= tp->snd_cwnd ||
-	    tcp_write_xmit(sk, tp->nonagle))
-		tcp_check_probe_timer(sk, tp);
-}
-
 static __inline__ void tcp_data_snd_check(struct sock *sk)
 {
 	struct sk_buff *skb = sk->sk_send_head;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -536,6 +536,16 @@ void __tcp_push_pending_frames(struct so
 	tcp_cwnd_validate(sk, tp);
 }
 
+void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd) ||
+	    tcp_packets_in_flight(tp) >= tp->snd_cwnd ||
+	    tcp_write_xmit(sk, tp->nonagle))
+		tcp_check_probe_timer(sk, tp);
+}
+
 int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp)
 {
 	struct sk_buff *skb = sk->sk_send_head;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
                   ` (3 preceding siblings ...)
  2005-06-07  4:18 ` [PATCH 4/9]: " David S. Miller
@ 2005-06-07  4:19 ` David S. Miller
  2005-06-07  4:20 ` [PATCH 6/9]: " David S. Miller
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:19 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner


[TCP]: Add missing skb_header_release() call to tcp_fragment().

When we add any new packet to the TCP socket write queue,
we must call skb_header_release() on it in order for the
TSO sharing checks in the drivers to work.

Signed-off-by: David S. Miller <davem@davemloft.net>

79eb6b25499ed5470cb7b20428c435288fcb3502 (from bdbf09522de5be3ada129dceaa3ad9da9be078bc)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -660,6 +660,7 @@ static int tcp_fragment(struct sock *sk,
 	}
 
 	/* Link BUFF into the send queue. */
+	skb_header_release(buff);
 	__skb_append(skb, buff);
 
 	return 0;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 6/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
                   ` (4 preceding siblings ...)
  2005-06-07  4:19 ` [PATCH 5/9]: " David S. Miller
@ 2005-06-07  4:20 ` David S. Miller
  2005-06-07  4:21 ` [PATCH 7/9]: " David S. Miller
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:20 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner


[TCP]: Kill extra cwnd validate in __tcp_push_pending_frames().

The tcp_cwnd_validate() function should only be invoked
if we actually send some frames, yet __tcp_push_pending_frames()
will always invoke it.  tcp_write_xmit() does the call for us,
so the call here can simply be removed.

Also, tcp_write_xmit() can be marked static.

Signed-off-by: David S. Miller <davem@davemloft.net>

ae083bd3447865cbaf0996a69ba03807fd9fce01 (from 79eb6b25499ed5470cb7b20428c435288fcb3502)
diff --git a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -944,7 +944,6 @@ extern __u32 cookie_v4_init_sequence(str
 
 /* tcp_output.c */
 
-extern int tcp_write_xmit(struct sock *, int nonagle);
 extern void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb);
 extern void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
 				      unsigned cur_mss, int nonagle);
@@ -964,6 +963,9 @@ extern void tcp_push_one(struct sock *, 
 extern void tcp_send_ack(struct sock *sk);
 extern void tcp_send_delayed_ack(struct sock *sk);
 
+/* tcp_input.c */
+extern void tcp_cwnd_application_limited(struct sock *sk);
+
 /* tcp_timer.c */
 extern void tcp_init_xmit_timers(struct sock *);
 extern void tcp_clear_xmit_timers(struct sock *);
@@ -1339,28 +1341,6 @@ static inline void tcp_sync_left_out(str
 	tp->left_out = tp->sacked_out + tp->lost_out;
 }
 
-extern void tcp_cwnd_application_limited(struct sock *sk);
-
-/* Congestion window validation. (RFC2861) */
-
-static inline void tcp_cwnd_validate(struct sock *sk, struct tcp_sock *tp)
-{
-	__u32 packets_out = tp->packets_out;
-
-	if (packets_out >= tp->snd_cwnd) {
-		/* Network is feed fully. */
-		tp->snd_cwnd_used = 0;
-		tp->snd_cwnd_stamp = tcp_time_stamp;
-	} else {
-		/* Network starves. */
-		if (tp->packets_out > tp->snd_cwnd_used)
-			tp->snd_cwnd_used = tp->packets_out;
-
-		if ((s32)(tcp_time_stamp - tp->snd_cwnd_stamp) >= tp->rto)
-			tcp_cwnd_application_limited(sk);
-	}
-}
-
 /* Set slow start threshould and cwnd not falling to slow start */
 static inline void __tcp_enter_cwr(struct tcp_sock *tp)
 {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -517,35 +517,6 @@ static inline int tcp_skb_is_last(const 
 	return skb->next == (struct sk_buff *)&sk->sk_write_queue;
 }
 
-/* Push out any pending frames which were held back due to
- * TCP_CORK or attempt at coalescing tiny packets.
- * The socket must be locked by the caller.
- */
-void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
-			       unsigned cur_mss, int nonagle)
-{
-	struct sk_buff *skb = sk->sk_send_head;
-
-	if (skb) {
-		if (!tcp_skb_is_last(sk, skb))
-			nonagle = TCP_NAGLE_PUSH;
-		if (!tcp_snd_test(sk, skb, cur_mss, nonagle) ||
-		    tcp_write_xmit(sk, nonagle))
-			tcp_check_probe_timer(sk, tp);
-	}
-	tcp_cwnd_validate(sk, tp);
-}
-
-void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-
-	if (after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd) ||
-	    tcp_packets_in_flight(tp) >= tp->snd_cwnd ||
-	    tcp_write_xmit(sk, tp->nonagle))
-		tcp_check_probe_timer(sk, tp);
-}
-
 int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp)
 {
 	struct sk_buff *skb = sk->sk_send_head;
@@ -846,6 +817,26 @@ unsigned int tcp_current_mss(struct sock
 	return mss_now;
 }
 
+/* Congestion window validation. (RFC2861) */
+
+static inline void tcp_cwnd_validate(struct sock *sk, struct tcp_sock *tp)
+{
+	__u32 packets_out = tp->packets_out;
+
+	if (packets_out >= tp->snd_cwnd) {
+		/* Network is feed fully. */
+		tp->snd_cwnd_used = 0;
+		tp->snd_cwnd_stamp = tcp_time_stamp;
+	} else {
+		/* Network starves. */
+		if (tp->packets_out > tp->snd_cwnd_used)
+			tp->snd_cwnd_used = tp->packets_out;
+
+		if ((s32)(tcp_time_stamp - tp->snd_cwnd_stamp) >= tp->rto)
+			tcp_cwnd_application_limited(sk);
+	}
+}
+
 /* This routine writes packets to the network.  It advances the
  * send_head.  This happens as incoming acks open up the remote
  * window for us.
@@ -853,7 +844,7 @@ unsigned int tcp_current_mss(struct sock
  * Returns 1, if no segments are in flight and we have queued segments, but
  * cannot send anything now because of SWS or another problem.
  */
-int tcp_write_xmit(struct sock *sk, int nonagle)
+static int tcp_write_xmit(struct sock *sk, int nonagle)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned int mss_now;
@@ -906,6 +897,34 @@ int tcp_write_xmit(struct sock *sk, int 
 	return 0;
 }
 
+/* Push out any pending frames which were held back due to
+ * TCP_CORK or attempt at coalescing tiny packets.
+ * The socket must be locked by the caller.
+ */
+void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
+			       unsigned cur_mss, int nonagle)
+{
+	struct sk_buff *skb = sk->sk_send_head;
+
+	if (skb) {
+		if (!tcp_skb_is_last(sk, skb))
+			nonagle = TCP_NAGLE_PUSH;
+		if (!tcp_snd_test(sk, skb, cur_mss, nonagle) ||
+		    tcp_write_xmit(sk, nonagle))
+			tcp_check_probe_timer(sk, tp);
+	}
+}
+
+void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd) ||
+	    tcp_packets_in_flight(tp) >= tp->snd_cwnd ||
+	    tcp_write_xmit(sk, tp->nonagle))
+		tcp_check_probe_timer(sk, tp);
+}
+
 /* This function returns the amount that we can raise the
  * usable window based on the following constraints
  *  

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 7/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
                   ` (5 preceding siblings ...)
  2005-06-07  4:20 ` [PATCH 6/9]: " David S. Miller
@ 2005-06-07  4:21 ` David S. Miller
  2005-06-07  4:22 ` [PATCH 8/9]: " David S. Miller
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:21 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner


[TCP]: tcp_write_xmit() tabbing cleanup

Put the main basic block of work at the top-level of
tabbing, and mark the TCP_CLOSE test with unlikely().

Signed-off-by: David S. Miller <davem@davemloft.net>

b8d892e4dc753d796e80da6e17f2a88aede0695e (from ae083bd3447865cbaf0996a69ba03807fd9fce01)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -847,54 +847,54 @@ static inline void tcp_cwnd_validate(str
 static int tcp_write_xmit(struct sock *sk, int nonagle)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
 	unsigned int mss_now;
+	int sent_pkts;
 
 	/* If we are closed, the bytes will have to remain here.
 	 * In time closedown will finish, we empty the write queue and all
 	 * will be happy.
 	 */
-	if (sk->sk_state != TCP_CLOSE) {
-		struct sk_buff *skb;
-		int sent_pkts = 0;
+	if (unlikely(sk->sk_state == TCP_CLOSE))
+		return 0;
 
-		/* Account for SACKS, we may need to fragment due to this.
-		 * It is just like the real MSS changing on us midstream.
-		 * We also handle things correctly when the user adds some
-		 * IP options mid-stream.  Silly to do, but cover it.
-		 */
-		mss_now = tcp_current_mss(sk, 1);
 
-		while ((skb = sk->sk_send_head) &&
-		       tcp_snd_test(sk, skb, mss_now,
-			       	    tcp_skb_is_last(sk, skb) ? nonagle :
-				    			       TCP_NAGLE_PUSH)) {
-			if (skb->len > mss_now) {
-				if (tcp_fragment(sk, skb, mss_now))
-					break;
-			}
-
-			TCP_SKB_CB(skb)->when = tcp_time_stamp;
-			tcp_tso_set_push(skb);
-			if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
+	/* Account for SACKS, we may need to fragment due to this.
+	 * It is just like the real MSS changing on us midstream.
+	 * We also handle things correctly when the user adds some
+	 * IP options mid-stream.  Silly to do, but cover it.
+	 */
+	mss_now = tcp_current_mss(sk, 1);
+	sent_pkts = 0;
+	while ((skb = sk->sk_send_head) &&
+	       tcp_snd_test(sk, skb, mss_now,
+			    tcp_skb_is_last(sk, skb) ? nonagle :
+			    TCP_NAGLE_PUSH)) {
+		if (skb->len > mss_now) {
+			if (tcp_fragment(sk, skb, mss_now))
 				break;
+		}
 
-			/* Advance the send_head.  This one is sent out.
-			 * This call will increment packets_out.
-			 */
-			update_send_head(sk, tp, skb);
+		TCP_SKB_CB(skb)->when = tcp_time_stamp;
+		tcp_tso_set_push(skb);
+		if (tcp_transmit_skb(sk, skb_clone(skb, GFP_ATOMIC)))
+			break;
 
-			tcp_minshall_update(tp, mss_now, skb);
-			sent_pkts = 1;
-		}
+		/* Advance the send_head.  This one is sent out.
+		 * This call will increment packets_out.
+		 */
+		update_send_head(sk, tp, skb);
 
-		if (sent_pkts) {
-			tcp_cwnd_validate(sk, tp);
-			return 0;
-		}
+		tcp_minshall_update(tp, mss_now, skb);
+		sent_pkts = 1;
+	}
 
-		return !tp->packets_out && sk->sk_send_head;
+	if (sent_pkts) {
+		tcp_cwnd_validate(sk, tp);
+		return 0;
 	}
-	return 0;
+
+	return !tp->packets_out && sk->sk_send_head;
 }
 
 /* Push out any pending frames which were held back due to

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 8/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
                   ` (6 preceding siblings ...)
  2005-06-07  4:21 ` [PATCH 7/9]: " David S. Miller
@ 2005-06-07  4:22 ` David S. Miller
  2005-06-07  4:23 ` [PATCH 9/9]: " David S. Miller
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:22 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner


[TCP]: Fix redundant calculations of tcp_current_mss()

tcp_write_xmit() uses tcp_current_mss(), but some of it's callers,
namely __tcp_push_pending_frames(), already has this value available
already.

While we're here, fix the "cur_mss" argument to be "unsigned int"
instead of plain "unsigned".

Signed-off-by: David S. Miller <davem@davemloft.net>

f22c7890049ef8c51b0cdcc5d7e0cd06333de6b0 (from b8d892e4dc753d796e80da6e17f2a88aede0695e)
diff --git a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -946,7 +946,7 @@ extern __u32 cookie_v4_init_sequence(str
 
 extern void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb);
 extern void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
-				      unsigned cur_mss, int nonagle);
+				      unsigned int cur_mss, int nonagle);
 extern int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp);
 extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
 extern void tcp_xmit_retransmit_queue(struct sock *);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -844,11 +844,10 @@ static inline void tcp_cwnd_validate(str
  * Returns 1, if no segments are in flight and we have queued segments, but
  * cannot send anything now because of SWS or another problem.
  */
-static int tcp_write_xmit(struct sock *sk, int nonagle)
+static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
-	unsigned int mss_now;
 	int sent_pkts;
 
 	/* If we are closed, the bytes will have to remain here.
@@ -858,13 +857,6 @@ static int tcp_write_xmit(struct sock *s
 	if (unlikely(sk->sk_state == TCP_CLOSE))
 		return 0;
 
-
-	/* Account for SACKS, we may need to fragment due to this.
-	 * It is just like the real MSS changing on us midstream.
-	 * We also handle things correctly when the user adds some
-	 * IP options mid-stream.  Silly to do, but cover it.
-	 */
-	mss_now = tcp_current_mss(sk, 1);
 	sent_pkts = 0;
 	while ((skb = sk->sk_send_head) &&
 	       tcp_snd_test(sk, skb, mss_now,
@@ -902,7 +894,7 @@ static int tcp_write_xmit(struct sock *s
  * The socket must be locked by the caller.
  */
 void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
-			       unsigned cur_mss, int nonagle)
+			       unsigned int cur_mss, int nonagle)
 {
 	struct sk_buff *skb = sk->sk_send_head;
 
@@ -910,7 +902,7 @@ void __tcp_push_pending_frames(struct so
 		if (!tcp_skb_is_last(sk, skb))
 			nonagle = TCP_NAGLE_PUSH;
 		if (!tcp_snd_test(sk, skb, cur_mss, nonagle) ||
-		    tcp_write_xmit(sk, nonagle))
+		    tcp_write_xmit(sk, cur_mss, nonagle))
 			tcp_check_probe_timer(sk, tp);
 	}
 }
@@ -921,7 +913,7 @@ void __tcp_data_snd_check(struct sock *s
 
 	if (after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd) ||
 	    tcp_packets_in_flight(tp) >= tp->snd_cwnd ||
-	    tcp_write_xmit(sk, tp->nonagle))
+	    tcp_write_xmit(sk, tcp_current_mss(sk, 1), tp->nonagle))
 		tcp_check_probe_timer(sk, tp);
 }
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 9/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
                   ` (7 preceding siblings ...)
  2005-06-07  4:22 ` [PATCH 8/9]: " David S. Miller
@ 2005-06-07  4:23 ` David S. Miller
  2005-06-07  4:56 ` [PATCH 0/9]: " Stephen Hemminger
  2005-06-08 21:40 ` John Heffner
  10 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  4:23 UTC (permalink / raw)
  To: netdev; +Cc: herbert, jheffner


[TCP]: Fix __tcp_push_pending_frames() 'nonagle' handling.

'nonagle' should be passed to the tcp_snd_test() function
as 'TCP_NAGLE_PUSH' if we are checking an SKB not at the
tail of the write_queue.  This is because Nagle does not
apply to such frames since we cannot possibly tack more
data onto them.

However, while doing this __tcp_push_pending_frames() makes
all of the packets in the write_queue use this modified
'nonagle' value.

Fix the bug and simplify this function by just calling
tcp_write_xmit() directly if sk_send_head is non-NULL.

As a result, we can now make tcp_data_snd_check() just call
tcp_push_pending_frames() instead of the specialized
__tcp_data_snd_check().

Signed-off-by: David S. Miller <davem@davemloft.net>

45d0377c7d18e1a036b0a1f96788a998dccf73cf (from f22c7890049ef8c51b0cdcc5d7e0cd06333de6b0)
diff --git a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -944,7 +944,6 @@ extern __u32 cookie_v4_init_sequence(str
 
 /* tcp_output.c */
 
-extern void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb);
 extern void __tcp_push_pending_frames(struct sock *sk, struct tcp_sock *tp,
 				      unsigned int cur_mss, int nonagle);
 extern int tcp_may_send_now(struct sock *sk, struct tcp_sock *tp);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3975,12 +3975,9 @@ static inline void tcp_check_space(struc
 	}
 }
 
-static __inline__ void tcp_data_snd_check(struct sock *sk)
+static __inline__ void tcp_data_snd_check(struct sock *sk, struct tcp_sock *tp)
 {
-	struct sk_buff *skb = sk->sk_send_head;
-
-	if (skb != NULL)
-		__tcp_data_snd_check(sk, skb);
+	tcp_push_pending_frames(sk, tp);
 	tcp_check_space(sk);
 }
 
@@ -4274,7 +4271,7 @@ int tcp_rcv_established(struct sock *sk,
 				 */
 				tcp_ack(sk, skb, 0);
 				__kfree_skb(skb); 
-				tcp_data_snd_check(sk);
+				tcp_data_snd_check(sk, tp);
 				return 0;
 			} else { /* Header too small */
 				TCP_INC_STATS_BH(TCP_MIB_INERRS);
@@ -4340,7 +4337,7 @@ int tcp_rcv_established(struct sock *sk,
 			if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
 				/* Well, only one small jumplet in fast path... */
 				tcp_ack(sk, skb, FLAG_DATA);
-				tcp_data_snd_check(sk);
+				tcp_data_snd_check(sk, tp);
 				if (!tcp_ack_scheduled(tp))
 					goto no_ack;
 			}
@@ -4418,7 +4415,7 @@ step5:
 	/* step 7: process the segment text */
 	tcp_data_queue(sk, skb);
 
-	tcp_data_snd_check(sk);
+	tcp_data_snd_check(sk, tp);
 	tcp_ack_snd_check(sk);
 	return 0;
 
@@ -4732,7 +4729,7 @@ int tcp_rcv_state_process(struct sock *s
 		/* Do step6 onward by hand. */
 		tcp_urg(sk, skb, th);
 		__kfree_skb(skb);
-		tcp_data_snd_check(sk);
+		tcp_data_snd_check(sk, tp);
 		return 0;
 	}
 
@@ -4921,7 +4918,7 @@ int tcp_rcv_state_process(struct sock *s
 
 	/* tcp_data could move socket to TIME-WAIT */
 	if (sk->sk_state != TCP_CLOSE) {
-		tcp_data_snd_check(sk);
+		tcp_data_snd_check(sk, tp);
 		tcp_ack_snd_check(sk);
 	}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -899,24 +899,11 @@ void __tcp_push_pending_frames(struct so
 	struct sk_buff *skb = sk->sk_send_head;
 
 	if (skb) {
-		if (!tcp_skb_is_last(sk, skb))
-			nonagle = TCP_NAGLE_PUSH;
-		if (!tcp_snd_test(sk, skb, cur_mss, nonagle) ||
-		    tcp_write_xmit(sk, cur_mss, nonagle))
+		if (tcp_write_xmit(sk, cur_mss, nonagle))
 			tcp_check_probe_timer(sk, tp);
 	}
 }
 
-void __tcp_data_snd_check(struct sock *sk, struct sk_buff *skb)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-
-	if (after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd) ||
-	    tcp_packets_in_flight(tp) >= tp->snd_cwnd ||
-	    tcp_write_xmit(sk, tcp_current_mss(sk, 1), tp->nonagle))
-		tcp_check_probe_timer(sk, tp);
-}
-
 /* This function returns the amount that we can raise the
  * usable window based on the following constraints
  *  

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
                   ` (8 preceding siblings ...)
  2005-06-07  4:23 ` [PATCH 9/9]: " David S. Miller
@ 2005-06-07  4:56 ` Stephen Hemminger
  2005-06-07  5:51   ` David S. Miller
  2005-06-08 21:40 ` John Heffner
  10 siblings, 1 reply; 18+ messages in thread
From: Stephen Hemminger @ 2005-06-07  4:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, herbert, jheffner

I'll merge these with the TCP infrastructure stuff and
send it off to Andrew. Actually, it is more of fix the TCP
infrastructure to match TSO + rc6 but you get the ida.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/9]: TCP: The Road to Super TSO
  2005-06-07  4:56 ` [PATCH 0/9]: " Stephen Hemminger
@ 2005-06-07  5:51   ` David S. Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-07  5:51 UTC (permalink / raw)
  To: shemminger; +Cc: netdev, herbert, jheffner

From: Stephen Hemminger <shemminger@osdl.org>
Date: Mon, 06 Jun 2005 21:56:16 -0700

> I'll merge these with the TCP infrastructure stuff and
> send it off to Andrew. Actually, it is more of fix the TCP
> infrastructure to match TSO + rc6 but you get the ida.

Probably not a good idea, it's %75 of the implementation
of Super TSO and totally conflicts with the super TSO patch.

Probably best to keep the existing Super TSO stuff in there
until I'm done with this stuff. :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/9]: TCP: The Road to Super TSO
  2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
                   ` (9 preceding siblings ...)
  2005-06-07  4:56 ` [PATCH 0/9]: " Stephen Hemminger
@ 2005-06-08 21:40 ` John Heffner
  2005-06-08 21:49   ` David S. Miller
  10 siblings, 1 reply; 18+ messages in thread
From: John Heffner @ 2005-06-08 21:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, herbert

On Tuesday 07 June 2005 12:08 am, David S. Miller wrote:
> Some folks, notable the S2IO guys, get performance degradation
> from the Super TSO v2 patch (they get it from the first version
> as well).  It's a real pain to spot what causes such things
> in such a huge patch... so I started splitting things up in
> a very fine grained manner so we can catch regressions more
> precisely.

I'm curious about the details of this.  Is there decreased performance 
relative to current TSO?  Relative to no TSO?  Sending to just one receiver 
or many, and is it receiver limited?

  -John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/9]: TCP: The Road to Super TSO
  2005-06-08 21:40 ` John Heffner
@ 2005-06-08 21:49   ` David S. Miller
  2005-06-08 22:10     ` Herbert Xu
  2005-06-08 22:19     ` Leonid Grossman
  0 siblings, 2 replies; 18+ messages in thread
From: David S. Miller @ 2005-06-08 21:49 UTC (permalink / raw)
  To: jheffner; +Cc: netdev, herbert

From: John Heffner <jheffner@psc.edu>
Subject: Re: [PATCH 0/9]: TCP: The Road to Super TSO
Date: Wed, 8 Jun 2005 17:40:10 -0400

> On Tuesday 07 June 2005 12:08 am, David S. Miller wrote:
> > Some folks, notable the S2IO guys, get performance degradation
> > from the Super TSO v2 patch (they get it from the first version
> > as well).  It's a real pain to spot what causes such things
> > in such a huge patch... so I started splitting things up in
> > a very fine grained manner so we can catch regressions more
> > precisely.
> 
> I'm curious about the details of this.  Is there decreased performance 
> relative to current TSO?  Relative to no TSO?  Sending to just one receiver 
> or many, and is it receiver limited?

The receiver is limited in their tests.  No current generation systems
can fill a 10gbit pipe fully, especially at 1500 byte MTU.

Performance went down, with both TSO enabled and disabled, compared to
not having the patches applied.

That's why I'm going through this entire exercise of doing things one
piece at a time.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/9]: TCP: The Road to Super TSO
  2005-06-08 21:49   ` David S. Miller
@ 2005-06-08 22:10     ` Herbert Xu
  2005-06-09  4:55       ` Leonid Grossman
  2005-06-08 22:19     ` Leonid Grossman
  1 sibling, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2005-06-08 22:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: jheffner, netdev

On Wed, Jun 08, 2005 at 02:49:06PM -0700, David S. Miller wrote:
> 
> Performance went down, with both TSO enabled and disabled, compared to
> not having the patches applied.

What was the receiver running? Was the performance degradation more
pronounced with TSO enabled?
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 0/9]: TCP: The Road to Super TSO
  2005-06-08 21:49   ` David S. Miller
  2005-06-08 22:10     ` Herbert Xu
@ 2005-06-08 22:19     ` Leonid Grossman
  1 sibling, 0 replies; 18+ messages in thread
From: Leonid Grossman @ 2005-06-08 22:19 UTC (permalink / raw)
  To: 'David S. Miller', jheffner; +Cc: netdev, herbert

 

> -----Original Message-----
> From: netdev-bounce@oss.sgi.com 
> [mailto:netdev-bounce@oss.sgi.com] On Behalf Of David S. Miller
> Sent: Wednesday, June 08, 2005 2:49 PM
> To: jheffner@psc.edu
> Cc: netdev@oss.sgi.com; herbert@gondor.apana.org.au
> Subject: Re: [PATCH 0/9]: TCP: The Road to Super TSO
> 
> From: John Heffner <jheffner@psc.edu>
> Subject: Re: [PATCH 0/9]: TCP: The Road to Super TSO
> Date: Wed, 8 Jun 2005 17:40:10 -0400
> 
> > On Tuesday 07 June 2005 12:08 am, David S. Miller wrote:
> > > Some folks, notable the S2IO guys, get performance 
> degradation from 
> > > the Super TSO v2 patch (they get it from the first 
> version as well).  
> > > It's a real pain to spot what causes such things in such a huge 
> > > patch... so I started splitting things up in a very fine grained 
> > > manner so we can catch regressions more precisely.
> > 
> > I'm curious about the details of this.  Is there decreased 
> performance 
> > relative to current TSO?  Relative to no TSO?  Sending to just one 
> > receiver or many, and is it receiver limited?
> 
> The receiver is limited in their tests.  No current 
> generation systems can fill a 10gbit pipe fully, especially 
> at 1500 byte MTU.

With jumbo frames, a single receiver can handle 10GbE line rate.
With 1500 mtu, a single receiver becomes a bottleneck. I will forward the
numbers later today.


> 
> Performance went down, with both TSO enabled and disabled, 
> compared to not having the patches applied.
> 
> That's why I'm going through this entire exercise of doing 
> things one piece at a time.
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/9]: TCP: The Road to Super TSO
@ 2005-06-09  4:30 Leonid Grossman
  0 siblings, 0 replies; 18+ messages in thread
From: Leonid Grossman @ 2005-06-09  4:30 UTC (permalink / raw)
  To: 'David S. Miller'; +Cc: netdev

 FYI, looks like the code in the nine patches is not responsible for the
performance drop; the problem is elsewhere in the Super TSO code.

-----Original Message-----
From: kshaw [mailto:kim.shaw@neterion.com] 
Sent: Wednesday, June 08, 2005 8:34 PM
To: 'David S. Miller'
Cc: ravinandan.arakali@neterion.com; leonid.grossman@neterion.com
Subject: RE: test Super TSO

David,

I have applied all 9 patches (6-9 are done by editing source files), I don't
see Tx performance drop from any patch, Tx throughput remains at 6.17 Gb/s -
6.18 Gb/s.

The following is configuration:
4 way Opteron system .247 with shipping kernel 2.6.12-rc5 as TX system ,
4 way Opteron system .226 with kernel 2.6.11.5 as Rx system, NIC driver
REL_1-7-7-7_LX installed on both systems, Mtu is set to 9000 on both
systems.
Systems are connected back to back.
Run 8 nttcp connections from Tx system to Rx system for 60 seconds.
TSO is set to default on in both systems.

I also re-tested the original TSO patch which I used weeks ago,
With above same hardware, kernel 2.6.12-rc4 applied with original TSO patch
on Tx System, Tx throughput drops to 5.28 Gb/s.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 0/9]: TCP: The Road to Super TSO
  2005-06-08 22:10     ` Herbert Xu
@ 2005-06-09  4:55       ` Leonid Grossman
  0 siblings, 0 replies; 18+ messages in thread
From: Leonid Grossman @ 2005-06-09  4:55 UTC (permalink / raw)
  To: 'Herbert Xu', 'David S. Miller'; +Cc: jheffner, netdev

Some of the original data that we got couple weeks ago are attached.
On questions from Herbert and others:

- The performance drop from the "super-TSO" with TSO OFF is marginal, with
TSO ON
is quite noticeable. 

- The numbers are similar in back-to-back and switch based (sender vs two
receivers) tests.

- The numbers are relative; we tested in pci-x 1.0 slots where ~7.5Gbps is a
practical bus limit
For TCP traffic. In pci-x 2.0 slots, the numbers are ~10Gbps with either
Jumbo frames 
Or with 1500 mtu + TSO, (against two 1500 mtu receivers), at a fraction of a
single Opteron %cpu 

- David is correct, with 1500 mtu the single receiver %cpu becomes a
bottleneck; the best throughput
 with 1500 mtu I've seen was ~5Gbps. So, in B2B setup with 1500 mtu the
advantages of TSO are mostly wasted since there is no TSO counterpart on the
receive side.

Receive side stateless offloads fix this, but we did not get around to
deploy these ASIC capabilities in Linux yet.

Anyway, here it goes:
----------------------------------------------------------
Configuration:
Dual Opteron system .243 as Rx, dual Opteron system .117 as 
Rx, four way Opteron system .247 as Tx, connected via CISCO switch.
.243 and .117 kernel source are patched with tcp_ack26.diff,
.247 kernel source are patched with tcp_super_tso.diff.
Run 8 nttcp connections from Tx system to each Rx system,
Use package size 65535 for mtu 1500,
Use package size 300000 for mtu 9000.  
 
Tx throughput on four way Opteron system .247:
2.6.12-rc4
            Tx-1500 CPU usage           Tx-9000  CPU usage   
            ----------------            ------------------
TSO off      2.5Gb/s  55%(note 1)          5.3   40%(3)
TSO on       4.0      47%(2)               6.1   35%(4)

 
========================================================== 

 
2.6.12-rc4 with tcp_super_tso.diff patch
            Tx-1500 CPU usage           Tx-9000  CPU usage
            ----------------            ------------------
TSO off      2.4Gb/s  60%(5)               5.0   41%(7)
TSO on       3.5      45%(6)               5.7   35%(8)

 


Note(1):
1500 tso off
top - 08:45:41 up 13 min,  2 users,  load average: 2.03, 1.01, 0.54
Tasks:  90 total,   3 running,  87 sleeping,   0 stopped,   0 zombie
 Cpu0 :  0.0% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa, 50.7% hi, 49.3% s
 Cpu1 :  0.3% us, 29.2% sy,  0.0% ni, 53.2% id,  0.0% wa,  0.0% hi, 17.3% s
 Cpu2 :  0.3% us, 27.9% sy,  0.0% ni, 53.2% id,  0.0% wa,  0.0% hi, 18.6% s
 Cpu3 :  0.3% us, 23.6% sy,  0.0% ni, 59.5% id,  0.0% wa,  0.0% hi, 16.6% s
Mem:   2055724k total,   203172k used,  1852552k free,    24112k buffers
Swap:  2040244k total,        0k used,  2040244k free,    79384k cached

Note(2):
1500 tso on
top - 08:48:19 up 16 min,  2 users,  load average: 0.74, 0.71, 0.49
Tasks:  90 total,   4 running,  86 sleeping,   0 stopped,   0 zombie
 Cpu0 :  0.3% us,  1.1% sy,  0.0% ni, 71.9% id,  0.6% wa, 12.2% hi, 13.8% s
 Cpu1 :  0.5% us,  7.8% sy,  0.0% ni, 88.2% id,  0.5% wa,  0.0% hi,  3.0% s
 Cpu2 :  0.4% us,  8.1% sy,  0.0% ni, 88.2% id,  0.5% wa,  0.0% hi,  2.9% s
 Cpu3 :  0.3% us,  6.6% sy,  0.0% ni, 90.3% id,  0.1% wa,  0.0% hi,  2.7% s
Mem:   2055724k total,   203652k used,  1852072k free,    25308k buffers
Swap:  2040244k total,        0k used,  2040244k free,    79412k cached

Note(3):
9000 off
top - 08:58:19 up 6 min,  2 users,  load average: 0.88, 0.47, 0.21
Tasks:  90 total,   2 running,  88 sleeping,   0 stopped,   0 zombie
 Cpu0 :  0.8% us,  8.8% sy,  0.0% ni, 79.1% id,  1.4% wa,  3.5% hi,  6.4% si
 Cpu1 :  0.7% us,  7.3% sy,  0.0% ni, 90.8% id,  0.4% wa,  0.0% hi,  0.8% si
 Cpu2 :  0.7% us,  6.9% sy,  0.0% ni, 90.8% id,  1.0% wa,  0.1% hi,  0.5% si
 Cpu3 :  0.5% us,  5.1% sy,  0.0% ni, 93.9% id,  0.3% wa,  0.0% hi,  0.2% si
Mem:   2055724k total,   378620k used,  1677104k free,    18400k buffers
Swap:  2040244k total,        0k used,  2040244k free,    72788k cached


Note(4):
9000 on
top - 08:55:55 up 4 min,  2 users,  load average: 0.53, 0.26, 0.12
Tasks:  90 total,   2 running,  88 sleeping,   0 stopped,   0 zombie
 Cpu0 :  1.1% us,  4.4% sy,  0.0% ni, 89.2% id,  2.2% wa,  1.2% hi,  1.9% si
 Cpu1 :  1.0% us,  3.5% sy,  0.0% ni, 94.3% id,  0.6% wa,  0.0% hi,  0.5% si
 Cpu2 :  1.1% us,  6.4% sy,  0.0% ni, 90.7% id,  1.6% wa,  0.1% hi,  0.2% si
 Cpu3 :  0.8% us,  5.3% sy,  0.0% ni, 93.5% id,  0.4% wa,  0.0% hi,  0.1% si
Mem:   2055724k total,   375892k used,  1679832k free,    17424k buffers
Swap:  2040244k total,        0k used,  2040244k free,    72676k cached


Note (5):
1500 tso off
top - 05:54:20 up 10 min,  2 users,  load average: 1.48, 0.62, 0.29
Tasks:  91 total,   3 running,  88 sleeping,   0 stopped,   0 zombie
 Cpu0 :  0.5% us,  0.5% sy,  0.0% ni, 81.3% id,  0.9% wa,  7.6% hi,  9.1%
 Cpu1 :  0.7% us,  5.4% sy,  0.0% ni, 91.5% id,  0.7% wa,  0.0% hi,  1.8%
 Cpu2 :  0.6% us,  6.5% sy,  0.0% ni, 90.2% id,  0.7% wa,  0.0% hi,  2.0%
 Cpu3 :  0.4% us,  5.5% sy,  0.0% ni, 92.1% id,  0.2% wa,  0.0% hi,  1.8%
Mem:   2055724k total,   204100k used,  1851624k free,    24056k buffers
Swap:  2040244k total,        0k used,  2040244k free,    79440k cached

Note (6):
1500 tso on
top - 05:49:36 up 6 min,  2 users,  load average: 1.28, 0.45, 0.18
Tasks:  91 total,   6 running,  85 sleeping,   0 stopped,   0 zombie
 Cpu0 :  0.0% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa, 41.5% hi, 58.5%
 Cpu1 :  0.0% us, 26.4% sy,  0.0% ni, 69.9% id,  0.0% wa,  0.0% hi,  3.7%
 Cpu2 :  0.3% us, 24.3% sy,  0.0% ni, 71.3% id,  0.0% wa,  0.0% hi,  4.0%
 Cpu3 :  0.0% us, 19.1% sy,  0.0% ni, 77.6% id,  0.0% wa,  0.0% hi,  3.3%
Mem:   2055724k total,   200496k used,  1855228k free,    22644k buffers
Swap:  2040244k total,        0k used,  2040244k free,    79288k cached

Note (7):
9000 off
top - 06:03:13 up 19 min,  2 users,  load average: 0.52, 0.27, 0.23
Tasks:  91 total,   3 running,  88 sleeping,   0 stopped,   0 zombie
 Cpu0 :  0.3% us,  1.0% sy,  0.0% ni, 86.0% id,  0.5% wa,  5.3% hi,  6.8%
 Cpu1 :  0.4% us,  4.3% sy,  0.0% ni, 93.7% id,  0.4% wa,  0.0% hi,  1.3%
 Cpu2 :  0.3% us,  4.5% sy,  0.0% ni, 93.2% id,  0.4% wa,  0.0% hi,  1.5%
 Cpu3 :  0.2% us,  3.8% sy,  0.0% ni, 94.7% id,  0.1% wa,  0.0% hi,  1.2%
Mem:   2055724k total,   399540k used,  1656184k free,    25816k buffers
Swap:  2040244k total,        0k used,  2040244k free,    79516k cached

Note (8):
9000 on
top - 06:05:16 up 21 min,  2 users,  load average: 0.79, 0.42, 0.29
Tasks:  91 total,   1 running,  90 sleeping,   0 stopped,   0 zombie
 Cpu0 :  0.3% us,  2.5% sy,  0.0% ni, 83.5% id,  0.5% wa,  5.6% hi,  7.7%
 Cpu1 :  0.4% us,  5.1% sy,  0.0% ni, 92.9% id,  0.3% wa,  0.0% hi,  1.3%
 Cpu2 :  0.3% us,  4.9% sy,  0.0% ni, 92.9% id,  0.4% wa,  0.0% hi,  1.4%
 Cpu3 :  0.2% us,  3.9% sy,  0.0% ni, 94.7% id,  0.1% wa,  0.0% hi,  1.2%
Mem:   2055724k total,   397784k used,  1657940k free,    26892k buffers
Swap:  2040244k total,        0k used,  2040244k free,    79528k cached 

> -----Original Message-----
> From: netdev-bounce@oss.sgi.com 
> [mailto:netdev-bounce@oss.sgi.com] On Behalf Of Herbert Xu
> Sent: Wednesday, June 08, 2005 3:11 PM
> To: David S. Miller
> Cc: jheffner@psc.edu; netdev@oss.sgi.com
> Subject: Re: [PATCH 0/9]: TCP: The Road to Super TSO
> 
> On Wed, Jun 08, 2005 at 02:49:06PM -0700, David S. Miller wrote:
> > 
> > Performance went down, with both TSO enabled and disabled, 
> compared to 
> > not having the patches applied.
> 
> What was the receiver running? Was the performance 
> degradation more pronounced with TSO enabled?
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-06-09  4:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-07  4:08 [PATCH 0/9]: TCP: The Road to Super TSO David S. Miller
2005-06-07  4:16 ` [PATCH 1/9]: " David S. Miller
2005-06-07  4:17 ` [PATCH 2/9]: " David S. Miller
2005-06-07  4:17 ` [PATCH 3/9]: " David S. Miller
2005-06-07  4:18 ` [PATCH 4/9]: " David S. Miller
2005-06-07  4:19 ` [PATCH 5/9]: " David S. Miller
2005-06-07  4:20 ` [PATCH 6/9]: " David S. Miller
2005-06-07  4:21 ` [PATCH 7/9]: " David S. Miller
2005-06-07  4:22 ` [PATCH 8/9]: " David S. Miller
2005-06-07  4:23 ` [PATCH 9/9]: " David S. Miller
2005-06-07  4:56 ` [PATCH 0/9]: " Stephen Hemminger
2005-06-07  5:51   ` David S. Miller
2005-06-08 21:40 ` John Heffner
2005-06-08 21:49   ` David S. Miller
2005-06-08 22:10     ` Herbert Xu
2005-06-09  4:55       ` Leonid Grossman
2005-06-08 22:19     ` Leonid Grossman
  -- strict thread matches above, loose matches on Subject: below --
2005-06-09  4:30 Leonid Grossman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).