netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] David Miller's rbtree patches for 2.6.22.6
@ 2007-09-20  1:39 Tom Quetchenbach
  2007-09-20  1:42 ` [PATCH 1/2] " Tom Quetchenbach
  2007-09-21 14:08 ` [PATCH 0/2] " Ilpo Järvinen
  0 siblings, 2 replies; 7+ messages in thread
From: Tom Quetchenbach @ 2007-09-20  1:39 UTC (permalink / raw)
  To: netdev

Hello,

I've been experimenting with David Miller's red-black tree patch for
SACK processing. We're sending TCP traffic between two machines with
10Gbps cards over a 1Gbps bottleneck link and were getting very high CPU
load with large windows. With a few tweaks, this patch seems to provide
a pretty substantial improvement. David: this seems like excellent work
so far.

Here are a couple of patches against 2.6.22.6. The first one is just
David's patches tweaked for 2.6.22.6, with a couple of minor bugfixes to
get it to compile and not crash. (I also changed
__tcp_insert_write_queue_tail() to set the fack_count of the new packet
to the fack_count of the tail plus the packet count of the tail, not the
packet count of the new skb, because I think that's how it was intended
to be. Right?

In the second patch there are a couple of significant changes.  One is
(as Baruch suggested) to modify the existing SACK fast path so that we
don't tag packets we've already tagged when we advance by a packet.

The other issue is that the cached fack_counts seem to be wrong, because
they're set when we insert into the queue, but tcp_set_tso_segs() is
called later, just before we send, so all the fack_counts are zero. My
solution was to set the fack_count when we advance the send_head. Also I
changed tcp_reset_fack_counts() so that it exits when it hits an skb
whose tcp_skb_pcount() is zero or whose fack_count is already correct.
(This really helps when TSO is on, since there's lots of inserting into
the middle of the queue.)

Please let me know how I can help get this tested and debugged. Reducing
the SACK processing load is really going to be essential for us to start
testing experimental TCP variants with large windows.

Thanks
-Tom



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] David Miller's rbtree patches for 2.6.22.6
  2007-09-20  1:39 [PATCH 0/2] David Miller's rbtree patches for 2.6.22.6 Tom Quetchenbach
@ 2007-09-20  1:42 ` Tom Quetchenbach
  2007-09-20  1:44   ` [PATCH 2/2] " Tom Quetchenbach
                     ` (2 more replies)
  2007-09-21 14:08 ` [PATCH 0/2] " Ilpo Järvinen
  1 sibling, 3 replies; 7+ messages in thread
From: Tom Quetchenbach @ 2007-09-20  1:42 UTC (permalink / raw)
  To: netdev

Patch 1: David Miller's red-black tree code, tweaked for 2.6.22.6,
with some bugfixes

-Tom

---
diff -ur linux-2.6.22.6/include/linux/skbuff.h linux-2.6.22.6-rbtree-davem-fixed/include/linux/skbuff.h
--- linux-2.6.22.6/include/linux/skbuff.h	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/include/linux/skbuff.h	2007-09-13 18:23:16.000000000 -0700
@@ -18,6 +18,7 @@
 #include <linux/compiler.h>
 #include <linux/time.h>
 #include <linux/cache.h>
+#include <linux/rbtree.h>
 
 #include <asm/atomic.h>
 #include <asm/types.h>
@@ -242,6 +243,8 @@
 	struct sk_buff		*next;
 	struct sk_buff		*prev;
 
+	struct rb_node		rb;
+
 	struct sock		*sk;
 	ktime_t			tstamp;
 	struct net_device	*dev;
diff -ur linux-2.6.22.6/include/linux/tcp.h linux-2.6.22.6-rbtree-davem-fixed/include/linux/tcp.h
--- linux-2.6.22.6/include/linux/tcp.h	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/include/linux/tcp.h	2007-09-13 18:23:16.000000000 -0700
@@ -174,6 +174,7 @@
 
 #include <linux/skbuff.h>
 #include <linux/dmaengine.h>
+#include <linux/rbtree.h>
 #include <net/sock.h>
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -321,6 +322,7 @@
 	u32	snd_cwnd_used;
 	u32	snd_cwnd_stamp;
 
+	struct rb_root		write_queue_rb;
 	struct sk_buff_head	out_of_order_queue; /* Out of order segments go here */
 
  	u32	rcv_wnd;	/* Current receiver window		*/
@@ -339,9 +341,7 @@
 	struct sk_buff *scoreboard_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 	struct sk_buff *forward_skb_hint;
-	struct sk_buff *fastpath_skb_hint;
 
-	int     fastpath_cnt_hint;
 	int     lost_cnt_hint;
 	int     retransmit_cnt_hint;
 	int     forward_cnt_hint;
diff -ur linux-2.6.22.6/include/net/tcp.h linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h
--- linux-2.6.22.6/include/net/tcp.h	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h	2007-09-19 17:36:07.000000000 -0700
@@ -540,6 +540,7 @@
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
 	__u32		when;		/* used to compute rtt's	*/
+	unsigned int	fack_count;	/* speed up SACK processing	*/
 	__u8		flags;		/* TCP header flags.		*/
 
 	/* NOTE: These must match up to the flags byte in a
@@ -1043,12 +1044,12 @@
 }
 
 /*from STCP */
-static inline void clear_all_retrans_hints(struct tcp_sock *tp){
+static inline void clear_all_retrans_hints(struct tcp_sock *tp)
+{
 	tp->lost_skb_hint = NULL;
 	tp->scoreboard_skb_hint = NULL;
 	tp->retransmit_skb_hint = NULL;
 	tp->forward_skb_hint = NULL;
-	tp->fastpath_skb_hint = NULL;
 }
 
 /* MD5 Signature */
@@ -1166,6 +1167,7 @@
 
 	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
 		sk_stream_free_skb(sk, skb);
+	tcp_sk(sk)->write_queue_rb = RB_ROOT;
 	sk_stream_mem_reclaim(sk);
 }
 
@@ -1208,7 +1210,7 @@
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	sk->sk_send_head = skb->next;
+	sk->sk_send_head = tcp_write_queue_next(sk, skb);
 	if (sk->sk_send_head == (struct sk_buff *)&sk->sk_write_queue)
 		sk->sk_send_head = NULL;
 	/* Don't override Nagle indefinately with F-RTO */
@@ -1227,9 +1229,61 @@
 	sk->sk_send_head = NULL;
 }
 
+static inline struct sk_buff *tcp_write_queue_find(struct sock *sk, __u32 seq)
+{
+	struct rb_node *rb_node = tcp_sk(sk)->write_queue_rb.rb_node;
+	struct sk_buff *skb = NULL;
+
+	while (rb_node) {
+		struct sk_buff *tmp = rb_entry(rb_node,struct sk_buff,rb);
+		if (TCP_SKB_CB(tmp)->end_seq > seq) {
+			skb = tmp;
+			if (TCP_SKB_CB(tmp)->seq <= seq)
+				break;
+			rb_node = rb_node->rb_left;
+		} else
+			rb_node = rb_node->rb_right;
+
+	}
+	return skb;
+}
+
+static inline void tcp_rb_insert(struct sk_buff *skb, struct rb_root *root)
+{
+	struct rb_node **rb_link, *rb_parent;
+	__u32 seq = TCP_SKB_CB(skb)->seq;
+
+	rb_link = &root->rb_node;
+	rb_parent = NULL;
+	while (*rb_link != NULL) {
+		struct sk_buff *tmp = rb_entry(*rb_link,struct sk_buff,rb);
+		rb_parent = *rb_link;
+		if (TCP_SKB_CB(tmp)->end_seq > seq) {
+			BUG_ON(TCP_SKB_CB(tmp)->seq <= seq);
+			rb_link = &rb_parent->rb_left;
+		} else {
+			rb_link = &rb_parent->rb_right;
+		}
+	}
+	rb_link_node(&skb->rb, rb_parent, rb_link);
+	rb_insert_color(&skb->rb, root);
+}
+
+static inline void tcp_rb_unlink(struct sk_buff *skb, struct rb_root *root)
+{
+	rb_erase(&skb->rb, root);
+}
+
 static inline void __tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
 {
+	struct sk_buff *tail = tcp_write_queue_tail(sk);
+	unsigned int fc = 0;
+
+	if (tail)
+		fc = TCP_SKB_CB(tail)->fack_count + tcp_skb_pcount(tail);
+	TCP_SKB_CB(skb)->fack_count = fc;
 	__skb_queue_tail(&sk->sk_write_queue, skb);
+	tcp_rb_insert(skb, &tcp_sk(sk)->write_queue_rb);
 }
 
 static inline void tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
@@ -1241,9 +1295,35 @@
 		sk->sk_send_head = skb;
 }
 
+/* This is only used for tcp_send_synack(), so the write queue should
+ * be empty.  If that stops being true, the fack_count assignment
+ * will need to be more elaborate.
+ */
 static inline void __tcp_add_write_queue_head(struct sock *sk, struct sk_buff *skb)
 {
+	BUG_ON(!skb_queue_empty(&sk->sk_write_queue));
 	__skb_queue_head(&sk->sk_write_queue, skb);
+	TCP_SKB_CB(skb)->fack_count = 0;
+	tcp_rb_insert(skb, &tcp_sk(sk)->write_queue_rb);
+}
+
+/* An insert into the middle of the write queue causes the fack
+ * counts in subsequent packets to become invalid, fix them up.
+ */
+static inline void tcp_reset_fack_counts(struct sock *sk, struct sk_buff *first)
+{
+	struct sk_buff *prev = first->prev;
+	unsigned int fc = 0;
+
+	if (prev != (struct sk_buff *) &sk->sk_write_queue)
+		fc = TCP_SKB_CB(prev)->fack_count + tcp_skb_pcount(prev);
+
+	while (first != (struct sk_buff *)&sk->sk_write_queue) {
+		TCP_SKB_CB(first)->fack_count = fc;
+
+		fc += tcp_skb_pcount(first);
+		first = first->next;
+	}
 }
 
 /* Insert buff after skb on the write queue of sk.  */
@@ -1252,19 +1332,24 @@
 						struct sock *sk)
 {
 	__skb_append(skb, buff, &sk->sk_write_queue);
+	tcp_reset_fack_counts(sk, buff);
+	tcp_rb_insert(buff, &tcp_sk(sk)->write_queue_rb);
 }
 
-/* Insert skb between prev and next on the write queue of sk.  */
+/* Insert new before skb on the write queue of sk.  */
 static inline void tcp_insert_write_queue_before(struct sk_buff *new,
 						  struct sk_buff *skb,
 						  struct sock *sk)
 {
 	__skb_insert(new, skb->prev, skb, &sk->sk_write_queue);
+	tcp_reset_fack_counts(sk, new);
+	tcp_rb_insert(new, &tcp_sk(sk)->write_queue_rb);
 }
 
 static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock *sk)
 {
 	__skb_unlink(skb, &sk->sk_write_queue);
+	tcp_rb_unlink(skb, &tcp_sk(sk)->write_queue_rb);
 }
 
 static inline int tcp_skb_is_last(const struct sock *sk,
diff -ur linux-2.6.22.6/net/ipv4/tcp_input.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c
--- linux-2.6.22.6/net/ipv4/tcp_input.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c	2007-09-13 18:23:16.000000000 -0700
@@ -947,14 +947,13 @@
 	unsigned char *ptr = (skb_transport_header(ack_skb) +
 			      TCP_SKB_CB(ack_skb)->sacked);
 	struct tcp_sack_block_wire *sp = (struct tcp_sack_block_wire *)(ptr+2);
-	struct sk_buff *cached_skb;
 	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
 	int reord = tp->packets_out;
 	int prior_fackets;
 	u32 lost_retrans = 0;
 	int flag = 0;
 	int found_dup_sack = 0;
-	int cached_fack_count;
+	int fack_count_base;
 	int i;
 	int first_sack_index;
 
@@ -1020,7 +1019,6 @@
 		num_sacks = 1;
 	else {
 		int j;
-		tp->fastpath_skb_hint = NULL;
 
 		/* order SACK blocks to allow in order walk of the retrans queue */
 		for (i = num_sacks-1; i > 0; i--) {
@@ -1045,14 +1043,7 @@
 	/* clear flag as used for different purpose in following code */
 	flag = 0;
 
-	/* Use SACK fastpath hint if valid */
-	cached_skb = tp->fastpath_skb_hint;
-	cached_fack_count = tp->fastpath_cnt_hint;
-	if (!cached_skb) {
-		cached_skb = tcp_write_queue_head(sk);
-		cached_fack_count = 0;
-	}
-
+	fack_count_base = TCP_SKB_CB(tcp_write_queue_head(sk))->fack_count;
 	for (i=0; i<num_sacks; i++, sp++) {
 		struct sk_buff *skb;
 		__u32 start_seq = ntohl(sp->start_seq);
@@ -1060,8 +1051,10 @@
 		int fack_count;
 		int dup_sack = (found_dup_sack && (i == first_sack_index));
 
-		skb = cached_skb;
-		fack_count = cached_fack_count;
+		skb = tcp_write_queue_find(sk, start_seq);
+		if (!skb)
+			continue;
+		fack_count = TCP_SKB_CB(skb)->fack_count - fack_count_base;
 
 		/* Event "B" in the comment above. */
 		if (after(end_seq, tp->high_seq))
@@ -1074,13 +1067,6 @@
 			if (skb == tcp_send_head(sk))
 				break;
 
-			cached_skb = skb;
-			cached_fack_count = fack_count;
-			if (i == first_sack_index) {
-				tp->fastpath_skb_hint = skb;
-				tp->fastpath_cnt_hint = fack_count;
-			}
-
 			/* The retransmission queue is always in order, so
 			 * we can short-circuit the walk early.
 			 */
diff -ur linux-2.6.22.6/net/ipv4/tcp_ipv4.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_ipv4.c
--- linux-2.6.22.6/net/ipv4/tcp_ipv4.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_ipv4.c	2007-09-13 18:23:16.000000000 -0700
@@ -1845,6 +1845,7 @@
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	tp->write_queue_rb = RB_ROOT;
 	skb_queue_head_init(&tp->out_of_order_queue);
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
diff -ur linux-2.6.22.6/net/ipv4/tcp_minisocks.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_minisocks.c
--- linux-2.6.22.6/net/ipv4/tcp_minisocks.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_minisocks.c	2007-09-13 18:23:16.000000000 -0700
@@ -421,6 +421,7 @@
 
 		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
+		newtp->write_queue_rb = RB_ROOT;
 		skb_queue_head_init(&newtp->out_of_order_queue);
 		newtp->write_seq = treq->snt_isn + 1;
 		newtp->pushed_seq = newtp->write_seq;
diff -ur linux-2.6.22.6/net/ipv4/tcp_output.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_output.c
--- linux-2.6.22.6/net/ipv4/tcp_output.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_output.c	2007-09-13 18:23:16.000000000 -0700
@@ -1278,11 +1278,11 @@
 	sk_charge_skb(sk, nskb);
 
 	skb = tcp_send_head(sk);
+	TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(skb)->seq;
+	TCP_SKB_CB(nskb)->end_seq = TCP_SKB_CB(skb)->seq + probe_size;
 	tcp_insert_write_queue_before(nskb, skb, sk);
 	tcp_advance_send_head(sk, skb);
 
-	TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(skb)->seq;
-	TCP_SKB_CB(nskb)->end_seq = TCP_SKB_CB(skb)->seq + probe_size;
 	TCP_SKB_CB(nskb)->flags = TCPCB_FLAG_ACK;
 	TCP_SKB_CB(nskb)->sacked = 0;
 	nskb->csum = 0;
diff -ur linux-2.6.22.6/net/ipv6/tcp_ipv6.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv6/tcp_ipv6.c
--- linux-2.6.22.6/net/ipv6/tcp_ipv6.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv6/tcp_ipv6.c	2007-09-13 18:23:16.000000000 -0700
@@ -1900,6 +1900,7 @@
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	tp->write_queue_rb = RB_ROOT;
 	skb_queue_head_init(&tp->out_of_order_queue);
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] David Miller's rbtree patches for 2.6.22.6
  2007-09-20  1:42 ` [PATCH 1/2] " Tom Quetchenbach
@ 2007-09-20  1:44   ` Tom Quetchenbach
  2007-09-21 13:48     ` Ilpo Järvinen
  2007-09-20 21:27   ` [PATCH 1/2 revised] " Tom Quetchenbach
  2007-09-21 13:37   ` [PATCH 1/2] " Ilpo Järvinen
  2 siblings, 1 reply; 7+ messages in thread
From: Tom Quetchenbach @ 2007-09-20  1:44 UTC (permalink / raw)
  To: netdev

Patch 2: fixes to fack_counts and enhancement of SACK fast path

-Tom

---
diff -ur linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h linux-2.6.22.6-rbtree-tomq/include/net/tcp.h
--- linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h	2007-09-19 17:36:07.000000000 -0700
+++ linux-2.6.22.6-rbtree-tomq/include/net/tcp.h	2007-09-19 12:22:06.000000000 -0700
@@ -1213,6 +1213,11 @@
 	sk->sk_send_head = tcp_write_queue_next(sk, skb);
 	if (sk->sk_send_head == (struct sk_buff *)&sk->sk_write_queue)
 		sk->sk_send_head = NULL;
+	else
+		/* update fack_count of send_head. Since we've sent skb already,
+ 		 * its packet count must be set by now. */
+		TCP_SKB_CB(sk->sk_send_head)->fack_count =
+			TCP_SKB_CB(skb)->fack_count + tcp_skb_pcount(skb);
 	/* Don't override Nagle indefinately with F-RTO */
 	if (tp->frto_counter == 2)
 		tp->frto_counter = 3;
@@ -1310,19 +1315,22 @@
 /* An insert into the middle of the write queue causes the fack
  * counts in subsequent packets to become invalid, fix them up.
  */
-static inline void tcp_reset_fack_counts(struct sock *sk, struct sk_buff *first)
+static inline void tcp_reset_fack_counts(struct sock *sk, struct sk_buff *skb)
 {
-	struct sk_buff *prev = first->prev;
+	struct sk_buff *prev = skb->prev;
 	unsigned int fc = 0;
 
 	if (prev != (struct sk_buff *) &sk->sk_write_queue)
 		fc = TCP_SKB_CB(prev)->fack_count + tcp_skb_pcount(prev);
 
-	while (first != (struct sk_buff *)&sk->sk_write_queue) {
-		TCP_SKB_CB(first)->fack_count = fc;
+	while (skb != (struct sk_buff *)&sk->sk_write_queue) {
+		if (TCP_SKB_CB(skb)->fack_count == fc || !tcp_skb_pcount(skb))
+			break;
 
-		fc += tcp_skb_pcount(first);
-		first = first->next;
+		TCP_SKB_CB(skb)->fack_count = fc;
+
+		fc += tcp_skb_pcount(skb);
+		skb = skb->next;
 	}
 }
 
diff -ur linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c linux-2.6.22.6-rbtree-tomq/net/ipv4/tcp_input.c
--- linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c	2007-09-13 18:23:16.000000000 -0700
+++ linux-2.6.22.6-rbtree-tomq/net/ipv4/tcp_input.c	2007-09-19 12:27:42.000000000 -0700
@@ -956,6 +956,7 @@
 	int fack_count_base;
 	int i;
 	int first_sack_index;
+	u32 prev_end_seq = 0;
 
 	if (!tp->sacked_out)
 		tp->fackets_out = 0;
@@ -1000,6 +1001,7 @@
 		if (i == 0) {
 			if (tp->recv_sack_cache[i].start_seq != start_seq)
 				flag = 0;
+			prev_end_seq = ntohl(tp->recv_sack_cache[i].end_seq);
 		} else {
 			if ((tp->recv_sack_cache[i].start_seq != start_seq) ||
 			    (tp->recv_sack_cache[i].end_seq != end_seq))
@@ -1016,9 +1018,16 @@
 
 	first_sack_index = 0;
 	if (flag)
+		/* all that has changed is end of first SACK block. So all we
+ 		 * need to do is tag those skbs that were'nt tagged last time. */
 		num_sacks = 1;
 	else {
 		int j;
+		
+		/* more than just end of first SACK block has changed; invalidate
+		 * prev_end_seq */
+
+		prev_end_seq = 0;
 
 		/* order SACK blocks to allow in order walk of the retrans queue */
 		for (i = num_sacks-1; i > 0; i--) {
@@ -1051,6 +1060,8 @@
 		int fack_count;
 		int dup_sack = (found_dup_sack && (i == first_sack_index));
 
+		if (prev_end_seq) start_seq = prev_end_seq;
+
 		skb = tcp_write_queue_find(sk, start_seq);
 		if (!skb)
 			continue;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2 revised] David Miller's rbtree patches for 2.6.22.6
  2007-09-20  1:42 ` [PATCH 1/2] " Tom Quetchenbach
  2007-09-20  1:44   ` [PATCH 2/2] " Tom Quetchenbach
@ 2007-09-20 21:27   ` Tom Quetchenbach
  2007-09-21 13:37   ` [PATCH 1/2] " Ilpo Järvinen
  2 siblings, 0 replies; 7+ messages in thread
From: Tom Quetchenbach @ 2007-09-20 21:27 UTC (permalink / raw)
  To: netdev

> Patch 1: David Miller's red-black tree code, tweaked for 2.6.22.6,
> with some bugfixes

oops...

The original patch can cause a kernel panic if tcp_sacktag_write_queue
is called when the write queue is empty. Use this one instead.

-Tom

---
diff -ur linux-2.6.22.6/include/linux/skbuff.h linux-2.6.22.6-rbtree-davem-fixed/include/linux/skbuff.h
--- linux-2.6.22.6/include/linux/skbuff.h	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/include/linux/skbuff.h	2007-09-13 18:23:16.000000000 -0700
@@ -18,6 +18,7 @@
 #include <linux/compiler.h>
 #include <linux/time.h>
 #include <linux/cache.h>
+#include <linux/rbtree.h>
 
 #include <asm/atomic.h>
 #include <asm/types.h>
@@ -242,6 +243,8 @@
 	struct sk_buff		*next;
 	struct sk_buff		*prev;
 
+	struct rb_node		rb;
+
 	struct sock		*sk;
 	ktime_t			tstamp;
 	struct net_device	*dev;
diff -ur linux-2.6.22.6/include/linux/tcp.h linux-2.6.22.6-rbtree-davem-fixed/include/linux/tcp.h
--- linux-2.6.22.6/include/linux/tcp.h	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/include/linux/tcp.h	2007-09-13 18:23:16.000000000 -0700
@@ -174,6 +174,7 @@
 
 #include <linux/skbuff.h>
 #include <linux/dmaengine.h>
+#include <linux/rbtree.h>
 #include <net/sock.h>
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -321,6 +322,7 @@
 	u32	snd_cwnd_used;
 	u32	snd_cwnd_stamp;
 
+	struct rb_root		write_queue_rb;
 	struct sk_buff_head	out_of_order_queue; /* Out of order segments go here */
 
  	u32	rcv_wnd;	/* Current receiver window		*/
@@ -339,9 +341,7 @@
 	struct sk_buff *scoreboard_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 	struct sk_buff *forward_skb_hint;
-	struct sk_buff *fastpath_skb_hint;
 
-	int     fastpath_cnt_hint;
 	int     lost_cnt_hint;
 	int     retransmit_cnt_hint;
 	int     forward_cnt_hint;
diff -ur linux-2.6.22.6/include/net/tcp.h linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h
--- linux-2.6.22.6/include/net/tcp.h	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h	2007-09-19 17:36:07.000000000 -0700
@@ -540,6 +540,7 @@
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
 	__u32		when;		/* used to compute rtt's	*/
+	unsigned int	fack_count;	/* speed up SACK processing	*/
 	__u8		flags;		/* TCP header flags.		*/
 
 	/* NOTE: These must match up to the flags byte in a
@@ -1043,12 +1044,12 @@
 }
 
 /*from STCP */
-static inline void clear_all_retrans_hints(struct tcp_sock *tp){
+static inline void clear_all_retrans_hints(struct tcp_sock *tp)
+{
 	tp->lost_skb_hint = NULL;
 	tp->scoreboard_skb_hint = NULL;
 	tp->retransmit_skb_hint = NULL;
 	tp->forward_skb_hint = NULL;
-	tp->fastpath_skb_hint = NULL;
 }
 
 /* MD5 Signature */
@@ -1166,6 +1167,7 @@
 
 	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
 		sk_stream_free_skb(sk, skb);
+	tcp_sk(sk)->write_queue_rb = RB_ROOT;
 	sk_stream_mem_reclaim(sk);
 }
 
@@ -1208,7 +1210,7 @@
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	sk->sk_send_head = skb->next;
+	sk->sk_send_head = tcp_write_queue_next(sk, skb);
 	if (sk->sk_send_head == (struct sk_buff *)&sk->sk_write_queue)
 		sk->sk_send_head = NULL;
 	/* Don't override Nagle indefinately with F-RTO */
@@ -1227,9 +1229,61 @@
 	sk->sk_send_head = NULL;
 }
 
+static inline struct sk_buff *tcp_write_queue_find(struct sock *sk, __u32 seq)
+{
+	struct rb_node *rb_node = tcp_sk(sk)->write_queue_rb.rb_node;
+	struct sk_buff *skb = NULL;
+
+	while (rb_node) {
+		struct sk_buff *tmp = rb_entry(rb_node,struct sk_buff,rb);
+		if (TCP_SKB_CB(tmp)->end_seq > seq) {
+			skb = tmp;
+			if (TCP_SKB_CB(tmp)->seq <= seq)
+				break;
+			rb_node = rb_node->rb_left;
+		} else
+			rb_node = rb_node->rb_right;
+
+	}
+	return skb;
+}
+
+static inline void tcp_rb_insert(struct sk_buff *skb, struct rb_root *root)
+{
+	struct rb_node **rb_link, *rb_parent;
+	__u32 seq = TCP_SKB_CB(skb)->seq;
+
+	rb_link = &root->rb_node;
+	rb_parent = NULL;
+	while (*rb_link != NULL) {
+		struct sk_buff *tmp = rb_entry(*rb_link,struct sk_buff,rb);
+		rb_parent = *rb_link;
+		if (TCP_SKB_CB(tmp)->end_seq > seq) {
+			BUG_ON(TCP_SKB_CB(tmp)->seq <= seq);
+			rb_link = &rb_parent->rb_left;
+		} else {
+			rb_link = &rb_parent->rb_right;
+		}
+	}
+	rb_link_node(&skb->rb, rb_parent, rb_link);
+	rb_insert_color(&skb->rb, root);
+}
+
+static inline void tcp_rb_unlink(struct sk_buff *skb, struct rb_root *root)
+{
+	rb_erase(&skb->rb, root);
+}
+
 static inline void __tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
 {
+	struct sk_buff *tail = tcp_write_queue_tail(sk);
+	unsigned int fc = 0;
+
+	if (tail)
+		fc = TCP_SKB_CB(tail)->fack_count + tcp_skb_pcount(tail);
+	TCP_SKB_CB(skb)->fack_count = fc;
 	__skb_queue_tail(&sk->sk_write_queue, skb);
+	tcp_rb_insert(skb, &tcp_sk(sk)->write_queue_rb);
 }
 
 static inline void tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
@@ -1241,9 +1295,35 @@
 		sk->sk_send_head = skb;
 }
 
+/* This is only used for tcp_send_synack(), so the write queue should
+ * be empty.  If that stops being true, the fack_count assignment
+ * will need to be more elaborate.
+ */
 static inline void __tcp_add_write_queue_head(struct sock *sk, struct sk_buff *skb)
 {
+	BUG_ON(!skb_queue_empty(&sk->sk_write_queue));
 	__skb_queue_head(&sk->sk_write_queue, skb);
+	TCP_SKB_CB(skb)->fack_count = 0;
+	tcp_rb_insert(skb, &tcp_sk(sk)->write_queue_rb);
+}
+
+/* An insert into the middle of the write queue causes the fack
+ * counts in subsequent packets to become invalid, fix them up.
+ */
+static inline void tcp_reset_fack_counts(struct sock *sk, struct sk_buff *first)
+{
+	struct sk_buff *prev = first->prev;
+	unsigned int fc = 0;
+
+	if (prev != (struct sk_buff *) &sk->sk_write_queue)
+		fc = TCP_SKB_CB(prev)->fack_count + tcp_skb_pcount(prev);
+
+	while (first != (struct sk_buff *)&sk->sk_write_queue) {
+		TCP_SKB_CB(first)->fack_count = fc;
+
+		fc += tcp_skb_pcount(first);
+		first = first->next;
+	}
 }
 
 /* Insert buff after skb on the write queue of sk.  */
@@ -1252,19 +1332,24 @@
 						struct sock *sk)
 {
 	__skb_append(skb, buff, &sk->sk_write_queue);
+	tcp_reset_fack_counts(sk, buff);
+	tcp_rb_insert(buff, &tcp_sk(sk)->write_queue_rb);
 }
 
-/* Insert skb between prev and next on the write queue of sk.  */
+/* Insert new before skb on the write queue of sk.  */
 static inline void tcp_insert_write_queue_before(struct sk_buff *new,
 						  struct sk_buff *skb,
 						  struct sock *sk)
 {
 	__skb_insert(new, skb->prev, skb, &sk->sk_write_queue);
+	tcp_reset_fack_counts(sk, new);
+	tcp_rb_insert(new, &tcp_sk(sk)->write_queue_rb);
 }
 
 static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock *sk)
 {
 	__skb_unlink(skb, &sk->sk_write_queue);
+	tcp_rb_unlink(skb, &tcp_sk(sk)->write_queue_rb);
 }
 
 static inline int tcp_skb_is_last(const struct sock *sk,
diff -ur linux-2.6.22.6/net/ipv4/tcp_input.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c
--- linux-2.6.22.6/net/ipv4/tcp_input.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c	2007-09-20 14:10:26.000000000 -0700
@@ -947,14 +947,13 @@
 	unsigned char *ptr = (skb_transport_header(ack_skb) +
 			      TCP_SKB_CB(ack_skb)->sacked);
 	struct tcp_sack_block_wire *sp = (struct tcp_sack_block_wire *)(ptr+2);
-	struct sk_buff *cached_skb;
 	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
 	int reord = tp->packets_out;
 	int prior_fackets;
 	u32 lost_retrans = 0;
 	int flag = 0;
 	int found_dup_sack = 0;
-	int cached_fack_count;
+	int fack_count_base;
 	int i;
 	int first_sack_index;
 
@@ -1020,7 +1019,6 @@
 		num_sacks = 1;
 	else {
 		int j;
-		tp->fastpath_skb_hint = NULL;
 
 		/* order SACK blocks to allow in order walk of the retrans queue */
 		for (i = num_sacks-1; i > 0; i--) {
@@ -1045,13 +1043,8 @@
 	/* clear flag as used for different purpose in following code */
 	flag = 0;
 
-	/* Use SACK fastpath hint if valid */
-	cached_skb = tp->fastpath_skb_hint;
-	cached_fack_count = tp->fastpath_cnt_hint;
-	if (!cached_skb) {
-		cached_skb = tcp_write_queue_head(sk);
-		cached_fack_count = 0;
-	}
+	if (likely(tcp_write_queue_head(sk)))
+		fack_count_base = TCP_SKB_CB(tcp_write_queue_head(sk))->fack_count;
 
 	for (i=0; i<num_sacks; i++, sp++) {
 		struct sk_buff *skb;
@@ -1060,8 +1053,10 @@
 		int fack_count;
 		int dup_sack = (found_dup_sack && (i == first_sack_index));
 
-		skb = cached_skb;
-		fack_count = cached_fack_count;
+		skb = tcp_write_queue_find(sk, start_seq);
+		if (!skb)
+			continue;
+		fack_count = TCP_SKB_CB(skb)->fack_count - fack_count_base;
 
 		/* Event "B" in the comment above. */
 		if (after(end_seq, tp->high_seq))
@@ -1074,13 +1069,6 @@
 			if (skb == tcp_send_head(sk))
 				break;
 
-			cached_skb = skb;
-			cached_fack_count = fack_count;
-			if (i == first_sack_index) {
-				tp->fastpath_skb_hint = skb;
-				tp->fastpath_cnt_hint = fack_count;
-			}
-
 			/* The retransmission queue is always in order, so
 			 * we can short-circuit the walk early.
 			 */
diff -ur linux-2.6.22.6/net/ipv4/tcp_ipv4.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_ipv4.c
--- linux-2.6.22.6/net/ipv4/tcp_ipv4.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_ipv4.c	2007-09-13 18:23:16.000000000 -0700
@@ -1845,6 +1845,7 @@
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	tp->write_queue_rb = RB_ROOT;
 	skb_queue_head_init(&tp->out_of_order_queue);
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
diff -ur linux-2.6.22.6/net/ipv4/tcp_minisocks.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_minisocks.c
--- linux-2.6.22.6/net/ipv4/tcp_minisocks.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_minisocks.c	2007-09-13 18:23:16.000000000 -0700
@@ -421,6 +421,7 @@
 
 		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
+		newtp->write_queue_rb = RB_ROOT;
 		skb_queue_head_init(&newtp->out_of_order_queue);
 		newtp->write_seq = treq->snt_isn + 1;
 		newtp->pushed_seq = newtp->write_seq;
diff -ur linux-2.6.22.6/net/ipv4/tcp_output.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_output.c
--- linux-2.6.22.6/net/ipv4/tcp_output.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_output.c	2007-09-13 18:23:16.000000000 -0700
@@ -1278,11 +1278,11 @@
 	sk_charge_skb(sk, nskb);
 
 	skb = tcp_send_head(sk);
+	TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(skb)->seq;
+	TCP_SKB_CB(nskb)->end_seq = TCP_SKB_CB(skb)->seq + probe_size;
 	tcp_insert_write_queue_before(nskb, skb, sk);
 	tcp_advance_send_head(sk, skb);
 
-	TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(skb)->seq;
-	TCP_SKB_CB(nskb)->end_seq = TCP_SKB_CB(skb)->seq + probe_size;
 	TCP_SKB_CB(nskb)->flags = TCPCB_FLAG_ACK;
 	TCP_SKB_CB(nskb)->sacked = 0;
 	nskb->csum = 0;
diff -ur linux-2.6.22.6/net/ipv6/tcp_ipv6.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv6/tcp_ipv6.c
--- linux-2.6.22.6/net/ipv6/tcp_ipv6.c	2007-08-30 23:21:01.000000000 -0700
+++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv6/tcp_ipv6.c	2007-09-13 18:23:16.000000000 -0700
@@ -1900,6 +1900,7 @@
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	tp->write_queue_rb = RB_ROOT;
 	skb_queue_head_init(&tp->out_of_order_queue);
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] David Miller's rbtree patches for 2.6.22.6
  2007-09-20  1:42 ` [PATCH 1/2] " Tom Quetchenbach
  2007-09-20  1:44   ` [PATCH 2/2] " Tom Quetchenbach
  2007-09-20 21:27   ` [PATCH 1/2 revised] " Tom Quetchenbach
@ 2007-09-21 13:37   ` Ilpo Järvinen
  2 siblings, 0 replies; 7+ messages in thread
From: Ilpo Järvinen @ 2007-09-21 13:37 UTC (permalink / raw)
  To: Tom Quetchenbach; +Cc: netdev

On Wed, 19 Sep 2007, Tom Quetchenbach wrote:

> Patch 1: David Miller's red-black tree code, tweaked for 2.6.22.6,
> with some bugfixes

It would help if you would leave the original changes as is (rb-tree and 
fack_count separated) and add your work on top of that...

> diff -ur linux-2.6.22.6/include/net/tcp.h linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h
> --- linux-2.6.22.6/include/net/tcp.h	2007-08-30 23:21:01.000000000 -0700
> +++ linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h	2007-09-19 17:36:07.000000000 -0700
> @@ -540,6 +540,7 @@
>  	__u32		seq;		/* Starting sequence number	*/
>  	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
>  	__u32		when;		/* used to compute rtt's	*/
> +	unsigned int	fack_count;	/* speed up SACK processing	*/
>  	__u8		flags;		/* TCP header flags.		*/
>  
>  	/* NOTE: These must match up to the flags byte in a
> @@ -1043,12 +1044,12 @@
>  }
>  
>  /*from STCP */
> -static inline void clear_all_retrans_hints(struct tcp_sock *tp){
> +static inline void clear_all_retrans_hints(struct tcp_sock *tp)
> +{

Unrelated change, please don't do that. Besides, it's already fixed in 
net-2.6.24.

>  	tp->lost_skb_hint = NULL;
>  	tp->scoreboard_skb_hint = NULL;
>  	tp->retransmit_skb_hint = NULL;
>  	tp->forward_skb_hint = NULL;
> -	tp->fastpath_skb_hint = NULL;
>  }
>  
>  /* MD5 Signature */
> @@ -1227,9 +1229,61 @@
>  	sk->sk_send_head = NULL;
>  }
>  
> +static inline struct sk_buff *tcp_write_queue_find(struct sock *sk, __u32 seq)
> +{
> +	struct rb_node *rb_node = tcp_sk(sk)->write_queue_rb.rb_node;
> +	struct sk_buff *skb = NULL;
> +
> +	while (rb_node) {
> +		struct sk_buff *tmp = rb_entry(rb_node,struct sk_buff,rb);
> +		if (TCP_SKB_CB(tmp)->end_seq > seq) {

This is old and buggy version of the rb-tree code. Get the latest rb-tree 
patch from tcp-2.6 tree.

> +			skb = tmp;
> +			if (TCP_SKB_CB(tmp)->seq <= seq)

...fixed in tcp-2.6.

> +				break;
> +			rb_node = rb_node->rb_left;
> +		} else
> +			rb_node = rb_node->rb_right;
> +
> +	}
> +	return skb;
> +}
> +
> +static inline void tcp_rb_insert(struct sk_buff *skb, struct rb_root *root)
> +{
> +	struct rb_node **rb_link, *rb_parent;
> +	__u32 seq = TCP_SKB_CB(skb)->seq;
> +
> +	rb_link = &root->rb_node;
> +	rb_parent = NULL;
> +	while (*rb_link != NULL) {
> +		struct sk_buff *tmp = rb_entry(*rb_link,struct sk_buff,rb);
> +		rb_parent = *rb_link;
> +		if (TCP_SKB_CB(tmp)->end_seq > seq) {
> +			BUG_ON(TCP_SKB_CB(tmp)->seq <= seq);

...these are broken as well.

>
> +			rb_link = &rb_parent->rb_left;
> +		} else {
> +			rb_link = &rb_parent->rb_right;
> +		}
> +	}
> +	rb_link_node(&skb->rb, rb_parent, rb_link);
> +	rb_insert_color(&skb->rb, root);
> +}
> +
> +static inline void tcp_rb_unlink(struct sk_buff *skb, struct rb_root *root)
> +{
> +	rb_erase(&skb->rb, root);
> +}
> +
>  static inline void __tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
>  {
> +	struct sk_buff *tail = tcp_write_queue_tail(sk);
> +	unsigned int fc = 0;
> +
> +	if (tail)
> +		fc = TCP_SKB_CB(tail)->fack_count + tcp_skb_pcount(tail);
> +	TCP_SKB_CB(skb)->fack_count = fc;
>  	__skb_queue_tail(&sk->sk_write_queue, skb);
> +	tcp_rb_insert(skb, &tcp_sk(sk)->write_queue_rb);
>  }
>  
>  static inline void tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb)
> diff -ur linux-2.6.22.6/net/ipv4/tcp_input.c linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c
> --- linux-2.6.22.6/net/ipv4/tcp_input.c	2007-08-30 23:21:01.000000000 -0700
> +++ linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c	2007-09-13 18:23:16.000000000 -0700
> @@ -947,14 +947,13 @@
>  	unsigned char *ptr = (skb_transport_header(ack_skb) +
>  			      TCP_SKB_CB(ack_skb)->sacked);
>  	struct tcp_sack_block_wire *sp = (struct tcp_sack_block_wire *)(ptr+2);
> -	struct sk_buff *cached_skb;
>  	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
>  	int reord = tp->packets_out;
>  	int prior_fackets;
>  	u32 lost_retrans = 0;
>  	int flag = 0;
>  	int found_dup_sack = 0;
> -	int cached_fack_count;
> +	int fack_count_base;
>  	int i;
>  	int first_sack_index;
>  
> @@ -1020,7 +1019,6 @@
>  		num_sacks = 1;
>  	else {
>  		int j;
> -		tp->fastpath_skb_hint = NULL;
>  
>  		/* order SACK blocks to allow in order walk of the retrans queue */
>  		for (i = num_sacks-1; i > 0; i--) {
> @@ -1045,14 +1043,7 @@
>  	/* clear flag as used for different purpose in following code */
>  	flag = 0;
>  
> -	/* Use SACK fastpath hint if valid */
> -	cached_skb = tp->fastpath_skb_hint;
> -	cached_fack_count = tp->fastpath_cnt_hint;
> -	if (!cached_skb) {
> -		cached_skb = tcp_write_queue_head(sk);
> -		cached_fack_count = 0;
> -	}
> -
> +	fack_count_base = TCP_SKB_CB(tcp_write_queue_head(sk))->fack_count;
>  	for (i=0; i<num_sacks; i++, sp++) {
>  		struct sk_buff *skb;
>  		__u32 start_seq = ntohl(sp->start_seq);
> @@ -1060,8 +1051,10 @@
>  		int fack_count;
>  		int dup_sack = (found_dup_sack && (i == first_sack_index));
>  
> -		skb = cached_skb;
> -		fack_count = cached_fack_count;
> +		skb = tcp_write_queue_find(sk, start_seq);
> +		if (!skb)
> +			continue;

In net-2.6.24 we validate SACK blocks early. ...This is not a working 
solution anyway since tcp_write_queue_find(end_seq) might be valid don't 
you think (though with validator it should only happen when dup_sack is 
set)? For non-DSACK cases, a better alternative is like this:

if (WARN_ON(skb == NULL))
	continue;

> +		fack_count = TCP_SKB_CB(skb)->fack_count - fack_count_base;
>  
>  		/* Event "B" in the comment above. */
>  		if (after(end_seq, tp->high_seq))

-- 
 i.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] David Miller's rbtree patches for 2.6.22.6
  2007-09-20  1:44   ` [PATCH 2/2] " Tom Quetchenbach
@ 2007-09-21 13:48     ` Ilpo Järvinen
  0 siblings, 0 replies; 7+ messages in thread
From: Ilpo Järvinen @ 2007-09-21 13:48 UTC (permalink / raw)
  To: Tom Quetchenbach; +Cc: Netdev

On Wed, 19 Sep 2007, Tom Quetchenbach wrote:

> Patch 2: fixes to fack_counts and enhancement of SACK fast path

...Usually these are not combined in patches but a separate patch per 
change.

> diff -ur linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h linux-2.6.22.6-rbtree-tomq/include/net/tcp.h
> --- linux-2.6.22.6-rbtree-davem-fixed/include/net/tcp.h	2007-09-19 17:36:07.000000000 -0700
> +++ linux-2.6.22.6-rbtree-tomq/include/net/tcp.h	2007-09-19 12:22:06.000000000 -0700
> @@ -1213,6 +1213,11 @@

...Btw, please always use -p to diff as well as it helps review :-).

>  	sk->sk_send_head = tcp_write_queue_next(sk, skb);
>  	if (sk->sk_send_head == (struct sk_buff *)&sk->sk_write_queue)
>  		sk->sk_send_head = NULL;
> +	else
> +		/* update fack_count of send_head. Since we've sent skb already,
> + 		 * its packet count must be set by now. */
> +		TCP_SKB_CB(sk->sk_send_head)->fack_count =
> +			TCP_SKB_CB(skb)->fack_count + tcp_skb_pcount(skb);
>
>  	/* Don't override Nagle indefinately with F-RTO */
>  	if (tp->frto_counter == 2)
>  		tp->frto_counter = 3;
> @@ -1310,19 +1315,22 @@
>  /* An insert into the middle of the write queue causes the fack
>   * counts in subsequent packets to become invalid, fix them up.
>   */
> -static inline void tcp_reset_fack_counts(struct sock *sk, struct sk_buff *first)
> +static inline void tcp_reset_fack_counts(struct sock *sk, struct sk_buff *skb)
>  {
> -	struct sk_buff *prev = first->prev;
> +	struct sk_buff *prev = skb->prev;
>  	unsigned int fc = 0;
>  
>  	if (prev != (struct sk_buff *) &sk->sk_write_queue)
>  		fc = TCP_SKB_CB(prev)->fack_count + tcp_skb_pcount(prev);
>  
> -	while (first != (struct sk_buff *)&sk->sk_write_queue) {
> -		TCP_SKB_CB(first)->fack_count = fc;
> +	while (skb != (struct sk_buff *)&sk->sk_write_queue) {
> +		if (TCP_SKB_CB(skb)->fack_count == fc || !tcp_skb_pcount(skb))

What is this !tcp_skb_pcount(skb) for???? I think what we really want here 
is this: 	|| (skb == tcp_send_head(sk))

> +			break;
>
> -		fc += tcp_skb_pcount(first);
> -		first = first->next;
> +		TCP_SKB_CB(skb)->fack_count = fc;
> +
> +		fc += tcp_skb_pcount(skb);
> +		skb = skb->next;
>  	}
>  }
>  
> diff -ur linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c linux-2.6.22.6-rbtree-tomq/net/ipv4/tcp_input.c
> --- linux-2.6.22.6-rbtree-davem-fixed/net/ipv4/tcp_input.c	2007-09-13 18:23:16.000000000 -0700
> +++ linux-2.6.22.6-rbtree-tomq/net/ipv4/tcp_input.c	2007-09-19 12:27:42.000000000 -0700
> @@ -956,6 +956,7 @@
>  	int fack_count_base;
>  	int i;
>  	int first_sack_index;
> +	u32 prev_end_seq = 0;
>  
>  	if (!tp->sacked_out)
>  		tp->fackets_out = 0;
> @@ -1000,6 +1001,7 @@
>  		if (i == 0) {
>  			if (tp->recv_sack_cache[i].start_seq != start_seq)
>  				flag = 0;
> +			prev_end_seq = ntohl(tp->recv_sack_cache[i].end_seq);
>  		} else {
>  			if ((tp->recv_sack_cache[i].start_seq != start_seq) ||
>  			    (tp->recv_sack_cache[i].end_seq != end_seq))
> @@ -1016,9 +1018,16 @@
>  
>  	first_sack_index = 0;
>  	if (flag)
> +		/* all that has changed is end of first SACK block. So all we
> + 		 * need to do is tag those skbs that were'nt tagged last time. */

IMHO, a bit verbose comment. Besides, we already tell above that what SACK 
fastpath is all about...

>  		num_sacks = 1;
>  	else {
>  		int j;
> +		
> +		/* more than just end of first SACK block has changed; invalidate
> +		 * prev_end_seq */
> +
> +		prev_end_seq = 0;

Don't tell what's obvious from code as well.

>  		/* order SACK blocks to allow in order walk of the retrans queue */
>  		for (i = num_sacks-1; i > 0; i--) {
> @@ -1051,6 +1060,8 @@
>  		int fack_count;
>  		int dup_sack = (found_dup_sack && (i == first_sack_index));
>  
> +		if (prev_end_seq) start_seq = prev_end_seq;
> +

...We never add statements on the same line after if statements (see 
Documentation/CodingStyle)



-- 
 i.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] David Miller's rbtree patches for 2.6.22.6
  2007-09-20  1:39 [PATCH 0/2] David Miller's rbtree patches for 2.6.22.6 Tom Quetchenbach
  2007-09-20  1:42 ` [PATCH 1/2] " Tom Quetchenbach
@ 2007-09-21 14:08 ` Ilpo Järvinen
  1 sibling, 0 replies; 7+ messages in thread
From: Ilpo Järvinen @ 2007-09-21 14:08 UTC (permalink / raw)
  To: Tom Quetchenbach; +Cc: Netdev

On Wed, 19 Sep 2007, Tom Quetchenbach wrote:

> Here are a couple of patches against 2.6.22.6. The first one is just
> David's patches tweaked for 2.6.22.6, with a couple of minor bugfixes to
> get it to compile and not crash.

Why did you combine original patches to a single larger one, I think Dave 
made them separate on purpose.

> (I also changed
> __tcp_insert_write_queue_tail() to set the fack_count of the new packet
> to the fack_count of the tail plus the packet count of the tail, not the
> packet count of the new skb, because I think that's how it was intended
> to be. Right?

I think I noticed similar "off-by-pcount" error when I looked that long 
time ago, so I guess you're correct. We're only interested in delta 
of it anyway and add the current skb's pcount to it (which is not 
fixed until tcp_fragment in sacktag is past).

> In the second patch there are a couple of significant changes.  One is
> (as Baruch suggested) to modify the existing SACK fast path so that we
> don't tag packets we've already tagged when we advance by a packet.

This solution would still spend extensive amount of time in processing  
loop, whenever recv_sack_cache fast-path is not taken, that is, e.g. when 
cumulative ACK after retransmissions arrive or new hole becomes visible 
(which are not very exceptional events after all :-)). In the cumulative 
ACK case especially, this processing is very likely _fully_ wasted 
walking.

So there is still room for large improvements. I've made an improved 
version of the current sacktag walk couple of days ago (it's in a state 
where it didn't crash but likely very buggy still), I'll post it here 
soon... Idea is embed recv_sack_cache checking fully into the walking 
loop. By doing that an previously known work is not duplicated. The patch
is currently against non-rbtree stuff but incorporating rb-tree things on 
top of it should be very trivial and synergy benefits with rbtree should 
be considerable because non-rbtree has to do "fast walk" skipping for skbs 
that are under highest_sack which is prone to cache misses. 

> The other issue is that the cached fack_counts seem to be wrong, because
> they're set when we insert into the queue, but tcp_set_tso_segs() is
> called later, just before we send, so all the fack_counts are zero. My
> solution was to set the fack_count when we advance the send_head.

I think it's better solution anyway, since we might have to do 
reset_fack_counts() in between and there's no need to update past 
sk_send_head.

> Also I
> changed tcp_reset_fack_counts() so that it exits when it hits an skb
> whose tcp_skb_pcount() is zero

Do you mind to explain what's the purpose of that?

> or whose fack_count is already correct.
> (This really helps when TSO is on, since there's lots of inserting into
> the middle of the queue.)

Good point.

> Please let me know how I can help get this tested and debugged.

Most network development happens against latest net-2.6(.x) trees. 
In addition, there's experimental tcp-2.6 tree but it's currently a bit 
outdated already (and DaveM is very busy with the phenomenal merge 
they're doing for 2.6.24 :-) so it's not too likely to be updated very 
soon).

...Anyway, thanks for your interest in these things.


-- 
 i.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-09-21 14:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-20  1:39 [PATCH 0/2] David Miller's rbtree patches for 2.6.22.6 Tom Quetchenbach
2007-09-20  1:42 ` [PATCH 1/2] " Tom Quetchenbach
2007-09-20  1:44   ` [PATCH 2/2] " Tom Quetchenbach
2007-09-21 13:48     ` Ilpo Järvinen
2007-09-20 21:27   ` [PATCH 1/2 revised] " Tom Quetchenbach
2007-09-21 13:37   ` [PATCH 1/2] " Ilpo Järvinen
2007-09-21 14:08 ` [PATCH 0/2] " Ilpo Järvinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).