[RFC PATCH net-2.6.24 0/3]: Attempt to fix lost

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness
@ 2007-10-09 12:19 Ilpo Järvinen
  2007-10-09 12:20 ` [PATCH] [TCP]: Separate lost_retrans loop into own function Ilpo Järvinen
  2007-10-09 13:03 ` [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness Ilpo Järvinen
  0 siblings, 2 replies; 29+ messages in thread
From: Ilpo Järvinen @ 2007-10-09 12:19 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Stephen Hemminger, TAKANO Ryousei

Lost_retrans handling of sacktag was found to be flawed, two
problems that were found have an intertwined solution. Fastpath
problem has existed since hints got added and the other problem
has probably been there even longer than that. ...This change
may add non-trivial processing cost.

Initial sketch, only compile tested. This will become more and
more useful, when sacktag starts to process less and less skbs,
which hopefully happens quite soon... :-) Sadly enough it will
probably then be consuming part of the benefits we're able to
achieve by less skb walking...

First one is trivial, so Dave might want to apply it already.

--
 i.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] [TCP]: Separate lost_retrans loop into own function
  2007-10-09 12:19 [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness Ilpo Järvinen
@ 2007-10-09 12:20 ` Ilpo Järvinen
  2007-10-09 12:20   ` [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems Ilpo Järvinen
  2007-10-10  9:45   ` [PATCH] [TCP]: Separate lost_retrans loop into own function David Miller
  2007-10-09 13:03 ` [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness Ilpo Järvinen
  1 sibling, 2 replies; 29+ messages in thread
From: Ilpo Järvinen @ 2007-10-09 12:20 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Stephen Hemminger, TAKANO Ryousei

Follows own function for each task principle, this is really
somewhat separate task being done in sacktag. Also reduces
indentation.

In addition, added ack_seq local var to break some long
lines & fixed coding style things.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   80 +++++++++++++++++++++++++++-----------------------
 1 files changed, 43 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a04e78e..56aa34a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1106,6 +1106,47 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
 	return !before(start_seq, end_seq - tp->max_window);
 }
 
+/* Check for lost retransmit. This superb idea is borrowed from "ratehalving".
+ * Event "C". Later note: FACK people cheated me again 8), we have to account
+ * for reordering! Ugly, but should help.
+ */
+static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+	int flag = 0;
+
+	tcp_for_write_queue(skb, sk) {
+		u32 ack_seq = TCP_SKB_CB(skb)->ack_seq;
+
+		if (skb == tcp_send_head(sk))
+			break;
+		if (after(TCP_SKB_CB(skb)->seq, lost_retrans))
+			break;
+		if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
+			continue;
+
+		if ((TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) &&
+		    after(lost_retrans, ack_seq) &&
+		    (tcp_is_fack(tp) ||
+		     !before(lost_retrans,
+			     ack_seq + tp->reordering * tp->mss_cache))) {
+			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
+			tp->retrans_out -= tcp_skb_pcount(skb);
+
+			/* clear lost hint */
+			tp->retransmit_skb_hint = NULL;
+
+			if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) {
+				tp->lost_out += tcp_skb_pcount(skb);
+				TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
+				flag |= FLAG_DATA_SACKED;
+				NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT);
+			}
+		}
+	}
+	return flag;
+}
 
 static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb,
 			   struct tcp_sack_block_wire *sp, int num_sacks,
@@ -1422,43 +1463,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		}
 	}
 
-	/* Check for lost retransmit. This superb idea is
-	 * borrowed from "ratehalving". Event "C".
-	 * Later note: FACK people cheated me again 8),
-	 * we have to account for reordering! Ugly,
-	 * but should help.
-	 */
-	if (lost_retrans && icsk->icsk_ca_state == TCP_CA_Recovery) {
-		struct sk_buff *skb;
-
-		tcp_for_write_queue(skb, sk) {
-			if (skb == tcp_send_head(sk))
-				break;
-			if (after(TCP_SKB_CB(skb)->seq, lost_retrans))
-				break;
-			if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
-				continue;
-			if ((TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_RETRANS) &&
-			    after(lost_retrans, TCP_SKB_CB(skb)->ack_seq) &&
-			    (tcp_is_fack(tp) ||
-			     !before(lost_retrans,
-				     TCP_SKB_CB(skb)->ack_seq + tp->reordering *
-				     tp->mss_cache))) {
-				TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
-				tp->retrans_out -= tcp_skb_pcount(skb);
-
-				/* clear lost hint */
-				tp->retransmit_skb_hint = NULL;
-
-				if (!(TCP_SKB_CB(skb)->sacked&(TCPCB_LOST|TCPCB_SACKED_ACKED))) {
-					tp->lost_out += tcp_skb_pcount(skb);
-					TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
-					flag |= FLAG_DATA_SACKED;
-					NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT);
-				}
-			}
-		}
-	}
+	if (lost_retrans && icsk->icsk_ca_state == TCP_CA_Recovery)
+		flag |= tcp_mark_lost_retrans(sk, lost_retrans);
 
 	tcp_verify_left_out(tp);
 
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems
  2007-10-09 12:20 ` [PATCH] [TCP]: Separate lost_retrans loop into own function Ilpo Järvinen
@ 2007-10-09 12:20   ` Ilpo Järvinen
  2007-10-09 12:20     ` [RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases Ilpo Järvinen
  2007-10-11  1:55     ` [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems TAKANO Ryousei
  2007-10-10  9:45   ` [PATCH] [TCP]: Separate lost_retrans loop into own function David Miller
  1 sibling, 2 replies; 29+ messages in thread
From: Ilpo Järvinen @ 2007-10-09 12:20 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Stephen Hemminger, TAKANO Ryousei

Detection implemented with lost_retrans must work also when
fastpath is taken, yet most of the queue is skipped including
(very likely) those retransmitted skb's we're interested in.
This problem appeared when the hints got added, which removed
a need to always walk over the whole write queue head.
Therefore decicion for the lost_retrans worker loop entry must
be separated from the sacktag processing more than it was
necessary before.

It turns out to be problematic to optimize the worker loop
very heavily because ack_seqs of skb may have a number of
discontinuity points. Maybe similar approach as currently is
implemented could be attempted but that's becoming more and
more complex because the trend is towards less skb walking
in sacktag marker.

Maybe after(highest_sack_end_seq, tp->high_seq) checking is not
sufficiently accurate and causes entry too often in no-work-to-do
cases. Since that's not known, I've separated solution to that
from this patch.

Noticed because of report against a related problem from TAKANO
Ryousei <takano@axe-inc.co.jp>. He also provided a patch to
that part of the problem. This patch includes solution to it
(though this patch has to use somewhat different placement).
TAKANO's description and patch is available here:

  http://marc.info/?l=linux-netdev&m=119149311913288&w=2

...In short, TAKANO's problem is that end_seq the loop is using
not necessarily the largest SACK block's end_seq because the
current ACK may still have higher SACK blocks which are later
by the loop.

Cc: TAKANO Ryousei <takano@axe-inc.co.jp>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   37 ++++++++++++++++++++++---------------
 1 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 56aa34a..b90c2fc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1109,27 +1109,34 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
 /* Check for lost retransmit. This superb idea is borrowed from "ratehalving".
  * Event "C". Later note: FACK people cheated me again 8), we have to account
  * for reordering! Ugly, but should help.
+ *
+ * Search retransmitted skbs from write_queue that were sent when snd_nxt was
+ * less than what is now known to be received by the other end (derived from
+ * SACK blocks by the caller).
  */
-static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans)
+static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	int flag = 0;
+	int cnt = 0;
 
 	tcp_for_write_queue(skb, sk) {
 		u32 ack_seq = TCP_SKB_CB(skb)->ack_seq;
 
 		if (skb == tcp_send_head(sk))
 			break;
-		if (after(TCP_SKB_CB(skb)->seq, lost_retrans))
+		if (cnt == tp->retrans_out)
 			break;
 		if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			continue;
 
-		if ((TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) &&
-		    after(lost_retrans, ack_seq) &&
+		if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS))
+			continue;
+
+		if (after(received_upto, ack_seq) &&
 		    (tcp_is_fack(tp) ||
-		     !before(lost_retrans,
+		     !before(received_upto,
 			     ack_seq + tp->reordering * tp->mss_cache))) {
 			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
 			tp->retrans_out -= tcp_skb_pcount(skb);
@@ -1143,6 +1150,8 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans)
 				flag |= FLAG_DATA_SACKED;
 				NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT);
 			}
+		} else {
+			cnt += tcp_skb_pcount(skb);
 		}
 	}
 	return flag;
@@ -1193,7 +1202,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
 	int reord = tp->packets_out;
 	int prior_fackets;
-	u32 lost_retrans = 0;
+	u32 highest_sack_end_seq = 0;
 	int flag = 0;
 	int found_dup_sack = 0;
 	int cached_fack_count;
@@ -1383,11 +1392,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 				continue;
 			}
 
-			if ((sacked&TCPCB_SACKED_RETRANS) &&
-			    after(end_seq, TCP_SKB_CB(skb)->ack_seq) &&
-			    (!lost_retrans || after(end_seq, lost_retrans)))
-				lost_retrans = end_seq;
-
 			if (!in_sack)
 				continue;
 
@@ -1441,9 +1445,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 				if (fack_count > tp->fackets_out)
 					tp->fackets_out = fack_count;
 
-				if (after(TCP_SKB_CB(skb)->seq,
-				    tp->highest_sack))
+				if (after(TCP_SKB_CB(skb)->seq, tp->highest_sack)) {
 					tp->highest_sack = TCP_SKB_CB(skb)->seq;
+					highest_sack_end_seq = TCP_SKB_CB(skb)->end_seq;
+				}
 			} else {
 				if (dup_sack && (sacked&TCPCB_RETRANS))
 					reord = min(fack_count, reord);
@@ -1463,8 +1468,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		}
 	}
 
-	if (lost_retrans && icsk->icsk_ca_state == TCP_CA_Recovery)
-		flag |= tcp_mark_lost_retrans(sk, lost_retrans);
+	if (tp->retrans_out && highest_sack_end_seq &&
+	    after(highest_sack_end_seq, tp->high_seq) &&
+	    icsk->icsk_ca_state == TCP_CA_Recovery)
+		flag |= tcp_mark_lost_retrans(sk, highest_sack_end_seq);
 
 	tcp_verify_left_out(tp);
 
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases
  2007-10-09 12:20   ` [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems Ilpo Järvinen
@ 2007-10-09 12:20     ` Ilpo Järvinen
  2007-10-10  9:44       ` David Miller
  2007-10-11  1:55     ` [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems TAKANO Ryousei
  1 sibling, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2007-10-09 12:20 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Stephen Hemminger, TAKANO Ryousei

This addition of lost_retrans_low to tcp_sock might be
unnecessary, it's not clear how often lost_retrans worker is
executed when there wasn't work to do.

Cc: TAKANO Ryousei <takano@axe-inc.co.jp>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h   |    2 ++
 net/ipv4/tcp_input.c  |   15 +++++++++++----
 net/ipv4/tcp_output.c |    2 ++
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 9ff456e..c5b94c1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -348,6 +348,8 @@ struct tcp_sock {
 	int     lost_cnt_hint;
 	int     retransmit_cnt_hint;
 
+	u32	lost_retrans_low;	/* Sent seq after any rxmit (lowest) */
+
 	u16	advmss;		/* Advertised MSS			*/
 	u16	prior_ssthresh; /* ssthresh saved at recovery start	*/
 	u32	lost_out;	/* Lost packets			*/
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b90c2fc..e7bc720 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1112,7 +1112,8 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
  *
  * Search retransmitted skbs from write_queue that were sent when snd_nxt was
  * less than what is now known to be received by the other end (derived from
- * SACK blocks by the caller).
+ * SACK blocks by the caller). Also calculate the lowest snd_nxt among the
+ * remaining retransmitted skbs to avoid some costly processing per ACKs.
  */
 static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto)
 {
@@ -1120,6 +1121,7 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto)
 	struct sk_buff *skb;
 	int flag = 0;
 	int cnt = 0;
+	u32 new_low_seq = 0;
 
 	tcp_for_write_queue(skb, sk) {
 		u32 ack_seq = TCP_SKB_CB(skb)->ack_seq;
@@ -1150,10 +1152,15 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto)
 				flag |= FLAG_DATA_SACKED;
 				NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT);
 			}
-		} else {
+		} else if (!new_low_seq || before(ack_seq, new_low_seq)) {
+			new_low_seq = ack_seq;
 			cnt += tcp_skb_pcount(skb);
 		}
 	}
+
+	if (tp->retrans_out)
+		tp->lost_retrans_low = new_low_seq;
+
 	return flag;
 }
 
@@ -1468,8 +1475,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		}
 	}
 
-	if (tp->retrans_out && highest_sack_end_seq &&
-	    after(highest_sack_end_seq, tp->high_seq) &&
+	if (tp->retrans_out &&
+	    after(highest_sack_end_seq, tp->lost_retrans_low) &&
 	    icsk->icsk_ca_state == TCP_CA_Recovery)
 		flag |= tcp_mark_lost_retrans(sk, highest_sack_end_seq);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5329675..324b420 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1914,6 +1914,8 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 				printk(KERN_DEBUG "retrans_out leaked.\n");
 		}
 #endif
+		if (!tp->retrans_out)
+			tp->lost_retrans_low = tp->snd_nxt;
 		TCP_SKB_CB(skb)->sacked |= TCPCB_RETRANS;
 		tp->retrans_out += tcp_skb_pcount(skb);
 
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness
  2007-10-09 12:19 [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness Ilpo Järvinen
  2007-10-09 12:20 ` [PATCH] [TCP]: Separate lost_retrans loop into own function Ilpo Järvinen
@ 2007-10-09 13:03 ` Ilpo Järvinen
  2007-10-10  9:48   ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2007-10-09 13:03 UTC (permalink / raw)
  To: Netdev; +Cc: David Miller, Stephen Hemminger, TAKANO Ryousei

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1261 bytes --]

On Tue, 9 Oct 2007, Ilpo Järvinen wrote:

> Lost_retrans handling of sacktag was found to be flawed, two
> problems that were found have an intertwined solution. Fastpath
> problem has existed since hints got added and the other problem
> has probably been there even longer than that. ...This change
> may add non-trivial processing cost.
> 
> Initial sketch, only compile tested. This will become more and
> more useful, when sacktag starts to process less and less skbs,
> which hopefully happens quite soon... :-) Sadly enough it will
> probably then be consuming part of the benefits we're able to
> achieve by less skb walking...
> 
> First one is trivial, so Dave might want to apply it already.

Hmm, forgot to add -n to git-format-patch. Since it's currently
RFC, I won't bother to resubmit with numbers unless somebody
really wants that. Here's the correct ordering, if it's not
obvious from the patches alone:

$ git-log -n 3 --pretty=oneline HEAD | cat
c628f5040cf31323f9166d41cb46070aa7be8cc5 [TCP]: Limit processing lost_retrans loop to work-to-do cases
b7388980b7bfa1b9159a65a81b9501df43c67b08 [TCP]: Fix lost_retrans loop vs fastpath problems
a1c39117034c58bf786a5941dd7e9a5968439007 [TCP]: Separate lost_retrans loop into own function


-- 
 i.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases
  2007-10-09 12:20     ` [RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases Ilpo Järvinen
@ 2007-10-10  9:44       ` David Miller
  2007-10-10  9:44         ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2007-10-10  9:44 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev, shemminger, takano

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue,  9 Oct 2007 15:20:02 +0300

> This addition of lost_retrans_low to tcp_sock might be
> unnecessary, it's not clear how often lost_retrans worker is
> executed when there wasn't work to do.
> 
> Cc: TAKANO Ryousei <takano@axe-inc.co.jp>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

I wanted to apply this, but it doesn't go cleanly on top of
net-2.6.24, can you respin this patch for me?

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases
  2007-10-10  9:44       ` David Miller
@ 2007-10-10  9:44         ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2007-10-10  9:44 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev, shemminger, takano

From: David Miller <davem@davemloft.net>
Date: Wed, 10 Oct 2007 02:44:03 -0700 (PDT)

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue,  9 Oct 2007 15:20:02 +0300
> 
> > This addition of lost_retrans_low to tcp_sock might be
> > unnecessary, it's not clear how often lost_retrans worker is
> > executed when there wasn't work to do.
> > 
> > Cc: TAKANO Ryousei <takano@axe-inc.co.jp>
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> 
> I wanted to apply this, but it doesn't go cleanly on top of
> net-2.6.24, can you respin this patch for me?

Nevermind, I mis-interpreted the ordering of the 3 patches,
sorry...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] [TCP]: Separate lost_retrans loop into own function
  2007-10-09 12:20 ` [PATCH] [TCP]: Separate lost_retrans loop into own function Ilpo Järvinen
  2007-10-09 12:20   ` [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems Ilpo Järvinen
@ 2007-10-10  9:45   ` David Miller
  1 sibling, 0 replies; 29+ messages in thread
From: David Miller @ 2007-10-10  9:45 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev, shemminger, takano

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue,  9 Oct 2007 15:20:00 +0300

> Follows own function for each task principle, this is really
> somewhat separate task being done in sacktag. Also reduces
> indentation.
> 
> In addition, added ack_seq local var to break some long
> lines & fixed coding style things.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied, thanks Ilpo!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness
  2007-10-09 13:03 ` [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness Ilpo Järvinen
@ 2007-10-10  9:48   ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2007-10-10  9:48 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev, shemminger, takano

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 9 Oct 2007 16:03:29 +0300 (EEST)

> On Tue, 9 Oct 2007, Ilpo Järvinen wrote:
> 
> > Lost_retrans handling of sacktag was found to be flawed, two
> > problems that were found have an intertwined solution. Fastpath
> > problem has existed since hints got added and the other problem
> > has probably been there even longer than that. ...This change
> > may add non-trivial processing cost.
> > 
> > Initial sketch, only compile tested. This will become more and
> > more useful, when sacktag starts to process less and less skbs,
> > which hopefully happens quite soon... :-) Sadly enough it will
> > probably then be consuming part of the benefits we're able to
> > achieve by less skb walking...
> > 
> > First one is trivial, so Dave might want to apply it already.
> 
> Hmm, forgot to add -n to git-format-patch. Since it's currently
> RFC, I won't bother to resubmit with numbers unless somebody
> really wants that. Here's the correct ordering, if it's not
> obvious from the patches alone:

I'm going to leave the 2nd patches and 3rd patches alone for
now so they can cook a little bit longer.

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems
  2007-10-09 12:20   ` [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems Ilpo Järvinen
  2007-10-09 12:20     ` [RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases Ilpo Järvinen
@ 2007-10-11  1:55     ` TAKANO Ryousei
  2007-10-11 10:12       ` Ilpo Järvinen
  1 sibling, 1 reply; 29+ messages in thread
From: TAKANO Ryousei @ 2007-10-11  1:55 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev, davem, shemminger

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Subject: [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems
Date: Tue,  9 Oct 2007 15:20:01 +0300

> Detection implemented with lost_retrans must work also when
> fastpath is taken, yet most of the queue is skipped including
> (very likely) those retransmitted skb's we're interested in.
> This problem appeared when the hints got added, which removed
> a need to always walk over the whole write queue head.
> Therefore decicion for the lost_retrans worker loop entry must
> be separated from the sacktag processing more than it was
> necessary before.
> 
> It turns out to be problematic to optimize the worker loop
> very heavily because ack_seqs of skb may have a number of
> discontinuity points. Maybe similar approach as currently is
> implemented could be attempted but that's becoming more and
> more complex because the trend is towards less skb walking
> in sacktag marker.
> 
> Maybe after(highest_sack_end_seq, tp->high_seq) checking is not
> sufficiently accurate and causes entry too often in no-work-to-do
> cases. Since that's not known, I've separated solution to that
> from this patch.
> 
> Noticed because of report against a related problem from TAKANO
> Ryousei <takano@axe-inc.co.jp>. He also provided a patch to
> that part of the problem. This patch includes solution to it
> (though this patch has to use somewhat different placement).
> TAKANO's description and patch is available here:
> 
>   http://marc.info/?l=linux-netdev&m=119149311913288&w=2
> 
> ...In short, TAKANO's problem is that end_seq the loop is using
> not necessarily the largest SACK block's end_seq because the
> current ACK may still have higher SACK blocks which are later
> by the loop.
> 
Thanks Ilpo!  I am trying to evaluate this patch. But, I got 
a kernel panic at net_rx_action() in our experimental setting.
I am using the net-2.6.24 kernel _without_ this patch. 
(I will post a bug report separately).
Anyway, I will report the result as soon as possible.

Ryousei Takano

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems
  2007-10-11  1:55     ` [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems TAKANO Ryousei
@ 2007-10-11 10:12       ` Ilpo Järvinen
  2007-10-11 13:51         ` Regression in net-2.6.24? TAKANO Ryousei
  0 siblings, 1 reply; 29+ messages in thread
From: Ilpo Järvinen @ 2007-10-11 10:12 UTC (permalink / raw)
  To: TAKANO Ryousei; +Cc: Netdev, David Miller, Stephen Hemminger

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1009 bytes --]

On Thu, 11 Oct 2007, TAKANO Ryousei wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Subject: [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems
> Date: Tue,  9 Oct 2007 15:20:01 +0300
> 
> Thanks Ilpo!  I am trying to evaluate this patch.

There's a minor problem in this 2nd patch, it's just preventing the 
cnt == tp->retrans_out short-circuit from working, not a correctness 
problem though it could affect the performance. I'll post larger patch 
series among which is a fixed version (hopefully today).

> But, I got 
> a kernel panic at net_rx_action() in our experimental setting.
> I am using the net-2.6.24 kernel _without_ this patch. 
> (I will post a bug report separately).

...Please do. :-)

> Anyway, I will report the result as soon as possible.

Thanks. ...It's very interesting to see because it's not that clear
cut how the extra processing that is necessary affects high-speed 
performance, it could add yet another source of RTOs due
processing latency.

-- 
 i.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Regression in net-2.6.24?
  2007-10-11 10:12       ` Ilpo Järvinen
@ 2007-10-11 13:51         ` TAKANO Ryousei
  2007-10-11 23:48           ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: TAKANO Ryousei @ 2007-10-11 13:51 UTC (permalink / raw)
  To: netdev; +Cc: davem, ilpo.jarvinen, shemminger

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Subject: Re: [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems
Date: Thu, 11 Oct 2007 13:12:49 +0300 (EEST)

> > But, I got 
> > a kernel panic at net_rx_action() in our experimental setting.
> > I am using the net-2.6.24 kernel _without_ this patch. 
> > (I will post a bug report separately).
> 
> ...Please do. :-)
> 
I have got a kernel panic, when I run the Iperf benchmark in the following
setting. Each node has x86_64 dual processors and 2 tg3 NICs (BCM95704A6).
If the router does not regulate the bandwidth, a kernel panic does not occur.

Node A ----> Router -------> Delay -------> Node B
            (Policing rate:  emulator
   	     500Mbps)        (RTT: 20ms)


Ggit-bisect told me that the following commit causes a regression:

commit bea3348eef27e6044b6161fd04c3152215f96411
Author: Stephen Hemminger <shemminger@linux-foundation.org>
Date:   Wed Oct 3 16:41:36 2007 -0700

    [NET]: Make NAPI polling independent of struct net_device objects.


Here is Oops message:

Unable to handle kernel paging request at 0000000000100108 RIP: 
 [<ffffffff80421d59>] net_rx_action+0x169/0x1c0
PGD f384d067 PUD 0 
Oops: 0002 [1] SMP 
CPU 0 
Modules linked in: 8021q tcp_bic netconsole evdev joydev sg st sr_mod ohci_hcd i2c_amd756 i2c_amd8111 i2c_core ipv6 tg3 usbhid usbcore ff_memless dm_mod ext3 jbd sata_sil libata sd_mod scsi_mod
Pid: 0, comm: swapper Not tainted 2.6.23-rc9 #2
RIP: 0010:[<ffffffff80421d59>]  [<ffffffff80421d59>] net_rx_action+0x169/0x1c0
RSP: 0018:ffffffff80847eb0  EFLAGS: 00010046
RAX: 0000000000200200 RBX: ffff8100f63ef750 RCX: 0000000000000246
RDX: 0000000000100100 RSI: ffffc20002e60204 RDI: ffff8100f63ef6c0
RBP: ffffffff80847ef0 R08: ffff810178b72fd8 R09: 0000000000000001
R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000040
R13: 0000000000000040 R14: ffff810003f51640 R15: 0000000000000000
FS:  0000000040331960(0000) GS:ffffffff805d7000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000100108 CR3: 00000000f45b6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff806de000, task ffffffff8058b540)
Stack:  ffffffff80847eb0 000000ec80847eb0 00000001000b17ec 0000000000000009
 ffffffff805e10b0 ffffffff8083eee0 0000000000000000 0000000000000009
 ffffffff80847f30 ffffffff8023d935 0000000000000000 0000000000000046
Call Trace:
 <IRQ>  [<ffffffff8023d935>] __do_softirq+0x75/0x100
 [<ffffffff8020cb8c>] call_softirq+0x1c/0x30
 [<ffffffff8020dd89>] do_softirq+0x49/0xb0
 [<ffffffff8023d4e5>] irq_exit+0x45/0x50
 [<ffffffff8020e053>] do_IRQ+0x83/0x100
 [<ffffffff8020a310>] default_idle+0x0/0x50
 [<ffffffff8020bf11>] ret_from_intr+0x0/0xb
 <EOI>  [<ffffffff8020a33d>] default_idle+0x2d/0x50
 [<ffffffff8020aa72>] enter_idle+0x22/0x30
 [<ffffffff8020ab0c>] cpu_idle+0x8c/0xd0
 [<ffffffff804af6ec>] rest_init+0x5c/0x70
 [<ffffffff806e8baf>] start_kernel+0x28f/0x300
 [<ffffffff806e8140>] _sinittext+0x140/0x180


Code: 48 89 42 08 48 89 10 49 8b 46 08 4c 89 33 49 89 5e 08 48 89 
RIP  [<ffffffff80421d59>] net_rx_action+0x169/0x1c0
 RSP <ffffffff80847eb0>
CR2: 0000000000100108
Kernel panic - not syncing: Aiee, killing interrupt handler!


RIP points at __list_del (net_rx_action -> list_move_tail -> __list_del).

$ objdump -S net/core/dev.o | less
	  <snip>
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
     b42:       48 8b 43 08             mov    0x8(%rbx),%rax
     b46:       48 8b 13                mov    (%rbx),%rdx
        next->prev = prev;
     b49:       48 89 42 08             mov    %rax,0x8(%rdx)  <====
        prev->next = next;
     b4d:       48 89 10                mov    %rdx,(%rax)
     b50:       49 8b 46 08             mov    0x8(%r14),%rax
     b54:       4c 89 33                mov    %r14,(%rbx)
     b57:       49 89 5e 08             mov    %rbx,0x8(%r14)
     b5b:       48 89 18                mov    %rbx,(%rax)


What is happen and what information do you need to fix it?

Thanks in advance,
Ryousei Takano

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-11 13:51         ` Regression in net-2.6.24? TAKANO Ryousei
@ 2007-10-11 23:48           ` David Miller
  2007-10-11 23:55             ` Stephen Hemminger
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2007-10-11 23:48 UTC (permalink / raw)
  To: takano; +Cc: netdev, ilpo.jarvinen, shemminger, mchan

From: TAKANO Ryousei <takano@axe-inc.co.jp>
Date: Thu, 11 Oct 2007 22:51:46 +0900 (JST)

> Modules linked in: 8021q tcp_bic netconsole evdev joydev sg st sr_mod ohci_hcd i2c_amd756 i2c_amd8111 i2c_core ipv6 tg3 usbhid usbcore ff_memless dm_mod ext3 jbd sata_sil libata sd_mod scsi_mod
 ...
> RIP points at __list_del (net_rx_action -> list_move_tail -> __list_del).

There is a contract between the driver's ->poll() method and
net_rx_action() in that it is assumed that if the entire quota has
been used up, the driver will not perform a netif_rx_complete().

It seems that in a corner case the tg3 driver, which you appear to be
using, will not abide by this rule.  That corner case is when the card
has exactly "budget" receive packets pending.  In such a case
tg3_has_work() will be false, and we will RX complete when work_done
>= budget, which violates the above mentioned rule.

Can you see if the following test patch makes the crash go away?

Michael, I know you're not pleased with this patch and neither am
I.  It might be better to just strictly RX complete when
(work < budget) and if tg3_has_work(), trigger a HW interrupt.

Alternatively we could loop in tg3_poll() until either budget
is exhausted or tg3_has_work() returns false.  Actually, this sounds
like a cleaner scheme the more I think about it.

BNX2 likely has a similar issue.

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index d2b30fb..8c27962 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3599,7 +3599,7 @@ static int tg3_poll(struct napi_struct *napi, int budget)
 		sblk->status &= ~SD_STATUS_UPDATED;
 
 	/* if no more work, tell net stack and NIC we're done */
-	if (!tg3_has_work(tp)) {
+	if ((work_done < budget) && !tg3_has_work(tp)) {
 		netif_rx_complete(netdev, napi);
 		tg3_restart_ints(tp);
 	}


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-11 23:48           ` David Miller
@ 2007-10-11 23:55             ` Stephen Hemminger
  2007-10-12  0:17               ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen Hemminger @ 2007-10-11 23:55 UTC (permalink / raw)
  To: David Miller; +Cc: takano, netdev, ilpo.jarvinen, mchan

On Thu, 11 Oct 2007 16:48:27 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: TAKANO Ryousei <takano@axe-inc.co.jp>
> Date: Thu, 11 Oct 2007 22:51:46 +0900 (JST)
> 
> > Modules linked in: 8021q tcp_bic netconsole evdev joydev sg st sr_mod ohci_hcd i2c_amd756 i2c_amd8111 i2c_core ipv6 tg3 usbhid usbcore ff_memless dm_mod ext3 jbd sata_sil libata sd_mod scsi_mod
>  ...
> > RIP points at __list_del (net_rx_action -> list_move_tail -> __list_del).
> 
> There is a contract between the driver's ->poll() method and
> net_rx_action() in that it is assumed that if the entire quota has
> been used up, the driver will not perform a netif_rx_complete().
> 
> It seems that in a corner case the tg3 driver, which you appear to be
> using, will not abide by this rule.  That corner case is when the card
> has exactly "budget" receive packets pending.  In such a case
> tg3_has_work() will be false, and we will RX complete when work_done
> >= budget, which violates the above mentioned rule.
> 
> Can you see if the following test patch makes the crash go away?
> 
> Michael, I know you're not pleased with this patch and neither am
> I.  It might be better to just strictly RX complete when
> (work < budget) and if tg3_has_work(), trigger a HW interrupt.
> 
> Alternatively we could loop in tg3_poll() until either budget
> is exhausted or tg3_has_work() returns false.  Actually, this sounds
> like a cleaner scheme the more I think about it.
> 
> BNX2 likely has a similar issue.

sky2 as well.


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-11 23:55             ` Stephen Hemminger
@ 2007-10-12  0:17               ` David Miller
  2007-10-12  0:31                 ` Stephen Hemminger
  2007-10-12  1:14                 ` David Miller
  0 siblings, 2 replies; 29+ messages in thread
From: David Miller @ 2007-10-12  0:17 UTC (permalink / raw)
  To: shemminger; +Cc: takano, netdev, ilpo.jarvinen, mchan

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Thu, 11 Oct 2007 16:55:48 -0700

> On Thu, 11 Oct 2007 16:48:27 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > Alternatively we could loop in tg3_poll() until either budget
> > is exhausted or tg3_has_work() returns false.  Actually, this sounds
> > like a cleaner scheme the more I think about it.
> > 
> > BNX2 likely has a similar issue.
> 
> sky2 as well.

Thanks for the heads up Stephen.

Here is a patch that implements the looping idea in tg3, bnx2, and
sky2.

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index bbfbdaf..b015d52 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2633,15 +2633,11 @@ bnx2_has_work(struct bnx2 *bp)
 	return 0;
 }
 
-static int
-bnx2_poll(struct napi_struct *napi, int budget)
+static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget)
 {
-	struct bnx2 *bp = container_of(napi, struct bnx2, napi);
-	struct net_device *dev = bp->dev;
 	struct status_block *sblk = bp->status_blk;
 	u32 status_attn_bits = sblk->status_attn_bits;
 	u32 status_attn_bits_ack = sblk->status_attn_bits_ack;
-	int work_done = 0;
 
 	if ((status_attn_bits & STATUS_ATTN_EVENTS) !=
 	    (status_attn_bits_ack & STATUS_ATTN_EVENTS)) {
@@ -2660,27 +2656,43 @@ bnx2_poll(struct napi_struct *napi, int budget)
 		bnx2_tx_int(bp);
 
 	if (bp->status_blk->status_rx_quick_consumer_index0 != bp->hw_rx_cons)
-		work_done = bnx2_rx_int(bp, budget);
+		work_done += bnx2_rx_int(bp, budget - work_done);
 
-	bp->last_status_idx = bp->status_blk->status_idx;
-	rmb();
+	return work_done;
+}
+
+static int bnx2_poll(struct napi_struct *napi, int budget)
+{
+	struct bnx2 *bp = container_of(napi, struct bnx2, napi);
+	int work_done = 0;
+
+	while (1) {
+		work_done += bnx2_poll_work(bp, work_done, budget);
 
-	if (!bnx2_has_work(bp)) {
-		netif_rx_complete(dev, napi);
-		if (likely(bp->flags & USING_MSI_FLAG)) {
+		if (unlikely(work_done >= budget))
+			break;
+
+		if (likely(!bnx2_has_work(bp))) {
+			bp->last_status_idx = bp->status_blk->status_idx;
+			rmb();
+
+			netif_rx_complete(bp->dev, napi);
+			if (likely(bp->flags & USING_MSI_FLAG)) {
+				REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
+				       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+				       bp->last_status_idx);
+				return 0;
+			}
 			REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
 			       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+			       BNX2_PCICFG_INT_ACK_CMD_MASK_INT |
 			       bp->last_status_idx);
-			return 0;
-		}
-		REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-		       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-		       BNX2_PCICFG_INT_ACK_CMD_MASK_INT |
-		       bp->last_status_idx);
 
-		REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-		       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-		       bp->last_status_idx);
+			REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
+			       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+			       bp->last_status_idx);
+			break;
+		}
 	}
 
 	return work_done;
diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index fe0e756..25da238 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -2605,33 +2605,41 @@ static void sky2_err_intr(struct sky2_hw *hw, u32 status)
 static int sky2_poll(struct napi_struct *napi, int work_limit)
 {
 	struct sky2_hw *hw = container_of(napi, struct sky2_hw, napi);
-	u32 status = sky2_read32(hw, B0_Y2_SP_EISR);
-	int work_done;
+	int work_done = 0;
 
-	if (unlikely(status & Y2_IS_ERROR))
-		sky2_err_intr(hw, status);
+	while (1) {
+		u32 status = sky2_read32(hw, B0_Y2_SP_EISR);
 
-	if (status & Y2_IS_IRQ_PHY1)
-		sky2_phy_intr(hw, 0);
+		if (unlikely(status & Y2_IS_ERROR))
+			sky2_err_intr(hw, status);
 
-	if (status & Y2_IS_IRQ_PHY2)
-		sky2_phy_intr(hw, 1);
+		if (status & Y2_IS_IRQ_PHY1)
+			sky2_phy_intr(hw, 0);
 
-	work_done = sky2_status_intr(hw, work_limit);
+		if (status & Y2_IS_IRQ_PHY2)
+			sky2_phy_intr(hw, 1);
 
-	/* More work? */
-	if (hw->st_idx == sky2_read16(hw, STAT_PUT_IDX)) {
-		/* Bug/Errata workaround?
-		 * Need to kick the TX irq moderation timer.
-		 */
-		if (sky2_read8(hw, STAT_TX_TIMER_CTRL) == TIM_START) {
-			sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
-			sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
-		}
+		work_done += sky2_status_intr(hw, work_limit - work_done);
 
-		napi_complete(napi);
-		sky2_read32(hw, B0_Y2_SP_LISR);
+		if (unlikely(work_done >= work_limit))
+			break;
+
+		/* More work? */
+		if (likely(hw->st_idx == sky2_read16(hw, STAT_PUT_IDX))) {
+			/* Bug/Errata workaround?
+			 * Need to kick the TX irq moderation timer.
+			 */
+			if (sky2_read8(hw, STAT_TX_TIMER_CTRL) == TIM_START) {
+				sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
+				sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
+			}
+
+			napi_complete(napi);
+			sky2_read32(hw, B0_Y2_SP_LISR);
+			break;
+		}
 	}
+
 	return work_done;
 }
 
diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index d2b30fb..ae08dda 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3555,12 +3555,9 @@ next_pkt_nopost:
 	return received;
 }
 
-static int tg3_poll(struct napi_struct *napi, int budget)
+static int tg3_poll_work(struct tg3 *tp, int work_done, int budget)
 {
-	struct tg3 *tp = container_of(napi, struct tg3, napi);
-	struct net_device *netdev = tp->dev;
 	struct tg3_hw_status *sblk = tp->hw_status;
-	int work_done = 0;
 
 	/* handle link change and other phy events */
 	if (!(tp->tg3_flags &
@@ -3578,11 +3575,8 @@ static int tg3_poll(struct napi_struct *napi, int budget)
 	/* run TX completion thread */
 	if (sblk->idx[0].tx_consumer != tp->tx_cons) {
 		tg3_tx(tp);
-		if (unlikely(tp->tg3_flags & TG3_FLAG_TX_RECOVERY_PENDING)) {
-			netif_rx_complete(netdev, napi);
-			schedule_work(&tp->reset_task);
+		if (unlikely(tp->tg3_flags & TG3_FLAG_TX_RECOVERY_PENDING))
 			return 0;
-		}
 	}
 
 	/* run RX thread, within the bounds set by NAPI.
@@ -3590,21 +3584,46 @@ static int tg3_poll(struct napi_struct *napi, int budget)
 	 * code synchronizes with tg3->napi.poll()
 	 */
 	if (sblk->idx[0].rx_producer != tp->rx_rcb_ptr)
-		work_done = tg3_rx(tp, budget);
+		work_done += tg3_rx(tp, budget - work_done);
 
-	if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
-		tp->last_tag = sblk->status_tag;
-		rmb();
-	} else
-		sblk->status &= ~SD_STATUS_UPDATED;
+	return work_done;
+}
 
-	/* if no more work, tell net stack and NIC we're done */
-	if (!tg3_has_work(tp)) {
-		netif_rx_complete(netdev, napi);
-		tg3_restart_ints(tp);
+static int tg3_poll(struct napi_struct *napi, int budget)
+{
+	struct tg3 *tp = container_of(napi, struct tg3, napi);
+	int work_done = 0;
+
+	while (1) {
+		work_done += tg3_poll_work(tp, work_done, budget);
+
+		if (unlikely(tp->tg3_flags & TG3_FLAG_TX_RECOVERY_PENDING))
+			goto tx_recovery;
+
+		if (unlikely(work_done >= budget))
+			break;
+
+		if (likely(!tg3_has_work(tp))) {
+			struct tg3_hw_status *sblk = tp->hw_status;
+
+			if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
+				tp->last_tag = sblk->status_tag;
+				rmb();
+			} else
+				sblk->status &= ~SD_STATUS_UPDATED;
+
+			netif_rx_complete(tp->dev, napi);
+			tg3_restart_ints(tp);
+			break;
+		}
 	}
 
 	return work_done;
+
+tx_recovery:
+	netif_rx_complete(tp->dev, napi);
+	schedule_work(&tp->reset_task);
+	return 0;
 }
 
 static void tg3_irq_quiesce(struct tg3 *tp)

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  0:17               ` David Miller
@ 2007-10-12  0:31                 ` Stephen Hemminger
  2007-10-12  0:40                   ` David Miller
  2007-10-12  1:14                 ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Stephen Hemminger @ 2007-10-12  0:31 UTC (permalink / raw)
  To: David Miller; +Cc: takano, netdev, ilpo.jarvinen, mchan

On Thu, 11 Oct 2007 17:17:43 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Thu, 11 Oct 2007 16:55:48 -0700
> 
> > On Thu, 11 Oct 2007 16:48:27 -0700 (PDT)
> > David Miller <davem@davemloft.net> wrote:
> > 
> > > Alternatively we could loop in tg3_poll() until either budget
> > > is exhausted or tg3_has_work() returns false.  Actually, this sounds
> > > like a cleaner scheme the more I think about it.
> > > 
> > > BNX2 likely has a similar issue.
> > 
> > sky2 as well.
> 
> Thanks for the heads up Stephen.
> 
> Here is a patch that implements the looping idea in tg3, bnx2, and
> sky2.
> 
> diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
> index bbfbdaf..b015d52 100644
> --- a/drivers/net/bnx2.c
> +++ b/drivers/net/bnx2.c
> @@ -2633,15 +2633,11 @@ bnx2_has_work(struct bnx2 *bp)
>  	return 0;
>  }
>  
> -static int
> -bnx2_poll(struct napi_struct *napi, int budget)
> +static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget)
>  {
> -	struct bnx2 *bp = container_of(napi, struct bnx2, napi);
> -	struct net_device *dev = bp->dev;
>  	struct status_block *sblk = bp->status_blk;
>  	u32 status_attn_bits = sblk->status_attn_bits;
>  	u32 status_attn_bits_ack = sblk->status_attn_bits_ack;
> -	int work_done = 0;
>  
>  	if ((status_attn_bits & STATUS_ATTN_EVENTS) !=
>  	    (status_attn_bits_ack & STATUS_ATTN_EVENTS)) {
> @@ -2660,27 +2656,43 @@ bnx2_poll(struct napi_struct *napi, int budget)
>  		bnx2_tx_int(bp);
>  
>  	if (bp->status_blk->status_rx_quick_consumer_index0 != bp->hw_rx_cons)
> -		work_done = bnx2_rx_int(bp, budget);
> +		work_done += bnx2_rx_int(bp, budget - work_done);
>  
> -	bp->last_status_idx = bp->status_blk->status_idx;
> -	rmb();
> +	return work_done;
> +}
> +
> +static int bnx2_poll(struct napi_struct *napi, int budget)
> +{
> +	struct bnx2 *bp = container_of(napi, struct bnx2, napi);
> +	int work_done = 0;
> +
> +	while (1) {
> +		work_done += bnx2_poll_work(bp, work_done, budget);
>  
> -	if (!bnx2_has_work(bp)) {
> -		netif_rx_complete(dev, napi);
> -		if (likely(bp->flags & USING_MSI_FLAG)) {
> +		if (unlikely(work_done >= budget))
> +			break;
> +
> +		if (likely(!bnx2_has_work(bp))) {
> +			bp->last_status_idx = bp->status_blk->status_idx;
> +			rmb();
> +
> +			netif_rx_complete(bp->dev, napi);
> +			if (likely(bp->flags & USING_MSI_FLAG)) {
> +				REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
> +				       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
> +				       bp->last_status_idx);
> +				return 0;
> +			}
>  			REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
>  			       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
> +			       BNX2_PCICFG_INT_ACK_CMD_MASK_INT |
>  			       bp->last_status_idx);
> -			return 0;
> -		}
> -		REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
> -		       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
> -		       BNX2_PCICFG_INT_ACK_CMD_MASK_INT |
> -		       bp->last_status_idx);
>  
> -		REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
> -		       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
> -		       bp->last_status_idx);
> +			REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
> +			       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
> +			       bp->last_status_idx);
> +			break;
> +		}
>  	}
>  
>  	return work_done;
> diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
> index fe0e756..25da238 100644
> --- a/drivers/net/sky2.c
> +++ b/drivers/net/sky2.c
> @@ -2605,33 +2605,41 @@ static void sky2_err_intr(struct sky2_hw *hw, u32 status)
>  static int sky2_poll(struct napi_struct *napi, int work_limit)
>  {
>  	struct sky2_hw *hw = container_of(napi, struct sky2_hw, napi);
> -	u32 status = sky2_read32(hw, B0_Y2_SP_EISR);
> -	int work_done;
> +	int work_done = 0;
>  
> -	if (unlikely(status & Y2_IS_ERROR))
> -		sky2_err_intr(hw, status);
> +	while (1) {
> +		u32 status = sky2_read32(hw, B0_Y2_SP_EISR);
>  
> -	if (status & Y2_IS_IRQ_PHY1)
> -		sky2_phy_intr(hw, 0);
> +		if (unlikely(status & Y2_IS_ERROR))
> +			sky2_err_intr(hw, status);
>  
> -	if (status & Y2_IS_IRQ_PHY2)
> -		sky2_phy_intr(hw, 1);
> +		if (status & Y2_IS_IRQ_PHY1)
> +			sky2_phy_intr(hw, 0);
>  
> -	work_done = sky2_status_intr(hw, work_limit);
> +		if (status & Y2_IS_IRQ_PHY2)
> +			sky2_phy_intr(hw, 1);
>  
> -	/* More work? */
> -	if (hw->st_idx == sky2_read16(hw, STAT_PUT_IDX)) {
> -		/* Bug/Errata workaround?
> -		 * Need to kick the TX irq moderation timer.
> -		 */
> -		if (sky2_read8(hw, STAT_TX_TIMER_CTRL) == TIM_START) {
> -			sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
> -			sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
> -		}
> +		work_done += sky2_status_intr(hw, work_limit - work_done);
>  
> -		napi_complete(napi);
> -		sky2_read32(hw, B0_Y2_SP_LISR);
> +		if (unlikely(work_done >= work_limit))
> +			break;
> +
> +		/* More work? */
> +		if (likely(hw->st_idx == sky2_read16(hw, STAT_PUT_IDX))) {
> +			/* Bug/Errata workaround?
> +			 * Need to kick the TX irq moderation timer.
> +			 */
> +			if (sky2_read8(hw, STAT_TX_TIMER_CTRL) == TIM_START) {
> +				sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
> +				sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
> +			}
> +
> +			napi_complete(napi);
> +			sky2_read32(hw, B0_Y2_SP_LISR);
> +			break;
> +		}
>  	}
> +
>  	return work_done;
>  }

You don't need to re-read the status register and process the PHY irq's inside loop.
Try this:

--- a/drivers/net/sky2.c	2007-10-11 13:16:15.000000000 -0700
+++ b/drivers/net/sky2.c	2007-10-11 17:30:29.000000000 -0700
@@ -2606,7 +2606,7 @@ static int sky2_poll(struct napi_struct 
 {
 	struct sky2_hw *hw = container_of(napi, struct sky2_hw, napi);
 	u32 status = sky2_read32(hw, B0_Y2_SP_EISR);
-	int work_done;
+	int work_done = 0;
 
 	if (unlikely(status & Y2_IS_ERROR))
 		sky2_err_intr(hw, status);
@@ -2617,10 +2617,16 @@ static int sky2_poll(struct napi_struct 
 	if (status & Y2_IS_IRQ_PHY2)
 		sky2_phy_intr(hw, 1);
 
-	work_done = sky2_status_intr(hw, work_limit);
+	for(;;) {
+		work_done += sky2_status_intr(hw, work_limit);
+
+		if (work_done >= work_limit)
+			break;
+
+		/* More work? */
+		if (hw->st_idx != sky2_read16(hw, STAT_PUT_IDX))
+			continue;
 
-	/* More work? */
-	if (hw->st_idx == sky2_read16(hw, STAT_PUT_IDX)) {
 		/* Bug/Errata workaround?
 		 * Need to kick the TX irq moderation timer.
 		 */
@@ -2628,10 +2634,13 @@ static int sky2_poll(struct napi_struct 
 			sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP);
 			sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START);
 		}
-
+
 		napi_complete(napi);
 		sky2_read32(hw, B0_Y2_SP_LISR);
+		break;
+
 	}
+
 	return work_done;
 }
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  0:31                 ` Stephen Hemminger
@ 2007-10-12  0:40                   ` David Miller
  2007-10-12  0:50                     ` Stephen Hemminger
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2007-10-12  0:40 UTC (permalink / raw)
  To: shemminger; +Cc: takano, netdev, ilpo.jarvinen, mchan

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Thu, 11 Oct 2007 17:31:49 -0700

> You don't need to re-read the status register and process the PHY irq's inside loop.
> Try this:

Are you sure?  What if a PHY interrupt comes in during the loop?

I'm just preserving the semantics of the driver when ->poll()
is invoked multiple times per interrupt.

And I think preserving that makes sense while we're purely
trying to fix this bug, so we don't add some new ones.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  0:40                   ` David Miller
@ 2007-10-12  0:50                     ` Stephen Hemminger
  2007-10-12  1:00                       ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen Hemminger @ 2007-10-12  0:50 UTC (permalink / raw)
  To: David Miller; +Cc: takano, netdev, ilpo.jarvinen, mchan

On Thu, 11 Oct 2007 17:40:26 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Thu, 11 Oct 2007 17:31:49 -0700
> 
> > You don't need to re-read the status register and process the PHY irq's inside loop.
> > Try this:
> 
> Are you sure?  What if a PHY interrupt comes in during the loop?

The interrupt is level triggered, and will rearm.

> 
> I'm just preserving the semantics of the driver when ->poll()
> is invoked multiple times per interrupt.
> 
> And I think preserving that makes sense while we're purely
> trying to fix this bug, so we don't add some new ones.


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  0:50                     ` Stephen Hemminger
@ 2007-10-12  1:00                       ` David Miller
  2007-10-12  1:03                         ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2007-10-12  1:00 UTC (permalink / raw)
  To: shemminger; +Cc: takano, netdev, ilpo.jarvinen, mchan

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Thu, 11 Oct 2007 17:50:59 -0700

> On Thu, 11 Oct 2007 17:40:26 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Stephen Hemminger <shemminger@linux-foundation.org>
> > Date: Thu, 11 Oct 2007 17:31:49 -0700
> > 
> > > You don't need to re-read the status register and process the PHY irq's inside loop.
> > > Try this:
> > 
> > Are you sure?  What if a PHY interrupt comes in during the loop?
> 
> The interrupt is level triggered, and will rearm.

Fair enough.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  1:00                       ` David Miller
@ 2007-10-12  1:03                         ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2007-10-12  1:03 UTC (permalink / raw)
  To: shemminger; +Cc: takano, netdev, ilpo.jarvinen, mchan

From: David Miller <davem@davemloft.net>
Date: Thu, 11 Oct 2007 18:00:31 -0700 (PDT)

> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Thu, 11 Oct 2007 17:50:59 -0700
> 
> > On Thu, 11 Oct 2007 17:40:26 -0700 (PDT)
> > David Miller <davem@davemloft.net> wrote:
> > 
> > > From: Stephen Hemminger <shemminger@linux-foundation.org>
> > > Date: Thu, 11 Oct 2007 17:31:49 -0700
> > > 
> > > > You don't need to re-read the status register and process the PHY irq's inside loop.
> > > > Try this:
> > > 
> > > Are you sure?  What if a PHY interrupt comes in during the loop?
> > 
> > The interrupt is level triggered, and will rearm.
> 
> Fair enough.

Actually, your change isn't right for another reason.

You missed the necessary budget reducing logic that I used
in the original changes.  You need to adjust work_limit
like this:

		work_done += sky2_status_intr(hw, work_limit - work_done);

Otherwise if you just pass plain "work_limit", and we do loop, the
driver uses more quota than it really should.

And I've made that above correction to your patch in my tree.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  0:17               ` David Miller
  2007-10-12  0:31                 ` Stephen Hemminger
@ 2007-10-12  1:14                 ` David Miller
  2007-10-12  1:22                   ` Stephen Hemminger
                                     ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: David Miller @ 2007-10-12  1:14 UTC (permalink / raw)
  To: shemminger; +Cc: takano, netdev, ilpo.jarvinen, mchan


Here is what I'm checking into net-2.6 for now:

commit 6f535763165331bb91277d7519b507fed22034e5
Author: David S. Miller <davem@sunset.davemloft.net>
Date:   Thu Oct 11 18:08:29 2007 -0700

    [NET]: Fix NAPI completion handling in some drivers.
    
    In order for the list handling in net_rx_action() to be
    correct, drivers must follow certain rules as stated by
    this comment in net_rx_action():
    
    		/* Drivers must not modify the NAPI state if they
    		 * consume the entire weight.  In such cases this code
    		 * still "owns" the NAPI instance and therefore can
    		 * move the instance around on the list at-will.
    		 */
    
    A few drivers do not do this because they mix the budget checks
    with reading hardware state, resulting in crashes like the one
    reported by takano@axe-inc.co.jp.
    
    BNX2 and TG3 are taken care of here, SKY2 fix is from Stephen
    Hemminger.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index bbfbdaf..d68acce 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -2633,15 +2633,11 @@ bnx2_has_work(struct bnx2 *bp)
 	return 0;
 }
 
-static int
-bnx2_poll(struct napi_struct *napi, int budget)
+static int bnx2_poll_work(struct bnx2 *bp, int work_done, int budget)
 {
-	struct bnx2 *bp = container_of(napi, struct bnx2, napi);
-	struct net_device *dev = bp->dev;
 	struct status_block *sblk = bp->status_blk;
 	u32 status_attn_bits = sblk->status_attn_bits;
 	u32 status_attn_bits_ack = sblk->status_attn_bits_ack;
-	int work_done = 0;
 
 	if ((status_attn_bits & STATUS_ATTN_EVENTS) !=
 	    (status_attn_bits_ack & STATUS_ATTN_EVENTS)) {
@@ -2660,27 +2656,43 @@ bnx2_poll(struct napi_struct *napi, int budget)
 		bnx2_tx_int(bp);
 
 	if (bp->status_blk->status_rx_quick_consumer_index0 != bp->hw_rx_cons)
-		work_done = bnx2_rx_int(bp, budget);
+		work_done += bnx2_rx_int(bp, budget - work_done);
 
-	bp->last_status_idx = bp->status_blk->status_idx;
-	rmb();
+	return work_done;
+}
+
+static int bnx2_poll(struct napi_struct *napi, int budget)
+{
+	struct bnx2 *bp = container_of(napi, struct bnx2, napi);
+	int work_done = 0;
+
+	while (1) {
+		work_done = bnx2_poll_work(bp, work_done, budget);
 
-	if (!bnx2_has_work(bp)) {
-		netif_rx_complete(dev, napi);
-		if (likely(bp->flags & USING_MSI_FLAG)) {
+		if (unlikely(work_done >= budget))
+			break;
+
+		if (likely(!bnx2_has_work(bp))) {
+			bp->last_status_idx = bp->status_blk->status_idx;
+			rmb();
+
+			netif_rx_complete(bp->dev, napi);
+			if (likely(bp->flags & USING_MSI_FLAG)) {
+				REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
+				       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+				       bp->last_status_idx);
+				return 0;
+			}
 			REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
 			       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+			       BNX2_PCICFG_INT_ACK_CMD_MASK_INT |
 			       bp->last_status_idx);
-			return 0;
-		}
-		REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-		       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-		       BNX2_PCICFG_INT_ACK_CMD_MASK_INT |
-		       bp->last_status_idx);
 
-		REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
-		       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
-		       bp->last_status_idx);
+			REG_WR(bp, BNX2_PCICFG_INT_ACK_CMD,
+			       BNX2_PCICFG_INT_ACK_CMD_INDEX_VALID |
+			       bp->last_status_idx);
+			break;
+		}
 	}
 
 	return work_done;
diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index fe0e756..4e569fa 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -2606,7 +2606,7 @@ static int sky2_poll(struct napi_struct *napi, int work_limit)
 {
 	struct sky2_hw *hw = container_of(napi, struct sky2_hw, napi);
 	u32 status = sky2_read32(hw, B0_Y2_SP_EISR);
-	int work_done;
+	int work_done = 0;
 
 	if (unlikely(status & Y2_IS_ERROR))
 		sky2_err_intr(hw, status);
@@ -2617,10 +2617,16 @@ static int sky2_poll(struct napi_struct *napi, int work_limit)
 	if (status & Y2_IS_IRQ_PHY2)
 		sky2_phy_intr(hw, 1);
 
-	work_done = sky2_status_intr(hw, work_limit);
+	for(;;) {
+		work_done += sky2_status_intr(hw, work_limit);
+
+		if (work_done >= work_limit)
+			break;
+
+		/* More work? */
+		if (hw->st_idx != sky2_read16(hw, STAT_PUT_IDX))
+			continue;
 
-	/* More work? */
-	if (hw->st_idx == sky2_read16(hw, STAT_PUT_IDX)) {
 		/* Bug/Errata workaround?
 		 * Need to kick the TX irq moderation timer.
 		 */
@@ -2631,7 +2637,10 @@ static int sky2_poll(struct napi_struct *napi, int work_limit)
 
 		napi_complete(napi);
 		sky2_read32(hw, B0_Y2_SP_LISR);
+		break;
+
 	}
+
 	return work_done;
 }
 
diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index d2b30fb..a402b5c 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3555,12 +3555,9 @@ next_pkt_nopost:
 	return received;
 }
 
-static int tg3_poll(struct napi_struct *napi, int budget)
+static int tg3_poll_work(struct tg3 *tp, int work_done, int budget)
 {
-	struct tg3 *tp = container_of(napi, struct tg3, napi);
-	struct net_device *netdev = tp->dev;
 	struct tg3_hw_status *sblk = tp->hw_status;
-	int work_done = 0;
 
 	/* handle link change and other phy events */
 	if (!(tp->tg3_flags &
@@ -3578,11 +3575,8 @@ static int tg3_poll(struct napi_struct *napi, int budget)
 	/* run TX completion thread */
 	if (sblk->idx[0].tx_consumer != tp->tx_cons) {
 		tg3_tx(tp);
-		if (unlikely(tp->tg3_flags & TG3_FLAG_TX_RECOVERY_PENDING)) {
-			netif_rx_complete(netdev, napi);
-			schedule_work(&tp->reset_task);
+		if (unlikely(tp->tg3_flags & TG3_FLAG_TX_RECOVERY_PENDING))
 			return 0;
-		}
 	}
 
 	/* run RX thread, within the bounds set by NAPI.
@@ -3590,21 +3584,46 @@ static int tg3_poll(struct napi_struct *napi, int budget)
 	 * code synchronizes with tg3->napi.poll()
 	 */
 	if (sblk->idx[0].rx_producer != tp->rx_rcb_ptr)
-		work_done = tg3_rx(tp, budget);
+		work_done += tg3_rx(tp, budget - work_done);
 
-	if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
-		tp->last_tag = sblk->status_tag;
-		rmb();
-	} else
-		sblk->status &= ~SD_STATUS_UPDATED;
+	return work_done;
+}
 
-	/* if no more work, tell net stack and NIC we're done */
-	if (!tg3_has_work(tp)) {
-		netif_rx_complete(netdev, napi);
-		tg3_restart_ints(tp);
+static int tg3_poll(struct napi_struct *napi, int budget)
+{
+	struct tg3 *tp = container_of(napi, struct tg3, napi);
+	int work_done = 0;
+
+	while (1) {
+		work_done = tg3_poll_work(tp, work_done, budget);
+
+		if (unlikely(tp->tg3_flags & TG3_FLAG_TX_RECOVERY_PENDING))
+			goto tx_recovery;
+
+		if (unlikely(work_done >= budget))
+			break;
+
+		if (likely(!tg3_has_work(tp))) {
+			struct tg3_hw_status *sblk = tp->hw_status;
+
+			if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
+				tp->last_tag = sblk->status_tag;
+				rmb();
+			} else
+				sblk->status &= ~SD_STATUS_UPDATED;
+
+			netif_rx_complete(tp->dev, napi);
+			tg3_restart_ints(tp);
+			break;
+		}
 	}
 
 	return work_done;
+
+tx_recovery:
+	netif_rx_complete(tp->dev, napi);
+	schedule_work(&tp->reset_task);
+	return 0;
 }
 
 static void tg3_irq_quiesce(struct tg3 *tp)

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  1:14                 ` David Miller
@ 2007-10-12  1:22                   ` Stephen Hemminger
  2007-10-12  1:25                     ` David Miller
  2007-10-12  3:17                   ` Michael Chan
  2007-10-12 10:22                   ` TAKANO Ryousei
  2 siblings, 1 reply; 29+ messages in thread
From: Stephen Hemminger @ 2007-10-12  1:22 UTC (permalink / raw)
  To: David Miller; +Cc: takano, netdev, ilpo.jarvinen, mchan

On Thu, 11 Oct 2007 18:14:49 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> 
> Here is what I'm checking into net-2.6 for now:
> 
> commit 6f535763165331bb91277d7519b507fed22034e5
> Author: David S. Miller <davem@sunset.davemloft.net>
> Date:   Thu Oct 11 18:08:29 2007 -0700
> 
>     [NET]: Fix NAPI completion handling in some drivers.
>     
>     In order for the list handling in net_rx_action() to be
>     correct, drivers must follow certain rules as stated by
>     this comment in net_rx_action():
>     
>     		/* Drivers must not modify the NAPI state if they
>     		 * consume the entire weight.  In such cases this code
>     		 * still "owns" the NAPI instance and therefore can
>     		 * move the instance around on the list at-will.
>     		 */
>     
>     A few drivers do not do this because they mix the budget checks
>     with reading hardware state, resulting in crashes like the one
>     reported by takano@axe-inc.co.jp.
>     
>     BNX2 and TG3 are taken care of here, SKY2 fix is from Stephen
>     Hemminger.
>     
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>

Please use the version I just sent, it saves another non-cached read.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  1:22                   ` Stephen Hemminger
@ 2007-10-12  1:25                     ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2007-10-12  1:25 UTC (permalink / raw)
  To: shemminger; +Cc: takano, netdev, ilpo.jarvinen, mchan

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Thu, 11 Oct 2007 18:22:59 -0700

> Please use the version I just sent, it saves another non-cached read.

No, rather, please send me fixed relative to that.

I'm trying to get the net-2.6.24 merge out the door.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  3:17                   ` Michael Chan
@ 2007-10-12  2:40                     ` David Miller
  2007-10-12  8:54                       ` Michael Chan
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2007-10-12  2:40 UTC (permalink / raw)
  To: mchan; +Cc: shemminger, takano, netdev, ilpo.jarvinen

From: "Michael Chan" <mchan@broadcom.com>
Date: Thu, 11 Oct 2007 20:17:16 -0700

> On Thu, 2007-10-11 at 18:14 -0700, David Miller wrote:
> > +       while (1) {
> > +               work_done = tg3_poll_work(tp, work_done, budget);
> > +
> > +               if (unlikely(tp->tg3_flags & TG3_FLAG_TX_RECOVERY_PENDING))
> > +                       goto tx_recovery;
> > +
> > +               if (unlikely(work_done >= budget))
> > +                       break;
> > +
> > +               if (likely(!tg3_has_work(tp))) {
> > +                       struct tg3_hw_status *sblk = tp->hw_status;
> > +
> 
> --> new status block DMA
> 
> > +                       if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
> > +                               tp->last_tag = sblk->status_tag;
> > +                               rmb();
> > +                       } else
> > +                               sblk->status &= ~SD_STATUS_UPDATED;
> 
> We need to read the sblk->status_tag before calling tg3_has_work().  If
> a new status block DMA happens in between (shown above), tp->last_tag
> will get the new tag and we will end up acknowledging work that we
> haven't processed.

Hmmm, the old code didn't do that and seemingly has the same
problem.  Also, if you look at the before-patch code and think
about what it does if we ->poll() multiple times for a single
interrupt the side-effects are essentially the same.

What's the crucial difference?

> I'll go over this some more tonight and will send a patch to refine it.

Thanks Michael.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  1:14                 ` David Miller
  2007-10-12  1:22                   ` Stephen Hemminger
@ 2007-10-12  3:17                   ` Michael Chan
  2007-10-12  2:40                     ` David Miller
  2007-10-12 10:22                   ` TAKANO Ryousei
  2 siblings, 1 reply; 29+ messages in thread
From: Michael Chan @ 2007-10-12  3:17 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, takano, netdev, ilpo.jarvinen

On Thu, 2007-10-11 at 18:14 -0700, David Miller wrote:
> +       while (1) {
> +               work_done = tg3_poll_work(tp, work_done, budget);
> +
> +               if (unlikely(tp->tg3_flags & TG3_FLAG_TX_RECOVERY_PENDING))
> +                       goto tx_recovery;
> +
> +               if (unlikely(work_done >= budget))
> +                       break;
> +
> +               if (likely(!tg3_has_work(tp))) {
> +                       struct tg3_hw_status *sblk = tp->hw_status;
> +

--> new status block DMA

> +                       if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
> +                               tp->last_tag = sblk->status_tag;
> +                               rmb();
> +                       } else
> +                               sblk->status &= ~SD_STATUS_UPDATED;

We need to read the sblk->status_tag before calling tg3_has_work().  If
a new status block DMA happens in between (shown above), tp->last_tag
will get the new tag and we will end up acknowledging work that we
haven't processed.

I'll go over this some more tonight and will send a patch to refine it.
Thanks.

> +
> +                       netif_rx_complete(tp->dev, napi);
> +                       tg3_restart_ints(tp);
> +                       break;
> +               }


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  8:54                       ` Michael Chan
@ 2007-10-12  8:39                         ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2007-10-12  8:39 UTC (permalink / raw)
  To: mchan; +Cc: shemminger, takano, netdev, ilpo.jarvinen

From: "Michael Chan" <mchan@broadcom.com>
Date: Fri, 12 Oct 2007 01:54:13 -0700

> On Thu, 2007-10-11 at 19:40 -0700, David Miller wrote:
> > Hmmm, the old code didn't do that and seemingly has the same
> > problem.  Also, if you look at the before-patch code and think
> > about what it does if we ->poll() multiple times for a single
> > interrupt the side-effects are essentially the same.
> > 
> 
> No, the old code before tonight's patch did this:
> 
> if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
> 	tp->last_tag = sblk->status_tag;
> 	rmb();
> }
> 
> before checking for more work.  The rmb() is there to make sure that the
> status tag is read and stored before we check for more work.

No I understand, thanks!

> [TG3]: Refine napi poll loop.
> 
> Need to read and store sblk->status_tag before checking for more work.
> The status tag is later written back to the hardware when enabling
> interrupts to acknowledge how much work has been processed.  If the
> order is reversed, we can end up acknowledging work we haven't
> processed.
> 
> When we detect tx error, it is more correct to return the rx
> work_done so far instead of 0.
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

I'll apply this, thanks a lot!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  2:40                     ` David Miller
@ 2007-10-12  8:54                       ` Michael Chan
  2007-10-12  8:39                         ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Chan @ 2007-10-12  8:54 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, takano, netdev, ilpo.jarvinen

On Thu, 2007-10-11 at 19:40 -0700, David Miller wrote:
> From: "Michael Chan" <mchan@broadcom.com>
> Date: Thu, 11 Oct 2007 20:17:16 -0700
>
> > > +               if (likely(!tg3_has_work(tp))) {
> > > +                       struct tg3_hw_status *sblk = tp->hw_status;
> > > +
> > 
> > --> new status block DMA
> > 
> > > +                       if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
> > > +                               tp->last_tag = sblk->status_tag;
> > > +                               rmb();
> > > +                       } else
> > > +                               sblk->status &= ~SD_STATUS_UPDATED;
> > 
> > We need to read the sblk->status_tag before calling tg3_has_work().  If
> > a new status block DMA happens in between (shown above), tp->last_tag
> > will get the new tag and we will end up acknowledging work that we
> > haven't processed.
> 
> Hmmm, the old code didn't do that and seemingly has the same
> problem.  Also, if you look at the before-patch code and think
> about what it does if we ->poll() multiple times for a single
> interrupt the side-effects are essentially the same.
> 

No, the old code before tonight's patch did this:

if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
	tp->last_tag = sblk->status_tag;
	rmb();
}

before checking for more work.  The rmb() is there to make sure that the
status tag is read and stored before we check for more work.

> What's the crucial difference?
> 

This sequence only matters when we eventually terminate and tell the
hardware the last tag we've processed and turn on the interrupt.  If
there's a status block race condition, the hw will know when the tag
written back does not match the latest one and it will generate an
interrupt right away.  The sequence guarantees that the hw will see the
proper tag corresponding to the work processed by the driver.

[TG3]: Refine napi poll loop.

Need to read and store sblk->status_tag before checking for more work.
The status tag is later written back to the hardware when enabling
interrupts to acknowledge how much work has been processed.  If the
order is reversed, we can end up acknowledging work we haven't
processed.

When we detect tx error, it is more correct to return the rx
work_done so far instead of 0.

Signed-off-by: Michael Chan <mchan@broadcom.com>

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 417641a..055cc68 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3576,7 +3576,7 @@ static int tg3_poll_work(struct tg3 *tp, int work_done, int budget)
 	if (sblk->idx[0].tx_consumer != tp->tx_cons) {
 		tg3_tx(tp);
 		if (unlikely(tp->tg3_flags & TG3_FLAG_TX_RECOVERY_PENDING))
-			return 0;
+			return work_done;
 	}
 
 	/* run RX thread, within the bounds set by NAPI.
@@ -3593,6 +3593,7 @@ static int tg3_poll(struct napi_struct *napi, int budget)
 {
 	struct tg3 *tp = container_of(napi, struct tg3, napi);
 	int work_done = 0;
+	struct tg3_hw_status *sblk = tp->hw_status;
 
 	while (1) {
 		work_done = tg3_poll_work(tp, work_done, budget);
@@ -3603,15 +3604,17 @@ static int tg3_poll(struct napi_struct *napi, int budget)
 		if (unlikely(work_done >= budget))
 			break;
 
-		if (likely(!tg3_has_work(tp))) {
-			struct tg3_hw_status *sblk = tp->hw_status;
-
-			if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
-				tp->last_tag = sblk->status_tag;
-				rmb();
-			} else
-				sblk->status &= ~SD_STATUS_UPDATED;
+		if (tp->tg3_flags & TG3_FLAG_TAGGED_STATUS) {
+			/* tp->last_tag is used in tg3_restart_ints() below
+			 * to tell the hw how much work has been processed,
+			 * so we must read it before checking for more work.
+			 */
+			tp->last_tag = sblk->status_tag;
+			rmb();
+		} else
+			sblk->status &= ~SD_STATUS_UPDATED;
 
+		if (likely(!tg3_has_work(tp))) {
 			netif_rx_complete(tp->dev, napi);
 			tg3_restart_ints(tp);
 			break;
@@ -3621,9 +3624,10 @@ static int tg3_poll(struct napi_struct *napi, int budget)
 	return work_done;
 
 tx_recovery:
+	/* work_done is guaranteed to be less than budget. */
 	netif_rx_complete(tp->dev, napi);
 	schedule_work(&tp->reset_task);
-	return 0;
+	return work_done;
 }
 
 static void tg3_irq_quiesce(struct tg3 *tp)





^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12  1:14                 ` David Miller
  2007-10-12  1:22                   ` Stephen Hemminger
  2007-10-12  3:17                   ` Michael Chan
@ 2007-10-12 10:22                   ` TAKANO Ryousei
  2007-10-12 10:56                     ` David Miller
  2 siblings, 1 reply; 29+ messages in thread
From: TAKANO Ryousei @ 2007-10-12 10:22 UTC (permalink / raw)
  To: davem; +Cc: shemminger, netdev, ilpo.jarvinen, mchan

From: David Miller <davem@davemloft.net>
Subject: Re: Regression in net-2.6.24?
Date: Thu, 11 Oct 2007 18:14:49 -0700 (PDT)

> 
> Here is what I'm checking into net-2.6 for now:
> 
> commit 6f535763165331bb91277d7519b507fed22034e5
> Author: David S. Miller <davem@sunset.davemloft.net>
> Date:   Thu Oct 11 18:08:29 2007 -0700
> 
>     [NET]: Fix NAPI completion handling in some drivers.
>     
>     In order for the list handling in net_rx_action() to be
>     correct, drivers must follow certain rules as stated by
>     this comment in net_rx_action():
>     
>     		/* Drivers must not modify the NAPI state if they
>     		 * consume the entire weight.  In such cases this code
>     		 * still "owns" the NAPI instance and therefore can
>     		 * move the instance around on the list at-will.
>     		 */
>     
>     A few drivers do not do this because they mix the budget checks
>     with reading hardware state, resulting in crashes like the one
>     reported by takano@axe-inc.co.jp.
>     
>     BNX2 and TG3 are taken care of here, SKY2 fix is from Stephen
>     Hemminger.
>     
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
I am sorry for late reply.
I confirmed this patch fixes the kernel panic that I reported.
Thanks David and others.

Ryousei Takano

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Regression in net-2.6.24?
  2007-10-12 10:22                   ` TAKANO Ryousei
@ 2007-10-12 10:56                     ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2007-10-12 10:56 UTC (permalink / raw)
  To: takano; +Cc: shemminger, netdev, ilpo.jarvinen, mchan

From: TAKANO Ryousei <takano@axe-inc.co.jp>
Date: Fri, 12 Oct 2007 19:22:00 +0900 (JST)

> I am sorry for late reply.
> I confirmed this patch fixes the kernel panic that I reported.
> Thanks David and others.

Thank you for testing.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2007-10-12 10:56 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-09 12:19 [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness Ilpo Järvinen
2007-10-09 12:20 ` [PATCH] [TCP]: Separate lost_retrans loop into own function Ilpo Järvinen
2007-10-09 12:20   ` [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems Ilpo Järvinen
2007-10-09 12:20     ` [RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases Ilpo Järvinen
2007-10-10  9:44       ` David Miller
2007-10-10  9:44         ` David Miller
2007-10-11  1:55     ` [RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems TAKANO Ryousei
2007-10-11 10:12       ` Ilpo Järvinen
2007-10-11 13:51         ` Regression in net-2.6.24? TAKANO Ryousei
2007-10-11 23:48           ` David Miller
2007-10-11 23:55             ` Stephen Hemminger
2007-10-12  0:17               ` David Miller
2007-10-12  0:31                 ` Stephen Hemminger
2007-10-12  0:40                   ` David Miller
2007-10-12  0:50                     ` Stephen Hemminger
2007-10-12  1:00                       ` David Miller
2007-10-12  1:03                         ` David Miller
2007-10-12  1:14                 ` David Miller
2007-10-12  1:22                   ` Stephen Hemminger
2007-10-12  1:25                     ` David Miller
2007-10-12  3:17                   ` Michael Chan
2007-10-12  2:40                     ` David Miller
2007-10-12  8:54                       ` Michael Chan
2007-10-12  8:39                         ` David Miller
2007-10-12 10:22                   ` TAKANO Ryousei
2007-10-12 10:56                     ` David Miller
2007-10-10  9:45   ` [PATCH] [TCP]: Separate lost_retrans loop into own function David Miller
2007-10-09 13:03 ` [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness Ilpo Järvinen
2007-10-10  9:48   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).