Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC] restore netdev_priv optimization
From: Benjamin Thery @ 2007-08-20 11:51 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev, ebiederm
In-Reply-To: <20070817.160409.88475506.davem@davemloft.net>

Hi,

David Miller wrote:
> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Fri, 17 Aug 2007 15:40:22 -0700
> 
>> Compile tested only!!!
> 
> Obviously.  The first loopback transmit is guarenteed to crash.
> 
> [...]
> 
> And this also breaks loopback again, which uses a static struct netdev
> in the kernel image, it doesn't use alloc_netdev(), so egress_subqueue
> of loopback will be NULL.

Talking about loopback, don't you think it could be the right time
to make it behave like any other kind of net devices, and allocate it
dynamically.

Having a dynamically allocated loopback could make maintenance easier
(removing special cases).
Also this is something we'll need to support multiple loopbacks for 
example for network namespaces.

Eric Biederman has written a nice patch that does this.
I'm using it on 2.6.23-rc2.

Benjamin

-- 
B e n j a m i n   T h e r y  - BULL/DT/Open Software R&D

    http://www.bull.com

^ permalink raw reply

* [PATCH 4/5] [TCP]: Discard fuzzy SACK blocks
From: Ilpo Järvinen @ 2007-08-20 13:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo Järvinen
In-Reply-To: <118761579328-git-send-email-ilpo.jarvinen@helsinki.fi>

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <ilpo.jarvinen@helsinki.fi>

SACK processing code has been a sort of russian roulette as no
validation of SACK blocks is previously attempted. Besides, it
is not very clear what all kinds of broken SACK blocks really
mean (e.g., one that has start and end sequence numbers
reversed). So now close the roulette once and for all.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   82 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2bf3d57..102aefa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1001,7 +1001,86 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
  *    for retransmitted and already SACKed segment -> reordering..
  * Both of these heuristics are not used in Loss state, when we cannot
  * account for retransmits accurately.
+ *
+ * SACK block validation.
+ * ----------------------
+ *
+ * SACK block range validation checks that the received SACK block fits to
+ * the expected sequence limits, i.e., it is between SND.UNA and SND.NXT.
+ * Note that SND.UNA is not included to the range though being valid because
+ * it means that the receiver is rather inconsistent with itself (reports
+ * SACK reneging when it should advance SND.UNA).
+ *
+ * Implements also blockage to start_seq wrap-around. Problem lies in the
+ * fact that though start_seq (s) is before end_seq (i.e., not reversed),
+ * there's no guarantee that it will be before snd_nxt (n). The problem
+ * happens when start_seq resides between end_seq wrap (e_w) and snd_nxt
+ * wrap (s_w):
+ *
+ *         <- outs wnd ->                          <- wrapzone ->
+ *         u     e      n                         u_w   e_w  s n_w
+ *         |     |      |                          |     |   |  |
+ * |<------------+------+----- TCP seqno space --------------+---------->|
+ * ...-- <2^31 ->|                                           |<--------...
+ * ...---- >2^31 ------>|                                    |<--------...
+ *
+ * Current code wouldn't be vulnerable but it's better still to discard such
+ * crazy SACK blocks. Doing this check for start_seq alone closes somewhat
+ * similar case (end_seq after snd_nxt wrap) as earlier reversed check in
+ * snd_nxt wrap -> snd_una region will then become "well defined", i.e.,
+ * equal to the ideal case (infinite seqno space without wrap caused issues).
+ *
+ * With D-SACK the lower bound is extended to cover sequence space below
+ * SND.UNA down to undo_marker, which is the last point of interest. Yet
+ * again, DSACK block must not to go across snd_una (for the same reason as
+ * for the normal SACK blocks, explained above). But there all simplicity
+ * ends, TCP might receive valid D-SACKs below that. As long as they reside
+ * fully below undo_marker they do not affect behavior in anyway and can
+ * therefore be safely ignored. In rare cases (which are more or less
+ * theoretical ones), the D-SACK will nicely cross that boundary due to skb
+ * fragmentation and packet reordering past skb's retransmission. To consider
+ * them correctly, the acceptable range must be extended even more though
+ * the exact amount is rather hard to quantify. However, tp->max_window can
+ * be used as an exaggerated estimate.
  */
+static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
+				  u32 start_seq, u32 end_seq)
+{
+	/* Too far in future, or reversed (interpretation is ambiguous) */
+	if (after(end_seq, tp->snd_nxt) || !before(start_seq, end_seq))
+		return 0;
+
+	/* Nasty start_seq wrap-around check (see comments above) */
+	if (!before(start_seq, tp->snd_nxt))
+		return 0;
+
+	/* In outstanding window? ...This is valid exit for DSACKs too.
+	 * start_seq == snd_una is non-sensical (see comments above)
+	 */
+	if (after(start_seq, tp->snd_una))
+		return 1;
+
+	if (!is_dsack || !tp->undo_marker)
+		return 0;
+
+	/* ...Then it's D-SACK, and must reside below snd_una completely */
+	if (!after(end_seq, tp->snd_una))
+		return 0;
+
+	if (!before(start_seq, tp->undo_marker))
+		return 1;
+
+	/* Too old */
+	if (!after(end_seq, tp->undo_marker))
+		return 0;
+
+	/* Undo_marker boundary crossing (overestimates a lot). Known already:
+	 *   start_seq < undo_marker and end_seq >= undo_marker.
+	 */
+	return !before(start_seq, end_seq - tp->max_window);
+}
+
+
 static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb,
 			   struct tcp_sack_block_wire *sp, int num_sacks,
 			   u32 prior_snd_una)
@@ -1143,6 +1222,9 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		int fack_count;
 		int dup_sack = (found_dup_sack && (i == first_sack_index));
 
+		if (!tcp_is_sackblock_valid(tp, dup_sack, start_seq, end_seq))
+			continue;
+
 		skb = cached_skb;
 		fack_count = cached_fack_count;
 
-- 
1.5.0.6


^ permalink raw reply related

* [PATCH net-2.6.24 0/5] TCP: Cleanups & SACK block validation
From: Ilpo Järvinen @ 2007-08-20 13:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Hi Dave,

Here are couple of patches to net-2.6.24. The first three are trivial
cleanups. The idea to the last two comes from tcp-2.6 but the validator
has been heavily modified (and hopefully improved in the process :-)).
I'm not sure though if checking for the undo_marker boundary crossing
case is a bit over-engineering (inherited from the original version
which already checked for that case). In addition, better names could
be invented for MIBs, suggestions?

--
 i.

^ permalink raw reply

* [PATCH 2/5] [TCP]: tcp_packets_out_inc to tcp_output.c (no callers elsewhere)
From: Ilpo Järvinen @ 2007-08-20 13:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo Järvinen
In-Reply-To: <1187615793197-git-send-email-ilpo.jarvinen@helsinki.fi>

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <ilpo.jarvinen@helsinki.fi>

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/net/tcp.h     |   12 ------------
 net/ipv4/tcp_output.c |   12 ++++++++++++
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6d586ca..f28f382 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -614,18 +614,6 @@ static inline void tcp_dec_pcount_approx(__u32 *count,
 	tcp_dec_pcount_approx_int(count, tcp_skb_pcount(skb));
 }
 
-static inline void tcp_packets_out_inc(struct sock *sk,
-				       const struct sk_buff *skb)
-{
-	struct tcp_sock *tp = tcp_sk(sk);
-	int orig = tp->packets_out;
-
-	tp->packets_out += tcp_skb_pcount(skb);
-	if (!orig)
-		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-					  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
-}
-
 /* Events passed to congestion control interface */
 enum tcp_ca_event {
 	CA_EVENT_TX_START,	/* first transmit when no packets in flight */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1d65ce1..a61a3e3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -61,6 +61,18 @@ int sysctl_tcp_base_mss __read_mostly = 512;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+static inline void tcp_packets_out_inc(struct sock *sk,
+				       const struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int orig = tp->packets_out;
+
+	tp->packets_out += tcp_skb_pcount(skb);
+	if (!orig)
+		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+					  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+}
+
 static void update_send_head(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-- 
1.5.0.6


^ permalink raw reply related

* [PATCH 1/5] [TCP]: Remove unnecessary wrapper tcp_packets_out_dec
From: Ilpo Järvinen @ 2007-08-20 13:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo Järvinen
In-Reply-To: <118761579315-git-send-email-ilpo.jarvinen@helsinki.fi>

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <ilpo.jarvinen@helsinki.fi>

Makes caller side more obvious, there's no need to have
a wrapper for this oneliner!

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/net/tcp.h     |    6 ------
 net/ipv4/tcp_input.c  |    2 +-
 net/ipv4/tcp_output.c |    2 +-
 3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7c65989..6d586ca 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -626,12 +626,6 @@ static inline void tcp_packets_out_inc(struct sock *sk,
 					  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
 }
 
-static inline void tcp_packets_out_dec(struct tcp_sock *tp, 
-				       const struct sk_buff *skb)
-{
-	tp->packets_out -= tcp_skb_pcount(skb);
-}
-
 /* Events passed to congestion control interface */
 enum tcp_ca_event {
 	CA_EVENT_TX_START,	/* first transmit when no packets in flight */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96ced89..45ad32c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2548,7 +2548,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, __s32 *seq_rtt_p)
 			last_ackt = skb->tstamp;
 		}
 		tcp_dec_pcount_approx(&tp->fackets_out, skb);
-		tcp_packets_out_dec(tp, skb);
+		tp->packets_out -= tcp_skb_pcount(skb);
 		tcp_unlink_write_queue(skb, sk);
 		sk_stream_free_skb(sk, skb);
 		clear_all_retrans_hints(tp);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a367917..1d65ce1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1735,7 +1735,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m
 		 * it is better to underestimate fackets.
 		 */
 		tcp_dec_pcount_approx(&tp->fackets_out, next_skb);
-		tcp_packets_out_dec(tp, next_skb);
+		tp->packets_out -= tcp_skb_pcount(next_skb);
 		sk_stream_free_skb(sk, next_skb);
 	}
 }
-- 
1.5.0.6


^ permalink raw reply related

* [PATCH 5/5] [TCP] MIB: Add counters for discarded SACK blocks
From: Ilpo Järvinen @ 2007-08-20 13:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo Järvinen
In-Reply-To: <11876157932029-git-send-email-ilpo.jarvinen@helsinki.fi>

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <ilpo.jarvinen@helsinki.fi>

In DSACK case, some events are not extraordinary, such as packet
duplication generated DSACK. They can arrive easily below
snd_una when undo_marker is not set (TCP being in CA_Open),
counting such DSACKs amoung SACK discards will likely just
mislead if they occur in some scenario when there are other
problems as well. Similarly, excessively delayed packets could
cause "normal" DSACKs. Therefore, separate counters are
allocated for DSACK events.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/snmp.h |    3 +++
 net/ipv4/proc.c      |    3 +++
 net/ipv4/tcp_input.c |   10 +++++++++-
 3 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index 802b3a3..d24c554 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -231,6 +231,9 @@ enum
 	LINUX_MIB_TCPABORTONLINGER,		/* TCPAbortOnLinger */
 	LINUX_MIB_TCPABORTFAILED,		/* TCPAbortFailed */
 	LINUX_MIB_TCPMEMORYPRESSURES,		/* TCPMemoryPressures */
+	LINUX_MIB_TCPSACKDISCARD,		/* TCPSACKDiscard */
+	LINUX_MIB_TCPDSACKIGNOREDOLD,		/* TCPSACKIgnoredOld */
+	LINUX_MIB_TCPDSACKIGNOREDNOUNDO,	/* TCPSACKIgnoredNoUndo */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 3b690cf..986d1c8 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -244,6 +244,9 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPAbortOnLinger", LINUX_MIB_TCPABORTONLINGER),
 	SNMP_MIB_ITEM("TCPAbortFailed", LINUX_MIB_TCPABORTFAILED),
 	SNMP_MIB_ITEM("TCPMemoryPressures", LINUX_MIB_TCPMEMORYPRESSURES),
+	SNMP_MIB_ITEM("TCPSACKDiscard", LINUX_MIB_TCPSACKDISCARD),
+	SNMP_MIB_ITEM("TCPDSACKIgnoredOld", LINUX_MIB_TCPDSACKIGNOREDOLD),
+	SNMP_MIB_ITEM("TCPDSACKIgnoredNoUndo", LINUX_MIB_TCPDSACKIGNOREDNOUNDO),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 102aefa..8692d0b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1222,8 +1222,16 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		int fack_count;
 		int dup_sack = (found_dup_sack && (i == first_sack_index));
 
-		if (!tcp_is_sackblock_valid(tp, dup_sack, start_seq, end_seq))
+		if (!tcp_is_sackblock_valid(tp, dup_sack, start_seq, end_seq)) {
+			if (dup_sack) {
+				if (!tp->undo_marker)
+					NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDNOUNDO);
+				else
+					NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDOLD);
+			} else
+				NET_INC_STATS_BH(LINUX_MIB_TCPSACKDISCARD);
 			continue;
+		}
 
 		skb = cached_skb;
 		fack_count = cached_fack_count;
-- 
1.5.0.6


^ permalink raw reply related

* [PATCH 3/5] [TCP]: Rename tcp_ack_packets_out -> tcp_rearm_rto
From: Ilpo Järvinen @ 2007-08-20 13:16 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo Järvinen
In-Reply-To: <1187615793583-git-send-email-ilpo.jarvinen@helsinki.fi>

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <ilpo.jarvinen@helsinki.fi>

Only thing that tiny function does is rearming the RTO (if
necessary), name it accordingly.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 45ad32c..2bf3d57 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2406,8 +2406,7 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack,
 /* Restart timer after forward progress on connection.
  * RFC2988 recommends to restart timer to now+rto.
  */
-
-static void tcp_ack_packets_out(struct sock *sk)
+static void tcp_rearm_rto(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -2560,7 +2559,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, __s32 *seq_rtt_p)
 			= inet_csk(sk)->icsk_ca_ops;
 
 		tcp_ack_update_rtt(sk, acked, seq_rtt);
-		tcp_ack_packets_out(sk);
+		tcp_rearm_rto(sk);
 
 		if (tcp_is_reno(tp))
 			tcp_remove_reno_sacks(sk, pkts_acked);
-- 
1.5.0.6


^ permalink raw reply related

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Chris Snook @ 2007-08-20 13:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Satyam Sharma, Herbert Xu, Paul Mackerras,
	Christoph Lameter, Ilpo Jarvinen, Paul E. McKenney,
	Stefan Richter, Linux Kernel Mailing List, linux-arch, Netdev,
	Andrew Morton, ak, heiko.carstens, David Miller, schwidefsky,
	wensong, horms, wjiang, cfriesen, zlynx, rpjday, jesper.juhl,
	segher
In-Reply-To: <alpine.LFD.0.999.0708170929580.30176@woody.linux-foundation.org>

Linus Torvalds wrote:
> So the only reason to add back "volatile" to the atomic_read() sequence is 
> not to fix bugs, but to _hide_ the bugs better. They're still there, they 
> are just a lot harder to trigger, and tend to be a lot subtler.

What about barrier removal?  With consistent semantics we could optimize a fair 
amount of code.  Whether or not that constitutes "premature" optimization is 
open to debate, but there's no question we could reduce our register wiping in 
some places.

	-- Chris

^ permalink raw reply

* [PATCH] spidernet: fix interrupt reason recognition
From: Ishizaki Kou @ 2007-08-20 13:13 UTC (permalink / raw)
  To: linas; +Cc: netdev, cbe-oss-dev

This patch solves a problem that the spidernet driver sometimes fails
to handle IRQ.

The problem happens because,
- In Cell architecture, interrupts may arrive at an interrupt
  controller, even if they are masked by the setting on registers of
  devices. It happens when interrupt packets are sent just before 
  the interrupts are masked.
- spidernet interrupt handler compares interrupt reasons with
  interrupt masks, so when such interrupts occurs, spidernet interrupt
  handler returns IRQ_NONE.
- When all of interrupt handler return IRQ_NONE, linux kernel disables
  the IRQ and it no longer delivers interrupts to the interrupt handlers.

spidernet doesn't work after above sequence, because it can't receive
interrupts.
 
This patch changes spidernet interrupt handler that it compares
interrupt reason with SPIDER_NET_INTX_MASK_VALUE.

Signed-off-by: Kou Ishizaki <kou.ishizaki@toshiba.co.jp>
---

Linas-san,

Please apply this to 2.6.23. Because this problem is sometimes happens
and we cannot use the ethernet port any more.

And also, please apply the following Arnd-san's patch to fix a problem
that spidernet driver sometimes causes a BUG_ON at open.

 http://patchwork.ozlabs.org/cbe-oss-dev/patch?id=12211


Index: linux-powerpc-git/drivers/net/spider_net.c
===================================================================
--- linux-powerpc-git.orig/drivers/net/spider_net.c	2007-07-19 18:42:02.000000000 +0900
+++ linux-powerpc-git/drivers/net/spider_net.c	2007-08-20 20:52:23.000000000 +0900
@@ -1441,17 +1441,14 @@ static void
 spider_net_handle_error_irq(struct spider_net_card *card, u32 status_reg)
 {
 	u32 error_reg1, error_reg2;
-	u32 mask_reg1, mask_reg2;
 	u32 i;
 	int show_error = 1;
 
 	error_reg1 = spider_net_read_reg(card, SPIDER_NET_GHIINT1STS);
 	error_reg2 = spider_net_read_reg(card, SPIDER_NET_GHIINT2STS);
-	mask_reg1 = spider_net_read_reg(card, SPIDER_NET_GHIINT1MSK);
-	mask_reg2 = spider_net_read_reg(card,SPIDER_NET_GHIINT2MSK);
 
-	error_reg1 &= mask_reg1;
-	error_reg2 &= mask_reg2;
+	error_reg1 &= SPIDER_NET_INT1_MASK_VALUE;
+	error_reg2 &= SPIDER_NET_INT2_MASK_VALUE;
 
 	/* check GHIINT0STS ************************************/
 	if (status_reg)
@@ -1679,11 +1676,10 @@ spider_net_interrupt(int irq, void *ptr)
 {
 	struct net_device *netdev = ptr;
 	struct spider_net_card *card = netdev_priv(netdev);
-	u32 status_reg, mask_reg;
+	u32 status_reg;
 
 	status_reg = spider_net_read_reg(card, SPIDER_NET_GHIINT0STS);
-	mask_reg = spider_net_read_reg(card, SPIDER_NET_GHIINT0MSK);
-	status_reg &= mask_reg;
+	status_reg &= SPIDER_NET_INT0_MASK_VALUE;
 
 	if (!status_reg)
 		return IRQ_NONE;

^ permalink raw reply

* Re: LDD3 pitfalls (was Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures)
From: Chris Snook @ 2007-08-20 13:28 UTC (permalink / raw)
  To: Stefan Richter
  Cc: Jonathan Corbet, Greg Kroah-Hartman, Nick Piggin, paulmck,
	Herbert Xu, Paul Mackerras, Satyam Sharma, Christoph Lameter,
	Linux Kernel Mailing List, linux-arch, Linus Torvalds, netdev,
	Andrew Morton, ak, heiko.carstens, davem, schwidefsky, wensong,
	horms, wjiang, cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C703C9.9060507@s5r6.in-berlin.de>

Stefan Richter wrote:
> Nick Piggin wrote:
>> Stefan Richter wrote:
>>> Nick Piggin wrote:
>>>
>>>> I don't know why people would assume volatile of atomics. AFAIK, most
>>>> of the documentation is pretty clear that all the atomic stuff can be
>>>> reordered etc. except for those that modify and return a value.
>>>
>>> Which documentation is there?
>> Documentation/atomic_ops.txt
>>
>>
>>> For driver authors, there is LDD3.  It doesn't specifically cover
>>> effects of optimization on accesses to atomic_t.
>>>
>>> For architecture port authors, there is Documentation/atomic_ops.txt.
>>> Driver authors also can learn something from that document, as it
>>> indirectly documents the atomic_t and bitops APIs.
>>>
>> "Semantics and Behavior of Atomic and Bitmask Operations" is
>> pretty direct :)
>>
>> Sure, it says that it's for arch maintainers, but there is no
>> reason why users can't make use of it.
> 
> 
> Note, LDD3 page 238 says:  "It is worth noting that most of the other
> kernel primitives dealing with synchronization, such as spinlock and
> atomic_t operations, also function as memory barriers."
> 
> I don't know about Linux 2.6.10 against which LDD3 was written, but
> currently only _some_ atomic_t operations function as memory barriers.
> 
> Besides, judging from some posts in this thread, saying that atomic_t
> operations dealt with synchronization may not be entirely precise.

atomic_t is often used as the basis for implementing more sophisticated 
synchronization mechanisms, such as rwlocks.  Whether or not they are designed 
for that purpose, the atomic_* operations are de facto synchronization primitives.

	-- Chris

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Herbert Xu @ 2007-08-20 13:32 UTC (permalink / raw)
  To: Chris Snook
  Cc: Linus Torvalds, Nick Piggin, Satyam Sharma, Paul Mackerras,
	Christoph Lameter, Ilpo Jarvinen, Paul E. McKenney,
	Stefan Richter, Linux Kernel Mailing List, linux-arch, Netdev,
	Andrew Morton, ak, heiko.carstens, David Miller, schwidefsky,
	wensong, horms, wjiang, cfriesen, zlynx, rpjday, jesper.juhl,
	segher
In-Reply-To: <46C993DF.4080400@redhat.com>

On Mon, Aug 20, 2007 at 09:15:11AM -0400, Chris Snook wrote:
> Linus Torvalds wrote:
> >So the only reason to add back "volatile" to the atomic_read() sequence is 
> >not to fix bugs, but to _hide_ the bugs better. They're still there, they 
> >are just a lot harder to trigger, and tend to be a lot subtler.
> 
> What about barrier removal?  With consistent semantics we could optimize a 
> fair amount of code.  Whether or not that constitutes "premature" 
> optimization is open to debate, but there's no question we could reduce our 
> register wiping in some places.

If you've been reading all of Linus's emails you should be
thinking about adding memory barriers, and not removing
compiler barriers.

He's just told you that code of the kind

	while (!atomic_read(cond))
		;

	do_something()

probably needs a memory barrier (not just compiler) so that
do_something() doesn't see stale cache content that occured
before cond flipped.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Chris Snook @ 2007-08-20 13:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Herbert Xu, Linus Torvalds, Nick Piggin,
	Paul Mackerras, Segher Boessenkool, heiko.carstens, horms,
	linux-kernel, rpjday, ak, netdev, cfriesen, akpm, jesper.juhl,
	linux-arch, zlynx, satyam, schwidefsky, davem, wensong, wjiang
In-Reply-To: <Pine.LNX.4.64.0708171817050.15427@schroedinger.engr.sgi.com>

Christoph Lameter wrote:
> On Fri, 17 Aug 2007, Paul E. McKenney wrote:
> 
>> On Sat, Aug 18, 2007 at 08:09:13AM +0800, Herbert Xu wrote:
>>> On Fri, Aug 17, 2007 at 04:59:12PM -0700, Paul E. McKenney wrote:
>>>> gcc bugzilla bug #33102, for whatever that ends up being worth.  ;-)
>>> I had totally forgotten that I'd already filed that bug more
>>> than six years ago until they just closed yours as a duplicate
>>> of mine :)
>>>
>>> Good luck in getting it fixed!
>> Well, just got done re-opening it for the third time.  And a local
>> gcc community member advised me not to give up too easily.  But I
>> must admit that I am impressed with the speed that it was identified
>> as duplicate.
>>
>> Should be entertaining!  ;-)
> 
> Right. ROTFL... volatile actually breaks atomic_t instead of making it 
> safe. x++ becomes a register load, increment and a register store. Without 
> volatile we can increment the memory directly. It seems that volatile 
> requires that the variable is loaded into a register first and then 
> operated upon. Understandable when you think about volatile being used to 
> access memory mapped I/O registers where a RMW operation could be 
> problematic.

So, if we want consistent behavior, we're pretty much screwed unless we use 
inline assembler everywhere?

	-- Chris

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Chris Snook @ 2007-08-20 13:38 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linus Torvalds, Nick Piggin, Satyam Sharma, Paul Mackerras,
	Christoph Lameter, Ilpo Jarvinen, Paul E. McKenney,
	Stefan Richter, Linux Kernel Mailing List, linux-arch, Netdev,
	Andrew Morton, ak, heiko.carstens, David Miller, schwidefsky,
	wensong, horms, wjiang, cfriesen, zlynx, rpjday, jesper.juhl,
	segher
In-Reply-To: <20070820133207.GA13835@gondor.apana.org.au>

Herbert Xu wrote:
> On Mon, Aug 20, 2007 at 09:15:11AM -0400, Chris Snook wrote:
>> Linus Torvalds wrote:
>>> So the only reason to add back "volatile" to the atomic_read() sequence is 
>>> not to fix bugs, but to _hide_ the bugs better. They're still there, they 
>>> are just a lot harder to trigger, and tend to be a lot subtler.
>> What about barrier removal?  With consistent semantics we could optimize a 
>> fair amount of code.  Whether or not that constitutes "premature" 
>> optimization is open to debate, but there's no question we could reduce our 
>> register wiping in some places.
> 
> If you've been reading all of Linus's emails you should be
> thinking about adding memory barriers, and not removing
> compiler barriers.
> 
> He's just told you that code of the kind
> 
> 	while (!atomic_read(cond))
> 		;
> 
> 	do_something()
> 
> probably needs a memory barrier (not just compiler) so that
> do_something() doesn't see stale cache content that occured
> before cond flipped.

Such code generally doesn't care precisely when it gets the update, just that 
the update is atomic, and it doesn't loop forever.  Regardless, I'm convinced we 
just need to do it all in assembly.

	-- Chris

^ permalink raw reply

* Re: skb_pull_rcsum - Fatal exception in interrupt
From: Herbert Xu @ 2007-08-20 14:24 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: alan, netdev
In-Reply-To: <20070815155431.GA28761@2ka.mipt.ru>

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> Actually if dmesg will show that there is something in fragments, it
> should use pskb_may_pull(). The same bug exist in bridge and vlan, btw,
> so it might be a solution to remove bug_on from skb_pull_rcsum() and
> instead call may_pull?

That would be too easy :) As was the case here, the data pulled
has already been accessed so calling pskb_may_pull in the pulling
function is too late.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 2/4 - rev2] Add new timeval_to_sec function
From: Stephen Hemminger @ 2007-08-20 14:53 UTC (permalink / raw)
  To: Varun Chandramohan; +Cc: davem, netdev, kaber, socketcan, krkumar2, varuncha
In-Reply-To: <20070820134536.d260585a.varunc@linux.vnet.ibm.com>

On Mon, 20 Aug 2007 13:45:36 +0530
Varun Chandramohan <varunc@linux.vnet.ibm.com> wrote:

> A new function for converting timeval to time_t is added in time.h. Its a common function used in different
> places.
> 
> Signed-off-by: Varun Chandramohan <varunc@linux.vnet.ibm.com>
> ---
>  include/linux/time.h |   12 ++++++++++++
>  1 files changed, 12 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/time.h b/include/linux/time.h
> index 6a5f503..1faf65c 100644
> --- a/include/linux/time.h
> +++ b/include/linux/time.h
> @@ -149,6 +149,18 @@ static inline s64 timeval_to_ns(const st
>  }
>  
>  /**
> + * timeval_to_sec - Convert timeval to seconds
> + * @tv:         pointer to the timeval variable to be converted
> + *
> + * Returns the seconds representation of timeval parameter.
> + * Note : Here we round up the value. We dont need accuracy.
> + */
> +static inline time_t timeval_to_sec(const struct timeval *tv)
> +{
> +	return (tv->tv_sec + (tv->tv_usec ? 1 : 0));
> +}
> +

Why roundup? Unless there is a requirement in the standard, please just
use the timeval seconds. In which case the inline is unneeded.


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply

* Re: Marvell 88E8056 gigabit ethernet controller
From: Stephen Hemminger @ 2007-08-20 15:01 UTC (permalink / raw)
  To: Kevin E; +Cc: linux-kernel, netdev
In-Reply-To: <573746.86745.qm@web38903.mail.mud.yahoo.com>

On Sun, 19 Aug 2007 18:15:47 -0700 (PDT)
Kevin E <kevin360@yahoo.com> wrote:

> Someone wrote me with a solution to try and so far
> it's working.  They suggested I try the driver up on
> Marvell's website but to make sure I powered off the
> machine completely and when it rebooted to not have
> any of the regular kernel drivers for the Marvell
> chipset to load.  They had found that letting the sky2
> load and then unloading the module would mean the
> vendor's driver wouldn't work.
> 
> So I got down the latest driver package they have
> (10.0.5.3).  At first I couldn't get it compiled
> against kernel 2.6.22.3 that I'm running, but I have
> it compiled with the 2.6.21.5 kernel, which is what
> the machine is running now.  And I'm happy to say that
> it's working fine so far.  I've transfered about 4G
> over the link and it's still working fine.
> 
> Since Marvell's driver seems to be working for the
> 88E8056 chipset and from what I've looked at the code
> it's marked as GPL, could it be rolled into the kernel
> for those of us that have 88E8056 chipsets that are
> working to use? 
> 

Submit for code review, and cleanup the resulting comments.
Good luck.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply

* Re: [patch 17/18] 3c59x: fix duplex configuration
From: Martin Buck @ 2007-08-20 15:06 UTC (permalink / raw)
  To: akpm; +Cc: jeff, netdev, klassert, protasnb
In-Reply-To: <200708102105.l7AL5Q2h008995@imap1.linux-foundation.org>

On Fri, Aug 10, 2007 at 02:05:26PM -0700, akpm@linux-foundation.org wrote:
[...]
> diff -puN drivers/net/3c59x.c~3c59x-fix-duplex-configuration drivers/net/3c59x.c
> --- a/drivers/net/3c59x.c~3c59x-fix-duplex-configuration
> +++ a/drivers/net/3c59x.c
> @@ -1559,6 +1559,7 @@ vortex_up(struct net_device *dev)
>  		mii_reg1 = mdio_read(dev, vp->phys[0], MII_BMSR);
>  		mii_reg5 = mdio_read(dev, vp->phys[0], MII_LPA);
>  		vp->partner_flow_ctrl = ((mii_reg5 & 0x0400) != 0);
> +		vp->mii.full_duplex = vp->full_duplex;
>  
>  		vortex_check_media(dev, 1);
>  	}
> _

Sorry for the late reply. I finally managed to get my notebook fixed so
that I could actually test this patch. I can confirm that it fixes my
duplex configuration problem. The steps described in
http://bugzilla.kernel.org/show_bug.cgi?id=8575
now result in a Ethernet chip properly configured for full duplex. Thanks
for the fix!

The only remaining issue I have the 3c59x driver is the time required until
it detects link loss when unplugging the Ethernet cable. At the moment,
this needs up to 60 seconds which makes this feature pretty useless. Other
drivers need 2-5 seconds for this which is roughly what I would have
expected. I've been using the patch below sucessfully for a few weeks now
which brings down this time to 5 seconds. Would be nice if somebody could
apply it.

Thanks,
Martin

--- drivers/net/3c59x.c.orig	2007-08-20 17:01:06.000000000 +0200
+++ drivers/net/3c59x.c	2007-08-20 17:02:38.000000000 +0200
@@ -1726,7 +1726,7 @@
 	struct net_device *dev = (struct net_device *)data;
 	struct vortex_private *vp = netdev_priv(dev);
 	void __iomem *ioaddr = vp->ioaddr;
-	int next_tick = 60*HZ;
+	int next_tick = 5*HZ;
 	int ok = 0;
 	int media_status, old_window;

@@ -1771,9 +1771,6 @@
 		ok = 1;
 	}

-	if (!netif_carrier_ok(dev))
-		next_tick = 5*HZ;
-
 	if (vp->medialock)
 		goto leave_media_alone;

^ permalink raw reply

* [PATCH V4 0/10] net/bonding: ADD IPoIB support for the bonding driver
From: Moni Shoua @ 2007-08-20 15:34 UTC (permalink / raw)
  To: rdreier, davem, fubar; +Cc: netdev, general

This patch series is the fourth version (see below link to V3) of the 
suggested changes to the bonding driver so it would be able to support 
non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. 

The motivation is to enable the bonding driver on its HA mode to work with 
the IP over Infiniband (IPoIB) driver. With these patches I was able to enslave 
IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with 
fail-over and fail-back working fine. The working environment was the net-2.6 git. 

More over, as IPoIB is also the IB ARP provider for the RDMA CM driver which 
is used by native IB ULPs whose addressing scheme is based on IP (e.g. iSER, 
SDP, Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices **enables** HA 
for these ULPs. This holds as when the ULP is informed by the IB HW on the 
failure of the current IB connection, it just need to reconnect, where the 
bonding device will now issue the IB ARP over the active IPoIB slave. 

This series also includes patches to the IPoIB driver that fix some fix 
some neighboring related issues. 

Major changes from the previous version:

1) Addressing the issue of safety when unloading the IPoIB module before
the bonding module
2) style changes

Links to earlier discussion: 

1. A discussion in netdev about bonding support for IPoIB.
http://lists.openwall.net/netdev/2006/11/30/46

2. A discussion in openfabrics regarding changes in the IPoIB that 
enable using it as a slave for bonding.
http://lists.openfabrics.org/pipermail/general/2007-July/038914.html

^ permalink raw reply

* [ofa-general] [PATCH V4 1/10] IB/ipoib: Export call to call_netdevice_notifiers and add new private flag
From: Moni Shoua @ 2007-08-20 15:42 UTC (permalink / raw)
  To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>

Export the call to raw_notifier_call_chain so modules can send notifications
on netdev events to the netdev_chain.
Add IFF_SLAVE_DETACH to the list of priv_flags for net_device.
This flag is set by a slave that is about to unregisster from the kernel.

Both changes are used in bonding slaves that wish to inform the bonding master
about coming detachment.

Signed-off-by: Moni Shoua <monis@voltaire.com>
---
 include/linux/if.h |    1 +
 net/core/dev.c     |    1 +
 2 files changed, 2 insertions(+)

Index: net-2.6/net/core/dev.c
===================================================================
--- net-2.6.orig/net/core/dev.c	2007-08-15 10:09:02.000000000 +0300
+++ net-2.6/net/core/dev.c	2007-08-15 10:53:00.832543390 +0300
@@ -1148,6 +1148,7 @@ int call_netdevice_notifiers(unsigned lo
 {
 	return raw_notifier_call_chain(&netdev_chain, val, v);
 }
+EXPORT_SYMBOL(call_netdevice_notifiers);
 
 /* When > 0 there are consumers of rx skb time stamps */
 static atomic_t netstamp_needed = ATOMIC_INIT(0);
Index: net-2.6/include/linux/if.h
===================================================================
--- net-2.6.orig/include/linux/if.h	2007-08-20 14:30:39.000000000 +0300
+++ net-2.6/include/linux/if.h	2007-08-20 14:31:06.625174369 +0300
@@ -61,6 +61,7 @@
 #define IFF_MASTER_ALB	0x10		/* bonding master, balance-alb.	*/
 #define IFF_BONDING	0x20		/* bonding master or slave	*/
 #define IFF_SLAVE_NEEDARP 0x40		/* need ARPs for validation	*/
+#define IFF_SLAVE_DETACH 0x80		/* slave is about to unregister */
 
 #define IF_GET_IFACE	0x0001		/* for querying only */
 #define IF_GET_PROTO	0x0002

^ permalink raw reply

* Re: 2.6.23-rc3 and SKY2 driver issue
From: James Corey @ 2007-08-20 15:42 UTC (permalink / raw)
  To: Stephen Hemminger, Michal Piotrowski; +Cc: linux-kernel, Netdev
In-Reply-To: <20070816090556.6a5295e6@oldman>


--- Stephen Hemminger
<shemminger@linux-foundation.org> wrote:

> On Thu, 16 Aug 2007 10:25:45 +0200
> "Michal Piotrowski"
> <michal.k.k.piotrowski@gmail.com> wrote:
> 
> > [Adding Stephen and netdev to CC]
> > 
> > On 15/08/07, James Corey <ploversegg@yahoo.com>
> wrote:
> > >
> > > I tried running a D-link gig card on kernel
> 2.6.21.1
> > > and it came up fine, but when I did a sftp of
> > > an linux dvd ISO to it, the interface would lock
> > > up hard with the error
> > >
> > > eth1: hw csum failure.
> > >
> > > Call Trace:
> > >  <IRQ>  [<ffffffff804d3e31>]
> > > __skb_checksum_complete_head+0x46/0x5f
> > >  [<ffffffff804d3e56>]
> __skb_checksum_complete+0xc/0x11
> > >  [<ffffffff805046c7>] tcp_v4_rcv+0x157/0x810
> > >  [<ffffffff804d8a6f>] dev_queue_xmit+0x237/0x260
> > >  [<ffffffff80229990>]
> find_busiest_group+0x252/0x684
> > >  [<ffffffff804ead50>]
> ip_local_deliver+0xca/0x14c
> > >  [<ffffffff804eb24a>] ip_rcv+0x478/0x4ba
> > >  [<ffffffff803ec7d3>] sky2_poll+0x6f9/0x9b9
> > >  [<ffffffff8022bd5d>]
> > > run_rebalance_domains+0x13e/0x408
> > >  [<ffffffff804d877a>] net_rx_action+0xa8/0x166
> > >  [<ffffffff80235d62>] __do_softirq+0x55/0xc3
> > >  [<ffffffff8020a4ec>] call_softirq+0x1c/0x28
> > >  [<ffffffff8020b611>] do_softirq+0x2c/0x7d
> > >  [<ffffffff8020b8cf>] do_IRQ+0x13e/0x15f
> > >  [<ffffffff802086ce>] mwait_idle+0x0/0x46
> > >  [<ffffffff80209871>] ret_from_intr+0x0/0xa
> > >  <EOI>  [<ffffffff80208710>]
> mwait_idle+0x42/0x46
> > >  [<ffffffff80208666>] cpu_idle+0x8c/0xaf
> > >  [<ffffffff8078174e>] start_kernel+0x2ac/0x2b8
> > >  [<ffffffff80781140>] _sinittext+0x140/0x144
> > >
> > >
> > > So I tried running the latest snapshot
> 2.6.23-rc3
> > > and the almost the same thing happens. Only
> > > difference is that now the entire box locks up.
> > > The error is almost the same
> > >
> > > eth1: hw csum failure.
> > >
> > > Call Trace:
> > >  <IRQ>  [<ffffffff804779b6>]
> > > __skb_checksum_complete_head+0x43/0x56
> > >  [<ffffffff804779d5>]
> __skb_checksum_complete+0xc/0x11
> > >  [<ffffffff804a989d>] tcp_v4_rcv+0x14e/0x801
> > >  [<ffffffff8048ff84>]
> ip_local_deliver+0xca/0x14c
> > >  [<ffffffff80490472>] ip_rcv+0x46c/0x4ae
> > >  [<ffffffff880060f9>]
> :sky2:sky2_poll+0x72b/0x9c7
> > >  [<ffffffff8047c934>] net_rx_action+0xa8/0x166
> > >  [<ffffffff80235ced>] __do_softirq+0x55/0xc4
> > >  [<ffffffff8020c5cc>] call_softirq+0x1c/0x28
> > >  [<ffffffff8020d6fd>] do_softirq+0x2c/0x7d
> > >  [<ffffffff8020d9bb>] do_IRQ+0x13e/0x15f
> > >  [<ffffffff8020a780>] mwait_idle+0x0/0x48
> > >  [<ffffffff8020b951>] ret_from_intr+0x0/0xa
> > >  <EOI>  [<ffffffff880063e7>]
> > > :sky2:sky2_xmit_frame+0x0/0x41d
> > >  [<ffffffff8020a7c2>] mwait_idle+0x42/0x48
> > >  [<ffffffff8020a718>] cpu_idle+0xbd/0xe0
> > >  [<ffffffff80704a5a>] start_kernel+0x2ac/0x2b8
> > >  [<ffffffff80704140>] _sinittext+0x140/0x144
> > >
> > >
> > > I see that the new kernel includes some sort of
> > > SKY2 DEBUG stuff. I would be happy to rerun
> > > the test with that turned on, given some minor
> > > direction.
> > >
> > 
> > Regards,
> > Michal
> 
> Please reproduce with a more recent kernel?

Um, I thought 2.6.23rc WAS pretty recent. :-)

I'll check if there is something newer in the
repository now.

-J



       
____________________________________________________________________________________
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC

^ permalink raw reply

* [ofa-general] [PATCH V4 2/10] IB/ipoib: Notify the world before doing unregister
From: Moni Shoua @ 2007-08-20 15:43 UTC (permalink / raw)
  To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>

When the bonding device enslaves IPoIB devices it takes pointers to
functions in the ib_ipoib module. This is fine as long as the ib_ipoib
nodule remains loaded while the references to its functions exist.
So, to help bonding do a cleanup on time, when the IPoIB net device is a 
slave of a bonding master, let the master know that the IPoIB device is
about to unregister (but before calling unregister).

Signed-off-by: Moni Shoua <monis@voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+)

Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-08-20 14:29:29.522209580 +0300
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-08-20 14:43:03.432162133 +0300
@@ -48,6 +48,7 @@
 #include <linux/in.h>
 
 #include <net/dst.h>
+#include <linux/netdevice.h>
 
 MODULE_AUTHOR("Roland Dreier");
 MODULE_DESCRIPTION("IP-over-InfiniBand net driver");
@@ -772,6 +773,18 @@ static void ipoib_timeout(struct net_dev
 	/* XXX reset QP, etc. */
 }
 
+static int ipoib_slave_detach(struct net_device *dev)
+{
+	int ret = 0;
+	if (dev->flags & IFF_SLAVE) {
+		dev->priv_flags |= IFF_SLAVE_DETACH;
+		rtnl_lock();
+		ret = call_netdevice_notifiers(NETDEV_CHANGE, dev);
+		rtnl_unlock();
+	}
+	return ret;
+}
+
 static int ipoib_hard_header(struct sk_buff *skb,
 			     struct net_device *dev,
 			     unsigned short type,
@@ -921,6 +934,7 @@ void ipoib_dev_cleanup(struct net_device
 
 	/* Delete any child interfaces first */
 	list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) {
+		ipoib_slave_detach(cpriv->dev);
 		unregister_netdev(cpriv->dev);
 		ipoib_dev_cleanup(cpriv->dev);
 		free_netdev(cpriv->dev);
@@ -1208,6 +1222,7 @@ static void ipoib_remove_one(struct ib_d
 		ib_unregister_event_handler(&priv->event_handler);
 		flush_scheduled_work();
 
+		ipoib_slave_detach(priv->dev);
 		unregister_netdev(priv->dev);
 		ipoib_dev_cleanup(priv->dev);
 		free_netdev(priv->dev);

^ permalink raw reply

* [ofa-general] [PATCH V4 3/10] IB/ipoib: Bound the net device to the ipoib_neigh structue
From: Moni Shoua @ 2007-08-20 15:44 UTC (permalink / raw)
  To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>

IPoIB uses a two layer neighboring scheme, such that for each struct neighbour
whose device is an ipoib one, there is a struct ipoib_neigh buddy which is
created on demand at the tx flow by an ipoib_neigh_alloc(skb->dst->neighbour)
call.

When using the bonding driver, neighbours are created by the net stack on behalf
of the bonding (master) device. On the tx flow the bonding code gets an skb such
that skb->dev points to the master device, it changes this skb to point on the
slave device and calls the slave hard_start_xmit function.

Under this scheme, ipoib_neigh_destructor assumption that for each struct
neighbour it gets, n->dev is an ipoib device and hence netdev_priv(n->dev)
can be casted to struct ipoib_dev_priv is buggy.

To fix it, this patch adds a dev field to struct ipoib_neigh which is used
instead of the struct neighbour dev one, when n->dev->flags has the
IFF_MASTER bit set.

Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |    4 +++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |   17 +++++++++++++++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |    3 ++-
 3 files changed, 20 insertions(+), 4 deletions(-)

Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2007-08-15 10:09:00.000000000 +0300
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib.h	2007-08-15 10:53:52.756348574 +0300
@@ -328,6 +328,7 @@ struct ipoib_neigh {
 	struct sk_buff_head queue;
 
 	struct neighbour   *neighbour;
+	struct net_device *dev;
 
 	struct list_head    list;
 };
@@ -344,7 +345,8 @@ static inline struct ipoib_neigh **to_ip
 				     INFINIBAND_ALEN, sizeof(void *));
 }
 
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh);
+struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh,
+				      struct net_device *dev);
 void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh);
 
 extern struct workqueue_struct *ipoib_workqueue;
Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-08-15 10:53:28.000000000 +0300
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-08-15 10:53:52.757348397 +0300
@@ -511,7 +511,7 @@ static void neigh_add_path(struct sk_buf
 	struct ipoib_path *path;
 	struct ipoib_neigh *neigh;
 
-	neigh = ipoib_neigh_alloc(skb->dst->neighbour);
+	neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev);
 	if (!neigh) {
 		++priv->stats.tx_dropped;
 		dev_kfree_skb_any(skb);
@@ -830,6 +830,17 @@ static void ipoib_neigh_cleanup(struct n
 	unsigned long flags;
 	struct ipoib_ah *ah = NULL;
 
+	if (n->dev->flags & IFF_MASTER) {
+		/* n->dev is not an IPoIB device and we have
+			to take priv from elsewhere */
+		neigh = *to_ipoib_neigh(n);
+		if (neigh) {
+			priv = netdev_priv(neigh->dev);
+			ipoib_dbg(priv, "neigh_destructor for bonding device: %s\n",
+				  n->dev->name);
+		} else
+			return;
+	}
 	ipoib_dbg(priv,
 		  "neigh_cleanup for %06x " IPOIB_GID_FMT "\n",
 		  IPOIB_QPN(n->ha),
@@ -851,7 +862,8 @@ static void ipoib_neigh_cleanup(struct n
 		ipoib_put_ah(ah);
 }
 
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour)
+struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour,
+				      struct net_device *dev)
 {
 	struct ipoib_neigh *neigh;
 
@@ -860,6 +872,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(st
 		return NULL;
 
 	neigh->neighbour = neighbour;
+	neigh->dev = dev;
 	*to_ipoib_neigh(neighbour) = neigh;
 	skb_queue_head_init(&neigh->queue);
 	ipoib_cm_set(neigh, NULL);
Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-08-15 10:09:00.000000000 +0300
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-08-15 10:53:52.758348220 +0300
@@ -727,7 +727,8 @@ out:
 		if (skb->dst            &&
 		    skb->dst->neighbour &&
 		    !*to_ipoib_neigh(skb->dst->neighbour)) {
-			struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour);
+			struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour,
+									skb->dev);
 
 			if (neigh) {
 				kref_get(&mcast->ah->ref);

^ permalink raw reply

* [PATCH V4 4/10] IB/ipoib: Verify address handle validity on send
From: Moni Shoua @ 2007-08-20 15:46 UTC (permalink / raw)
  To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>

When the bonding device senses a carrier loss of its active slave it replaces
that slave with a new one. In between the times when the carrier of an IPoIB
device goes down and ipoib_neigh is destroyed, it is possible that the
bonding driver will send a packet on a new slave that uses an old ipoib_neigh.
This patch detects and prevents this from happenning.

Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

Index: net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- net-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-08-15 10:53:52.000000000 +0300
+++ net-2.6/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-08-15 10:54:03.959364640 +0300
@@ -686,9 +686,10 @@ static int ipoib_start_xmit(struct sk_bu
 				goto out;
 			}
 		} else if (neigh->ah) {
-			if (unlikely(memcmp(&neigh->dgid.raw,
+			if (unlikely((memcmp(&neigh->dgid.raw,
 					    skb->dst->neighbour->ha + 4,
-					    sizeof(union ib_gid)))) {
+					    sizeof(union ib_gid))) ||
+						 (neigh->dev != dev))) {
 				spin_lock(&priv->lock);
 				/*
 				 * It's safe to call ipoib_put_ah() inside


^ permalink raw reply

* [PATCH V4 5/10] net/bonding: Enable bonding to enslave non ARPHRD_ETHER
From: Moni Shoua @ 2007-08-20 15:48 UTC (permalink / raw)
  To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>

This patch changes some of the bond netdevice attributes and functions
to be that of the active slave for the case of the enslaved device not being
of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(),
which are netdevice **type** dependent and hence might be not appropriate for
devices of other types. It also enforces mutual exclusion on bonding slaves
from dissimilar ether types, as was concluded over the v1 discussion.

IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes
IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this
IPoIB device is bounded to. The QP is a resource created by the IB HW and the
GID is an identifier burned into the HCA (i have omitted here some details which
are not important for the bonding RFC).

Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
---
 drivers/net/bonding/bond_main.c |   39 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 39 insertions(+)

Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c	2007-08-15 10:08:59.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c	2007-08-15 10:54:13.424688411 +0300
@@ -1237,6 +1237,26 @@ static int bond_compute_features(struct 
 	return 0;
 }
 
+
+static void bond_setup_by_slave(struct net_device *bond_dev,
+				struct net_device *slave_dev)
+{
+	bond_dev->hard_header	        = slave_dev->hard_header;
+	bond_dev->rebuild_header        = slave_dev->rebuild_header;
+	bond_dev->hard_header_cache	= slave_dev->hard_header_cache;
+	bond_dev->header_cache_update   = slave_dev->header_cache_update;
+	bond_dev->hard_header_parse	= slave_dev->hard_header_parse;
+
+	bond_dev->neigh_setup           = slave_dev->neigh_setup;
+
+	bond_dev->type		    = slave_dev->type;
+	bond_dev->hard_header_len   = slave_dev->hard_header_len;
+	bond_dev->addr_len	    = slave_dev->addr_len;
+
+	memcpy(bond_dev->broadcast, slave_dev->broadcast,
+		slave_dev->addr_len);
+}
+
 /* enslave device <slave> to bond device <master> */
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 {
@@ -1311,6 +1331,25 @@ int bond_enslave(struct net_device *bond
 		goto err_undo_flags;
 	}
 
+	/* set bonding device ether type by slave - bonding netdevices are
+	 * created with ether_setup, so when the slave type is not ARPHRD_ETHER
+	 * there is a need to override some of the type dependent attribs/funcs.
+	 *
+	 * bond ether type mutual exclusion - don't allow slaves of dissimilar
+	 * ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond
+	 */
+	if (bond->slave_cnt == 0) {
+		if (slave_dev->type != ARPHRD_ETHER)
+			bond_setup_by_slave(bond_dev, slave_dev);
+	} else if (bond_dev->type != slave_dev->type) {
+		printk(KERN_ERR DRV_NAME ": %s ether type (%d) is different "
+			"from other slaves (%d), can not enslave it.\n",
+			slave_dev->name,
+			slave_dev->type, bond_dev->type);
+			res = -EINVAL;
+			goto err_undo_flags;
+	}
+
 	if (slave_dev->set_mac_address == NULL) {
 		printk(KERN_ERR DRV_NAME
 			": %s: Error: The slave device you specified does "


^ permalink raw reply

* [PATCH V4 6/10] net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address()
From: Moni Shoua @ 2007-08-20 15:49 UTC (permalink / raw)
  To: rdreier, davem, fubar; +Cc: netdev, general
In-Reply-To: <46C9B474.5020202@voltaire.com>

This patch allows for enslaving netdevices which do not support
the set_mac_address() function. In that case the bond mac address is the one
of the active slave, where remote peers are notified on the mac address
(neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs
(this is already done by the bonding code).

Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
---
 drivers/net/bonding/bond_main.c |   87 +++++++++++++++++++++++++++-------------
 drivers/net/bonding/bonding.h   |    1 
 2 files changed, 60 insertions(+), 28 deletions(-)

Index: net-2.6/drivers/net/bonding/bond_main.c
===================================================================
--- net-2.6.orig/drivers/net/bonding/bond_main.c	2007-08-15 10:54:13.000000000 +0300
+++ net-2.6/drivers/net/bonding/bond_main.c	2007-08-15 10:54:41.971632881 +0300
@@ -1095,6 +1095,14 @@ void bond_change_active_slave(struct bon
 		if (new_active) {
 			bond_set_slave_active_flags(new_active);
 		}
+
+		/* when bonding does not set the slave MAC address, the bond MAC
+		 * address is the one of the active slave.
+		 */
+		if (new_active && !bond->do_set_mac_addr)
+			memcpy(bond->dev->dev_addr,  new_active->dev->dev_addr,
+				new_active->dev->addr_len);
+
 		bond_send_gratuitous_arp(bond);
 	}
 }
@@ -1351,13 +1359,22 @@ int bond_enslave(struct net_device *bond
 	}
 
 	if (slave_dev->set_mac_address == NULL) {
-		printk(KERN_ERR DRV_NAME
-			": %s: Error: The slave device you specified does "
-			"not support setting the MAC address. "
-			"Your kernel likely does not support slave "
-			"devices.\n", bond_dev->name);
-  		res = -EOPNOTSUPP;
-		goto err_undo_flags;
+		if (bond->slave_cnt == 0) {
+			printk(KERN_WARNING DRV_NAME
+				": %s: Warning: The first slave device you "
+				"specified does not support setting the MAC "
+				"address. This bond MAC address would be that "
+				"of the active slave.\n", bond_dev->name);
+			bond->do_set_mac_addr = 0;
+		} else if (bond->do_set_mac_addr) {
+			printk(KERN_ERR DRV_NAME
+				": %s: Error: The slave device you specified "
+				"does not support setting the MAC addres,."
+				"but this bond uses this practice. \n"
+				, bond_dev->name);
+			res = -EOPNOTSUPP;
+			goto err_undo_flags;
+		}
 	}
 
 	new_slave = kzalloc(sizeof(struct slave), GFP_KERNEL);
@@ -1378,16 +1395,18 @@ int bond_enslave(struct net_device *bond
 	 */
 	memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN);
 
-	/*
-	 * Set slave to master's mac address.  The application already
-	 * set the master's mac address to that of the first slave
-	 */
-	memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len);
-	addr.sa_family = slave_dev->type;
-	res = dev_set_mac_address(slave_dev, &addr);
-	if (res) {
-		dprintk("Error %d calling set_mac_address\n", res);
-		goto err_free;
+	if (bond->do_set_mac_addr) {
+		/*
+		 * Set slave to master's mac address.  The application already
+		 * set the master's mac address to that of the first slave
+		 */
+		memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len);
+		addr.sa_family = slave_dev->type;
+		res = dev_set_mac_address(slave_dev, &addr);
+		if (res) {
+			dprintk("Error %d calling set_mac_address\n", res);
+			goto err_free;
+		}
 	}
 
 	res = netdev_set_master(slave_dev, bond_dev);
@@ -1612,9 +1631,11 @@ err_close:
 	dev_close(slave_dev);
 
 err_restore_mac:
-	memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN);
-	addr.sa_family = slave_dev->type;
-	dev_set_mac_address(slave_dev, &addr);
+	if (bond->do_set_mac_addr) {
+		memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN);
+		addr.sa_family = slave_dev->type;
+		dev_set_mac_address(slave_dev, &addr);
+	}
 
 err_free:
 	kfree(new_slave);
@@ -1792,10 +1813,12 @@ int bond_release(struct net_device *bond
 	/* close slave before restoring its mac address */
 	dev_close(slave_dev);
 
-	/* restore original ("permanent") mac address */
-	memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
-	addr.sa_family = slave_dev->type;
-	dev_set_mac_address(slave_dev, &addr);
+	if (bond->do_set_mac_addr) {
+		/* restore original ("permanent") mac address */
+		memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
+		addr.sa_family = slave_dev->type;
+		dev_set_mac_address(slave_dev, &addr);
+	}
 
 	slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
 				   IFF_SLAVE_INACTIVE | IFF_BONDING |
@@ -1882,10 +1905,12 @@ static int bond_release_all(struct net_d
 		/* close slave before restoring its mac address */
 		dev_close(slave_dev);
 
-		/* restore original ("permanent") mac address*/
-		memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
-		addr.sa_family = slave_dev->type;
-		dev_set_mac_address(slave_dev, &addr);
+		if (bond->do_set_mac_addr) {
+			/* restore original ("permanent") mac address*/
+			memcpy(addr.sa_data, slave->perm_hwaddr, ETH_ALEN);
+			addr.sa_family = slave_dev->type;
+			dev_set_mac_address(slave_dev, &addr);
+		}
 
 		slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
 					   IFF_SLAVE_INACTIVE);
@@ -3922,6 +3947,9 @@ static int bond_set_mac_address(struct n
 
 	dprintk("bond=%p, name=%s\n", bond, (bond_dev ? bond_dev->name : "None"));
 
+	if (!bond->do_set_mac_addr)
+		return -EOPNOTSUPP;
+
 	if (!is_valid_ether_addr(sa->sa_data)) {
 		return -EADDRNOTAVAIL;
 	}
@@ -4312,6 +4340,9 @@ static int bond_init(struct net_device *
 	bond_create_proc_entry(bond);
 #endif
 
+	/* set do_set_mac_addr to true on startup */
+	bond->do_set_mac_addr = 1;
+
 	list_add_tail(&bond->bond_list, &bond_dev_list);
 
 	return 0;
Index: net-2.6/drivers/net/bonding/bonding.h
===================================================================
--- net-2.6.orig/drivers/net/bonding/bonding.h	2007-08-15 10:08:58.000000000 +0300
+++ net-2.6/drivers/net/bonding/bonding.h	2007-08-15 10:55:34.359354833 +0300
@@ -185,6 +185,7 @@ struct bonding {
 	struct   timer_list mii_timer;
 	struct   timer_list arp_timer;
 	s8       kill_timers;
+	s8       do_set_mac_addr;
 	struct   net_device_stats stats;
 #ifdef CONFIG_PROC_FS
 	struct   proc_dir_entry *proc_entry;


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox