Netdev List
 help / color / mirror / Atom feed
* [patch net-2.6.25 04/10][NETNS][IPV6] make the ipv6 sysctl to be a netns subsystem
From: Daniel Lezcano @ 2008-01-09 16:45 UTC (permalink / raw)
  To: davem; +Cc: netdev, benjamin.thery
In-Reply-To: <20080109164533.695191040@localhost.localdomain>

[-- Attachment #1: sysctl/make-ipv6-sysctl-to-be-a-subsystem.patch --]
[-- Type: text/plain, Size: 1654 bytes --]

The initialization of the sysctl for the ipv6 protocol is changed to
a network namespace subsystem. That means when a new network namespace
is created the initialization function for the sysctl will be called.

That do not change the behavior of the sysctl in case of the kernel
with the network namespace disabled.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 net/ipv6/sysctl_net_ipv6.c |   23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

Index: net-2.6.25/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- net-2.6.25.orig/net/ipv6/sysctl_net_ipv6.c
+++ net-2.6.25/net/ipv6/sysctl_net_ipv6.c
@@ -91,10 +91,10 @@ EXPORT_SYMBOL_GPL(net_ipv6_ctl_path);
 
 static struct ctl_table_header *ipv6_sysctl_header;
 
-int ipv6_sysctl_register(void)
+static int ipv6_sysctl_net_init(struct net *net)
 {
-	ipv6_sysctl_header = register_sysctl_paths(net_ipv6_ctl_path,
-						   ipv6_table);
+	ipv6_sysctl_header = register_net_sysctl_table(net, net_ipv6_ctl_path,
+						       ipv6_table);
 	if (!ipv6_sysctl_header)
 		return -ENOMEM;
 
@@ -102,7 +102,22 @@ int ipv6_sysctl_register(void)
 
 }
 
+static void ipv6_sysctl_net_exit(struct net *net)
+{
+	unregister_net_sysctl_table(ipv6_sysctl_header);
+}
+
+static struct pernet_operations ipv6_sysctl_net_ops = {
+	.init = ipv6_sysctl_net_init,
+	.exit = ipv6_sysctl_net_exit,
+};
+
+int ipv6_sysctl_register(void)
+{
+	return register_pernet_subsys(&ipv6_sysctl_net_ops);
+}
+
 void ipv6_sysctl_unregister(void)
 {
-	unregister_sysctl_table(ipv6_sysctl_header);
+	unregister_pernet_subsys(&ipv6_sysctl_net_ops);
 }

-- 

^ permalink raw reply

* [patch net-2.6.25 01/10][NETNS][IPV6] make ipv6_sysctl_register to return a value
From: Daniel Lezcano @ 2008-01-09 16:45 UTC (permalink / raw)
  To: davem; +Cc: netdev, benjamin.thery
In-Reply-To: <20080109164533.695191040@localhost.localdomain>

[-- Attachment #1: sysctl/ipv6-sysctl-register-return-value.patch --]
[-- Type: text/plain, Size: 2035 bytes --]

This patch makes the function ipv6_sysctl_register to return a
value. The af_inet6 init function is now able to handle an error
and catch it from the initialization of the sysctl.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 include/net/ipv6.h         |    2 +-
 net/ipv6/af_inet6.c        |    5 ++++-
 net/ipv6/sysctl_net_ipv6.c |    9 +++++++--
 3 files changed, 12 insertions(+), 4 deletions(-)

Index: net-2.6.25/include/net/ipv6.h
===================================================================
--- net-2.6.25.orig/include/net/ipv6.h
+++ net-2.6.25/include/net/ipv6.h
@@ -622,7 +622,7 @@ static inline int snmp6_unregister_dev(s
 extern ctl_table ipv6_route_table[];
 extern ctl_table ipv6_icmp_table[];
 
-extern void ipv6_sysctl_register(void);
+extern int ipv6_sysctl_register(void);
 extern void ipv6_sysctl_unregister(void);
 #endif
 
Index: net-2.6.25/net/ipv6/af_inet6.c
===================================================================
--- net-2.6.25.orig/net/ipv6/af_inet6.c
+++ net-2.6.25/net/ipv6/af_inet6.c
@@ -783,7 +783,9 @@ static int __init inet6_init(void)
 	 */
 
 #ifdef CONFIG_SYSCTL
-	ipv6_sysctl_register();
+	err = ipv6_sysctl_register();
+	if (err)
+		goto sysctl_fail;
 #endif
 	err = icmpv6_init(&inet6_family_ops);
 	if (err)
@@ -897,6 +899,7 @@ ndisc_fail:
 icmp_fail:
 #ifdef CONFIG_SYSCTL
 	ipv6_sysctl_unregister();
+sysctl_fail:
 #endif
 	cleanup_ipv6_mibs();
 out_unregister_sock:
Index: net-2.6.25/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- net-2.6.25.orig/net/ipv6/sysctl_net_ipv6.c
+++ net-2.6.25/net/ipv6/sysctl_net_ipv6.c
@@ -91,10 +91,15 @@ EXPORT_SYMBOL_GPL(net_ipv6_ctl_path);
 
 static struct ctl_table_header *ipv6_sysctl_header;
 
-void ipv6_sysctl_register(void)
+int ipv6_sysctl_register(void)
 {
 	ipv6_sysctl_header = register_sysctl_paths(net_ipv6_ctl_path,
-			ipv6_table);
+						   ipv6_table);
+	if (!ipv6_sysctl_header)
+		return -ENOMEM;
+
+	return 0;
+
 }
 
 void ipv6_sysctl_unregister(void)

-- 

^ permalink raw reply

* [patch net-2.6.25 06/10][NETNS][IPV6] make bindv6only sysctl per namespace
From: Daniel Lezcano @ 2008-01-09 16:45 UTC (permalink / raw)
  To: davem; +Cc: netdev, benjamin.thery
In-Reply-To: <20080109164533.695191040@localhost.localdomain>

[-- Attachment #1: sysctl/move-bindv6only-to-netns.patch --]
[-- Type: text/plain, Size: 2924 bytes --]

This patch moves the bindv6only sysctl to the network namespace
structure. Until the ipv6 protocol is not per namespace, the sysctl
variable is always from the initial network namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 include/net/ipv6.h         |    1 -
 include/net/netns/ipv6.h   |    1 +
 net/ipv6/af_inet6.c        |    5 ++---
 net/ipv6/sysctl_net_ipv6.c |    4 +++-
 4 files changed, 6 insertions(+), 5 deletions(-)

Index: net-2.6.25/include/net/ipv6.h
===================================================================
--- net-2.6.25.orig/include/net/ipv6.h
+++ net-2.6.25/include/net/ipv6.h
@@ -109,7 +109,6 @@ struct frag_hdr {
 #include <net/sock.h>
 
 /* sysctls */
-extern int sysctl_ipv6_bindv6only;
 extern int sysctl_mld_max_msf;
 
 extern struct ctl_path net_ipv6_ctl_path[];
Index: net-2.6.25/include/net/netns/ipv6.h
===================================================================
--- net-2.6.25.orig/include/net/netns/ipv6.h
+++ net-2.6.25/include/net/netns/ipv6.h
@@ -11,6 +11,7 @@ struct netns_sysctl_ipv6 {
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header *table;
 #endif
+	int bindv6only;
 };
 
 struct netns_ipv6 {
Index: net-2.6.25/net/ipv6/af_inet6.c
===================================================================
--- net-2.6.25.orig/net/ipv6/af_inet6.c
+++ net-2.6.25/net/ipv6/af_inet6.c
@@ -66,8 +66,6 @@ MODULE_AUTHOR("Cast of dozens");
 MODULE_DESCRIPTION("IPv6 protocol stack for Linux");
 MODULE_LICENSE("GPL");
 
-int sysctl_ipv6_bindv6only __read_mostly;
-
 /* The inetsw6 table contains everything that inet6_create needs to
  * build a new socket.
  */
@@ -193,7 +191,7 @@ lookup_protocol:
 	np->mcast_hops	= -1;
 	np->mc_loop	= 1;
 	np->pmtudisc	= IPV6_PMTUDISC_WANT;
-	np->ipv6only	= sysctl_ipv6_bindv6only;
+	np->ipv6only	= init_net.ipv6.sysctl.bindv6only;
 
 	/* Init the ipv4 part of the socket since we can have sockets
 	 * using v6 API for ipv4.
@@ -721,6 +719,7 @@ static void cleanup_ipv6_mibs(void)
 
 static int inet6_net_init(struct net *net)
 {
+	net->ipv6.sysctl.bindv6only = 0;
 	return 0;
 }
 
Index: net-2.6.25/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- net-2.6.25.orig/net/ipv6/sysctl_net_ipv6.c
+++ net-2.6.25/net/ipv6/sysctl_net_ipv6.c
@@ -35,7 +35,7 @@ static ctl_table ipv6_table_template[] =
 	{
 		.ctl_name	= NET_IPV6_BINDV6ONLY,
 		.procname	= "bindv6only",
-		.data		= &sysctl_ipv6_bindv6only,
+		.data		= &init_net.ipv6.sysctl.bindv6only,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
@@ -116,6 +116,8 @@ static int ipv6_sysctl_net_init(struct n
      	ipv6_table[0].child = ipv6_route_table;
      	ipv6_table[1].child = ipv6_icmp_table;
 
+	ipv6_table[2].data = &net->ipv6.sysctl.bindv6only;
+
 	net->ipv6.sysctl.table = register_net_sysctl_table(net, net_ipv6_ctl_path,
 							   ipv6_table);
 	if (!net->ipv6.sysctl.table)

-- 

^ permalink raw reply

* [patch net-2.6.25 03/10][NETNS][IPV6] add ipv6 structure for netns
From: Daniel Lezcano @ 2008-01-09 16:45 UTC (permalink / raw)
  To: davem; +Cc: netdev, benjamin.thery
In-Reply-To: <20080109164533.695191040@localhost.localdomain>

[-- Attachment #1: add-ipv6-for-netns.patch --]
[-- Type: text/plain, Size: 1248 bytes --]

Like the ipv4 part, this patch adds an ipv6 structure in the net structure
to aggregate the different resources to make ipv6 per namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 include/net/net_namespace.h |    4 ++++
 include/net/netns/ipv6.h    |   10 ++++++++++
 2 files changed, 14 insertions(+)

Index: net-2.6.25/include/net/net_namespace.h
===================================================================
--- net-2.6.25.orig/include/net/net_namespace.h
+++ net-2.6.25/include/net/net_namespace.h
@@ -11,6 +11,7 @@
 #include <net/netns/unix.h>
 #include <net/netns/packet.h>
 #include <net/netns/ipv4.h>
+#include <net/netns/ipv6.h>
 
 struct proc_dir_entry;
 struct net_device;
@@ -48,6 +49,9 @@ struct net {
 	struct netns_packet	packet;
 	struct netns_unix	unx;
 	struct netns_ipv4	ipv4;
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	struct netns_ipv6	ipv6;
+#endif
 };
 
 #ifdef CONFIG_NET
Index: net-2.6.25/include/net/netns/ipv6.h
===================================================================
--- /dev/null
+++ net-2.6.25/include/net/netns/ipv6.h
@@ -0,0 +1,10 @@
+/*
+ * ipv6 in net namespaces
+ */
+
+#ifndef __NETNS_IPV6_H__
+#define __NETNS_IPV6_H__
+
+struct netns_ipv6 {
+};
+#endif

-- 

^ permalink raw reply

* [patch net-2.6.25 08/10][NETNS][IPV6] make mld_max_msf readonly in other namespaces
From: Daniel Lezcano @ 2008-01-09 16:45 UTC (permalink / raw)
  To: davem; +Cc: netdev, benjamin.thery
In-Reply-To: <20080109164533.695191040@localhost.localdomain>

[-- Attachment #1: make-mld_max_msf-readonly.patch --]
[-- Type: text/plain, Size: 1366 bytes --]

The mld_max_msf protects the system with a maximum allowed multicast 
source filters. Making this variable per namespace can be potentially
an problem if someone inside a namespace set it to a big value, that
will impact the whole system including other namespaces.

I don't see any benefits to have it per namespace for now, so in order 
to keep a directory entry in a newly created namespace, I make it
read-only when we are not in the initial network namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 net/ipv6/sysctl_net_ipv6.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: net-2.6.25/net/ipv6/sysctl_net_ipv6.c
===================================================================
--- net-2.6.25.orig/net/ipv6/sysctl_net_ipv6.c
+++ net-2.6.25/net/ipv6/sysctl_net_ipv6.c
@@ -122,6 +122,12 @@ static int ipv6_sysctl_net_init(struct n
       	ipv6_table[5].data = &net->ipv6.sysctl.frags.timeout;
     	ipv6_table[6].data = &net->ipv6.sysctl.frags.secret_interval;
 
+ 	/* We don't want this value to be per namespace, it should be global
+	   to all namespaces, so make it read-only when we are not in the
+	   init network namespace */
+    	if (net != &init_net)
+    		ipv6_table[7].mode = 0444;
+
 	net->ipv6.sysctl.table = register_net_sysctl_table(net, net_ipv6_ctl_path,
 							   ipv6_table);
 	if (!net->ipv6.sysctl.table)

-- 

^ permalink raw reply

* [patch net-2.6.25 02/10][NETNS][IPV6] make a subsystem for af_inet6
From: Daniel Lezcano @ 2008-01-09 16:45 UTC (permalink / raw)
  To: davem; +Cc: netdev, benjamin.thery
In-Reply-To: <20080109164533.695191040@localhost.localdomain>

[-- Attachment #1: make-af-inet6-a-subsystem.patch --]
[-- Type: text/plain, Size: 1965 bytes --]

This patch add a network namespace subsystem for the af_inet6 module. 
It does nothing right now, but one of its purpose is to receive the 
different variables for sysctl in order to initialize them.

When the sysctl variable will be moved to the network namespace structure,
they will be no longer initialized as global static variables, so we must
find a place to initialize them. Because the sysctl can be disabled, it 
has no sense to store them in the sysctl_net_ipv6 file.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 net/ipv6/af_inet6.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Index: net-2.6.25/net/ipv6/af_inet6.c
===================================================================
--- net-2.6.25.orig/net/ipv6/af_inet6.c
+++ net-2.6.25/net/ipv6/af_inet6.c
@@ -719,6 +719,21 @@ static void cleanup_ipv6_mibs(void)
 	snmp_mib_free((void **)udplite_stats_in6);
 }
 
+static int inet6_net_init(struct net *net)
+{
+	return 0;
+}
+
+static void inet6_net_exit(struct net *net)
+{
+	return;
+}
+
+static struct pernet_operations inet6_net_ops = {
+	.init = inet6_net_init,
+	.exit = inet6_net_exit,
+};
+
 static int __init inet6_init(void)
 {
 	struct sk_buff *dummy_skb;
@@ -782,6 +797,10 @@ static int __init inet6_init(void)
 	 *	able to communicate via both network protocols.
 	 */
 
+	err = register_pernet_subsys(&inet6_net_ops);
+	if (err)
+		goto register_pernet_fail;
+
 #ifdef CONFIG_SYSCTL
 	err = ipv6_sysctl_register();
 	if (err)
@@ -901,6 +920,8 @@ icmp_fail:
 	ipv6_sysctl_unregister();
 sysctl_fail:
 #endif
+	unregister_pernet_subsys(&inet6_net_ops);
+register_pernet_fail:
 	cleanup_ipv6_mibs();
 out_unregister_sock:
 	sock_unregister(PF_INET6);
@@ -956,6 +977,7 @@ static void __exit inet6_exit(void)
 #ifdef CONFIG_SYSCTL
 	ipv6_sysctl_unregister();
 #endif
+	unregister_pernet_subsys(&inet6_net_ops);
 	cleanup_ipv6_mibs();
 	proto_unregister(&rawv6_prot);
 	proto_unregister(&udplitev6_prot);

-- 

^ permalink raw reply

* [patch net-2.6.25 00/10][NETNS][IPV6] make sysctl per namespace - V3
From: Daniel Lezcano @ 2008-01-09 16:45 UTC (permalink / raw)
  To: davem; +Cc: netdev, benjamin.thery

The following patchset makes the ipv6 sysctl to handle multiple
network namespaces. Each instance of a network namespace as its own
set of sysctl values, that means the behavior of the ipv6 stack can be
different depending on the sysctl values setup in the different
network namespaces.

Changelog:
	V3 : fixed compilation error when CONFIG_SYSCTL=n,
	     fixed missing initialization when CONFIG_SYSCTL=n

	V2 : make the mld_max_msf variable readonly when we are
	     not in the initial network namespace

	V1 : initial post

-- 

^ permalink raw reply

* Re: Linux IPv6 DAD not full conform to RFC 4862 ?
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2008-01-09 16:40 UTC (permalink / raw)
  To: kkeil; +Cc: netdev, yoshfuji
In-Reply-To: <20080110.013857.37616214.yoshfuji@linux-ipv6.org>

In article <20080110.013857.37616214.yoshfuji@linux-ipv6.org> (at Thu, 10 Jan 2008 01:38:57 +0900 (JST)), YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> says:

> - we could have "dad_reaction" interface variable and
>  > 1: disable interface
>  = 1: disable IPv6
>  < 0: ignore (as we do now)

Argh, >0, 0 and <0, maybe.

--yoshfuji

^ permalink raw reply

* Re: Linux IPv6 DAD not full conform to RFC 4862 ?
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2008-01-09 16:38 UTC (permalink / raw)
  To: kkeil; +Cc: netdev, yoshfuji
In-Reply-To: <20080109153656.GA16962@pingi.kke.suse.de>

In article <20080109153656.GA16962@pingi.kke.suse.de> (at Wed, 9 Jan 2008 16:36:56 +0100), Karsten Keil <kkeil@suse.de> says:

> So I think we should disable the interface now, if DAD fails on a
> hardware based LLA.

I don't want to do this, at least, unconditionally.

Options (not exclusive):

- we could have "enable_ipv6" interface flag and check it in
  input/output paths
- we could have "dad_reaction" interface variable and
 > 1: disable interface
 = 1: disable IPv6
 < 0: ignore (as we do now)

--yoshfuji

^ permalink raw reply

* Re: Linux IPv6 DAD not full conform to RFC 4862 ?
From: Neil Horman @ 2008-01-09 16:17 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20080109153656.GA16962@pingi.kke.suse.de>

On Wed, Jan 09, 2008 at 04:36:56PM +0100, Karsten Keil wrote:
> Hi,
> 
> I tried to run the 1.5.0 Beta2  TAHI Selftest on recent Linux kernel.
> It fails in the Stateless Address Autoconfiguration section with
> 6 tests.
> These tests are for Duplicate Address Detection (DAD).
> They are detect for the Link Local Address a duplicate address on the
> network. It seems that our current behavior is to log an message and
> do not assign this address.
> 
> But the RFC 4862 says:
> 
> 5.4.5.  When Duplicate Address Detection Fails
> 
>    A tentative address that is determined to be a duplicate as described
>    above MUST NOT be assigned to an interface, and the node SHOULD log a
>    system management error.
> 
>    If the address is a link-local address formed from an interface
>    identifier based on the hardware address, which is supposed to be
>    uniquely assigned (e.g., EUI-64 for an Ethernet interface), IP
>    operation on the interface SHOULD be disabled.  By disabling IP
>    operation, the node will then:
> 
>    -  not send any IP packets from the interface,
> 
>    -  silently drop any IP packets received on the interface, and
> 
>    -  not forward any IP packets to the interface (when acting as a
>       router or processing a packet with a Routing header).
> 
>    In this case, the IP address duplication probably means duplicate
>    hardware addresses are in use, and trying to recover from it by
>    configuring another IP address will not result in a usable network.
>    In fact, it probably makes things worse by creating problems that are
>    harder to diagnose than just disabling network operation on the
>    interface; the user will see a partially working network where some
>    things work, and other things do not.
> 
>    On the other hand, if the duplicate link-local address is not formed
>    from an interface identifier based on the hardware address, which is
>    supposed to be uniquely assigned, IP operation on the interface MAY
>    be continued.
> 
> 
> So I think we should disable the interface now, if DAD fails on a
> hardware based LLA.
> 

Not sure I agree with that.  I assume that by disable, you mean that we should
clear the IFF_UP flag?  If we do that, and another ip address is assigned to
that interface, then your proposal would discontinue the functionality of those
already established addresses, which would be bad.  I could see a DOS scenario
comming out of that as well.  Simply send ndisc na's for a recently advertised
address, and you could prevent network communication for an entire system.

Reading the section you reference, we do follow all the MUST requirements, and
we log an error.  Given that the disable section is a SHOULD, I think we can at
least be somewhat more restrictive in our implementation.  Perhaps we should
just disable the interface iff the failed address is link-local AND there are no
other functional address assigned to the interface.

Neil

> -- 
> Karsten Keil
> SuSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/0]: Cassini bug fixes.
From: Laszlo Attila Toth @ 2008-01-09 16:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, bazsi, hidden
In-Reply-To: <20080104.003231.127196736.davem@davemloft.net>

David Miller wrote:
> Over the past day I've put together the following set of bug fixes for
> the Cassini driver.
> 
> At least with my setup it appears to basically work fine, not leak
> memory, and the SKB BUG messages go away too.
> 
> I'll be honest and say that I've devoted a couple days to this work,
> and therefore I have to turn my attention back to other tasks.  As a
> result, it means it will be some time before I can look seriously into
> any feedback folks provide.  And for that I apologize, but this
> already consumed too much of my time.
> 
> I'll be pushing these to Linus and -stable shortly.
> 
> Thanks.
> 

We tested the card, it works well, all previous bugs are gone (truesize 
bug messages and memory comsumption).

Thank you again.

--
Attila

^ permalink raw reply

* Linux IPv6 DAD not full conform to RFC 4862 ?
From: Karsten Keil @ 2008-01-09 15:36 UTC (permalink / raw)
  To: netdev

Hi,

I tried to run the 1.5.0 Beta2  TAHI Selftest on recent Linux kernel.
It fails in the Stateless Address Autoconfiguration section with
6 tests.
These tests are for Duplicate Address Detection (DAD).
They are detect for the Link Local Address a duplicate address on the
network. It seems that our current behavior is to log an message and
do not assign this address.

But the RFC 4862 says:

5.4.5.  When Duplicate Address Detection Fails

   A tentative address that is determined to be a duplicate as described
   above MUST NOT be assigned to an interface, and the node SHOULD log a
   system management error.

   If the address is a link-local address formed from an interface
   identifier based on the hardware address, which is supposed to be
   uniquely assigned (e.g., EUI-64 for an Ethernet interface), IP
   operation on the interface SHOULD be disabled.  By disabling IP
   operation, the node will then:

   -  not send any IP packets from the interface,

   -  silently drop any IP packets received on the interface, and

   -  not forward any IP packets to the interface (when acting as a
      router or processing a packet with a Routing header).

   In this case, the IP address duplication probably means duplicate
   hardware addresses are in use, and trying to recover from it by
   configuring another IP address will not result in a usable network.
   In fact, it probably makes things worse by creating problems that are
   harder to diagnose than just disabling network operation on the
   interface; the user will see a partially working network where some
   things work, and other things do not.

   On the other hand, if the duplicate link-local address is not formed
   from an interface identifier based on the hardware address, which is
   supposed to be uniquely assigned, IP operation on the interface MAY
   be continued.


So I think we should disable the interface now, if DAD fails on a
hardware based LLA.

-- 
Karsten Keil
SuSE Labs

^ permalink raw reply

* Re: [PATCH 0/3] bonding: 3 fixes for 2.6.24
From: Andy Gospodarek @ 2008-01-09 15:27 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Krzysztof Oledzki, netdev, Jeff Garzik, David Miller,
	Andy Gospodarek, Herbert Xu
In-Reply-To: <17850.1199865514@death>

On Tue, Jan 08, 2008 at 11:58:34PM -0800, Jay Vosburgh wrote:
> Krzysztof Oledzki <olel@ans.pl> wrote:
> 
> >Fine. Just let you know that someone test your patches and everything
> >works, except mentioned problem.
> 
> 	And I appreciate it; I just wanted to make sure our many fans
> following along at home didn't misunderstand.
> 
> 	Could you let me know if the patch below make the lockdep
> warning go away?  This applies on top of the previous three, although it
> should be trivial to do by hand.
> 
> 	I'm still checking to make sure this is safe with regard to
> mutexing the bonding structures, but it would be good to know if it
> eliminates the warning.
> 
> 	-J
> 

Jay,

My initial concern was that a slave device could disappear out from
under us, but it seems like this certainly isn't the case since all
calls to bond_release are protected by rtnl-locks, so I think you are
correct that we are safe.  I'll test this on my setup here and let you
know if I see any problems.

-andy




^ permalink raw reply

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
From: Arjan van de Ven @ 2008-01-09 15:28 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, NetDev
In-Reply-To: <1199887950.6762.26.camel@johannes.berg>

Johannes Berg wrote:
>> Rank 1: __ieee80211_rx
>> 	Warning at net/mac80211/rx.c:1672
>> 	Reported 6 times (11 total reports)
>> 	Same issue that was ranked 2nd last week
>> 	Johannes has diagnosed this as a driver bug in the iwlwifi drivers
>> 	More info: http://www.kerneloops.org/search.php?search=__ieee80211_rx
> 
> Note that because we don't get the module list for WARN_ON, we don't
> actually know whether all of these instances are from the iwlwifi
> drivers. A few other drivers suffer from the same problem. In one of
> these cases, iwlwifi was contained in the stack trace, but in the common
> case that isn't happening because packet processing is delayed to a
> tasklet.
> 

and fwiw a patch to get this added to WARN_ON was posted by my last week to fix this;
once this goes into 2.6.25-rc this annoyance/hinderance in debugging will be fixed.

^ permalink raw reply

* Re: SACK scoreboard
From: John Heffner @ 2008-01-09 14:56 UTC (permalink / raw)
  To: David Miller; +Cc: andi, ilpo.jarvinen, lachlan.andrew, netdev, quetchen
In-Reply-To: <20080108.224144.234253941.davem@davemloft.net>

David Miller wrote:
> From: John Heffner <jheffner@psc.edu>
> Date: Tue, 08 Jan 2008 23:27:08 -0500
> 
>> I also wonder how much of a problem this is (for now, with window sizes 
>> of order 10000 packets.  My understanding is that the biggest problems 
>> arise from O(N^2) time for recovery because every ack was expensive. 
>> Have current tests shown the final ack to be a major source of problems?
> 
> Yes, several people have reported this.

I may have missed some of this.  Does anyone have a link to some recent 
data?

   -John

^ permalink raw reply

* Re: [NET] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache
From: Paul E. McKenney @ 2008-01-09 14:43 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, herbert, dipankar, netdev, josh
In-Reply-To: <20080109.063126.68241252.davem@davemloft.net>

On Wed, Jan 09, 2008 at 06:31:26AM -0800, David Miller wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Date: Wed, 9 Jan 2008 06:22:58 -0800
> 
> > On Wed, Jan 09, 2008 at 11:37:27AM +0100, Eric Dumazet wrote:
> > > On Wed, 9 Jan 2008 20:46:37 +1100
> > > Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > 
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index d337706..28484f3 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -283,12 +283,12 @@ static struct rtable *rt_cache_get_first(struct seq_file *seq)
> > >  			break;
> > >  		rcu_read_unlock_bh();
> > >  	}
> > > -	return r;
> > > +	return rcu_dereference(r);
> > >  }
> > 
> > Would it be possible to tag rt_cache_get_first() with an __acquires(RCU)
> > to help out sparse?
> 
> Sparse can't handle conditional locking very well, as is done here.
> There is a seperate thread where Eric reworks how all of this
> locking is done in order to pacify sparse and be able to add the
> __acquires() etc. tags and some of us found it too ugly to
> swallow :-)

Ah!  ;-)

							Thanx, Paul

^ permalink raw reply

* Re: [NET] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache
From: David Miller @ 2008-01-09 14:31 UTC (permalink / raw)
  To: paulmck; +Cc: dada1, herbert, dipankar, netdev
In-Reply-To: <20080109142258.GC13714@linux.vnet.ibm.com>

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Wed, 9 Jan 2008 06:22:58 -0800

> On Wed, Jan 09, 2008 at 11:37:27AM +0100, Eric Dumazet wrote:
> > On Wed, 9 Jan 2008 20:46:37 +1100
> > Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index d337706..28484f3 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -283,12 +283,12 @@ static struct rtable *rt_cache_get_first(struct seq_file *seq)
> >  			break;
> >  		rcu_read_unlock_bh();
> >  	}
> > -	return r;
> > +	return rcu_dereference(r);
> >  }
> 
> Would it be possible to tag rt_cache_get_first() with an __acquires(RCU)
> to help out sparse?

Sparse can't handle conditional locking very well, as is done here.
There is a seperate thread where Eric reworks how all of this
locking is done in order to pacify sparse and be able to add the
__acquires() etc. tags and some of us found it too ugly to
swallow :-)

^ permalink raw reply

* Re: [NET] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache
From: Paul E. McKenney @ 2008-01-09 14:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, davem, dipankar, netdev
In-Reply-To: <20080109113727.50eae500.dada1@cosmosbay.com>

On Wed, Jan 09, 2008 at 11:37:27AM +0100, Eric Dumazet wrote:
> On Wed, 9 Jan 2008 20:46:37 +1100
> Herbert Xu <herbert@gondor.apana.org.au> wrote:
> 
> > On Wed, Jan 09, 2008 at 08:38:56AM +0100, Eric Dumazet wrote:
> > > 
> > > I am not sure this is valid, since it will do this :
> > > 
> > > r = rt_hash_table[st->bucket].chain;
> > > if (r)
> > >     return rcu_dereference(r);
> > > 
> > > So compiler might be dumb enough do dereference 
> > > &rt_hash_table[st->bucket].chain two times.
> > 
> > That wouldn't be a problem at all.  The key is to add a barrier between
> > reading the pointer:
> > 
> > 	r = rt_hash_table[st->bucket].chain
> > 
> > and dereferencing it later, e.g.,
> > 
> > 	r->u.dst.rt_next
> > 
> > The barrier is there so that when we dereference r we don't read
> > stale cache that was there before the memory at r was initialised.
> > How many times you read the pointer value before the barrier is
> > irrelevant to the effectiveness of the barrier preceding the
> > dereference.

Agreed -- as long as you don't try to dereference the pointer before
passing it through rcu_dereference(), and as long as both the initial
fetch of the pointer, the rcu_dereference(), and the actual dereferencing
of the pointer are all within the same RCU read-side critical section.

> You are absolutely right Herbert, so I changed the patch to :
> 
> [NET] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache
> 
> In rt_cache_get_next(), no need to guard seq->private by a rcu_dereference()
> since seq is private to the thread running this function. Reading seq.private
> once (as guaranted bu rcu_dereference()) or several time if compiler really is 
> dumb enough wont change the result.
> 
> But we miss real spots where rcu_dereference() are needed, both in 
> rt_cache_get_first() and rt_cache_get_next()
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index d337706..28484f3 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -283,12 +283,12 @@ static struct rtable *rt_cache_get_first(struct seq_file *seq)
>  			break;
>  		rcu_read_unlock_bh();
>  	}
> -	return r;
> +	return rcu_dereference(r);
>  }

Would it be possible to tag rt_cache_get_first() with an __acquires(RCU)
to help out sparse?

>  static struct rtable *rt_cache_get_next(struct seq_file *seq, struct rtable *r)
>  {
> -	struct rt_cache_iter_state *st = rcu_dereference(seq->private);
> +	struct rt_cache_iter_state *st = seq->private;
> 
>  	r = r->u.dst.rt_next;
>  	while (!r) {
> @@ -298,7 +298,7 @@ static struct rtable *rt_cache_get_next(struct seq_file *seq, struct rtable *r)
>  		rcu_read_lock_bh();
>  		r = rt_hash_table[st->bucket].chain;
>  	}
> -	return r;
> +	return rcu_dereference(r);
>  }

Ditto for rt_cache_get_next()?

>  static struct rtable *rt_cache_get_idx(struct seq_file *seq, loff_t pos)

There would need to be a __releases(RCU) somewhere -- possibly
in rt_cache_seq_stop(), but need to defer to you guys on this one.

						Thanx, Paul

^ permalink raw reply

* Re: FW:  ccid2/ccid3 oopses
From: Gerrit Renker @ 2008-01-09 14:17 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, devzero, dccp, netdev
In-Reply-To: <20080109140211.GA9857@ghostprotocols.net>

| > >> the easiest way to reproduce is:
| > >> 
| > >> while true;do modprobe dccp_ccid2/3;modprobe -r dccp_ccid2/3;done
| > >> after short time, the kernel oopses (messages below)
| > >> 
<snip>
| 
| Gerrit, the control socket isn't attached to any CCID module, so the
| CCID modules should be safe to remove, and IIRC they were safe to
| unload.
| 
Ah, right. I have misread the email. And can confirm the above: running
the for-loop at the top of the message (60 seconds uninterrupted for
CCID2,3 each) brought no oopses.
So maybe the cause triggering this oops is somewhere else.

^ permalink raw reply

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
From: Johannes Berg @ 2008-01-09 14:12 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, NetDev
In-Reply-To: <477FF149.4070609@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 670 bytes --]


> Rank 1: __ieee80211_rx
> 	Warning at net/mac80211/rx.c:1672
> 	Reported 6 times (11 total reports)
> 	Same issue that was ranked 2nd last week
> 	Johannes has diagnosed this as a driver bug in the iwlwifi drivers
> 	More info: http://www.kerneloops.org/search.php?search=__ieee80211_rx

Note that because we don't get the module list for WARN_ON, we don't
actually know whether all of these instances are from the iwlwifi
drivers. A few other drivers suffer from the same problem. In one of
these cases, iwlwifi was contained in the stack trace, but in the common
case that isn't happening because packet processing is delayed to a
tasklet.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: FW:  ccid2/ccid3 oopses
From: Arnaldo Carvalho de Melo @ 2008-01-09 14:02 UTC (permalink / raw)
  To: Gerrit Renker, devzero, dccp, netdev
In-Reply-To: <20080109122827.GC4461@gerrit.erg.abdn.ac.uk>

Em Wed, Jan 09, 2008 at 12:28:27PM +0000, Gerrit Renker escreveu:
> Roland, -
> 
> >> apparently, i got crashes when loading/unloading other driver modules just
> >> after ccid2 or ccid3 had been loaded/unloaded _once_ (have not used them at
> >> all, just modprobe module;modprobe -r module) >
> >> 
> <snip>
> >> the easiest way to reproduce is:
> >> 
> >> while true;do modprobe dccp_ccid2/3;modprobe -r dccp_ccid2/3;done
> >> after short time, the kernel oopses (messages below)
> >> 
> >> i`m not sure if this is worth to be filed at kernel bugzilla, so i`m contacting
> >> you personally first.
> >>
> The issue is known: once loaded, the DCCP modules can not be unloaded
> without causing a crash as the one you have observed. This is due to the
> fact that dccp_ipv{4,6} use control sockets which need to be released
> before the module can be unloaded.
> When the control sockets are not released then crashes will always
> result.
> In earlier versions of DCCP there was a kernel option known as "unload hack",
> which conditionally inserted 
> 	sock_release(dccp_v{4,6}_ctl_socket);
> in 
> 	dccp_v{4,6}_exit()
> 
> However, as the name says, it is a hack since there are other issues to 
> be considered:
> 	* sockets in timewait state
> 	* other wait states (e.g. half-open connections)
> 	* memory which has not been released
> 	* module dependencies
> 
> With regard to the latter, I am normally using the Unload Hack and
> release modules in the following order:
> 
> 	dccp_probe => dccp_ccid2 => dccp_ccid3 => dccp_tfrc_lib =>
>         dccp_ipv6  => dccp_ipv4  => dccp_diag  => dccp
> 
> Long story short
>  * the CCID/DCCP modules can currently not safely be unloaded
>  * maybe we should disable module unloading for the mainline kernel
>  * if anyone is interested to use the unload hack, here is the old patch
>    http://www.erg.abdn.ac.uk/users/gerrit/dccp/testing_dccp/Unload_Hack.diff

Gerrit, the control socket isn't attached to any CCID module, so the
CCID modules should be safe to remove, and IIRC they were safe to
unload.

The unload hack was for something else, for the core DCCP modules. We
can't unload because there are refcounts held by the control sock, so
the unload hack would just destroy the control sock and thus the module
refcount would reach zero and it could then be unloaded.

I've been consistently being sidetracked with work (huh :-)) and
couldn't look at this issue, but the CCID modules should be safe to
unload.

- Arnaldo

^ permalink raw reply

* Re: SACK scoreboard
From: Andi Kleen @ 2008-01-09 14:02 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Andi Kleen, David Miller, jheffner, ilpo.jarvinen, lachlan.andrew,
	netdev, quetchen
In-Reply-To: <20080109094725.GA22140@2ka.mipt.ru>

> Postponing freeing of the skb has major drawbacks. Some time ago I

Yes, the trick would be to make sure that it also does not tie up
too much memory. e.g. it would need some throttling at least.

Also the fast path of kmem_cache_free() is actually not that
much different from just putting something on a list so perhaps
it would not make that much difference.

-Andi

^ permalink raw reply

* Re: SACK scoreboard
From: Ilpo Järvinen @ 2008-01-09 12:55 UTC (permalink / raw)
  To: John Heffner; +Cc: Andi Kleen, David Miller, lachlan.andrew, Netdev, quetchen
In-Reply-To: <47844D1C.1060706@psc.edu>

On Tue, 8 Jan 2008, John Heffner wrote:

> Andi Kleen wrote:
> > David Miller <davem@davemloft.net> writes:
> > > The big problem is that recovery from even a single packet loss in a
> > > window makes us run kfree_skb() for a all the packets in a full
> > > window's worth of data when recovery completes.
> > 
> > Why exactly is it a problem to free them all at once? Are you worried
> > about kernel preemption latencies?
> 
> I also wonder how much of a problem this is (for now, with window sizes of
> order 10000 packets.  My understanding is that the biggest problems arise from
> O(N^2) time for recovery because every ack was expensive. Have current 
> tests shown the final ack to be a major source of problems?

This thread got started because I tried to solve the other latencies but 
realized that it helps very little because this latency spike would 
have remained unsolved and it happens in one of the most common case.

-- 
 i.

^ permalink raw reply

* Re: FW:  ccid2/ccid3 oopses
From: Gerrit Renker @ 2008-01-09 12:28 UTC (permalink / raw)
  To: devzero; +Cc: dccp, netdev
In-Reply-To: <93680347@web.de>

Roland, -

>> apparently, i got crashes when loading/unloading other driver modules just
>> after ccid2 or ccid3 had been loaded/unloaded _once_ (have not used them at
>> all, just modprobe module;modprobe -r module) >
>> 
<snip>
>> the easiest way to reproduce is:
>> 
>> while true;do modprobe dccp_ccid2/3;modprobe -r dccp_ccid2/3;done
>> after short time, the kernel oopses (messages below)
>> 
>> i`m not sure if this is worth to be filed at kernel bugzilla, so i`m contacting
>> you personally first.
>>
The issue is known: once loaded, the DCCP modules can not be unloaded
without causing a crash as the one you have observed. This is due to the
fact that dccp_ipv{4,6} use control sockets which need to be released
before the module can be unloaded.
When the control sockets are not released then crashes will always
result.
In earlier versions of DCCP there was a kernel option known as "unload hack",
which conditionally inserted 
	sock_release(dccp_v{4,6}_ctl_socket);
in 
	dccp_v{4,6}_exit()

However, as the name says, it is a hack since there are other issues to 
be considered:
	* sockets in timewait state
	* other wait states (e.g. half-open connections)
	* memory which has not been released
	* module dependencies

With regard to the latter, I am normally using the Unload Hack and
release modules in the following order:

	dccp_probe => dccp_ccid2 => dccp_ccid3 => dccp_tfrc_lib =>
        dccp_ipv6  => dccp_ipv4  => dccp_diag  => dccp

Long story short
 * the CCID/DCCP modules can currently not safely be unloaded
 * maybe we should disable module unloading for the mainline kernel
 * if anyone is interested to use the unload hack, here is the old patch
   http://www.erg.abdn.ac.uk/users/gerrit/dccp/testing_dccp/Unload_Hack.diff

Please feel free to come back on this issue
Gerrit

^ permalink raw reply

* Re: [PATCH net-2.6.25] [IPVS] Added include for ip_vs.h for ctl_path (build was broken)
From: David Miller @ 2008-01-09 11:57 UTC (permalink / raw)
  To: ramirose; +Cc: netdev
In-Reply-To: <eb3ff54b0801090333r4c1770dakd65f61e356aa0304@mail.gmail.com>

From: "Rami Rosen" <ramirose@gmail.com>
Date: Wed, 9 Jan 2008 13:33:49 +0200

> Hi,
>    The build was broken with this error:
> 	
>   In file included from net/ipv4/ipvs/ip_vs_rr.c:27:
>   include/net/ip_vs.h:857: error: array type has incomplete element type
>   make[3]: *** [net/ipv4/ipvs/ip_vs_rr.o] Error 1
> 
> 	This was due to missing include to the header file for ctl_path.
> 	
> 	This patch added #include <linux/sysctl.h> to ip_vs_.h to avoid it
> 
> Signed-off-by: Rami Rosen <ramirose@gmail.com>

Applied, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox