Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next when it opens] net: 8021q: Add pr_fmt
From: Joe Perches @ 2011-05-26 20:58 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Ben Greear, David S. Miller, netdev, linux-kernel
In-Reply-To: <20110526.145601.735416831290933990.davem@davemloft.net>

Use the current logging style.

Add #define pr_fmt and remove embedded prefix from formats.

Not converting the current pr_<level> uses to netdev_<level>
because all the output here is nicely prefaced with "8021q: ".

Remove __func__ use from proc registration failure message.

Signed-off-by: Joe Perches <joe@perches.com>
---
 net/8021q/vlan.c     |   15 ++++++++-------
 net/8021q/vlan_dev.c |    4 +++-
 net/8021q/vlanproc.c |    4 +++-
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index c7a581a..cfa9afe 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -18,6 +18,8 @@
  *		2 of the License, or (at your option) any later version.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/capability.h>
 #include <linux/module.h>
 #include <linux/netdevice.h>
@@ -149,13 +151,13 @@ int vlan_check_real_dev(struct net_device *real_dev, u16 vlan_id)
 	const struct net_device_ops *ops = real_dev->netdev_ops;
 
 	if (real_dev->features & NETIF_F_VLAN_CHALLENGED) {
-		pr_info("8021q: VLANs not supported on %s\n", name);
+		pr_info("VLANs not supported on %s\n", name);
 		return -EOPNOTSUPP;
 	}
 
 	if ((real_dev->features & NETIF_F_HW_VLAN_FILTER) &&
 	    (!ops->ndo_vlan_rx_add_vid || !ops->ndo_vlan_rx_kill_vid)) {
-		pr_info("8021q: Device %s has buggy VLAN hw accel\n", name);
+		pr_info("Device %s has buggy VLAN hw accel\n", name);
 		return -EOPNOTSUPP;
 	}
 
@@ -344,13 +346,12 @@ static void __vlan_device_event(struct net_device *dev, unsigned long event)
 	case NETDEV_CHANGENAME:
 		vlan_proc_rem_dev(dev);
 		if (vlan_proc_add_dev(dev) < 0)
-			pr_warning("8021q: failed to change proc name for %s\n",
-					dev->name);
+			pr_warn("failed to change proc name for %s\n",
+				dev->name);
 		break;
 	case NETDEV_REGISTER:
 		if (vlan_proc_add_dev(dev) < 0)
-			pr_warning("8021q: failed to add proc entry for %s\n",
-					dev->name);
+			pr_warn("failed to add proc entry for %s\n", dev->name);
 		break;
 	case NETDEV_UNREGISTER:
 		vlan_proc_rem_dev(dev);
@@ -374,7 +375,7 @@ static int vlan_device_event(struct notifier_block *unused, unsigned long event,
 	if ((event == NETDEV_UP) &&
 	    (dev->features & NETIF_F_HW_VLAN_FILTER) &&
 	    dev->netdev_ops->ndo_vlan_rx_add_vid) {
-		pr_info("8021q: adding VLAN 0 to HW filter on device %s\n",
+		pr_info("adding VLAN 0 to HW filter on device %s\n",
 			dev->name);
 		dev->netdev_ops->ndo_vlan_rx_add_vid(dev, 0);
 	}
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index f247f5b..4225e24 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -20,6 +20,8 @@
  *		2 of the License, or (at your option) any later version.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/skbuff.h>
@@ -55,7 +57,7 @@ static int vlan_dev_rebuild_header(struct sk_buff *skb)
 		return arp_find(veth->h_dest, skb);
 #endif
 	default:
-		pr_debug("%s: unable to resolve type %X addresses.\n",
+		pr_debug("%s: unable to resolve type %X addresses\n",
 			 dev->name, ntohs(veth->h_vlan_encapsulated_proto));
 
 		memcpy(veth->h_source, dev->dev_addr, ETH_ALEN);
diff --git a/net/8021q/vlanproc.c b/net/8021q/vlanproc.c
index d940c49..016d7f4 100644
--- a/net/8021q/vlanproc.c
+++ b/net/8021q/vlanproc.c
@@ -17,6 +17,8 @@
  * Jan 20, 1998        Ben Greear     Initial Version
  *****************************************************************************/
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/module.h>
 #include <linux/errno.h>
 #include <linux/kernel.h>
@@ -155,7 +157,7 @@ int __net_init vlan_proc_init(struct net *net)
 	return 0;
 
 err:
-	pr_err("%s: can't create entry in proc filesystem!\n", __func__);
+	pr_err("can't create entry in proc filesystem!\n");
 	vlan_proc_cleanup(net);
 	return -ENOBUFS;
 }
-- 
1.7.5.rc3.dirty

^ permalink raw reply related

* [PATCH] iproute2: Add processless network namespace support.
From: Eric W. Biederman @ 2011-05-26 20:58 UTC (permalink / raw)
  To: shemminger; +Cc: netdev, Linux Containers


The goal of this code change is to implement a mechanism such that it is
simple to work with a kernel that is using multiple network namespaces
at once.

This comes in handy for interacting with vpns where there may be rfc1918
address overlaps, and different policies default routes, name servers
and the like.

Configuration specific to a network namespace that would ordinarily be
stored under /etc/ is stored under /etc/netns/<name>.  For example if
the dns server configuration is different for your vpn you would create
a file /etc/netns/myvpn/resolv.conf.

File descriptors that can be used to manipulate a network namespace can
be created by opening /var/run/netns/<NAME>.

This adds the following commands to iproute.
ip netns add NAME
ip netns delete NAME
ip netns monitor
ip netns list
ip netns exec NAME cmd ....
ip link set DEV netns NAME

ip netns exec exists to cater the vast majority of programs that only
know how to operate in a single network namespace.  ip netns exec
changes the default network namespace, creates a new mount namespace,
remounts /sys and bind mounts netns specific configuration files to
their standard locations.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/if_link.h |    1 +
 ip/Makefile             |    2 +-
 ip/ip.c                 |    4 +-
 ip/ip_common.h          |    2 +
 ip/iplink.c             |    8 +-
 ip/ipnetns.c            |  314 +++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/ip.8           |   56 +++++++++
 7 files changed, 383 insertions(+), 4 deletions(-)
 create mode 100644 ip/ipnetns.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index e4a3a2d..304c44f 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -136,6 +136,7 @@ enum {
 	IFLA_PORT_SELF,
 	IFLA_AF_SPEC,
 	IFLA_GROUP,		/* Group the device belongs to */
+	IFLA_NET_NS_FD,
 	__IFLA_MAX
 };
 
diff --git a/ip/Makefile b/ip/Makefile
index 6054e8a..2ee4e7c 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -1,4 +1,4 @@
-IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
+IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
     rtm_map.o iptunnel.o ip6tunnel.o tunnel.o ipneigh.o ipntable.o iplink.o \
     ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
     ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
diff --git a/ip/ip.c b/ip/ip.c
index b127d57..7f0c468 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -44,7 +44,8 @@ static void usage(void)
 "Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }\n"
 "       ip [ -force ] -batch filename\n"
 "where  OBJECT := { link | addr | addrlabel | route | rule | neigh | ntable |\n"
-"                   tunnel | tuntap | maddr | mroute | mrule | monitor | xfrm }\n"
+"                   tunnel | tuntap | maddr | mroute | mrule | monitor | xfrm |\n"
+"                   netns }\n"
 "       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "                    -f[amily] { inet | inet6 | ipx | dnet | link } |\n"
 "                    -l[oops] { maximum-addr-flush-attempts } |\n"
@@ -80,6 +81,7 @@ static const struct cmd {
 	{ "xfrm",	do_xfrm },
 	{ "mroute",	do_multiroute },
 	{ "mrule",	do_multirule },
+	{ "netns",	do_netns },
 	{ "help",	do_help },
 	{ 0 }
 };
diff --git a/ip/ip_common.h b/ip/ip_common.h
index a114186..5e5fb76 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -38,6 +38,7 @@ extern int do_ipmonitor(int argc, char **argv);
 extern int do_multiaddr(int argc, char **argv);
 extern int do_multiroute(int argc, char **argv);
 extern int do_multirule(int argc, char **argv);
+extern int do_netns(int argc, char **argv);
 extern int do_xfrm(int argc, char **argv);
 
 static inline int rtm_get_table(struct rtmsg *r, struct rtattr **tb)
@@ -64,6 +65,7 @@ struct link_util
 };
 
 struct link_util *get_link_kind(const char *kind);
+int get_netns_fd(const char *name);
 
 #ifndef	INFINITY_LIFE_TIME
 #define     INFINITY_LIFE_TIME      0xFFFFFFFFU
diff --git a/ip/iplink.c b/ip/iplink.c
index 48c0254..e5325a6 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -67,6 +67,7 @@ void iplink_usage(void)
 	fprintf(stderr, "	                  [ broadcast LLADDR ]\n");
 	fprintf(stderr, "	                  [ mtu MTU ]\n");
 	fprintf(stderr, "	                  [ netns PID ]\n");
+	fprintf(stderr, "	                  [ netns NAME ]\n");
 	fprintf(stderr, "			  [ alias NAME ]\n");
 	fprintf(stderr, "	                  [ vf NUM [ mac LLADDR ]\n");
 	fprintf(stderr, "				   [ vlan VLANID [ qos VLAN-QOS ] ]\n");
@@ -304,9 +305,12 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
                         NEXT_ARG();
                         if (netns != -1)
                                 duparg("netns", *argv);
-                        if (get_integer(&netns, *argv, 0))
+			if ((netns = get_netns_fd(*argv)) >= 0)
+				addattr_l(&req->n, sizeof(*req), IFLA_NET_NS_FD, &netns, 4);
+			else if (get_integer(&netns, *argv, 0) == 0)
+				addattr_l(&req->n, sizeof(*req), IFLA_NET_NS_PID, &netns, 4);
+			else
                                 invarg("Invalid \"netns\" value\n", *argv);
-                        addattr_l(&req->n, sizeof(*req), IFLA_NET_NS_PID, &netns, 4);
 		} else if (strcmp(*argv, "multicast") == 0) {
 			NEXT_ARG();
 			req->i.ifi_change |= IFF_MULTICAST;
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
new file mode 100644
index 0000000..db7007c
--- /dev/null
+++ b/ip/ipnetns.c
@@ -0,0 +1,314 @@
+#define _ATFILE_SOURCE
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <sys/inotify.h>
+#include <sys/mount.h>
+#include <sys/param.h>
+#include <sys/syscall.h>
+#include <stdio.h>
+#include <string.h>
+#include <sched.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include <errno.h>
+#include <unistd.h>
+
+#include "utils.h"
+#include "ip_common.h"
+
+#define NETNS_RUN_DIR "/var/run/netns"
+#define NETNS_ETC_DIR "/etc/netns"
+
+#ifndef CLONE_NEWNET
+#define CLONE_NEWNET 0x40000000	/* New network namespace (lo, device, names sockets, etc) */
+#endif
+
+#ifndef MNT_DETACH
+#define MNT_DETACH	0x00000002	/* Just detach from the tree */
+#endif /* MNT_DETACH */
+
+static int setns(int fd, int nstype)
+{
+#ifdef __NR_setns
+	return syscall(__NR_setns, fd, nstype);
+#else
+	errno = ENOSYS;
+	return -1;
+#endif
+}
+
+
+static int touch(const char *path, mode_t mode)
+{
+	int fd;
+	fd = open(path, O_RDONLY|O_CREAT, mode);
+	if (fd < 0)
+		return -1;
+	close(fd);
+	return 0;
+}
+
+static void usage(void) __attribute__((noreturn));
+
+static void usage(void)
+{
+	fprintf(stderr, "Usage: ip netns list\n");
+	fprintf(stderr, "       ip netns add NAME\n");
+	fprintf(stderr, "       ip netns delete NAME\n");
+	fprintf(stderr, "       ip netns exec NAME cmd ...\n");
+	fprintf(stderr, "       ip netns monitor\n");
+	exit(-1);
+}
+
+int get_netns_fd(const char *name)
+{
+	char pathbuf[MAXPATHLEN];
+	const char *path, *ptr;
+
+	path = name;
+	ptr = strchr(name, '/');
+	if (!ptr) {
+		snprintf(pathbuf, sizeof(pathbuf), "%s/%s",
+			NETNS_RUN_DIR, name );
+		path = pathbuf;
+	}
+	return open(path, O_RDONLY);
+}
+
+static int netns_list(int argc, char **argv)
+{
+	struct dirent *entry;
+	DIR *dir;
+
+	dir = opendir(NETNS_RUN_DIR);
+	if (!dir)
+		return 0;
+
+	while ((entry = readdir(dir)) != NULL) {
+		if (strcmp(entry->d_name, ".") == 0)
+			continue;
+		if (strcmp(entry->d_name, "..") == 0)
+			continue;
+		printf("%s\n", entry->d_name);
+	}
+	closedir(dir);
+	return 0;
+}
+
+static void bind_etc(const char *name)
+{
+	char etc_netns_path[MAXPATHLEN];
+	char netns_name[MAXPATHLEN];
+	char etc_name[MAXPATHLEN];
+	struct dirent *entry;
+	DIR *dir;
+
+	snprintf(etc_netns_path, sizeof(etc_netns_path), "%s/%s", NETNS_ETC_DIR, name);
+	dir = opendir(etc_netns_path);
+	if (!dir)
+		return;
+
+	while ((entry = readdir(dir)) != NULL) {
+		if (strcmp(entry->d_name, ".") == 0)
+			continue;
+		if (strcmp(entry->d_name, "..") == 0)
+			continue;
+		snprintf(netns_name, sizeof(netns_name), "%s/%s", etc_netns_path, entry->d_name);
+		snprintf(etc_name, sizeof(etc_name), "/etc/%s", entry->d_name);
+		if (mount(netns_name, etc_name, "none", MS_BIND, NULL) < 0) {
+			fprintf(stderr, "Bind %s -> %s failed: %s\n",
+				netns_name, etc_name, strerror(errno));
+		}
+	}
+	closedir(dir);
+}
+
+static int netns_exec(int argc, char **argv)
+{
+	/* Setup the proper environment for apps that are not netns
+	 * aware, and execute a program in that environment.
+	 */
+	const char *name, *cmd;
+	char net_path[MAXPATHLEN];
+	int netns;
+
+	if (argc < 1) {
+		fprintf(stderr, "No netns name specified\n");
+		return -1;
+	}
+	if (argc < 2) {
+		fprintf(stderr, "No cmd specified\n");
+		return -1;
+	}
+	name = argv[0];
+	cmd = argv[1];
+	snprintf(net_path, sizeof(net_path), "%s/%s", NETNS_RUN_DIR, name);
+	netns = open(net_path, O_RDONLY);
+	if (netns < 0) {
+		fprintf(stderr, "Cannot open network namespace: %s\n",
+			strerror(errno));
+		return -1;
+	}
+	if (setns(netns, CLONE_NEWNET) < 0) {
+		fprintf(stderr, "seting the network namespace failed: %s\n",
+			strerror(errno));
+		return -1;
+	}
+
+	if (unshare(CLONE_NEWNS) < 0) {
+		fprintf(stderr, "unshare failed: %s\n", strerror(errno));
+		return -1;
+	}
+	/* Mount a version of /sys that describes the network namespace */
+	if (umount2("/sys", MNT_DETACH) < 0) {
+		fprintf(stderr, "umount of /sys failed: %s\n", strerror(errno));
+		return -1;
+	}
+	if (mount(name, "/sys", "sysfs", 0, NULL) < 0) {
+		fprintf(stderr, "mount of /sys failed: %s\n",strerror(errno));
+		return -1;
+	}
+
+	/* Setup bind mounts for config files in /etc */
+	bind_etc(name);
+
+	if (execvp(cmd, argv + 1)  < 0)
+		fprintf(stderr, "exec of %s failed: %s\n",
+			cmd, strerror(errno));
+	exit(-1);
+}
+
+static int netns_delete(int argc, char **argv)
+{
+	const char *name;
+	char netns_path[MAXPATHLEN];
+
+	if (argc < 1) {
+		fprintf(stderr, "No netns name specified\n");
+		return -1;
+	}
+
+	name = argv[0];
+	snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name);
+	umount2(netns_path, MNT_DETACH);
+	if (unlink(netns_path) < 0) {
+		fprintf(stderr, "Cannot remove %s: %s\n",
+			netns_path, strerror(errno));
+		return -1;
+	}
+	return 0;
+}
+
+static int netns_add(int argc, char **argv)
+{
+	/* This function creates a new network namespace and
+	 * a new mount namespace and bind them into a well known
+	 * location in the filesystem based on the name provided.
+	 *
+	 * The mount namespace is created so that any necessary
+	 * userspace tweaks like remounting /sys, or bind mounting
+	 * a new /etc/resolv.conf can be shared between uers.
+	 */
+	char netns_path[MAXPATHLEN];
+	const char *name;
+
+	if (argc < 1) {
+		fprintf(stderr, "No netns name specified\n");
+		return -1;
+	}
+	name = argv[0];
+
+	snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name);
+
+	/* Create the base netns directory if it doesn't exist */
+	mkdir(NETNS_RUN_DIR, S_IRWXU|S_IRGRP|S_IXGRP|S_IROTH|S_IXOTH);
+
+	/* Create the filesystem state */
+	if (touch(netns_path, 0) < 0) {
+		fprintf(stderr, "Could not create %s: %s\n",
+			netns_path, strerror(errno));
+		goto out_delete;
+	}
+	if (unshare(CLONE_NEWNET) < 0) {
+		fprintf(stderr, "Failed to create a new network namespace: %s\n",
+			strerror(errno));
+		goto out_delete;
+	}
+
+	/* Bind the netns last so I can watch for it */
+	if (mount("/proc/self/ns/net", netns_path, "none", MS_BIND, NULL) < 0) {
+		fprintf(stderr, "Bind /proc/self/ns/net -> %s failed: %s\n",
+			netns_path, strerror(errno));
+		goto out_delete;
+	}
+	return 0;
+out_delete:
+	netns_delete(argc, argv);
+	exit(-1);
+	return -1;
+}
+
+
+static int netns_monitor(int argc, char **argv)
+{
+	char buf[4096];
+	struct inotify_event *event;
+	int fd;
+	fd = inotify_init();
+	if (fd < 0) {
+		fprintf(stderr, "inotify_init failed: %s\n",
+			strerror(errno));
+		return -1;
+	}
+	if (inotify_add_watch(fd, NETNS_RUN_DIR, IN_CREATE | IN_DELETE) < 0) {
+		fprintf(stderr, "inotify_add_watch failed: %s\n",
+			strerror(errno));
+		return -1;
+	}
+	for(;;) {
+		ssize_t len = read(fd, buf, sizeof(buf));
+		if (len < 0) {
+			fprintf(stderr, "read failed: %s\n",
+				strerror(errno));
+			return -1;
+		}
+		for (event = (struct inotify_event *)buf;
+		     (char *)event < &buf[len];
+		     event = (struct inotify_event *)((char *)event + sizeof(*event) + event->len)) {
+			if (event->mask & IN_CREATE)
+				printf("add %s\n", event->name);
+			if (event->mask & IN_DELETE)
+				printf("delete %s\n", event->name);
+		}
+	}
+	return 0;
+}
+
+int do_netns(int argc, char **argv)
+{
+	if (argc < 1)
+		return netns_list(0, NULL);
+
+	if ((matches(*argv, "list") == 0) || (matches(*argv, "show") == 0) ||
+	    (matches(*argv, "lst") == 0))
+		return netns_list(argc-1, argv+1);
+
+	if (matches(*argv, "help") == 0)
+		usage();
+
+	if (matches(*argv, "add") == 0)
+		return netns_add(argc-1, argv+1);
+
+	if (matches(*argv, "delete") == 0)
+		return netns_delete(argc-1, argv+1);
+
+	if (matches(*argv, "exec") == 0)
+		return netns_exec(argc-1, argv+1);
+
+	if (matches(*argv, "monitor") == 0)
+		return netns_monitor(argc-1, argv+1);
+
+	fprintf(stderr, "Command \"%s\" is unknown, try \"ip netns help\".\n", *argv);
+	exit(-1);
+}
diff --git a/man/man8/ip.8 b/man/man8/ip.8
index c5248ef..1935dc5 100644
--- a/man/man8/ip.8
+++ b/man/man8/ip.8
@@ -85,6 +85,9 @@ ip \- show / manipulate routing, devices, policy routing and tunnels
 .B  netns
 .IR PID " |"
 .br
+.B  netns
+.IR NETNSNAME " |"
+.br
 .B alias
 .IR NAME  " |"
 .br
@@ -162,6 +165,17 @@ tentative " | " deprecated " | " dadfailed " | " temporary " ]"
 .BR "ip addrlabel" " { " list " | " flush " }"
 
 .ti -8
+.BR "ip netns" " { " list " | " monitor " } "
+
+.ti -8
+.BR "ip netns" " { " add " | " delete " } "
+.I NETNSNAME
+
+.ti -8
+.BR "ip netns exec "
+.I NETNSNAME command ...
+
+.ti -8
 .BR "ip route" " { "
 .BR list " | " flush " } "
 .I  SELECTOR
@@ -1006,6 +1020,11 @@ move the device to the network namespace associated with the process
 .IR "PID".
 
 .TP
+.BI netns " NETNSNAME"
+move the device to the network namespace associated with name
+.IR "NETNSNAME".
+
+.TP
 .BI alias " NAME"
 give the device a symbolic name for easy reference.
 
@@ -2470,6 +2489,43 @@ at any time.
 It prepends the history with the state snapshot dumped at the moment
 of starting.
 
+.SH ip netns - process network namespace management
+
+A network namespace is logically another copy of the network stack,
+with it's own routes, firewall rules, and network devices.
+
+By convention a named network namespace is an object at
+.BR "/var/run/netns/" NAME
+that can be opened.  The file descriptor resulting from opening
+.BR "/var/run/netns/" NAME 
+refers to the specified network namespace.  Holding that file
+descriptor open keeps the network namespace alive.  The file
+descriptor can be used with the
+.B setns(2)
+system call to change the network namespace associated with a task.
+
+The convention for network namespace aware applications is to look
+for global network configuration files first in
+.BR "/etc/netns/" NAME "/"
+then in
+.BR "/etc/".
+For example, if you want a different version of
+.BR /etc/resolv.conf
+for a network namespace used to isolate your vpn you would name it
+.BR /etc/netns/myvpn/resolv.conf.
+
+.B ip netns exec
+automates handling of this configuration, file convention for network
+namespace unaware applications, by creating a mount namespace and
+bind mounting all of the per network namespace configure files into
+their traditional location in /etc.
+
+.SS ip netns list - show all of the named network namespaces
+.SS ip netns monitor - report when network namespace names are created and destroyed
+.SS ip netns add NAME - create a new named network namespace
+.SS ip netns delete NAME - delete the name of a network namespace
+.SS ip netns exec NAME cmd ... - Run cmd in the named network namespace
+
 .SH ip xfrm - setting xfrm
 xfrm is an IP framework, which can transform format of the datagrams,
 .br
-- 
1.7.5.1.217.g4e3aa


^ permalink raw reply related

* Re: [PATCH V6 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
From: Eric Dumazet @ 2011-05-26 20:55 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, mst, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1306441442.5180.51.camel@localhost.localdomain>

Le jeudi 26 mai 2011 à 13:24 -0700, Shirley Ma a écrit :

> I could reduce callback pointer by moving it to *arg, but not desc, this
> indicates that which buffer DMA hasn't done yet in *arg.


I guess you dont need to use skb itself to hold all your states ?

I understand its convenient for you, but I believe its worth the pain to
use only one pointer to a (small) object where you put all your stuff.

Some machines alloc/free millions of skbs per second. 

If/when most skb uses are for userspace zero-copy buffers we can embbed
your small object in skb itself ;)

^ permalink raw reply

* [PATCH 2/3] net: Add linux/sysctl.h includes where needed.
From: David Miller @ 2011-05-26 20:43 UTC (permalink / raw)
  To: netdev; +Cc: mingo


Several networking headers were depending upon the implicit
linux/sysctl.h include they get when including linux/net.h

Add explicit includes.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/netfilter.h   |    1 +
 include/net/net_namespace.h |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 7fa95df..857f502 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -13,6 +13,7 @@
 #endif
 #include <linux/types.h>
 #include <linux/compiler.h>
+#include <linux/sysctl.h>
 
 /* Responses from hook functions. */
 #define NF_DROP 0
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 3ae4919..07a60e8 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -7,6 +7,7 @@
 #include <asm/atomic.h>
 #include <linux/workqueue.h>
 #include <linux/list.h>
+#include <linux/sysctl.h>
 
 #include <net/netns/core.h>
 #include <net/netns/mib.h>
-- 
1.7.4.4


^ permalink raw reply related

* [PATCH 3/3] net: Kill ratelimit.h dependency in linux/net.h
From: David Miller @ 2011-05-26 20:43 UTC (permalink / raw)
  To: netdev; +Cc: mingo


Ingo Molnar noticed that we have this unnecessary ratelimit.h
dependency in linux/net.h, which hid compilation problems from
people doing builds only with CONFIG_NET enabled.

Move this stuff out to a seperate net/net_ratelimit.h file and
include that in the only two places where this thing is needed.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/net.h        |    6 ------
 net/core/sysctl_net_core.c |    1 +
 net/core/utils.c           |    1 +
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 1da55e9..b299230 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -289,11 +289,5 @@ extern int kernel_sock_shutdown(struct socket *sock,
 	MODULE_ALIAS("net-pf-" __stringify(pf) "-proto-" __stringify(proto) \
 		     "-type-" __stringify(type))
 
-#ifdef CONFIG_SYSCTL
-#include <linux/sysctl.h>
-#include <linux/ratelimit.h>
-extern struct ratelimit_state net_ratelimit_state;
-#endif
-
 #endif /* __KERNEL__ */
 #endif	/* _LINUX_NET_H */
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index a829e3f..77a65f0 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -17,6 +17,7 @@
 
 #include <net/ip.h>
 #include <net/sock.h>
+#include <net/net_ratelimit.h>
 
 #ifdef CONFIG_RPS
 static int rps_sock_flow_sysctl(ctl_table *table, int write,
diff --git a/net/core/utils.c b/net/core/utils.c
index 2012bc7..386e263f 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -27,6 +27,7 @@
 #include <linux/ratelimit.h>
 
 #include <net/sock.h>
+#include <net/net_ratelimit.h>
 
 #include <asm/byteorder.h>
 #include <asm/system.h>
-- 
1.7.4.4


^ permalink raw reply related

* [PATCH 1/3] net: Kill ether_table[] declaration.
From: David Miller @ 2011-05-26 20:43 UTC (permalink / raw)
  To: netdev; +Cc: mingo


This got missed back in 2006 when Jes Sorensen deleted
net/ethernet/sysctl_net_ether.c

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/if_ether.h |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/include/linux/if_ether.h b/include/linux/if_ether.h
index 0f1325d..0065ffd 100644
--- a/include/linux/if_ether.h
+++ b/include/linux/if_ether.h
@@ -132,10 +132,6 @@ static inline struct ethhdr *eth_hdr(const struct sk_buff *skb)
 
 int eth_header_parse(const struct sk_buff *skb, unsigned char *haddr);
 
-#ifdef CONFIG_SYSCTL
-extern struct ctl_table ether_table[];
-#endif
-
 int mac_pton(const char *s, u8 *mac);
 extern ssize_t sysfs_format_mac(char *buf, const unsigned char *addr, int len);
 
-- 
1.7.4.4


^ permalink raw reply related

* Re: [PATCH V6 0/4 net-next] macvtap/vhost TX zero-copy support
From: Shirley Ma @ 2011-05-26 20:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <20110526202836.GA1951@redhat.com>

On Thu, 2011-05-26 at 23:28 +0300, Michael S. Tsirkin wrote:
> On Thu, May 26, 2011 at 01:00:20PM -0700, Shirley Ma wrote:
> > 3. Add sleep in vhost shutting down instead of busy-wait for
> outstanding
> >    DMAs.
> 
> I still think this is not much better. We need to use a
> completion structure and wait on it instead.
> If this gets blocked thinkably a tx watchdog can fire and save us
> from blocking forver :) 

Ok, I can add a completion structure here.

Thanks
Shirley

^ permalink raw reply

* Re: [PATCH V6 0/4 net-next] macvtap/vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-26 20:28 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1306440020.5180.47.camel@localhost.localdomain>

On Thu, May 26, 2011 at 01:00:20PM -0700, Shirley Ma wrote:
> 3. Add sleep in vhost shutting down instead of busy-wait for outstanding
>    DMAs.

I still think this is not much better. We need to use a
completion structure and wait on it instead.
If this gets blocked thinkably a tx watchdog can fire and save us
from blocking forver :)

^ permalink raw reply

* Re: [PATCH V6 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
From: Shirley Ma @ 2011-05-26 20:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, mst, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1306440247.2543.14.camel@edumazet-laptop>

On Thu, 2011-05-26 at 22:04 +0200, Eric Dumazet wrote:
> Le jeudi 26 mai 2011 à 12:36 -0700, Shirley Ma a écrit :
> > This patch adds userspace buffers support in skb shared info. A new 
> > struct skb_ubuf_info is needed to maintain the userspace buffers
> > argument and index, a callback is used to notify userspace to
> release
> > the buffers once lower device has done DMA (Last reference to that
> skb
> > has gone).
> > 
> > If there is any userspace apps to reference these userspace buffers,
> > then these userspaces buffers will be copied into kernel. This way
> we
> > can prevent userspace apps to hold these userspace buffers too long.
> > 
> > Signed-off-by: Shirley Ma <xma@us.ibm.com>
> > ---
> > 
> >  include/linux/skbuff.h |   26 +++++++++++++++
> >  net/core/skbuff.c      |   80
> ++++++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 104 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index d0ae90a..025de5c 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -189,6 +189,18 @@ enum {
> >       SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
> >  };
> >  
> > +/*
> > + * The callback notifies userspace to release buffers when skb DMA
> is done in
> > + * lower device, the skb last reference should be 0 when calling
> this.
> > + * The desc is used to track userspace buffer index.
> > + */
> > +struct skb_ubuf_info {
> > +     /* support buffers allocation from userspace */
> > +     void            (*callback)(struct sk_buff *);
> > +     void            *arg;
> > +     size_t          desc;
> > +};
> 
> Thats 24 bytes on each skb...   desc for example is not used in this
> patch (yes, its used later in patch 4/4)
> 
> But still... could you instead use one pointer only in skb ?

I could reduce callback pointer by moving it to *arg, but not desc, this
indicates that which buffer DMA hasn't done yet in *arg.

> 
> > +
> >  /* This data is invariant across clones and lives at
> >   * the end of the header data, ie. at skb->end.
> >   */
> > @@ -211,6 +223,10 @@ struct skb_shared_info {
> >       /* Intermediate layers must ensure that destructor_arg
> >        * remains valid until skb destructor */
> >       void *          destructor_arg;
> > +
> > +     /* DMA mapping from/to userspace buffers */
> > +     struct skb_ubuf_info ubuf;
> > +
> >       /* must be last field, see pskb_expand_head() */
> >       skb_frag_t      frags[MAX_SKB_FRAGS];
> >  };
> > @@ -2261,5 +2277,15 @@ static inline void
> skb_checksum_none_assert(struct sk_buff *skb)
> >  }
> >  
> >  bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
> > +
> > +/*
> > + *   skb_ubuf - is the buffer from userspace
> > + *   @skb: buffer to check
> > + */
> > +static inline int skb_ubuf(const struct sk_buff *skb)
> > +{
> > +     return (skb_shinfo(skb)->ubuf.callback != NULL);
> 
>         return skb_shinfo(skb)->ubuf.callback != NULL;
> 
> > +}
> > +
> >  #endif       /* __KERNEL__ */
> >  #endif       /* _LINUX_SKBUFF_H */
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 7ebeed0..890447c 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size,
> gfp_t gfp_mask,
> >       shinfo = skb_shinfo(skb);
> >       memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> >       atomic_set(&shinfo->dataref, 1);
> > +     shinfo->ubuf.callback = NULL;
> > +     shinfo->ubuf.arg = NULL;
> 
> if you put ubuf ptr before dataref, no need to add this (the memset()
> clear all shared_info up to dataref)

Ok, will remove it.

> >       kmemcheck_annotate_variable(shinfo->destructor_arg);
> >  
> >       if (fclone) {
> > @@ -328,6 +330,14 @@ static void skb_release_data(struct sk_buff
> *skb)
> >                               put_page(skb_shinfo(skb)->frags[i].page);
> >               }
> >  
> > +             /*
> > +              * if skb buf is from userspace, we need to notify the
> caller
> > +              * the lower device DMA has done;
> > +              */
> > +             if (skb_ubuf(skb)) {
> > +                     skb_shinfo(skb)->ubuf.callback(skb);
> > +                     skb_shinfo(skb)->ubuf.callback = NULL;
> > +             }
> >               if (skb_has_frag_list(skb))
> >                       skb_drop_fraglist(skb);
> >  
> > @@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int
> skb_size)
> >       if (irqs_disabled())
> >               return false;
> >  
> > +     if (skb_ubuf(skb))
> > +             return false;
> > +
> >       if (skb_is_nonlinear(skb) || skb->fclone !=
> SKB_FCLONE_UNAVAILABLE)
> >               return false;
> >  
> > @@ -572,6 +585,7 @@ static struct sk_buff *__skb_clone(struct
> sk_buff *n, struct sk_buff *skb)
> >       atomic_set(&n->users, 1);
> >  
> >       atomic_inc(&(skb_shinfo(skb)->dataref));
> > +     skb_shinfo(skb)->ubuf.callback = NULL;
> >       skb->cloned = 1;
> >  
> >       return n;
> > @@ -595,6 +609,48 @@ struct sk_buff *skb_morph(struct sk_buff *dst,
> struct sk_buff *src)
> >  }
> >  EXPORT_SYMBOL_GPL(skb_morph);
> >  
> > +/* skb frags copy userspace buffers to kernel */
> > +static int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
> > +{
> > +     int i;
> > +     int num_frags = skb_shinfo(skb)->nr_frags;
> > +     struct page *page, *head = NULL;
> > +
> > +     for (i = 0; i < num_frags; i++) {
> > +             u8 *vaddr;
> > +             skb_frag_t *f = &skb_shinfo(skb)->frags[i];
> > +
> > +             page = alloc_page(GFP_ATOMIC);
> > +             if (!page) {
> > +                     while (head) {
> > +                             put_page(head);
> > +                             head = (struct page *)head->private;
> > +                     }
> > +                     return -ENOMEM;
> > +             }
> > +             vaddr = kmap_skb_frag(&skb_shinfo(skb)->frags[i]);
> > +             memcpy(page_address(page), vaddr + f->page_offset,
> f->size);
> > +             kunmap_skb_frag(vaddr);
> > +             page->private = (unsigned long)head;
> > +             head = page;
> > +     }
> > +
> > +     /* skb frags release userspace buffers */
> > +     for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> > +             put_page(skb_shinfo(skb)->frags[i].page);
> > +     skb_shinfo(skb)->ubuf.callback(skb);
> > +     skb_shinfo(skb)->ubuf.callback = NULL;
> > +
> > +     /* skb frags point to kernel buffers */
> > +     for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
> > +             skb_shinfo(skb)->frags[i - 1].page_offset = 0;
> > +             skb_shinfo(skb)->frags[i - 1].page = head;
> > +             head = (struct page *)head->private;
> > +     }
> > +     return 0;
> > +}
> > +
> > +
> >  /**
> >   *   skb_clone       -       duplicate an sk_buff
> >   *   @skb: buffer to clone
> > @@ -613,6 +669,11 @@ struct sk_buff *skb_clone(struct sk_buff *skb,
> gfp_t gfp_mask)
> >  {
> >       struct sk_buff *n;
> >  
> > +     if (skb_ubuf(skb)) {
> > +             if (skb_copy_ubufs(skb, gfp_mask))
> > +                     return NULL;
> > +     }
> > +
> >       n = skb + 1;
> >       if (skb->fclone == SKB_FCLONE_ORIG &&
> >           n->fclone == SKB_FCLONE_UNAVAILABLE) {
> > @@ -730,6 +791,12 @@ struct sk_buff *pskb_copy(struct sk_buff *skb,
> gfp_t gfp_mask)
> >       if (skb_shinfo(skb)->nr_frags) {
> >               int i;
> >  
> > +             if (skb_ubuf(skb)) {
> > +                     if (skb_copy_ubufs(skb, gfp_mask)) {
> > +                             kfree(n);
> > +                             goto out;
> > +                     }
> > +             }
> >               for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> >                       skb_shinfo(n)->frags[i] =
> skb_shinfo(skb)->frags[i];
> >                       get_page(skb_shinfo(n)->frags[i].page);
> > @@ -743,6 +810,7 @@ struct sk_buff *pskb_copy(struct sk_buff *skb,
> gfp_t gfp_mask)
> >       }
> >  
> >       copy_skb_header(n, skb);
> > +
> 
> please dont add new lines like this in your patch

checkpatch doesn't check this kind of line, will remove it.

> >  out:
> >       return n;
> >  }
> > @@ -787,7 +855,6 @@ int pskb_expand_head(struct sk_buff *skb, int
> nhead, int ntail,
> >               fastpath = true;
> >       else {
> >               int delta = skb->nohdr ? (1 << SKB_DATAREF_SHIFT) +
> 1 : 1;
> > -
> 
> and dont delete this one as well.
> 
> >               fastpath = atomic_read(&skb_shinfo(skb)->dataref) ==
> delta;
> >       }
> >  
> > @@ -818,14 +885,19 @@ int pskb_expand_head(struct sk_buff *skb, int
> nhead, int ntail,
> >       if (fastpath) {
> >               kfree(skb->head);
> >       } else {
> > +             /* copy this zero copy skb frags */
> > +             if (skb_ubuf(skb)) {
> > +                     if (skb_copy_ubufs(skb, gfp_mask))
> > +                             goto nofrags;
> > +             }
> >               for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> >                       get_page(skb_shinfo(skb)->frags[i].page);
> > -
> 
> ditto
> 
> >               if (skb_has_frag_list(skb))
> >                       skb_clone_fraglist(skb);
> >  
> >               skb_release_data(skb);
> >       }
> > +
> 
> ditto
> 
> >       off = (data + nhead) - skb->head;
> >  
> >       skb->head     = data;
> > @@ -852,6 +924,8 @@ adjust_others:
> >       atomic_set(&skb_shinfo(skb)->dataref, 1);
> >       return 0;
> >  
> > +nofrags:
> > +     kfree(data);
> >  nodata:
> >       return -ENOMEM;
> >  }
> > @@ -1353,6 +1427,8 @@ int skb_copy_bits(const struct sk_buff *skb,
> int offset, void *to, int len)
> >               }
> >               start = end;
> >       }
> > +     skb_shinfo(skb)->ubuf.callback = NULL;
> > +
> >       if (!len)
> >               return 0;
> >  
> > 
> > 
> 
> 

^ permalink raw reply

* Re: [PATCH] af-packet: Add flag to distinguish VID 0 from no-vlan.
From: David Miller @ 2011-05-26 20:19 UTC (permalink / raw)
  To: greearb; +Cc: netdev
In-Reply-To: <1306366595-731-1-git-send-email-greearb@candelatech.com>

From: greearb@candelatech.com
Date: Wed, 25 May 2011 16:36:35 -0700

> From: Ben Greear <greearb@candelatech.com>
> 
> Currently, user-space cannot determine if a 0 tcp_vlan_tci
> means there is no VLAN tag or the VLAN ID was zero.
> 
> Add flag to make this explicit.  User-space can check for
> TP_STATUS_VLAN_VALID || tp_vlan_tci > 0, which will be backwards
> compatible. Older could would have just checked for tp_vlan_tci,
> so it will work no worse than before.
> 
> Signed-off-by: Ben Greear <greearb@candelatech.com>

Please do this in tpacket_rcv() too, otherwise the feature
bit won't be available to mmap() users of AF_PACKET.

Thanks.

^ permalink raw reply

* Re: [BUG] netconsole broken by scheduler updates
From: David Miller @ 2011-05-26 20:18 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: laurent.riffard, netdev
In-Reply-To: <1306403308.1200.44.camel@twins>

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu, 26 May 2011 11:48:28 +0200

> Also, are you very _very_ sure that lockdep message is a false positive?

I suspect these warnings are entirely specific to using netconsole
with the forcedeth driver.

There is some really scary stuff in there, to say the least.

Just have a look at nv_do_nic_poll().  That's the first time I've
seen a networking driver use the {disable,enable}_irq_lockdep()
interfaces.

It's also using netif_tx_unlock_bh() in the "recovery" case which
in netpoll will cause a local_bh_enable() with cpu interrupts
disabled.  Which is not allowed.

Much of the forcedeth driver's locking seems suspect to me.

^ permalink raw reply

* Re: [ANNOUNCE]: Release of iptables-1.4.11
From: Jan Engelhardt @ 2011-05-26 20:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Netfilter Development Mailinglist, NetDev,
	netfilter-announce, 'netfilter@vger.kernel.org'
In-Reply-To: <1306434537.2543.4.camel@edumazet-laptop>

On Thursday 2011-05-26 20:28, Eric Dumazet wrote:

>Le jeudi 26 mai 2011 à 18:53 +0200, Patrick McHardy a écrit :
>> The netfilter coreteam presents:
>> 
>>     iptables version 1.4.10
>> 
>> the iptables release for the 2.6.39 kernels. Due to some mistakes
>> on my side we didn't have a release for longer than expected, so
>> this contains a rather large number of changes.
>> 
>> Changes include:
>> 
>
>...
>> - a new iptables option "-C" to check for existance of a rules
>
>Nice, but this still loads modules...

So does iptables -S (and -L).  :)


^ permalink raw reply

* Re: [PATCH v3]net:8021q:vlan.c Fix pr_info to just give the vlan fullname and version.
From: Justin P. Mattock @ 2011-05-26 20:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, joe, greearb
In-Reply-To: <20110526.145601.735416831290933990.davem@davemloft.net>

On 05/26/2011 11:56 AM, David Miller wrote:
> From: "Justin P. Mattock"<justinmattock@gmail.com>
> Date: Mon, 23 May 2011 23:43:48 -0700
>
>> The below patch removes vlan_buggyright and vlan_copyright from vlan_proto_init,
>> so that it prints out just the fullname of vlan and the version number.
>>
>> before:
>>
>> [   30.438203] 802.1Q VLAN Support v1.8 Ben Greear<greearb@candelatech.com>
>> [   30.441542] All bugs added by David S. Miller<davem@redhat.com>
>>
>> after:
>>
>> [   31.513910] 802.1Q VLAN Support v1.8
>>
>> Signed-off-by: Justin P. Mattock<justinmattock@gmail.com>
>
> Applied, thanks Justin.
>

cool..

Justin P. Mattock

^ permalink raw reply

* Re: [PATCH V6 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
From: Eric Dumazet @ 2011-05-26 20:04 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, mst, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <1306438593.5180.33.camel@localhost.localdomain>

Le jeudi 26 mai 2011 à 12:36 -0700, Shirley Ma a écrit :
> This patch adds userspace buffers support in skb shared info. A new 
> struct skb_ubuf_info is needed to maintain the userspace buffers
> argument and index, a callback is used to notify userspace to release
> the buffers once lower device has done DMA (Last reference to that skb
> has gone).
> 
> If there is any userspace apps to reference these userspace buffers,
> then these userspaces buffers will be copied into kernel. This way we
> can prevent userspace apps to hold these userspace buffers too long.
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
> 
>  include/linux/skbuff.h |   26 +++++++++++++++
>  net/core/skbuff.c      |   80 ++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 104 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index d0ae90a..025de5c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -189,6 +189,18 @@ enum {
>  	SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
>  };
>  
> +/*
> + * The callback notifies userspace to release buffers when skb DMA is done in
> + * lower device, the skb last reference should be 0 when calling this.
> + * The desc is used to track userspace buffer index.
> + */
> +struct skb_ubuf_info {
> +	/* support buffers allocation from userspace */
> +	void		(*callback)(struct sk_buff *);
> +	void		*arg;
> +	size_t		desc;
> +};

Thats 24 bytes on each skb...   desc for example is not used in this
patch (yes, its used later in patch 4/4)

But still... could you instead use one pointer only in skb ?



> +
>  /* This data is invariant across clones and lives at
>   * the end of the header data, ie. at skb->end.
>   */
> @@ -211,6 +223,10 @@ struct skb_shared_info {
>  	/* Intermediate layers must ensure that destructor_arg
>  	 * remains valid until skb destructor */
>  	void *		destructor_arg;
> +
> +	/* DMA mapping from/to userspace buffers */
> +	struct skb_ubuf_info ubuf;
> +
>  	/* must be last field, see pskb_expand_head() */
>  	skb_frag_t	frags[MAX_SKB_FRAGS];
>  };
> @@ -2261,5 +2277,15 @@ static inline void skb_checksum_none_assert(struct sk_buff *skb)
>  }
>  
>  bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
> +
> +/*
> + *	skb_ubuf - is the buffer from userspace
> + *	@skb: buffer to check
> + */
> +static inline int skb_ubuf(const struct sk_buff *skb)
> +{
> +	return (skb_shinfo(skb)->ubuf.callback != NULL);

	return skb_shinfo(skb)->ubuf.callback != NULL;

> +}
> +
>  #endif	/* __KERNEL__ */
>  #endif	/* _LINUX_SKBUFF_H */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7ebeed0..890447c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	shinfo = skb_shinfo(skb);
>  	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>  	atomic_set(&shinfo->dataref, 1);
> +	shinfo->ubuf.callback = NULL;
> +	shinfo->ubuf.arg = NULL;

if you put ubuf ptr before dataref, no need to add this (the memset()
clear all shared_info up to dataref)

>  	kmemcheck_annotate_variable(shinfo->destructor_arg);
>  
>  	if (fclone) {
> @@ -328,6 +330,14 @@ static void skb_release_data(struct sk_buff *skb)
>  				put_page(skb_shinfo(skb)->frags[i].page);
>  		}
>  
> +		/*
> +		 * if skb buf is from userspace, we need to notify the caller
> +		 * the lower device DMA has done;
> +		 */
> +		if (skb_ubuf(skb)) {
> +			skb_shinfo(skb)->ubuf.callback(skb);
> +			skb_shinfo(skb)->ubuf.callback = NULL;
> +		}
>  		if (skb_has_frag_list(skb))
>  			skb_drop_fraglist(skb);
>  
> @@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
>  	if (irqs_disabled())
>  		return false;
>  
> +	if (skb_ubuf(skb))
> +		return false;
> +
>  	if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
>  		return false;
>  
> @@ -572,6 +585,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
>  	atomic_set(&n->users, 1);
>  
>  	atomic_inc(&(skb_shinfo(skb)->dataref));
> +	skb_shinfo(skb)->ubuf.callback = NULL;
>  	skb->cloned = 1;
>  
>  	return n;
> @@ -595,6 +609,48 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
>  }
>  EXPORT_SYMBOL_GPL(skb_morph);
>  
> +/* skb frags copy userspace buffers to kernel */
> +static int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
> +{
> +	int i;
> +	int num_frags = skb_shinfo(skb)->nr_frags;
> +	struct page *page, *head = NULL;
> +
> +	for (i = 0; i < num_frags; i++) {
> +		u8 *vaddr;
> +		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
> +
> +		page = alloc_page(GFP_ATOMIC);
> +		if (!page) {
> +			while (head) {
> +				put_page(head);
> +				head = (struct page *)head->private;
> +			}
> +			return -ENOMEM;
> +		}
> +		vaddr = kmap_skb_frag(&skb_shinfo(skb)->frags[i]);
> +		memcpy(page_address(page), vaddr + f->page_offset, f->size);
> +		kunmap_skb_frag(vaddr);
> +		page->private = (unsigned long)head;
> +		head = page;
> +	}
> +
> +	/* skb frags release userspace buffers */
> +	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> +		put_page(skb_shinfo(skb)->frags[i].page);
> +	skb_shinfo(skb)->ubuf.callback(skb);
> +	skb_shinfo(skb)->ubuf.callback = NULL;
> +
> +	/* skb frags point to kernel buffers */
> +	for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
> +		skb_shinfo(skb)->frags[i - 1].page_offset = 0;
> +		skb_shinfo(skb)->frags[i - 1].page = head;
> +		head = (struct page *)head->private;
> +	}
> +	return 0;
> +}
> +
> +
>  /**
>   *	skb_clone	-	duplicate an sk_buff
>   *	@skb: buffer to clone
> @@ -613,6 +669,11 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
>  {
>  	struct sk_buff *n;
>  
> +	if (skb_ubuf(skb)) {
> +		if (skb_copy_ubufs(skb, gfp_mask))
> +			return NULL;
> +	}
> +
>  	n = skb + 1;
>  	if (skb->fclone == SKB_FCLONE_ORIG &&
>  	    n->fclone == SKB_FCLONE_UNAVAILABLE) {
> @@ -730,6 +791,12 @@ struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
>  	if (skb_shinfo(skb)->nr_frags) {
>  		int i;
>  
> +		if (skb_ubuf(skb)) {
> +			if (skb_copy_ubufs(skb, gfp_mask)) {
> +				kfree(n);
> +				goto out;
> +			}
> +		}
>  		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
>  			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
>  			get_page(skb_shinfo(n)->frags[i].page);
> @@ -743,6 +810,7 @@ struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
>  	}
>  
>  	copy_skb_header(n, skb);
> +

please dont add new lines like this in your patch

>  out:
>  	return n;
>  }
> @@ -787,7 +855,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  		fastpath = true;
>  	else {
>  		int delta = skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1;
> -

and dont delete this one as well.

>  		fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
>  	}
>  
> @@ -818,14 +885,19 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	if (fastpath) {
>  		kfree(skb->head);
>  	} else {
> +		/* copy this zero copy skb frags */
> +		if (skb_ubuf(skb)) {
> +			if (skb_copy_ubufs(skb, gfp_mask))
> +				goto nofrags;
> +		}
>  		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>  			get_page(skb_shinfo(skb)->frags[i].page);
> -

ditto

>  		if (skb_has_frag_list(skb))
>  			skb_clone_fraglist(skb);
>  
>  		skb_release_data(skb);
>  	}
> +

ditto

>  	off = (data + nhead) - skb->head;
>  
>  	skb->head     = data;
> @@ -852,6 +924,8 @@ adjust_others:
>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
>  	return 0;
>  
> +nofrags:
> +	kfree(data);
>  nodata:
>  	return -ENOMEM;
>  }
> @@ -1353,6 +1427,8 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
>  		}
>  		start = end;
>  	}
> +	skb_shinfo(skb)->ubuf.callback = NULL;
> +
>  	if (!len)
>  		return 0;
>  
> 
> 

Thanks

^ permalink raw reply

* [PATCH V6 0/4 net-next] macvtap/vhost TX zero-copy support
From: Shirley Ma @ 2011-05-26 20:00 UTC (permalink / raw)
  To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
  Cc: netdev, kvm, linux-kernel

This patchset add supports for TX zero-copy between guest and host
kernel through vhost. It significantly reduces CPU utilization on the
local host on which the guest is located (It reduced about 50% CPU usage
for single stream test on the host, while 4K message size BW has
increased about 50%). The patchset is based on previous submission and
comments from the community regarding when/how to handle guest kernel
buffers to be released. This is the simplest approach I can think of
after comparing with several other solutions.

This patchset has integrated V3 review comments from community: 

1. Add more comments on how to use device ZEROCOPY flag;

2. Change device ZEROCOPY to available bit 31

3. Fix skb header linear allocation when virtio_net GSO is not enabled

It has integrated V4 review comments from MST and Sridhar:

1. In vhost, using socket poll wake up for outstanding DMAs

2. Add detailed comments for vhost_zerocopy_signal_used call

3. Add sleep in vhost shutting down instead of busy-wait for outstanding
   DMAs.

4. Copy small packets, don't do zero-copy callback in mavtap, mark it's
   DMA done in vhost

5. change zerocopy to bool in macvtap.

It integrates V5 review comments from MST and 


Michał Mirosław <mirqus@gmail.com>

1. Prevent userspace apps from holding skb userspace buffers by copying
userspace buffers to kernel in skb_clone, skb_copy, pskb_copy,
pskb_expand_head.

2. It is also used HIGHDMA, SG feature bits to enable ZEROCOPY to remove
the dependency of a new feature bit, we can add it later when new
feature bit is available.

This patchset includes:

1/4: Add a new sock zero-copy flag, SOCK_ZEROCOPY;

2/4: Add a new struct skb_ubuf_info in skb_share_info for userspace
buffers release callback when lower device DMA has done for that skb,
which is the last reference count gone; Or whenever skb_clone, skb_copy,
pskb_copy, pskb_expand_head get call from tcpdump, filtering, these
userspace
buffers will be copied into kernel ... we don't want userspace apps to
hold
userspace buffers too long.

3/4: Add vhost zero-copy callback in vhost when skb last refcnt is gone;
add vhost_zerocopy_signal_used to notify guest to release TX skb
buffers.

4/4: Add macvtap zero-copy in lower device when sending packet is
greater than 256 bytes.

The patchset is built against most recent net-next linux 2.6.39-rc7. It
has passed netperf/netserver multiple streams stress test, tcpdump
suspended test, dynamically SG change test.

Single TCP_STREAM 120 secs test results over ixgbe 10Gb NIC results:

Message BW(Gb/s)qemu-kvm (NumCPU)vhost-net(NumCPU) PerfTop irq/s
4K      7408.57         92.1%           22.6%           1229
4K(Orig)4913.17         118.1%          84.1%           2086    
8K      9129.90         89.3%           23.3%           1141
8K(Orig)7094.55         115.9%          84.7%           2157
16K     9178.81         89.1%           23.3%           1139
16K(Orig)8927.1         118.7%          83.4%           2262
64K     9171.43         88.4%           24.9%           1253
64K(Orig)9085.85        115.9%          82.4%           2229

For message size less or equal than 2K, there is a known KVM guest TX
overrun issue. With this zero-copy patch, the issue becomes more severe,
guest io_exits has tripled than before, so the performance is not good.
Once the TX overrun problem has been addressed, I will retest the small
message size performance.

 drivers/net/macvtap.c          |  132
++++++++++++++++++++++++++++++++++++---
 drivers/vhost/net.c            |   44 +++++++++++++-
 drivers/vhost/vhost.c          |   49 +++++++++++++++
 drivers/vhost/vhost.h          |   13 ++++
 include/linux/netdevice.h      |   10 +++
 include/linux/skbuff.h         |   26 ++++++++
 include/net/sock.h             |    1 +
 net/core/skbuff.c              |   81 ++++++++++++++++++++++++-
 8 files changed, 345 insertions(+), 17 deletions(-)



^ permalink raw reply

* Re: [PATCH] sctp compilation failed due to lack of sctp_local_addr_free
From: David Miller @ 2011-05-26 19:58 UTC (permalink / raw)
  To: dmitry; +Cc: bruce.w.allan, netdev, difrost.kernel
In-Reply-To: <1306436821.28745.4.camel@lb-tlvb-dmitry>

From: "Dmitry Kravkov" <dmitry@broadcom.com>
Date: Thu, 26 May 2011 22:07:01 +0300

> On Thu, 2011-05-26 at 11:44 -0700, David Miller wrote:
>> From: "Allan, Bruce W" <bruce.w.allan@intel.com>
>> Date: Thu, 26 May 2011 11:40:34 -0700
>> 
>> >>-----Original Message-----
>> >>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>> >>Behalf Of David Miller
>> >>Sent: Thursday, May 26, 2011 10:39 AM
>> >>To: dmitry@broadcom.com
>> >>Cc: netdev@vger.kernel.org; difrost.kernel@gmail.com
>> >>Subject: Re: [PATCH] sctp compilation failed due to lack of sctp_local_addr_free
>> >>
>> >>
>> >>This has been fixed for days.
>> > 
>> > Fixed in net-2.6, yes, but not yet in net-next-2.6
>> 
>> net-next-2.6 is not active and is just a dummy tree for Stephen
>> Rothwell to pull into -next
>> 
>> You should never be referencing or using net-next-2.6 during the
>> merge window when new features changes are not being accepted.
>> Only net-2.6 is active during this time.
>> 
> Hmm, so how similar will be net-2.6 and net-next-2.6 when re-opened?
> You apply a lot of fixes during this time...

When it opens up I just update net-next-2.6 to whatever is in net-2.6
at the time.

^ permalink raw reply

* [PATCH V6 4/4 net-next] vhost: vhost TX zero-copy support
From: Shirley Ma @ 2011-05-26 19:56 UTC (permalink / raw)
  To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
  Cc: netdev, kvm, linux-kernel

Hello Michael,

Let me anything I might miss from your preview's review.

Thanks
Shirley

-----------

This patch maintains the outstanding userspace buffers in the 
sequence it is delivered to vhost. The outstanding userspace buffers 
will be marked as done once the lower device buffers DMA has finished. 
This is monitored through last reference of kfree_skb callback. Two
buffer index are used for this purpose.

The vhost passes the userspace buffers info to lower device skb 
through message control. Since there will be some done DMAs when
entering vhost handle_tx. The worse case is all buffers in the vq are
in pending/done status, so we need to notify guest to release DMA done 
buffers first before get any new buffers from the vq.

Signed-off-by: Shirley <xma@us.ibm.com>
---

 drivers/vhost/net.c   |   44 +++++++++++++++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h |   13 +++++++++++++
 3 files changed, 105 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..b27ba64 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,10 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
+/* MAX number of TX used buffers for outstanding zerocopy */
+#define VHOST_MAX_PEND 128
+#define VHOST_GOODCOPY_LEN 256
+
 enum {
 	VHOST_NET_VQ_RX = 0,
 	VHOST_NET_VQ_TX = 1,
@@ -129,6 +133,7 @@ static void handle_tx(struct vhost_net *net)
 	int err, wmem;
 	size_t hdr_size;
 	struct socket *sock;
+	struct skb_ubuf_info pend;
 
 	/* TODO: check that we are running from vhost_worker? */
 	sock = rcu_dereference_check(vq->private_data, 1);
@@ -151,6 +156,10 @@ static void handle_tx(struct vhost_net *net)
 	hdr_size = vq->vhost_hlen;
 
 	for (;;) {
+		/* Release DMAs done buffers first */
+		if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND)
+			vhost_zerocopy_signal_used(vq, false);
+
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
@@ -166,6 +175,12 @@ static void handle_tx(struct vhost_net *net)
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
+			/* If more outstanding DMAs, queue the work */
+			if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND) {
+				tx_poll_start(net, sock);
+				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				break;
+			}
 			if (unlikely(vhost_enable_notify(vq))) {
 				vhost_disable_notify(vq);
 				continue;
@@ -188,6 +203,24 @@ static void handle_tx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), hdr_size);
 			break;
 		}
+		/* use msg_control to pass vhost zerocopy ubuf info to skb */
+		if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
+			vq->heads[vq->upend_idx].id = head;
+			if (len < VHOST_GOODCOPY_LEN)
+				/* copy don't need to wait for DMA done */
+				vq->heads[vq->upend_idx].len =
+							VHOST_DMA_DONE_LEN;
+			else {
+				vq->heads[vq->upend_idx].len = len;
+				pend.callback = vhost_zerocopy_callback;
+				pend.arg = vq;
+				pend.desc = vq->upend_idx;
+				msg.msg_control = &pend;
+				msg.msg_controllen = sizeof(pend);
+			}
+			atomic_inc(&vq->refcnt);
+			vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
+		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
 		err = sock->ops->sendmsg(NULL, sock, &msg, len);
 		if (unlikely(err < 0)) {
@@ -198,12 +231,21 @@ static void handle_tx(struct vhost_net *net)
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+			vhost_add_used_and_signal(&net->dev, vq, head, 0);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
 			break;
 		}
+		/* if upend_idx is full, then wait for free more */
+/*
+		if (unlikely(vq->upend_idx == vq->done_idx)) {
+			tx_poll_start(net, sock);
+			set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				break;
+		}
+*/
 	}
 
 	mutex_unlock(&vq->mutex);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..0424c97 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call_ctx = NULL;
 	vq->call = NULL;
 	vq->log_ctx = NULL;
+	vq->upend_idx = 0;
+	vq->done_idx = 0;
+	atomic_set(&vq->refcnt, 0);
 }
 
 static int vhost_worker(void *data)
@@ -385,16 +388,49 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
 	return 0;
 }
 
+/* In case of DMA done not in order in lower device driver for some reason.
+ * upend_idx is used to track end of used idx, done_idx is used to track head
+ * of used idx. Once lower device DMA done contiguously, we will signal KVM
+ * guest used idx.
+ */
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
+{
+	int i, j = 0;
+
+	for (i = vq->done_idx; i != vq->upend_idx; i = ++i % UIO_MAXIOV) {
+		if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
+			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+			vhost_add_used_and_signal(vq->dev, vq,
+						  vq->heads[i].id, 0);
+			++j;
+		} else
+			break;
+	}
+	if (j) {
+		vq->done_idx = i;
+		atomic_sub(j, &vq->refcnt);
+	}
+}
+
 /* Caller should have device mutex */
 void vhost_dev_cleanup(struct vhost_dev *dev)
 {
 	int i;
+	unsigned long begin = jiffies;
 
 	for (i = 0; i < dev->nvqs; ++i) {
 		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
 			vhost_poll_stop(&dev->vqs[i].poll);
 			vhost_poll_flush(&dev->vqs[i].poll);
 		}
+		/* Wait for all lower device DMAs done, then notify virtio_net
+		 * or just notify it without waiting for all DMA done here ?
+		 * in case of DMAs never done for some reason */
+		if (atomic_read(&dev->vqs[i].refcnt)) {
+			/* how long should we wait ? */
+			msleep(1000);
+			vhost_zerocopy_signal_used(&dev->vqs[i], true);
+		}
 		if (dev->vqs[i].error_ctx)
 			eventfd_ctx_put(dev->vqs[i].error_ctx);
 		if (dev->vqs[i].error)
@@ -603,6 +639,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 
 	mutex_lock(&vq->mutex);
 
+	/* clean up lower device outstanding DMAs, before setting ring */
+	if (atomic_read(&vq->refcnt))
+		vhost_zerocopy_signal_used(vq, true);
+
 	switch (ioctl) {
 	case VHOST_SET_VRING_NUM:
 		/* Resizing ring with an active backend?
@@ -1416,3 +1456,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
 		vq_err(vq, "Failed to enable notification at %p: %d\n",
 		       &vq->used->flags, r);
 }
+
+void vhost_zerocopy_callback(struct sk_buff *skb)
+{
+	int idx = skb_shinfo(skb)->ubuf.desc;
+	struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
+
+	/* set len = 1 to mark this desc buffers done DMA */
+	vq->heads[idx].len = VHOST_DMA_DONE_LEN;
+}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..d0e7ac6 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,6 +13,11 @@
 #include <linux/virtio_ring.h>
 #include <asm/atomic.h>
 
+/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
+ * done */
+#define VHOST_DMA_DONE_LEN	1
+#define VHOST_DMA_CLEAR_LEN	0
+
 struct vhost_device;
 
 struct vhost_work;
@@ -108,6 +113,12 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	/* vhost zerocopy support */
+	atomic_t refcnt; /* num of outstanding zerocopy DMAs */
+	/* last used idx for outstanding DMA zerocopy buffers */
+	int upend_idx;
+	/* first used idx for DMA done zerocopy buffers */
+	int done_idx;
 };
 
 struct vhost_dev {
@@ -154,6 +165,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
+void vhost_zerocopy_callback(struct sk_buff *skb);
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
 
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \

^ permalink raw reply related

* Re: Kernel crash after using new Intel NIC (igb)
From: Eric Dumazet @ 2011-05-26 19:47 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Maximilian Engelhardt, linux-kernel, netdev, StuStaNet Vorstand
In-Reply-To: <4DDEAA3C.7020502@fb.com>

Le jeudi 26 mai 2011 à 12:30 -0700, Arun Sharma a écrit :
> On 5/24/11 11:35 PM, Eric Dumazet wrote:
> 
> >> Another possibility is to do the list_empty() check twice. Once without
> >> taking the lock and again with the spinlock held.
> >>
> >
> > Why ?
> >
> 
> Part of the problem is that I don't have a precise understanding of the 
> race condition that's causing the list to become corrupted.
> 
> All I know is that doing it under the lock fixes it. If it's slowing 
> things down, we do a check outside the lock (since it's cheap). But if 
> we get the wrong answer, we verify it again under the lock.
> 

You dont get the problem. Problem is : We can do the empty() test only
if protected by the lock.

If not locked, result can be wrong. [ false positive or negative ]


> > list_del_init(&p->unused); (done under lock of course) is safe, you can
> > call it twice, no problem.
> 
> Doing it twice is not a problem. But doing it when we shouldn't be doing 
> it could be the problem.
> 
> The list modification under unused_peers.lock looks generally safe. But 
> the control flow (based on refcnt) done outside the lock might have races.
> 

"might" is not a good word when dealing with this ;)

> Eg: inet_putpeer() might find the refcnt go to zero, but before it adds 
> it to the unused list, another thread may be doing inet_getpeer() and 
> set refcnt to 1. In the end, we end up with a node that's potentially in 
> use, but ends up on the unused list.
> 

Did you test my fix ?

Its doing the right thing : Using refcnt as the only marker to say if
the item must be removed from unused list (and lock the central lock
protecting this list only when needed)

Since we already must do an atomic operation on refcnt, using
atomic_inc_return [ or similar full barrier op ] is enough to tell us
the truth.

^ permalink raw reply

* [PATCH V6 3/4]macvtap: macvtap TX zero-copy support
From: Shirley Ma @ 2011-05-26 19:42 UTC (permalink / raw)
  To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
  Cc: netdev, kvm, linux-kernel

Only when buffer size is greater than GOODCOPY_LEN (256), macvtap
enables zero-copy.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 drivers/net/macvtap.c |  132 ++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 121 insertions(+), 11 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..97ad224 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
  */
 static dev_t macvtap_major;
 #define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN 256
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
 {
 	struct net *net = current->nsproxy->net_ns;
 	struct net_device *dev = dev_get_by_index(net, iminor(inode));
+	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvtap_queue *q;
 	int err;
 
@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
 	q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
 	q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
 
+	/*
+	 * so far only KVM virtio_net uses macvtap, enable zero copy between
+	 * guest kernel and host kernel when lower device supports highdma
+	 * and sg
+	 */
+	if (vlan) {
+		if (vlan->lowerdev->features & (NETIF_F_HIGHDMA | NETIF_F_SG))
+			sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+	}
+
 	err = macvtap_set_queue(dev, file, q);
 	if (err)
 		sock_put(&q->sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
 	return skb;
 }
 
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+				  int offset, size_t count)
+{
+	int len = iov_length(from, count) - offset;
+	int copy = skb_headlen(skb);
+	int size, offset1 = 0;
+	int i = 0;
+	skb_frag_t *f;
+
+	/* Skip over from offset */
+	while (count && (offset >= from->iov_len)) {
+		offset -= from->iov_len;
+		++from;
+		--count;
+	}
+
+	/* copy up to skb headlen */
+	while (count && (copy > 0)) {
+		size = min_t(unsigned int, copy, from->iov_len - offset);
+		if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+				   size))
+			return -EFAULT;
+		if (copy > size) {
+			++from;
+			--count;
+		}
+		copy -= size;
+		offset1 += size;
+		offset = 0;
+	}
+
+	if (len == offset1)
+		return 0;
+
+	while (count--) {
+		struct page *page[MAX_SKB_FRAGS];
+		int num_pages;
+		unsigned long base;
+
+		len = from->iov_len - offset1;
+		if (!len) {
+			offset1 = 0;
+			++from;
+			continue;
+		}
+		base = (unsigned long)from->iov_base + offset1;
+		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+		if ((num_pages != size) ||
+		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+			/* put_page is in skb free */
+			return -EFAULT;
+		skb->data_len += len;
+		skb->len += len;
+		skb->truesize += len;
+		atomic_add(len, &skb->sk->sk_wmem_alloc);
+		while (len) {
+			f = &skb_shinfo(skb)->frags[i];
+			f->page = page[i];
+			f->page_offset = base & ~PAGE_MASK;
+			f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+			skb_shinfo(skb)->nr_frags++;
+			/* increase sk_wmem_alloc */
+			base += f->size;
+			len -= f->size;
+			i++;
+		}
+		offset1 = 0;
+		++from;
+	}
+	return 0;
+}
+
 /*
  * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
  * be shared with the tun/tap driver.
@@ -515,16 +601,18 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
 
 
 /* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
-				const struct iovec *iv, size_t count,
-				int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+				const struct iovec *iv, unsigned long total_len,
+				size_t count, int noblock)
 {
 	struct sk_buff *skb;
 	struct macvlan_dev *vlan;
-	size_t len = count;
+	unsigned long len = total_len;
 	int err;
 	struct virtio_net_hdr vnet_hdr = { 0 };
 	int vnet_hdr_len = 0;
+	int copylen;
+	bool zerocopy = false;
 
 	if (q->flags & IFF_VNET_HDR) {
 		vnet_hdr_len = q->vnet_hdr_sz;
@@ -552,12 +640,30 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 	if (unlikely(len < ETH_HLEN))
 		goto err;
 
-	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
-				noblock, &err);
+	if (m && m->msg_control)
+		zerocopy = true;
+
+	if (zerocopy) {
+		/* There are 256 bytes to be copied in skb, so there is enough
+		 * room for skb expand head in case it is used.
+		 * The rest buffer is mapped from userspace.
+		 */
+		copylen = vnet_hdr.hdr_len;
+		if (!copylen)
+			copylen = GOODCOPY_LEN;
+	} else
+		copylen = len;
+
+	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
+				vnet_hdr.hdr_len, noblock, &err);
 	if (!skb)
 		goto err;
 
-	err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+	if (zerocopy)
+		err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+	else
+		err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+						   len);
 	if (err)
 		goto err_kfree;
 
@@ -573,13 +679,17 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 
 	rcu_read_lock_bh();
 	vlan = rcu_dereference_bh(q->vlan);
+	/* copy skb_ubuf_info for callback when skb has no error */
+	if (zerocopy)
+		memcpy(&skb_shinfo(skb)->ubuf, m->msg_control,
+		       sizeof(struct skb_ubuf_info));
 	if (vlan)
 		macvlan_start_xmit(skb, vlan->dev);
 	else
 		kfree_skb(skb);
 	rcu_read_unlock_bh();
 
-	return count;
+	return total_len;
 
 err_kfree:
 	kfree_skb(skb);
@@ -601,8 +711,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
 	ssize_t result = -ENOLINK;
 	struct macvtap_queue *q = file->private_data;
 
-	result = macvtap_get_user(q, iv, iov_length(iv, count),
-			      file->f_flags & O_NONBLOCK);
+	result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+				  file->f_flags & O_NONBLOCK);
 	return result;
 }
 
@@ -815,7 +925,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
 			   struct msghdr *m, size_t total_len)
 {
 	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
-	return macvtap_get_user(q, m->msg_iov, total_len,
+	return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
 			    m->msg_flags & MSG_DONTWAIT);
 }
 

^ permalink raw reply related

* [PATCH V6 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
From: Shirley Ma @ 2011-05-26 19:36 UTC (permalink / raw)
  To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
  Cc: netdev, kvm, linux-kernel

This patch adds userspace buffers support in skb shared info. A new 
struct skb_ubuf_info is needed to maintain the userspace buffers
argument and index, a callback is used to notify userspace to release
the buffers once lower device has done DMA (Last reference to that skb
has gone).

If there is any userspace apps to reference these userspace buffers,
then these userspaces buffers will be copied into kernel. This way we
can prevent userspace apps to hold these userspace buffers too long.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 include/linux/skbuff.h |   26 +++++++++++++++
 net/core/skbuff.c      |   80 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d0ae90a..025de5c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -189,6 +189,18 @@ enum {
 	SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
 };
 
+/*
+ * The callback notifies userspace to release buffers when skb DMA is done in
+ * lower device, the skb last reference should be 0 when calling this.
+ * The desc is used to track userspace buffer index.
+ */
+struct skb_ubuf_info {
+	/* support buffers allocation from userspace */
+	void		(*callback)(struct sk_buff *);
+	void		*arg;
+	size_t		desc;
+};
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -211,6 +223,10 @@ struct skb_shared_info {
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+
+	/* DMA mapping from/to userspace buffers */
+	struct skb_ubuf_info ubuf;
+
 	/* must be last field, see pskb_expand_head() */
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
@@ -2261,5 +2277,15 @@ static inline void skb_checksum_none_assert(struct sk_buff *skb)
 }
 
 bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
+
+/*
+ *	skb_ubuf - is the buffer from userspace
+ *	@skb: buffer to check
+ */
+static inline int skb_ubuf(const struct sk_buff *skb)
+{
+	return (skb_shinfo(skb)->ubuf.callback != NULL);
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SKBUFF_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7ebeed0..890447c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	shinfo = skb_shinfo(skb);
 	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
 	atomic_set(&shinfo->dataref, 1);
+	shinfo->ubuf.callback = NULL;
+	shinfo->ubuf.arg = NULL;
 	kmemcheck_annotate_variable(shinfo->destructor_arg);
 
 	if (fclone) {
@@ -328,6 +330,14 @@ static void skb_release_data(struct sk_buff *skb)
 				put_page(skb_shinfo(skb)->frags[i].page);
 		}
 
+		/*
+		 * if skb buf is from userspace, we need to notify the caller
+		 * the lower device DMA has done;
+		 */
+		if (skb_ubuf(skb)) {
+			skb_shinfo(skb)->ubuf.callback(skb);
+			skb_shinfo(skb)->ubuf.callback = NULL;
+		}
 		if (skb_has_frag_list(skb))
 			skb_drop_fraglist(skb);
 
@@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
 	if (irqs_disabled())
 		return false;
 
+	if (skb_ubuf(skb))
+		return false;
+
 	if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
 		return false;
 
@@ -572,6 +585,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	atomic_set(&n->users, 1);
 
 	atomic_inc(&(skb_shinfo(skb)->dataref));
+	skb_shinfo(skb)->ubuf.callback = NULL;
 	skb->cloned = 1;
 
 	return n;
@@ -595,6 +609,48 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
+/* skb frags copy userspace buffers to kernel */
+static int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	int i;
+	int num_frags = skb_shinfo(skb)->nr_frags;
+	struct page *page, *head = NULL;
+
+	for (i = 0; i < num_frags; i++) {
+		u8 *vaddr;
+		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+
+		page = alloc_page(GFP_ATOMIC);
+		if (!page) {
+			while (head) {
+				put_page(head);
+				head = (struct page *)head->private;
+			}
+			return -ENOMEM;
+		}
+		vaddr = kmap_skb_frag(&skb_shinfo(skb)->frags[i]);
+		memcpy(page_address(page), vaddr + f->page_offset, f->size);
+		kunmap_skb_frag(vaddr);
+		page->private = (unsigned long)head;
+		head = page;
+	}
+
+	/* skb frags release userspace buffers */
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+		put_page(skb_shinfo(skb)->frags[i].page);
+	skb_shinfo(skb)->ubuf.callback(skb);
+	skb_shinfo(skb)->ubuf.callback = NULL;
+
+	/* skb frags point to kernel buffers */
+	for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
+		skb_shinfo(skb)->frags[i - 1].page_offset = 0;
+		skb_shinfo(skb)->frags[i - 1].page = head;
+		head = (struct page *)head->private;
+	}
+	return 0;
+}
+
+
 /**
  *	skb_clone	-	duplicate an sk_buff
  *	@skb: buffer to clone
@@ -613,6 +669,11 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 {
 	struct sk_buff *n;
 
+	if (skb_ubuf(skb)) {
+		if (skb_copy_ubufs(skb, gfp_mask))
+			return NULL;
+	}
+
 	n = skb + 1;
 	if (skb->fclone == SKB_FCLONE_ORIG &&
 	    n->fclone == SKB_FCLONE_UNAVAILABLE) {
@@ -730,6 +791,12 @@ struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
 	if (skb_shinfo(skb)->nr_frags) {
 		int i;
 
+		if (skb_ubuf(skb)) {
+			if (skb_copy_ubufs(skb, gfp_mask)) {
+				kfree(n);
+				goto out;
+			}
+		}
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
 			get_page(skb_shinfo(n)->frags[i].page);
@@ -743,6 +810,7 @@ struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
 	}
 
 	copy_skb_header(n, skb);
+
 out:
 	return n;
 }
@@ -787,7 +855,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		fastpath = true;
 	else {
 		int delta = skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1;
-
 		fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
 	}
 
@@ -818,14 +885,19 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	if (fastpath) {
 		kfree(skb->head);
 	} else {
+		/* copy this zero copy skb frags */
+		if (skb_ubuf(skb)) {
+			if (skb_copy_ubufs(skb, gfp_mask))
+				goto nofrags;
+		}
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			get_page(skb_shinfo(skb)->frags[i].page);
-
 		if (skb_has_frag_list(skb))
 			skb_clone_fraglist(skb);
 
 		skb_release_data(skb);
 	}
+
 	off = (data + nhead) - skb->head;
 
 	skb->head     = data;
@@ -852,6 +924,8 @@ adjust_others:
 	atomic_set(&skb_shinfo(skb)->dataref, 1);
 	return 0;
 
+nofrags:
+	kfree(data);
 nodata:
 	return -ENOMEM;
 }
@@ -1353,6 +1427,8 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
 		}
 		start = end;
 	}
+	skb_shinfo(skb)->ubuf.callback = NULL;
+
 	if (!len)
 		return 0;
 

^ permalink raw reply related

* [PATCH V6 1/4 net-next] sock.h: Add a new sock zero-copy flag
From: Shirley Ma @ 2011-05-26 19:32 UTC (permalink / raw)
  To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
  Cc: netdev, kvm, linux-kernel

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 include/net/sock.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 01810a3..ab09097 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -562,6 +562,7 @@ enum sock_flags {
 	SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
 	SOCK_FASYNC, /* fasync() active */
 	SOCK_RXQ_OVFL,
+	SOCK_ZEROCOPY, /* buffers from userspace */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)



^ permalink raw reply related

* [PATCH V6 0/4]
From: Shirley Ma @ 2011-05-26 19:29 UTC (permalink / raw)
  To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
  Cc: netdev, kvm, linux-kernel

This patchset add supports for TX zero-copy between guest and host
kernel through vhost. It significantly reduces CPU utilization on the
local host on which the guest is located (It reduced about 50% CPU usage
for single stream test on the host, while 4K message size BW has
increased about 50%). The patchset is based on previous submission and
comments from the community regarding when/how to handle guest kernel
buffers to be released. This is the simplest approach I can think of
after comparing with several other solutions.

This patchset has integrated V3 review comments from community: 

1. Add more comments on how to use device ZEROCOPY flag;

2. Change device ZEROCOPY to available bit 31

3. Fix skb header linear allocation when virtio_net GSO is not enabled

It has integrated V4 review comments from MST and Sridhar:

1. In vhost, using socket poll wake up for outstanding DMAs

2. Add detailed comments for vhost_zerocopy_signal_used call

3. Add sleep in vhost shutting down instead of busy-wait for outstanding
   DMAs.

4. Copy small packets, don't do zero-copy callback in mavtap, mark it's
   DMA done in vhost

5. change zerocopy to bool in macvtap.

It integrates V5 review comments from MST and 


Michał Mirosław <mirqus@gmail.com>

1. Prevent userspace apps from holding skb userspace buffers by copying
userspace buffers to kernel in skb_clone, skb_copy, pskb_copy,
pskb_expand_head.

2. It is also used HIGHDMA, SG feature bits to enable ZEROCOPY to remove
the dependency of a new feature bit, we can add it later when new
feature bit is available.

This patchset includes:

1/4: Add a new sock zero-copy flag, SOCK_ZEROCOPY;

2/4: Add a new struct skb_ubuf_info in skb_share_info for userspace
buffers release callback when lower device DMA has done for that skb,
which is the last reference count gone; Or whenever skb_clone, skb_copy,
pskb_copy, pskb_expand_head get call from tcpdump, filtering, these userspace
buffers will be copied into kernel ... we don't want userspace apps to hold
userspace buffers too long.

3/4: Add vhost zero-copy callback in vhost when skb last refcnt is gone;
add vhost_zerocopy_signal_used to notify guest to release TX skb
buffers.

4/4: Add macvtap zero-copy in lower device when sending packet is
greater than 256 bytes.

The patchset is built against most recent net-next linux 2.6.39-rc7. It
has passed netperf/netserver multiple streams stress test, tcpdump
suspended test, dynamically SG change test.

Single TCP_STREAM 120 secs test results over ixgbe 10Gb NIC results:

Message BW(Gb/s)qemu-kvm (NumCPU)vhost-net(NumCPU) PerfTop irq/s
4K      7408.57         92.1%           22.6%           1229
4K(Orig)4913.17         118.1%          84.1%           2086    
8K      9129.90         89.3%           23.3%           1141
8K(Orig)7094.55         115.9%          84.7%           2157
16K     9178.81         89.1%           23.3%           1139
16K(Orig)8927.1         118.7%          83.4%           2262
64K     9171.43         88.4%           24.9%           1253
64K(Orig)9085.85        115.9%          82.4%           2229

For message size less or equal than 2K, there is a known KVM guest TX
overrun issue. With this zero-copy patch, the issue becomes more severe,
guest io_exits has tripled than before, so the performance is not good.
Once the TX overrun problem has been addressed, I will retest the small
message size performance.

 drivers/net/macvtap.c          |  132 ++++++++++++++++++++++++++++++++++++---
 drivers/vhost/net.c            |   44 +++++++++++++-
 drivers/vhost/vhost.c          |   49 +++++++++++++++
 drivers/vhost/vhost.h          |   13 ++++
 include/linux/netdevice.h      |   10 +++
 include/linux/skbuff.h         |   26 ++++++++
 include/net/sock.h             |    1 +
 net/core/skbuff.c              |   81 ++++++++++++++++++++++++-
 8 files changed, 345 insertions(+), 17 deletions(-)



^ permalink raw reply

* Re: Kernel crash after using new Intel NIC (igb)
From: Arun Sharma @ 2011-05-26 19:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Maximilian Engelhardt, linux-kernel, netdev, StuStaNet Vorstand
In-Reply-To: <1306305331.3305.22.camel@edumazet-laptop>

On 5/24/11 11:35 PM, Eric Dumazet wrote:

>> Another possibility is to do the list_empty() check twice. Once without
>> taking the lock and again with the spinlock held.
>>
>
> Why ?
>

Part of the problem is that I don't have a precise understanding of the 
race condition that's causing the list to become corrupted.

All I know is that doing it under the lock fixes it. If it's slowing 
things down, we do a check outside the lock (since it's cheap). But if 
we get the wrong answer, we verify it again under the lock.

> list_del_init(&p->unused); (done under lock of course) is safe, you can
> call it twice, no problem.

Doing it twice is not a problem. But doing it when we shouldn't be doing 
it could be the problem.

The list modification under unused_peers.lock looks generally safe. But 
the control flow (based on refcnt) done outside the lock might have races.

Eg: inet_putpeer() might find the refcnt go to zero, but before it adds 
it to the unused list, another thread may be doing inet_getpeer() and 
set refcnt to 1. In the end, we end up with a node that's potentially in 
use, but ends up on the unused list.

  -Arun

^ permalink raw reply

* Re: [patch] net/core/filter.c: Fix build error
From: Ingo Molnar @ 2011-05-26 19:12 UTC (permalink / raw)
  To: David Miller; +Cc: joe, greearb, linux-kernel, linux-arch, arnd, netdev
In-Reply-To: <20110526190939.GC3476@elte.hu>


* Ingo Molnar <mingo@elte.hu> wrote:

> So no, i don't think your patch is the real solution. [...]

s/real/complete

Note, your patch solves a real problem: the ratelimited WARN_ON()s 
should not be in the generic bug.h header, that's just broken.

I just don't think this is the only problem: the other problem was 
what hid the bug in the first place, the spurious ratelimit.h 
inclusion via net.h.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-26 19:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michał Mirosław, Ben Hutchings, David Miller,
	Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
	linux-kernel
In-Reply-To: <20110526084916.GA17928@redhat.com>

On Thu, 2011-05-26 at 11:49 +0300, Michael S. Tsirkin wrote:
> > 
> > > - SG support
> > > - HIGHDMA support (on arches where this makes sense)
> > 
> > This can be checked by device flags.
> 
> OK, but pls note that SG can get turned off dynamically.

Tested the patch w/i SG dynmically on/off and tcpdump suspended.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox