Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 4/4] cnic: Handle RAMROD_CMD_ID_CLOSE error.
From: Michael Chan @ 2012-06-28  1:08 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1340845704-12580-3-git-send-email-mchan@broadcom.com>

From: Eddie Wai <eddie.wai@broadcom.com>

If firmware returns error status, proceed to close the iSCSI connection.
Update version to 2.5.11.

Signed-off-by: Eddie Wai <eddie.wai@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/ethernet/broadcom/cnic.c    |    9 +++++++++
 drivers/net/ethernet/broadcom/cnic_if.h |    4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index ec43df1..f897306 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -3953,6 +3953,15 @@ static void cnic_cm_process_kcqe(struct cnic_dev *dev, struct kcqe *kcqe)
 		cnic_cm_upcall(cp, csk, opcode);
 		break;
 
+	case L5CM_RAMROD_CMD_ID_CLOSE:
+		if (l4kcqe->status != 0) {
+			netdev_warn(dev->netdev, "RAMROD CLOSE compl with "
+				    "status 0x%x\n", l4kcqe->status);
+			opcode = L4_KCQE_OPCODE_VALUE_CLOSE_COMP;
+			/* Fall through */
+		} else {
+			break;
+		}
 	case L4_KCQE_OPCODE_VALUE_RESET_RECEIVED:
 	case L4_KCQE_OPCODE_VALUE_CLOSE_COMP:
 	case L4_KCQE_OPCODE_VALUE_RESET_COMP:
diff --git a/drivers/net/ethernet/broadcom/cnic_if.h b/drivers/net/ethernet/broadcom/cnic_if.h
index d63d455..54f68f0 100644
--- a/drivers/net/ethernet/broadcom/cnic_if.h
+++ b/drivers/net/ethernet/broadcom/cnic_if.h
@@ -14,8 +14,8 @@
 
 #include "bnx2x/bnx2x_mfw_req.h"
 
-#define CNIC_MODULE_VERSION	"2.5.10"
-#define CNIC_MODULE_RELDATE	"March 21, 2012"
+#define CNIC_MODULE_VERSION	"2.5.11"
+#define CNIC_MODULE_RELDATE	"June 27, 2012"
 
 #define CNIC_ULP_RDMA		0
 #define CNIC_ULP_ISCSI		1
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 2/4] cnic: Read bnx2x function number from internal register
From: Michael Chan @ 2012-06-28  1:08 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1340845704-12580-1-git-send-email-mchan@broadcom.com>

From: Eddie Wai <eddie.wai@broadcom.com>

so that it will work on any hypervisor.

Signed-off-by: Eddie Wai <eddie.wai@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/ethernet/broadcom/cnic.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index 31b05ad..5980443 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -4988,8 +4988,14 @@ static int cnic_start_bnx2x_hw(struct cnic_dev *dev)
 	cp->port_mode = CHIP_PORT_MODE_NONE;
 
 	if (BNX2X_CHIP_IS_E2_PLUS(cp->chip_id)) {
-		u32 val = CNIC_RD(dev, MISC_REG_PORT4MODE_EN_OVWR);
+		u32 val;
+
+		pci_read_config_dword(dev->pcidev, PCICFG_ME_REGISTER, &val);
+		cp->func = (u8) ((val & ME_REG_ABS_PF_NUM) >>
+				 ME_REG_ABS_PF_NUM_SHIFT);
+		func = CNIC_FUNC(cp);
 
+		val = CNIC_RD(dev, MISC_REG_PORT4MODE_EN_OVWR);
 		if (!(val & 1))
 			val = CNIC_RD(dev, MISC_REG_PORT4MODE_EN);
 		else
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 1/4] cnic: Fix occasional NULL pointer dereference during reboot.
From: Michael Chan @ 2012-06-28  1:08 UTC (permalink / raw)
  To: davem; +Cc: netdev

We register with bnx2x before we allocate ctx_tbl structure, so it is
possible for bnx2x to call cnic_ctl before the structure is allocated.
This can sometimes cause NULL pointer dereference of cp->ctx_tbl.  We
fix this by adding simple checking for valid state before proceeding.
The cnic_ctl call is RCU protected so we don't have to deal with race
conditions.

Because of the additional checking, we need to finish the shutdown
before clearing the CNIC_UP flag.

Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/ethernet/broadcom/cnic.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index 0e9be2b..31b05ad 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -291,6 +291,9 @@ static int cnic_get_l5_cid(struct cnic_local *cp, u32 cid, u32 *l5_cid)
 {
 	u32 i;
 
+	if (!cp->ctx_tbl)
+		return -EINVAL;
+
 	for (i = 0; i < cp->max_cid_space; i++) {
 		if (cp->ctx_tbl[i].cid == cid) {
 			*l5_cid = i;
@@ -3220,6 +3223,9 @@ static int cnic_ctl(void *data, struct cnic_ctl_info *info)
 		u32 l5_cid;
 		struct cnic_local *cp = dev->cnic_priv;
 
+		if (!test_bit(CNIC_F_CNIC_UP, &dev->flags))
+			break;
+
 		if (cnic_get_l5_cid(cp, cid, &l5_cid) == 0) {
 			struct cnic_context *ctx = &cp->ctx_tbl[l5_cid];
 
@@ -4253,8 +4259,6 @@ static int cnic_cm_shutdown(struct cnic_dev *dev)
 	struct cnic_local *cp = dev->cnic_priv;
 	int i;
 
-	cp->stop_cm(dev);
-
 	if (!cp->csk_tbl)
 		return 0;
 
@@ -5290,6 +5294,7 @@ static void cnic_stop_hw(struct cnic_dev *dev)
 			i++;
 		}
 		cnic_shutdown_rings(dev);
+		cp->stop_cm(dev);
 		clear_bit(CNIC_F_CNIC_UP, &dev->flags);
 		RCU_INIT_POINTER(cp->ulp_ops[CNIC_ULP_L4], NULL);
 		synchronize_rcu();
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 2/2] bnx2: Add missing netif_tx_disable() in bnx2_close()
From: Michael Chan @ 2012-06-28  1:08 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1340845704-12580-5-git-send-email-mchan@broadcom.com>

to stop all tx queues.  Update version to 2.2.3.

Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index e6116ec..9eb7624 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -58,8 +58,8 @@
 #include "bnx2_fw.h"
 
 #define DRV_MODULE_NAME		"bnx2"
-#define DRV_MODULE_VERSION	"2.2.2"
-#define DRV_MODULE_RELDATE	"June 16, 2012"
+#define DRV_MODULE_VERSION	"2.2.3"
+#define DRV_MODULE_RELDATE	"June 27, 2012"
 #define FW_MIPS_FILE_06		"bnx2/bnx2-mips-06-6.2.3.fw"
 #define FW_RV2P_FILE_06		"bnx2/bnx2-rv2p-06-6.0.15.fw"
 #define FW_MIPS_FILE_09		"bnx2/bnx2-mips-09-6.2.1b.fw"
@@ -6703,6 +6703,7 @@ bnx2_close(struct net_device *dev)
 
 	bnx2_disable_int_sync(bp);
 	bnx2_napi_disable(bp);
+	netif_tx_disable(dev);
 	del_timer_sync(&bp->timer);
 	bnx2_shutdown_chip(bp);
 	bnx2_free_irq(bp);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 3/4] cnic: Remove uio mem[0].
From: Michael Chan @ 2012-06-28  1:08 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1340845704-12580-2-git-send-email-mchan@broadcom.com>

This memory region is no longer used.  Userspace gets the BAR address
directly from sysfs.

Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/ethernet/broadcom/cnic.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index 5980443..ec43df1 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -1063,10 +1063,7 @@ static int cnic_init_uio(struct cnic_dev *dev)
 
 	uinfo = &udev->cnic_uinfo;
 
-	uinfo->mem[0].addr = dev->netdev->base_addr;
-	uinfo->mem[0].internal_addr = dev->regview;
-	uinfo->mem[0].size = dev->netdev->mem_end - dev->netdev->mem_start;
-	uinfo->mem[0].memtype = UIO_MEM_PHYS;
+	uinfo->mem[0].memtype = UIO_MEM_NONE;
 
 	if (test_bit(CNIC_F_BNX2_CLASS, &dev->flags)) {
 		uinfo->mem[1].addr = (unsigned long) cp->status_blk.gen &
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 1/2] bnx2: Add "fall through" comments
From: Michael Chan @ 2012-06-28  1:08 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1340845704-12580-4-git-send-email-mchan@broadcom.com>

to indicate that the mising break statements are intended.

Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index 9b69a62..e6116ec 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -1972,22 +1972,26 @@ bnx2_remote_phy_event(struct bnx2 *bp)
 		switch (speed) {
 			case BNX2_LINK_STATUS_10HALF:
 				bp->duplex = DUPLEX_HALF;
+				/* fall through */
 			case BNX2_LINK_STATUS_10FULL:
 				bp->line_speed = SPEED_10;
 				break;
 			case BNX2_LINK_STATUS_100HALF:
 				bp->duplex = DUPLEX_HALF;
+				/* fall through */
 			case BNX2_LINK_STATUS_100BASE_T4:
 			case BNX2_LINK_STATUS_100FULL:
 				bp->line_speed = SPEED_100;
 				break;
 			case BNX2_LINK_STATUS_1000HALF:
 				bp->duplex = DUPLEX_HALF;
+				/* fall through */
 			case BNX2_LINK_STATUS_1000FULL:
 				bp->line_speed = SPEED_1000;
 				break;
 			case BNX2_LINK_STATUS_2500HALF:
 				bp->duplex = DUPLEX_HALF;
+				/* fall through */
 			case BNX2_LINK_STATUS_2500FULL:
 				bp->line_speed = SPEED_2500;
 				break;
-- 
1.7.1

^ permalink raw reply related

* [PATCH 00/02] iproute2: Add support for new tunnel type VTI.
From: Saurabh @ 2012-06-28  1:01 UTC (permalink / raw)
  To: netdev



Resubmitting after taking into account review comments:
The VTI tunnel is applicable to esp, ah and ipcomp.

Introduction:
Virtual tunnel interface is a way to represent policy based IPsec tunnels as virtual interfaces in linux. This is similar to Cisco's VTI (virtual tunnel interface) and Juniper's representaion of secure tunnel (st.xx). The advantage of representing an IPsec tunnel as an interface is that it is possible to plug Ipsec tunnels into the routing protocol infrastructure of a router. Therefore it becomes possible to influence the packet path by toggling the link state of the tunnel or based on routing metrics.

Overview:
Natively linux kernel does not support ipsec as an interface. Also secure interface assume a ipsec policy 4 tupple of {dst-ip-any, src-ip-any, dst-port-any, src-port-any}. Applying this 4 tuple in linux would result in all traffic matching the ipsec policy. What is needed is a tunnel distinguisher. The linux kernel skbuff has fwmark which is used for policy based routing (PBR). Linux kernel version 2.6.35 enhanced SPD/SADB to use fwmark as part of the IPsec policy. Strongswan has also introduced support for this kernel feature with version 4.5.0. We can therefore use the fwmark as the distinguisher for tunnel interface. We can also create a light weight tunnel kernel module (vti) to give the notion of an interface for rest of the kernel routing system. The tunnel module does not do any enc
 apsulation/decapsulation. The kernel's xfrm modules still do the esp encryption/decryption. 

Enhancement to iproute2:
Add support to configure and display VTI tunnel using ioctl and rtnetlink.

Usage:
ip tunnel add sti15 mode vti remote 12.0.0.1 local 12.0.0.3 ikey 15
or
ip link add sti15 type vti key 15 remote 12.0.0.1 local 12.0.0.3

Sample strongswan config would be:
conn peer-12.0.0.1-tunnel-1
   left=12.0.0.3
   right=12.0.0.1
   leftsubnet=0.0.0.0/0
   rightsubnet=0.0.0.0/0
   ike=aes128-sha1-modp1024!
   ikelifetime=28800s
   keyingtries=%forever
   esp=aes128-sha1!
   keylife=3600s
   rekeymargin=540s
   type=tunnel
   pfs=yes
   compress=no
   authby=secret
   auto=start
   mark_in=0xf
   mark_out=0xf
   keyexchange=ikev1


Also you need the iptables rule for ingress esp and udp-4500 packets:
-A PREROUTING -s 12.0.0.1/32 -d 12.0.0.3/32 -p esp -j MARK --set-xmark 0xf/0xffffffff

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>

---

^ permalink raw reply

* [PATCH 01/02] iproute2: VTI support for ip tunnel command.
From: Saurabh @ 2012-06-28  1:01 UTC (permalink / raw)
  To: netdev



Configure VTI using 'ip tunnel'.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>

---
diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index 38ccd87..0cf6cf8 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -33,7 +33,7 @@ static void usage(void) __attribute__((noreturn));
 static void usage(void)
 {
 	fprintf(stderr, "Usage: ip tunnel { add | change | del | show | prl | 6rd } [ NAME ]\n");
-	fprintf(stderr, "          [ mode { ipip | gre | sit | isatap } ] [ remote ADDR ] [ local ADDR ]\n");
+	fprintf(stderr, "          [ mode { ipip | gre | sit | isatap | vti } ] [ remote ADDR ] [ local ADDR ]\n");
 	fprintf(stderr, "          [ [i|o]seq ] [ [i|o]key KEY ] [ [i|o]csum ]\n");
 	fprintf(stderr, "          [ prl-default ADDR ] [ prl-nodefault ADDR ] [ prl-delete ADDR ]\n");
 	fprintf(stderr, "          [ 6rd-prefix ADDR ] [ 6rd-relay_prefix ADDR ] [ 6rd-reset ]\n");
@@ -94,6 +94,13 @@ static int parse_args(int argc, char **argv, int cmd, struct ip_tunnel_parm *p)
 				}
 				p->iph.protocol = IPPROTO_IPV6;
 				isatap++;
+			} else if (strcmp(*argv, "vti") == 0) {
+				if (p->iph.protocol && p->iph.protocol != IPPROTO_IPIP) {
+					fprintf(stderr, "You managed to ask for more than one tunnel mode.\n");
+					exit(-1);
+				}
+				p->iph.protocol = IPPROTO_IPIP;
+				p->i_flags |= VTI_ISVTI;
 			} else {
 				fprintf(stderr,"Cannot guess tunnel mode.\n");
 				exit(-1);
@@ -220,6 +227,9 @@ static int parse_args(int argc, char **argv, int cmd, struct ip_tunnel_parm *p)
 		else if (memcmp(p->name, "isatap", 6) == 0) {
 			p->iph.protocol = IPPROTO_IPV6;
 			isatap++;
+		} else if (memcmp(p->name, "vti", 3) == 0) {
+			p->iph.protocol = IPPROTO_IPIP;
+			p->i_flags |= VTI_ISVTI;
 		}
 	}
 
@@ -269,13 +279,16 @@ static int do_add(int cmd, int argc, char **argv)
 
 	switch (p.iph.protocol) {
 	case IPPROTO_IPIP:
-		return tnl_add_ioctl(cmd, "tunl0", p.name, &p);
+		if (p.i_flags != VTI_ISVTI)
+			return tnl_add_ioctl(cmd, "tunl0", p.name, &p);
+		else
+			return tnl_add_ioctl(cmd, "ip_vti0", p.name, &p);
 	case IPPROTO_GRE:
 		return tnl_add_ioctl(cmd, "gre0", p.name, &p);
 	case IPPROTO_IPV6:
 		return tnl_add_ioctl(cmd, "sit0", p.name, &p);
 	default:
-		fprintf(stderr, "cannot determine tunnel mode (ipip, gre or sit)\n");
+		fprintf(stderr, "cannot determine tunnel mode (ipip, gre, vti or sit)\n");
 		return -1;
 	}
 	return -1;
@@ -290,7 +303,10 @@ static int do_del(int argc, char **argv)
 
 	switch (p.iph.protocol) {
 	case IPPROTO_IPIP:
-		return tnl_del_ioctl("tunl0", p.name, &p);
+		if (p.i_flags != VTI_ISVTI)
+			return tnl_del_ioctl("tunl0", p.name, &p);
+		else
+			return tnl_del_ioctl("ip_vti0", p.name, &p);
 	case IPPROTO_GRE:
 		return tnl_del_ioctl("gre0", p.name, &p);
 	case IPPROTO_IPV6:
@@ -479,7 +495,10 @@ static int do_show(int argc, char **argv)
 
 	switch (p.iph.protocol) {
 	case IPPROTO_IPIP:
-		err = tnl_get_ioctl(p.name[0] ? p.name : "tunl0", &p);
+		if (p.i_flags != VTI_ISVTI)
+			err = tnl_get_ioctl(p.name[0] ? p.name : "tunl0", &p);
+		else
+			err = tnl_get_ioctl(p.name[0] ? p.name : "ip_vti0", &p);
 		break;
 	case IPPROTO_GRE:
 		err = tnl_get_ioctl(p.name[0] ? p.name : "gre0", &p);

^ permalink raw reply related

* [PATCH 02/02] iproute2: VTI support for ip link command.
From: Saurabh @ 2012-06-28  1:01 UTC (permalink / raw)
  To: netdev



Support for VTI via rt netlink.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>

---
diff --git a/ip/Makefile b/ip/Makefile
index e029ea1..6a518f8 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o ipnetns.o \
     ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
     ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
     iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
-    iplink_macvlan.o iplink_macvtap.o ipl2tp.o
+    iplink_macvlan.o iplink_macvtap.o ipl2tp.o link_vti.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/link_vti.c b/ip/link_vti.c
new file mode 100644
index 0000000..385f435
--- /dev/null
+++ b/ip/link_vti.c
@@ -0,0 +1,245 @@
+/*
+ * link_vti.c	VTI driver module
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Herbert Xu <herbert@gondor.apana.org.au>
+ *          Saurabh Mohan <saurabh.mohan@vyatta.com> Modified link_gre.c for VTI
+ */
+
+#include <string.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+
+#include <linux/ip.h>
+#include <linux/if_tunnel.h>
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+#include "tunnel.h"
+
+
+static void usage(void) __attribute__((noreturn));
+static void usage(void)
+{
+	fprintf(stderr, "Usage: ip link { add | set | change | replace | del } NAME\n");
+	fprintf(stderr, "          type { vti } [ remote ADDR ] [ local ADDR ]\n");
+	fprintf(stderr, "          [ [i|o]key KEY ]\n");
+	fprintf(stderr, "          [ dev PHYS_DEV ]\n");
+	fprintf(stderr, "\n");
+	fprintf(stderr, "Where: NAME := STRING\n");
+	fprintf(stderr, "       ADDR := { IP_ADDRESS }\n");
+	fprintf(stderr, "       KEY  := { DOTTED_QUAD | NUMBER }\n");
+	exit(-1);
+}
+
+static int vti_parse_opt(struct link_util *lu, int argc, char **argv,
+			 struct nlmsghdr *n)
+{
+	struct {
+		struct nlmsghdr n;
+		struct ifinfomsg i;
+		char buf[1024];
+	} req;
+	struct ifinfomsg *ifi = (struct ifinfomsg *)(n + 1);
+	struct rtattr *tb[IFLA_MAX + 1];
+	struct rtattr *linkinfo[IFLA_INFO_MAX+1];
+	struct rtattr *vtiinfo[IFLA_VTI_MAX + 1];
+	unsigned ikey = 0;
+	unsigned okey = 0;
+	unsigned saddr = 0;
+	unsigned daddr = 0;
+	unsigned link = 0;
+	int len;
+
+	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
+		memset(&req, 0, sizeof(req));
+
+		req.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi));
+		req.n.nlmsg_flags = NLM_F_REQUEST;
+		req.n.nlmsg_type = RTM_GETLINK;
+		req.i.ifi_family = preferred_family;
+		req.i.ifi_index = ifi->ifi_index;
+
+		if (rtnl_talk(&rth, &req.n, 0, 0, &req.n) < 0) {
+get_failed:
+			fprintf(stderr,
+				"Failed to get existing tunnel info.\n");
+			return -1;
+		}
+
+		len = req.n.nlmsg_len;
+		len -= NLMSG_LENGTH(sizeof(*ifi));
+		if (len < 0)
+			goto get_failed;
+
+		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
+
+		if (!tb[IFLA_LINKINFO])
+			goto get_failed;
+
+		parse_rtattr_nested(linkinfo, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
+
+		if (!linkinfo[IFLA_INFO_DATA])
+			goto get_failed;
+
+		parse_rtattr_nested(vtiinfo, IFLA_VTI_MAX,
+				    linkinfo[IFLA_INFO_DATA]);
+
+		if (vtiinfo[IFLA_VTI_IKEY])
+			ikey = *(__u32 *)RTA_DATA(vtiinfo[IFLA_VTI_IKEY]);
+
+		if (vtiinfo[IFLA_VTI_OKEY])
+			okey = *(__u32 *)RTA_DATA(vtiinfo[IFLA_VTI_OKEY]);
+
+		if (vtiinfo[IFLA_VTI_LOCAL])
+			saddr = *(__u32 *)RTA_DATA(vtiinfo[IFLA_VTI_LOCAL]);
+
+		if (vtiinfo[IFLA_VTI_REMOTE])
+			daddr = *(__u32 *)RTA_DATA(vtiinfo[IFLA_VTI_REMOTE]);
+
+		if (vtiinfo[IFLA_VTI_LINK])
+			link = *(__u8 *)RTA_DATA(vtiinfo[IFLA_VTI_LINK]);
+	}
+
+	while (argc > 0) {
+		if (!matches(*argv, "key")) {
+			unsigned uval;
+
+			NEXT_ARG();
+			if (strchr(*argv, '.'))
+				uval = get_addr32(*argv);
+			else {
+				if (get_unsigned(&uval, *argv, 0) < 0) {
+					fprintf(stderr,
+						"Invalid value for \"key\"\n");
+					exit(-1);
+				}
+				uval = htonl(uval);
+			}
+
+			ikey = okey = uval;
+		} else if (!matches(*argv, "ikey")) {
+			unsigned uval;
+
+			NEXT_ARG();
+			if (strchr(*argv, '.'))
+				uval = get_addr32(*argv);
+			else {
+				if (get_unsigned(&uval, *argv, 0) < 0) {
+					fprintf(stderr, "invalid value of \"ikey\"\n");
+					exit(-1);
+				}
+				uval = htonl(uval);
+			}
+			ikey = uval;
+		} else if (!matches(*argv, "okey")) {
+			unsigned uval;
+
+			NEXT_ARG();
+			if (strchr(*argv, '.'))
+				uval = get_addr32(*argv);
+			else {
+				if (get_unsigned(&uval, *argv, 0) < 0) {
+					fprintf(stderr, "invalid value of \"okey\"\n");
+					exit(-1);
+				}
+				uval = htonl(uval);
+			}
+			okey = uval;
+		} else if (!matches(*argv, "remote")) {
+			NEXT_ARG();
+			if (!strcmp(*argv, "any")) {
+				fprintf(stderr, "invalid value of \"remote\"\n");
+				exit(-1);
+			} else {
+				daddr = get_addr32(*argv);
+			}
+		} else if (!matches(*argv, "local")) {
+			NEXT_ARG();
+			if (!strcmp(*argv, "any")) {
+				fprintf(stderr, "invalid value of \"local\"\n");
+				exit(-1);
+			} else {
+				saddr = get_addr32(*argv);
+			}
+		} else if (!matches(*argv, "dev")) {
+			NEXT_ARG();
+			link = if_nametoindex(*argv);
+			if (link == 0)
+				exit(-1);
+		} else
+			usage();
+		argc--; argv++;
+	}
+
+	addattr32(n, 1024, IFLA_VTI_IKEY, ikey);
+	addattr32(n, 1024, IFLA_VTI_OKEY, okey);
+	addattr_l(n, 1024, IFLA_VTI_LOCAL, &saddr, 4);
+	addattr_l(n, 1024, IFLA_VTI_REMOTE, &daddr, 4);
+	if (link)
+		addattr32(n, 1024, IFLA_VTI_LINK, link);
+
+	return 0;
+}
+
+static void vti_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+	char s1[1024];
+	char s2[64];
+	const char *local = "any";
+	const char *remote = "any";
+
+	if (!tb)
+		return;
+
+	if (tb[IFLA_VTI_REMOTE]) {
+		unsigned addr = *(__u32 *)RTA_DATA(tb[IFLA_VTI_REMOTE]);
+
+		if (addr)
+			remote = format_host(AF_INET, 4, &addr, s1, sizeof(s1));
+	}
+
+	fprintf(f, "remote %s ", remote);
+
+	if (tb[IFLA_VTI_LOCAL]) {
+		unsigned addr = *(__u32 *)RTA_DATA(tb[IFLA_VTI_LOCAL]);
+
+		if (addr)
+			local = format_host(AF_INET, 4, &addr, s1, sizeof(s1));
+	}
+
+	fprintf(f, "local %s ", local);
+
+	if (tb[IFLA_VTI_LINK] && *(__u32 *)RTA_DATA(tb[IFLA_VTI_LINK])) {
+		unsigned link = *(__u32 *)RTA_DATA(tb[IFLA_VTI_LINK]);
+		const char *n = if_indextoname(link, s2);
+
+		if (n)
+			fprintf(f, "dev %s ", n);
+		else
+			fprintf(f, "dev %u ", link);
+	}
+
+	if (tb[IFLA_VTI_IKEY]) {
+		inet_ntop(AF_INET, RTA_DATA(tb[IFLA_VTI_IKEY]), s2, sizeof(s2));
+		fprintf(f, "ikey %s ", s2);
+	}
+
+	if (tb[IFLA_VTI_OKEY]) {
+		inet_ntop(AF_INET, RTA_DATA(tb[IFLA_VTI_OKEY]), s2, sizeof(s2));
+		fprintf(f, "okey %s ", s2);
+	}
+}
+
+struct link_util vti_link_util = {
+	.id = "vti",
+	.maxattr = IFLA_VTI_MAX,
+	.parse_opt = vti_parse_opt,
+	.print_opt = vti_print_opt,
+};

^ permalink raw reply related

* [net-next PATCH 00/02] net/ipv4: Add support for new tunnel type VTI.
From: Saurabh @ 2012-06-28  1:02 UTC (permalink / raw)
  To: netdev



Resubmitting after taking into account review comments:
The VTI tunnel is applicable to esp, ah and ipcomp.

Introduction:
Virtual tunnel interface is a way to represent policy based IPsec tunnels as virtual interfaces in linux. This is similar to Cisco's VTI (virtual tunnel interface) and Juniper's representaion of secure tunnel (st.xx). The advantage of representing an IPsec tunnel as an interface is that it is possible to plug Ipsec tunnels into the routing protocol infrastructure of a router. Therefore it becomes possible to influence the packet path by toggling the link state of the tunnel or based on routing metrics.

Overview:
Natively linux kernel does not support ipsec as an interface. Also secure interface assume a ipsec policy 4 tupple of {dst-ip-any, src-ip-any, dst-port-any, src-port-any}. Applying this 4 tuple in linux would result in all traffic matching the ipsec policy. What is needed is a tunnel distinguisher. The linux kernel skbuff has fwmark which is used for policy based routing (PBR). Linux kernel version 2.6.35 enhanced SPD/SADB to use fwmark as part of the IPsec policy. Strongswan has also introduced support for this kernel feature with version 4.5.0. We can therefore use the fwmark as the distinguisher for tunnel interface. We can also create a light weight tunnel kernel module (vti) to give the notion of an interface for rest of the kernel routing system. The tunnel module does not do any enc
 apsulation/decapsulation. The kernel's xfrm modules still do the esp encryption/decryption.

Usage:
ip tunnel add sti15 mode vti remote 12.0.0.1 local 12.0.0.3 ikey 15
or
ip link add sti15 type vti key 15 remote 12.0.0.1 local 12.0.0.3

Sample strongswan config would be:
conn peer-12.0.0.1-tunnel-1
   left=12.0.0.3
   right=12.0.0.1
   leftsubnet=0.0.0.0/0
   rightsubnet=0.0.0.0/0
   ike=aes128-sha1-modp1024!
   ikelifetime=28800s
   keyingtries=%forever
   esp=aes128-sha1!
   keylife=3600s
   rekeymargin=540s
   type=tunnel
   pfs=yes
   compress=no
   authby=secret
   auto=start
   mark_in=0xf
   mark_out=0xf
   keyexchange=ikev1


Also you need the iptables rule for ingress esp and udp-4500 packets:
-A PREROUTING -s 12.0.0.1/32 -d 12.0.0.3/32 -p esp -j MARK --set-xmark 0xf/0xffffffff


Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---

^ permalink raw reply

* [net-next PATCH 01/02] net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.
From: Saurabh @ 2012-06-28  1:02 UTC (permalink / raw)
  To: netdev



Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index e0a55df..04214c0 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1475,6 +1475,8 @@ extern int xfrm4_output(struct sk_buff *skb);
 extern int xfrm4_output_finish(struct sk_buff *skb);
 extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler, unsigned short family);
 extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler, unsigned short family);
+extern int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler);
+extern int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler);
 extern int xfrm6_extract_header(struct sk_buff *skb);
 extern int xfrm6_extract_input(struct xfrm_state *x, struct sk_buff *skb);
 extern int xfrm6_rcv_spi(struct sk_buff *skb, int nexthdr, __be32 spi);
diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
index ed4bf11..4fc2944 100644
--- a/net/ipv4/xfrm4_mode_tunnel.c
+++ b/net/ipv4/xfrm4_mode_tunnel.c
@@ -15,6 +15,68 @@
 #include <net/ip.h>
 #include <net/xfrm.h>
 
+/*
+ * Informational hook. The decap is still done here.
+ */
+static struct xfrm_tunnel __rcu *rcv_notify_handlers __read_mostly;
+static DEFINE_MUTEX(xfrm4_mode_tunnel_input_mutex);
+
+int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+
+	int ret = -EEXIST;
+	int priority = handler->priority;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+
+	for (pprev = &rcv_notify_handlers;
+		(t = rcu_dereference_protected(*pprev,
+		lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+		pprev = &t->next) {
+		if (t->priority > priority)
+			break;
+		if (t->priority == priority)
+			goto err;
+
+	}
+
+	handler->next = *pprev;
+	rcu_assign_pointer(*pprev, handler);
+
+	ret = 0;
+
+err:
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_register);
+
+int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+	int ret = -ENOENT;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+	for (pprev = &rcv_notify_handlers;
+		(t = rcu_dereference_protected(*pprev,
+		lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+		pprev = &t->next) {
+		if (t == handler) {
+			*pprev = handler->next;
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	synchronize_net();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_deregister);
+
 static inline void ipip_ecn_decapsulate(struct sk_buff *skb)
 {
 	struct iphdr *inner_iph = ipip_hdr(skb);
@@ -64,8 +126,14 @@ static int xfrm4_mode_tunnel_output(struct xfrm_state *x, struct sk_buff *skb)
 	return 0;
 }
 
+#define for_each_input_rcu(head, handler)	\
+	for (handler = rcu_dereference(head);	\
+		handler != NULL;		\
+		handler = rcu_dereference(handler->next))  \
+
 static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 {
+	struct xfrm_tunnel *handler;
 	int err = -EINVAL;
 
 	if (XFRM_MODE_SKB_CB(skb)->protocol != IPPROTO_IPIP)
@@ -74,6 +142,10 @@ static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
 		goto out;
 
+	/* The handlers do not consume the skb. */
+	for_each_input_rcu(rcv_notify_handlers, handler)
+		handler->handler(skb);
+
 	if (skb_cloned(skb) &&
 	    (err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
 		goto out;

^ permalink raw reply related

* [net-next PATCH 02/02] net/ipv4: VTI support new module for ip_vti.
From: Saurabh @ 2012-06-28  1:02 UTC (permalink / raw)
  To: netdev



New VTI tunnel kernel module, Kconfig and Makefile changes.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 16b92d0..5efff60 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -80,4 +80,18 @@ enum {
 
 #define IFLA_GRE_MAX	(__IFLA_GRE_MAX - 1)
 
+/* VTI-mode i_flags */
+#define VTI_ISVTI 0x0001
+
+enum {
+	IFLA_VTI_UNSPEC,
+	IFLA_VTI_LINK,
+	IFLA_VTI_IKEY,
+	IFLA_VTI_OKEY,
+	IFLA_VTI_LOCAL,
+	IFLA_VTI_REMOTE,
+	__IFLA_VTI_MAX,
+};
+
+#define IFLA_VTI_MAX	(__IFLA_VTI_MAX - 1)
 #endif /* _IF_TUNNEL_H_ */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 20f1cb5..8e5083d 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -310,6 +310,21 @@ config SYN_COOKIES
 
 	  If unsure, say N.
 
+config NET_IPVTI
+	tristate "Virtual (secure) IP: tunneling"
+	select INET_TUNNEL
+	depends on INET_XFRM_MODE_TUNNEL
+	---help---
+	Tunneling means encapsulating data of one protocol type within
+	another protocol and sending it over a channel that understands the
+	Pencapsulating protocol. This particular tunneling driver implements
+	encapsulation of IP within IP-ESP. This can be used with xfrm to give
+	the notion of a secure tunnel and then use routing protocol on top.
+
+	Saying Y to this option will produce one module ( = code which can
+	be inserted in and removed from the running kernel whenever you
+	want). Most people won't need this and can say N.
+
 config INET_AH
 	tristate "IP: AH transformation"
 	select XFRM_ALGO
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index ff75d3b..3999ce9 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IP_MROUTE) += ipmr.o
 obj-$(CONFIG_NET_IPIP) += ipip.o
 obj-$(CONFIG_NET_IPGRE_DEMUX) += gre.o
 obj-$(CONFIG_NET_IPGRE) += ip_gre.o
+obj-$(CONFIG_NET_IPVTI) += ip_vti.o
 obj-$(CONFIG_SYN_COOKIES) += syncookies.o
 obj-$(CONFIG_INET_AH) += ah4.o
 obj-$(CONFIG_INET_ESP) += esp4.o
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
new file mode 100644
index 0000000..052a25e
--- /dev/null
+++ b/net/ipv4/ip_vti.c
@@ -0,0 +1,968 @@
+/*
+ *	Linux NET3:	IP/IP protocol decoder modified to support virtual tunnel interface
+ *
+ *	Authors:
+ *		Saurabh Mohan (saurabh.mohan@vyatta.com) 05/07/2012
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; either version
+ *	2 of the License, or (at your option) any later version.
+ *
+ */
+
+/*
+   This version of net/ipv4/ip_vti.c is cloned of net/ipv4/ipip.c
+
+   For comments look at net/ipv4/ip_gre.c --ANK
+ */
+
+
+#include <linux/capability.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/uaccess.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/in.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/if_arp.h>
+#include <linux/mroute.h>
+#include <linux/init.h>
+#include <linux/netfilter_ipv4.h>
+#include <linux/if_ether.h>
+
+#include <net/sock.h>
+#include <net/ip.h>
+#include <net/icmp.h>
+#include <net/ipip.h>
+#include <net/inet_ecn.h>
+#include <net/xfrm.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#define HASH_SIZE  16
+#define HASH(addr) (((__force u32)addr^((__force u32)addr>>4))&0xF)
+
+static struct rtnl_link_ops vti_link_ops __read_mostly;
+
+static int vti_net_id __read_mostly;
+struct vti_net {
+	struct ip_tunnel __rcu *tunnels_r_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_r[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_wc[1];
+	struct ip_tunnel **tunnels[4];
+
+	struct net_device *fb_tunnel_dev;
+};
+
+static int vti_fb_tunnel_init(struct net_device *dev);
+static int vti_tunnel_init(struct net_device *dev);
+static void vti_tunnel_setup(struct net_device *dev);
+static void vti_dev_free(struct net_device *dev);
+static int vti_tunnel_bind_dev(struct net_device *dev);
+
+/*
+ * Locking : hash tables are protected by RCU and RTNL
+ */
+
+#define for_each_ip_tunnel_rcu(start) \
+	for (t = rcu_dereference(start); t; t = rcu_dereference(t->next))
+
+/* often modified stats are per cpu, other are shared (netdev->stats) */
+struct pcpu_tstats {
+	u64	rx_packets;
+	u64	rx_bytes;
+	u64	tx_packets;
+	u64	tx_bytes;
+	struct	u64_stats_sync	syncp;
+};
+
+#define VTI_XMIT(stats1, stats2) do {				\
+	int err;						\
+	int pkt_len = skb->len;					\
+	err = dst_output(skb);					\
+	if (net_xmit_eval(err) == 0) {				\
+		(stats1)->tx_bytes += pkt_len;			\
+		(stats1)->tx_packets++;				\
+	} else {						\
+		(stats2)->tx_errors++;				\
+		(stats2)->tx_aborted_errors++;			\
+	}							\
+} while (0)
+
+
+static struct rtnl_link_stats64 *vti_get_stats64(struct net_device *dev,
+					       struct rtnl_link_stats64 *tot)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		const struct pcpu_tstats *tstats = per_cpu_ptr(dev->tstats, i);
+		u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
+		unsigned int start;
+
+		do {
+			start = u64_stats_fetch_begin_bh(&tstats->syncp);
+			rx_packets = tstats->rx_packets;
+			tx_packets = tstats->tx_packets;
+			rx_bytes = tstats->rx_bytes;
+			tx_bytes = tstats->tx_bytes;
+		} while (u64_stats_fetch_retry_bh(&tstats->syncp, start));
+
+		tot->rx_packets += rx_packets;
+		tot->tx_packets += tx_packets;
+		tot->rx_bytes   += rx_bytes;
+		tot->tx_bytes   += tx_bytes;
+	}
+
+	tot->multicast = dev->stats.multicast;
+	tot->rx_crc_errors = dev->stats.rx_crc_errors;
+	tot->rx_fifo_errors = dev->stats.rx_fifo_errors;
+	tot->rx_length_errors = dev->stats.rx_length_errors;
+	tot->rx_errors = dev->stats.rx_errors;
+	tot->tx_fifo_errors = dev->stats.tx_fifo_errors;
+	tot->tx_carrier_errors = dev->stats.tx_carrier_errors;
+	tot->tx_dropped = dev->stats.tx_dropped;
+	tot->tx_aborted_errors = dev->stats.tx_aborted_errors;
+	tot->tx_errors = dev->stats.tx_errors;
+
+	return tot;
+}
+
+static struct ip_tunnel *vti_tunnel_lookup(struct net *net,
+					 __be32 remote, __be32 local)
+{
+	unsigned h0 = HASH(remote);
+	unsigned h1 = HASH(local);
+	struct ip_tunnel *t;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_r_l[h0 ^ h1])
+		if (local == t->parms.iph.saddr &&
+		    remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+	for_each_ip_tunnel_rcu(ipn->tunnels_r[h0])
+		if (remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_l[h1])
+		if (local == t->parms.iph.saddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_wc[0])
+		if (t && (t->dev->flags&IFF_UP))
+			return t;
+	return NULL;
+}
+
+static struct ip_tunnel **__vti_bucket(struct vti_net *ipn,
+				     struct ip_tunnel_parm *parms)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	unsigned h = 0;
+	int prio = 0;
+
+	if (remote) {
+		prio |= 2;
+		h ^= HASH(remote);
+	}
+	if (local) {
+		prio |= 1;
+		h ^= HASH(local);
+	}
+	return &ipn->tunnels[prio][h];
+}
+
+static inline struct ip_tunnel **vti_bucket(struct vti_net *ipn,
+					  struct ip_tunnel *t)
+{
+	return __vti_bucket(ipn, &t->parms);
+}
+
+static void vti_tunnel_unlink(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp;
+	struct ip_tunnel *iter;
+
+	for (tp = vti_bucket(ipn, t);
+	     (iter = rtnl_dereference(*tp)) != NULL;
+	     tp = &iter->next) {
+		if (t == iter) {
+			rcu_assign_pointer(*tp, t->next);
+			break;
+		}
+	}
+}
+
+static void vti_tunnel_link(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp = vti_bucket(ipn, t);
+
+	rcu_assign_pointer(t->next, rtnl_dereference(*tp));
+	rcu_assign_pointer(*tp, t);
+}
+
+static struct ip_tunnel *vti_tunnel_locate(struct net *net,
+					 struct ip_tunnel_parm *parms,
+					 int create)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	struct ip_tunnel *t, *nt;
+	struct ip_tunnel __rcu **tp;
+	struct net_device *dev;
+	char name[IFNAMSIZ];
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for (tp = __vti_bucket(ipn, parms);
+	     (t = rtnl_dereference(*tp)) != NULL;
+	     tp = &t->next) {
+		if (local == t->parms.iph.saddr && remote == t->parms.iph.daddr)
+			return t;
+	}
+	if (!create)
+		return NULL;
+
+	if (parms->name[0])
+		strlcpy(name, parms->name, IFNAMSIZ);
+	else
+		strcpy(name, "vti%d");
+
+	dev = alloc_netdev(sizeof(*t), name, vti_tunnel_setup);
+	if (dev == NULL)
+		return NULL;
+
+	dev_net_set(dev, net);
+
+	nt = netdev_priv(dev);
+	nt->parms = *parms;
+	dev->rtnl_link_ops = &vti_link_ops;
+
+	vti_tunnel_bind_dev(dev);
+
+	if (register_netdevice(dev) < 0)
+		goto failed_free;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+	return nt;
+
+ failed_free:
+	free_netdev(dev);
+	return NULL;
+}
+
+static void vti_tunnel_uninit(struct net_device *dev)
+{
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	if (dev == ipn->fb_tunnel_dev)
+		RCU_INIT_POINTER(ipn->tunnels_wc[0], NULL);
+	else
+		vti_tunnel_unlink(ipn, netdev_priv(dev));
+	dev_put(dev);
+}
+
+static int vti_err(struct sk_buff *skb, u32 info)
+{
+
+	/* All the routers (except for Linux) return only
+	 * 8 bytes of packet payload. It means, that precise relaying of
+	 * ICMP in the real Internet is absolutely infeasible.
+	 */
+	struct iphdr *iph = (struct iphdr *)skb->data;
+	const int type = icmp_hdr(skb)->type;
+	const int code = icmp_hdr(skb)->code;
+	struct ip_tunnel *t;
+	int err;
+
+	switch (type) {
+	default:
+	case ICMP_PARAMETERPROB:
+		return 0;
+
+	case ICMP_DEST_UNREACH:
+		switch (code) {
+		case ICMP_SR_FAILED:
+		case ICMP_PORT_UNREACH:
+			/* Impossible event. */
+			return 0;
+		case ICMP_FRAG_NEEDED:
+			/* Soft state for pmtu is maintained by IP core. */
+			return 0;
+		default:
+			/* All others are translated to HOST_UNREACH. */
+			break;
+		}
+		break;
+	case ICMP_TIME_EXCEEDED:
+		if (code != ICMP_EXC_TTL)
+			return 0;
+		break;
+	}
+
+	err = -ENOENT;
+
+	rcu_read_lock();
+	t = vti_tunnel_lookup(dev_net(skb->dev), iph->daddr, iph->saddr);
+	if (t == NULL || t->parms.iph.daddr == 0)
+		goto out;
+
+	err = 0;
+	if (t->parms.iph.ttl == 0 && type == ICMP_TIME_EXCEEDED)
+		goto out;
+
+	if (time_before(jiffies, t->err_time + IPTUNNEL_ERR_TIMEO))
+		t->err_count++;
+	else
+		t->err_count = 1;
+	t->err_time = jiffies;
+out:
+	rcu_read_unlock();
+	return err;
+}
+
+/*
+ * We dont digest the packet therefore let the packet pass.
+ */
+static int vti_rcv(struct sk_buff *skb)
+{
+	struct ip_tunnel *tunnel;
+	const struct iphdr *iph = ip_hdr(skb);
+
+	rcu_read_lock();
+	tunnel = vti_tunnel_lookup(dev_net(skb->dev), iph->saddr, iph->daddr);
+	if (tunnel != NULL) {
+		struct pcpu_tstats *tstats;
+
+		tstats = this_cpu_ptr(tunnel->dev->tstats);
+		tstats->rx_packets++;
+		tstats->rx_bytes += skb->len;
+
+		skb->dev = tunnel->dev;
+		rcu_read_unlock();
+		/* We do not eat the packet here therefore return 1 */
+		return 1;
+	}
+	rcu_read_unlock();
+
+	return -1;
+}
+
+/*
+ *	This function assumes it is being called from dev_queue_xmit()
+ *	and that skb is filled properly by that function.
+ */
+
+static netdev_tx_t vti_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct pcpu_tstats *tstats;
+	struct net_device_stats *stats = &tunnel->dev->stats;
+	struct iphdr  *tiph = &tunnel->parms.iph;
+	u8     tos = tunnel->parms.iph.tos;
+	struct rtable *rt;		/* Route to the other host */
+	struct net_device *tdev;	/* Device to other host */
+	struct iphdr  *old_iph = ip_hdr(skb);
+	__be32 dst = tiph->daddr;
+	struct flowi4 fl4;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		goto tx_error;
+
+	if (tos&1)
+		tos = old_iph->tos;
+
+	if (!dst) {
+		/* NBMA tunnel */
+		rt = skb_rtable(skb);
+		if (rt == NULL) {
+			stats->tx_fifo_errors++;
+			goto tx_error;
+		}
+		dst = rt->rt_gateway;
+		if (dst == 0)
+			goto tx_error_icmp;
+	}
+
+	memset(&fl4, 0, sizeof(fl4));
+	flowi4_init_output(&fl4, tunnel->parms.link,
+		htonl(tunnel->parms.i_key), RT_TOS(tos), RT_SCOPE_UNIVERSE,
+		IPPROTO_IPIP, 0,
+		dst, tiph->saddr, 0, 0);
+	rt = ip_route_output_key(dev_net(dev), &fl4);
+	if (IS_ERR(rt)) {
+		dev->stats.tx_carrier_errors++;
+		goto tx_error_icmp;
+	}
+#ifdef CONFIG_XFRM
+		/* if there is no transform then this tunnel is not functional. */
+		if (!rt->dst.xfrm) {
+			stats->tx_carrier_errors++;
+			goto tx_error_icmp;
+		}
+#endif
+	tdev = rt->dst.dev;
+
+	if (tdev == dev) {
+		ip_rt_put(rt);
+		stats->collisions++;
+		goto tx_error;
+
+	}
+
+
+	if (tunnel->err_count > 0) {
+		if (time_before(jiffies,
+				tunnel->err_time + IPTUNNEL_ERR_TIMEO)) {
+			tunnel->err_count--;
+			dst_link_failure(skb);
+		} else
+			tunnel->err_count = 0;
+	}
+
+
+	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED |
+			      IPSKB_REROUTED);
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &rt->dst);
+	nf_reset(skb);
+	skb->dev = skb_dst(skb)->dev;
+
+	tstats = this_cpu_ptr(dev->tstats);
+	VTI_XMIT(tstats, &dev->stats);
+	return NETDEV_TX_OK;
+
+tx_error_icmp:
+	dst_link_failure(skb);
+tx_error:
+	stats->tx_errors++;
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int vti_tunnel_bind_dev(struct net_device *dev)
+{
+	struct net_device *tdev = NULL;
+	struct ip_tunnel *tunnel;
+	struct iphdr *iph;
+
+	tunnel = netdev_priv(dev);
+	iph = &tunnel->parms.iph;
+
+	if (iph->daddr) {
+		struct rtable *rt;
+		struct flowi4 fl4;
+		memset(&fl4, 0, sizeof(fl4));
+		flowi4_init_output(&fl4, tunnel->parms.link,
+				htonl(tunnel->parms.i_key), RT_TOS(iph->tos), RT_SCOPE_UNIVERSE,
+				IPPROTO_IPIP, 0,
+				iph->daddr, iph->saddr, 0, 0);
+		rt = ip_route_output_key(dev_net(dev), &fl4);
+		if (!IS_ERR(rt)) {
+			tdev = rt->dst.dev;
+			ip_rt_put(rt);
+		}
+		dev->flags |= IFF_POINTOPOINT;
+	}
+
+	if (!tdev && tunnel->parms.link)
+		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+
+	if (tdev) {
+		dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr);
+		dev->mtu = tdev->mtu;
+	}
+	dev->iflink = tunnel->parms.link;
+	return dev->mtu;
+}
+
+static int
+vti_tunnel_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
+{
+	int err = 0;
+	struct ip_tunnel_parm p;
+	struct ip_tunnel *t;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	switch (cmd) {
+	case SIOCGETTUNNEL:
+		t = NULL;
+		if (dev == ipn->fb_tunnel_dev) {
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p))) {
+				err = -EFAULT;
+				break;
+			}
+			t = vti_tunnel_locate(net, &p, 0);
+		}
+		if (t == NULL)
+			t = netdev_priv(dev);
+		memcpy(&p, &t->parms, sizeof(p));
+		p.i_flags |= GRE_KEY;
+		p.o_flags |= GRE_KEY;
+		if (copy_to_user(ifr->ifr_ifru.ifru_data, &p, sizeof(p)))
+			err = -EFAULT;
+		break;
+
+	case SIOCADDTUNNEL:
+	case SIOCCHGTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		err = -EFAULT;
+		if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p)))
+			goto done;
+
+		err = -EINVAL;
+		if (p.iph.version != 4 || p.iph.protocol != IPPROTO_IPIP ||
+		    p.iph.ihl != 5 || (p.iph.frag_off&htons(~IP_DF)))
+			goto done;
+		if (p.iph.ttl)
+			p.iph.frag_off |= htons(IP_DF);
+
+		t = vti_tunnel_locate(net, &p, cmd == SIOCADDTUNNEL);
+
+		if (dev != ipn->fb_tunnel_dev && cmd == SIOCCHGTUNNEL) {
+			if (t != NULL) {
+				if (t->dev != dev) {
+					err = -EEXIST;
+					break;
+				}
+			} else {
+				if (((dev->flags&IFF_POINTOPOINT) && !p.iph.daddr) ||
+				    (!(dev->flags&IFF_POINTOPOINT) && p.iph.daddr)) {
+					err = -EINVAL;
+					break;
+				}
+				t = netdev_priv(dev);
+				vti_tunnel_unlink(ipn, t);
+				synchronize_net();
+				t->parms.iph.saddr = p.iph.saddr;
+				t->parms.iph.daddr = p.iph.daddr;
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				t->parms.iph.protocol = IPPROTO_IPIP;
+				memcpy(dev->dev_addr, &p.iph.saddr, 4);
+				memcpy(dev->broadcast, &p.iph.daddr, 4);
+				vti_tunnel_link(ipn, t);
+				netdev_state_change(dev);
+			}
+		}
+
+		if (t) {
+			err = 0;
+			if (cmd == SIOCCHGTUNNEL) {
+				t->parms.iph.ttl = p.iph.ttl;
+				t->parms.iph.tos = p.iph.tos;
+				t->parms.iph.frag_off = p.iph.frag_off;
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				if (t->parms.link != p.link) {
+					t->parms.link = p.link;
+					vti_tunnel_bind_dev(dev);
+					netdev_state_change(dev);
+				}
+			}
+			if (copy_to_user(ifr->ifr_ifru.ifru_data, &t->parms, sizeof(p)))
+				err = -EFAULT;
+		} else
+			err = (cmd == SIOCADDTUNNEL ? -ENOBUFS : -ENOENT);
+		break;
+
+	case SIOCDELTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		if (dev == ipn->fb_tunnel_dev) {
+			err = -EFAULT;
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p)))
+				goto done;
+			err = -ENOENT;
+
+			t = vti_tunnel_locate(net, &p, 0);
+			if (t == NULL)
+				goto done;
+			err = -EPERM;
+			if (t->dev == ipn->fb_tunnel_dev)
+				goto done;
+			dev = t->dev;
+		}
+		unregister_netdevice(dev);
+		err = 0;
+		break;
+
+	default:
+		err = -EINVAL;
+	}
+
+done:
+	return err;
+}
+
+static int vti_tunnel_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (new_mtu < 68 || new_mtu > 0xFFF8)
+		return -EINVAL;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static const struct net_device_ops vti_netdev_ops = {
+	.ndo_init	= vti_tunnel_init,
+	.ndo_uninit	= vti_tunnel_uninit,
+	.ndo_start_xmit	= vti_tunnel_xmit,
+	.ndo_do_ioctl	= vti_tunnel_ioctl,
+	.ndo_change_mtu	= vti_tunnel_change_mtu,
+	.ndo_get_stats64  = vti_get_stats64,
+};
+
+static void vti_dev_free(struct net_device *dev)
+{
+	free_percpu(dev->tstats);
+	free_netdev(dev);
+}
+
+static void vti_tunnel_setup(struct net_device *dev)
+{
+	dev->netdev_ops		= &vti_netdev_ops;
+	dev->destructor		= vti_dev_free;
+
+	dev->type		= ARPHRD_TUNNEL;
+	dev->hard_header_len	= LL_MAX_HEADER + sizeof(struct iphdr);
+	dev->mtu		= ETH_DATA_LEN;
+	dev->flags		= IFF_NOARP;
+	dev->iflink		= 0;
+	dev->addr_len		= 4;
+	dev->features		|= NETIF_F_NETNS_LOCAL;
+	dev->features		|= NETIF_F_LLTX;
+	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
+}
+
+static int vti_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
+	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int __net_init vti_fb_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct iphdr *iph = &tunnel->parms.iph;
+	struct vti_net *ipn = net_generic(dev_net(dev), vti_net_id);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	iph->version		= 4;
+	iph->protocol		= IPPROTO_IPIP;
+	iph->ihl		= 5;
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	dev_hold(dev);
+	rcu_assign_pointer(ipn->tunnels_wc[0], tunnel);
+	return 0;
+}
+
+static struct xfrm_tunnel vti_handler __read_mostly = {
+	.handler	=	vti_rcv,
+	.err_handler	=	vti_err,
+	.priority	=	1,
+};
+
+static void vti_destroy_tunnels(struct vti_net *ipn, struct list_head *head)
+{
+	int prio;
+
+	for (prio = 1; prio < 4; prio++) {
+		int h;
+		for (h = 0; h < HASH_SIZE; h++) {
+			struct ip_tunnel *t;
+
+			t = rtnl_dereference(ipn->tunnels[prio][h]);
+			while (t != NULL) {
+				unregister_netdevice_queue(t->dev, head);
+				t = rtnl_dereference(t->next);
+			}
+		}
+	}
+}
+
+static int __net_init vti_init_net(struct net *net)
+{
+	int err;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	ipn->tunnels[0] = ipn->tunnels_wc;
+	ipn->tunnels[1] = ipn->tunnels_l;
+	ipn->tunnels[2] = ipn->tunnels_r;
+	ipn->tunnels[3] = ipn->tunnels_r_l;
+
+	ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+					   "ip_vti0",
+					   vti_tunnel_setup);
+	if (!ipn->fb_tunnel_dev) {
+		err = -ENOMEM;
+		goto err_alloc_dev;
+	}
+	dev_net_set(ipn->fb_tunnel_dev, net);
+
+	err = vti_fb_tunnel_init(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	ipn->fb_tunnel_dev->rtnl_link_ops = &vti_link_ops;
+
+	err = register_netdev(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	return 0;
+
+err_reg_dev:
+	vti_dev_free(ipn->fb_tunnel_dev);
+err_alloc_dev:
+	/* nothing */
+	return err;
+}
+
+static void __net_exit vti_exit_net(struct net *net)
+{
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	LIST_HEAD(list);
+
+	rtnl_lock();
+	vti_destroy_tunnels(ipn, &list);
+	unregister_netdevice_many(&list);
+	rtnl_unlock();
+}
+
+static struct pernet_operations vti_net_ops = {
+	.init = vti_init_net,
+	.exit = vti_exit_net,
+	.id   = &vti_net_id,
+	.size = sizeof(struct vti_net),
+};
+
+static int vti_tunnel_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	return 0;
+}
+
+static void vti_netlink_parms(struct nlattr *data[],
+				struct ip_tunnel_parm *parms)
+{
+	memset(parms, 0, sizeof(*parms));
+
+	parms->iph.protocol = IPPROTO_IPIP;
+
+	if (!data)
+		return;
+
+	if (data[IFLA_VTI_LINK])
+		parms->link = nla_get_u32(data[IFLA_VTI_LINK]);
+
+	if (data[IFLA_VTI_IKEY])
+		parms->i_key = nla_get_be32(data[IFLA_VTI_IKEY]);
+
+	if (data[IFLA_VTI_OKEY])
+		parms->o_key = nla_get_be32(data[IFLA_VTI_OKEY]);
+
+	if (data[IFLA_VTI_LOCAL])
+		parms->iph.saddr = nla_get_be32(data[IFLA_VTI_LOCAL]);
+
+	if (data[IFLA_VTI_REMOTE])
+		parms->iph.daddr = nla_get_be32(data[IFLA_VTI_REMOTE]);
+
+}
+
+static int vti_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[],
+			 struct nlattr *data[])
+{
+	struct ip_tunnel *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	int mtu;
+	int err;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &nt->parms);
+
+	if (vti_tunnel_locate(net, &nt->parms, 0))
+		return -EEXIST;
+
+	mtu = vti_tunnel_bind_dev(dev);
+	if (!tb[IFLA_MTU])
+		dev->mtu = mtu;
+
+	err = register_netdevice(dev);
+	if (err)
+		goto out;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+
+out:
+	return err;
+	return 0;
+}
+
+static int vti_changelink(struct net_device *dev, struct nlattr *tb[],
+			    struct nlattr *data[])
+{
+	struct ip_tunnel *t, *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	struct ip_tunnel_parm p;
+	int mtu;
+
+	if (dev == ipn->fb_tunnel_dev)
+		return -EINVAL;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &p);
+
+	t = vti_tunnel_locate(net, &p, 0);
+
+	if (t) {
+		if (t->dev != dev)
+			return -EEXIST;
+	} else {
+		t = nt;
+
+		vti_tunnel_unlink(ipn, t);
+		t->parms.iph.saddr = p.iph.saddr;
+		t->parms.iph.daddr = p.iph.daddr;
+		t->parms.i_key = p.i_key;
+		t->parms.o_key = p.o_key;
+		if (dev->type != ARPHRD_ETHER) {
+			memcpy(dev->dev_addr, &p.iph.saddr, 4);
+			memcpy(dev->broadcast, &p.iph.daddr, 4);
+		}
+		vti_tunnel_link(ipn, t);
+		netdev_state_change(dev);
+	}
+
+	if (t->parms.link != p.link) {
+		t->parms.link = p.link;
+		mtu = vti_tunnel_bind_dev(dev);
+		if (!tb[IFLA_MTU])
+			dev->mtu = mtu;
+		netdev_state_change(dev);
+	}
+
+	return 0;
+}
+
+static size_t vti_get_size(const struct net_device *dev)
+{
+	return
+		/* IFLA_VTI_LINK */
+		nla_total_size(4) +
+		/* IFLA_VTI_IKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_OKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_LOCAL */
+		nla_total_size(4) +
+		/* IFLA_VTI_REMOTE */
+		nla_total_size(4) +
+		0;
+}
+
+static int vti_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct ip_tunnel *t = netdev_priv(dev);
+	struct ip_tunnel_parm *p = &t->parms;
+
+	nla_put_u32(skb, IFLA_VTI_LINK, p->link);
+	nla_put_be32(skb, IFLA_VTI_IKEY, p->i_key);
+	nla_put_be32(skb, IFLA_VTI_OKEY, p->o_key);
+	nla_put_be32(skb, IFLA_VTI_LOCAL, p->iph.saddr);
+	nla_put_be32(skb, IFLA_VTI_REMOTE, p->iph.daddr);
+
+	return 0;
+}
+
+static const struct nla_policy vti_policy[IFLA_VTI_MAX + 1] = {
+	[IFLA_VTI_LINK]		= { .type = NLA_U32 },
+	[IFLA_VTI_IKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_OKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_LOCAL]	= { .len = FIELD_SIZEOF(struct iphdr, saddr) },
+	[IFLA_VTI_REMOTE]	= { .len = FIELD_SIZEOF(struct iphdr, daddr) },
+};
+
+static struct rtnl_link_ops vti_link_ops __read_mostly = {
+	.kind		= "vti",
+	.maxtype	= IFLA_VTI_MAX,
+	.policy		= vti_policy,
+	.priv_size	= sizeof(struct ip_tunnel),
+	.setup		= vti_tunnel_setup,
+	.validate	= vti_tunnel_validate,
+	.newlink	= vti_newlink,
+	.changelink	= vti_changelink,
+	.get_size	= vti_get_size,
+	.fill_info	= vti_fill_info,
+};
+
+static int __init vti_init(void)
+{
+	int err;
+
+	pr_info("IPv4 over IPSec tunneling driver\n");
+
+	err = register_pernet_device(&vti_net_ops);
+	if (err < 0)
+		return err;
+	err = xfrm4_mode_tunnel_input_register(&vti_handler);
+	if (err < 0) {
+		unregister_pernet_device(&vti_net_ops);
+		pr_info(KERN_INFO "vti init: can't register tunnel\n");
+	}
+
+	err = rtnl_link_register(&vti_link_ops);
+	if (err < 0)
+		goto rtnl_link_failed;
+
+	return err;
+
+rtnl_link_failed:
+	xfrm4_mode_tunnel_input_deregister(&vti_handler);
+	unregister_pernet_device(&vti_net_ops);
+	return err;
+}
+
+static void __exit vti_fini(void)
+{
+	rtnl_link_unregister(&vti_link_ops);
+	if (xfrm4_mode_tunnel_input_deregister(&vti_handler))
+		pr_info("vti close: can't deregister tunnel\n");
+
+	unregister_pernet_device(&vti_net_ops);
+}
+
+module_init(vti_init);
+module_exit(vti_fini);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK("vti");
+MODULE_ALIAS_NETDEV("ip_vti0");

^ permalink raw reply related

* Re: [net-next PATCH 02/02] net/ipv4: VTI support new module for ip_vti.
From: David Miller @ 2012-06-28  1:19 UTC (permalink / raw)
  To: saurabh.mohan; +Cc: netdev
In-Reply-To: <20120628010218.GA4056@debian-saurabh-64.vyatta.com>

From: Saurabh <saurabh.mohan@vyatta.com>
Date: Wed, 27 Jun 2012 18:02:18 -0700

> +static int vti_err(struct sk_buff *skb, u32 info)

In net-next, individual ICMP error handlers must explicitly
handle PMTU messages.

You're does not.

^ permalink raw reply

* Re: [PATCH] can: flexcan: use be32_to_cpup to handle the value of dt entry
From: Hui Wang @ 2012-06-28  1:54 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Shawn Guo, davem, netdev, linux-can, Hui Wang
In-Reply-To: <4FEAEFC1.4060104@pengutronix.de>

Marc Kleine-Budde wrote:
> On 06/27/2012 01:26 PM, Shawn Guo wrote:
>   
>> On 27 June 2012 17:27, Marc Kleine-Budde <mkl@pengutronix.de> wrote:
>>     
>>> From: Hui Wang <jason77.wang@gmail.com>
>>>
>>> The freescale arm i.MX series platform can support this driver, and
>>> usually the arm cpu works in the little endian mode by default, while
>>> device tree entry value is stored in big endian format, we should use
>>> be32_to_cpup() to handle them, after modification, it can work well
>>> both on the le cpu and be cpu.
>>>
>>>       
>> I'm wondering if you want to just use of_property_read_u32() to make
>> it a little bit easier.
>>     
>
> Even better. Hui can you send a updated patch.
>   
OK.

Regards,
Hui.
> Marc
>
>   


^ permalink raw reply

* linux-next: manual merge of the wireless-next tree with the net-next tree
From: Stephen Rothwell @ 2012-06-28  2:40 UTC (permalink / raw)
  To: John W. Linville
  Cc: linux-next, linux-kernel, Joe Perches, Franky Lin, David Miller,
	netdev

[-- Attachment #1: Type: text/plain, Size: 488 bytes --]

Hi John,

Today's linux-next merge of the wireless-next tree got a conflict in
drivers/net/wireless/brcm80211/brcmfmac/dhd_sdio.c between commit
2c208890c6d4 ("wireless: Remove casts to same type") from the net-next
tree and commit d610cde30b00 ("brcmfmac: use firmware data buffer
directly for nvram") from the wireless-next tree.

The latter removed the code modified by the former, so I used the latter.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support
From: Jason Wang @ 2012-06-28  3:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, netdev, linux-kernel, krkumar2, tahm, akong, davem,
	shemminger, mashirle
In-Reply-To: <20120627084431.GC15406@redhat.com>

On 06/27/2012 04:44 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 27, 2012 at 01:16:30PM +0800, Jason Wang wrote:
>> On 06/26/2012 06:42 PM, Michael S. Tsirkin wrote:
>>> On Tue, Jun 26, 2012 at 11:42:17AM +0800, Jason Wang wrote:
>>>> On 06/25/2012 04:25 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Jun 25, 2012 at 02:10:18PM +0800, Jason Wang wrote:
>>>>>> This patch adds multiqueue support for tap device. This is done by abstracting
>>>>>> each queue as a file/socket and allowing multiple sockets to be attached to the
>>>>>> tuntap device (an array of tun_file were stored in the tun_struct). Userspace
>>>>>> could write and read from those files to do the parallel packet
>>>>>> sending/receiving.
>>>>>>
>>>>>> Unlike the previous single queue implementation, the socket and device were
>>>>>> loosely coupled, each of them were allowed to go away first. In order to let the
>>>>>> tx path lockless, netif_tx_loch_bh() is replaced by RCU/NETIF_F_LLTX to
>>>>>> synchronize between data path and system call.
>>>>> Don't use LLTX/RCU. It's not worth it.
>>>>> Use something like netif_set_real_num_tx_queues.
>>>>>
>>>>>> The tx queue selecting is first based on the recorded rxq index of an skb, it
>>>>>> there's no such one, then choosing based on rx hashing (skb_get_rxhash()).
>>>>>>
>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>> Interestingly macvtap switched to hashing first:
>>>>> ef0002b577b52941fb147128f30bd1ecfdd3ff6d
>>>>> (the commit log is corrupted but see what it
>>>>> does in the patch).
>>>>> Any idea why?
>>>> Yes, so tap should be changed to behave same as macvtap. I remember
>>>> the reason we do that is to make sure the packet of a single flow to
>>>> be queued to a fixed socket/virtqueues. As 10g cards like ixgbe
>>>> choose the rx queue for a flow based on the last tx queue where the
>>>> packets of that flow comes. So if we are using recored rx queue in
>>>> macvtap, the queue index of a flow would change as vhost thread
>>>> moves amongs processors.
>>> Hmm. OTOH if you override this, if TX is sent from VCPU0, RX might land
>>> on VCPU1 in the guest, which is not good, right?
>> Yes, but better than making the rx moves between vcpus when we use
>> recorded rx queue.
> Why isn't this a problem with native TCP?
> I think what happens is one of the following:
> - moving between CPUs is more expensive with tun
>    because it can queue so much data on xmit
> - scheduler makes very bad decisions about VCPUs
>    bouncing them around all the time

For usual native TCP/host process, as it reads and writes tcp sockets, 
so it make make sense to move rx to the porcessor where the process 
moves. But vhost does not do tcp stuffs and ixgbe would still move rx 
when vhost process moves, and we can't even make sure the vhost process 
that handling rx is running on processor that handle rx interrupt.

> Could we isolate which it is? Does the problem
> still happen if you pin VCPUs to host cpus?
> If not it's the queue depth.

It may not help as tun does not record the vcpu/queue that send the 
stream, so it can't transmit the packets back the same vcpu/queue.
>> Flow steering is needed to make sure the tx and
>> rx on the same vcpu.
> That involves IPI between processes, so it might be
> very expensive for kvm.
>
>>>> But during test tun/tap, one interesting thing I find is that even
>>>> ixgbe has recorded the queue index during rx, it seems be lost when
>>>> tap tries to transmit skbs to userspace.
>>> dev_pick_tx does this I think but ndo_select_queue
>>> should be able to get it without trouble.
>>>
>>>
>>>>>> ---
>>>>>>   drivers/net/tun.c |  371 +++++++++++++++++++++++++++++++++--------------------
>>>>>>   1 files changed, 232 insertions(+), 139 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>> index 8233b0a..5c26757 100644
>>>>>> --- a/drivers/net/tun.c
>>>>>> +++ b/drivers/net/tun.c
>>>>>> @@ -107,6 +107,8 @@ struct tap_filter {
>>>>>>   	unsigned char	addr[FLT_EXACT_COUNT][ETH_ALEN];
>>>>>>   };
>>>>>>
>>>>>> +#define MAX_TAP_QUEUES (NR_CPUS<    16 ? NR_CPUS : 16)
>>>>> Why the limit? I am guessing you copied this from macvtap?
>>>>> This is problematic for a number of reasons:
>>>>> 	- will not play well with migration
>>>>> 	- will not work well for a large guest
>>>>>
>>>>> Yes, macvtap needs to be fixed too.
>>>>>
>>>>> I am guessing what it is trying to prevent is queueing
>>>>> up a huge number of packets?
>>>>> So just divide the default tx queue limit by the # of queues.
>>>>>
>>>>> And by the way, for MQ applications maybe we can finally
>>>>> ignore tx queue altogether and limit the total number
>>>>> of bytes queued?
>>>>> To avoid regressions we can make it large like 64M/# queues.
>>>>> Could be a separate patch I think, and for a single queue
>>>>> might need a compatible mode though I am not sure.
>>>>>
>>>>>> +
>>>>>>   struct tun_file {
>>>>>>   	struct sock sk;
>>>>>>   	struct socket socket;
>>>>>> @@ -114,16 +116,18 @@ struct tun_file {
>>>>>>   	int vnet_hdr_sz;
>>>>>>   	struct tap_filter txflt;
>>>>>>   	atomic_t count;
>>>>>> -	struct tun_struct *tun;
>>>>>> +	struct tun_struct __rcu *tun;
>>>>>>   	struct net *net;
>>>>>>   	struct fasync_struct *fasync;
>>>>>>   	unsigned int flags;
>>>>>> +	u16 queue_index;
>>>>>>   };
>>>>>>
>>>>>>   struct tun_sock;
>>>>>>
>>>>>>   struct tun_struct {
>>>>>> -	struct tun_file		*tfile;
>>>>>> +	struct tun_file		*tfiles[MAX_TAP_QUEUES];
>>>>>> +	unsigned int            numqueues;
>>>>>>   	unsigned int 		flags;
>>>>>>   	uid_t			owner;
>>>>>>   	gid_t			group;
>>>>>> @@ -138,80 +142,159 @@ struct tun_struct {
>>>>>>   #endif
>>>>>>   };
>>>>>>
>>>>>> -static int tun_attach(struct tun_struct *tun, struct file *file)
>>>>>> +static DEFINE_SPINLOCK(tun_lock);
>>>>>> +
>>>>>> +/*
>>>>>> + * tun_get_queue(): calculate the queue index
>>>>>> + *     - if skbs comes from mq nics, we can just borrow
>>>>>> + *     - if not, calculate from the hash
>>>>>> + */
>>>>>> +static struct tun_file *tun_get_queue(struct net_device *dev,
>>>>>> +				      struct sk_buff *skb)
>>>>>>   {
>>>>>> -	struct tun_file *tfile = file->private_data;
>>>>>> -	int err;
>>>>>> +	struct tun_struct *tun = netdev_priv(dev);
>>>>>> +	struct tun_file *tfile = NULL;
>>>>>> +	int numqueues = tun->numqueues;
>>>>>> +	__u32 rxq;
>>>>>>
>>>>>> -	ASSERT_RTNL();
>>>>>> +	BUG_ON(!rcu_read_lock_held());
>>>>>>
>>>>>> -	netif_tx_lock_bh(tun->dev);
>>>>>> +	if (!numqueues)
>>>>>> +		goto out;
>>>>>>
>>>>>> -	err = -EINVAL;
>>>>>> -	if (tfile->tun)
>>>>>> +	if (numqueues == 1) {
>>>>>> +		tfile = rcu_dereference(tun->tfiles[0]);
>>>>> Instead of hacks like this, you can ask for an MQ
>>>>> flag to be set in SETIFF. Then you won't need to
>>>>> handle attach/detach at random times.
>>>>> And most of the scary num_queues checks can go away.
>>>>> You can then also ask userspace about the max # of queues
>>>>> to expect if you want to save some memory.
>>>>>
>>>>>
>>>>>>   		goto out;
>>>>>> +	}
>>>>>>
>>>>>> -	err = -EBUSY;
>>>>>> -	if (tun->tfile)
>>>>>> +	if (likely(skb_rx_queue_recorded(skb))) {
>>>>>> +		rxq = skb_get_rx_queue(skb);
>>>>>> +
>>>>>> +		while (unlikely(rxq>= numqueues))
>>>>>> +			rxq -= numqueues;
>>>>>> +
>>>>>> +		tfile = rcu_dereference(tun->tfiles[rxq]);
>>>>>>   		goto out;
>>>>>> +	}
>>>>>>
>>>>>> -	err = 0;
>>>>>> -	tfile->tun = tun;
>>>>>> -	tun->tfile = tfile;
>>>>>> -	netif_carrier_on(tun->dev);
>>>>>> -	dev_hold(tun->dev);
>>>>>> -	sock_hold(&tfile->sk);
>>>>>> -	atomic_inc(&tfile->count);
>>>>>> +	/* Check if we can use flow to select a queue */
>>>>>> +	rxq = skb_get_rxhash(skb);
>>>>>> +	if (rxq) {
>>>>>> +		u32 idx = ((u64)rxq * numqueues)>>    32;
>>>>> This completely confuses me. What's the logic here?
>>>>> How do we even know it's in range?
>>>>>
>>>>>> +		tfile = rcu_dereference(tun->tfiles[idx]);
>>>>>> +		goto out;
>>>>>> +	}
>>>>>>
>>>>>> +	tfile = rcu_dereference(tun->tfiles[0]);
>>>>>>   out:
>>>>>> -	netif_tx_unlock_bh(tun->dev);
>>>>>> -	return err;
>>>>>> +	return tfile;
>>>>>>   }
>>>>>>
>>>>>> -static void __tun_detach(struct tun_struct *tun)
>>>>>> +static int tun_detach(struct tun_file *tfile, bool clean)
>>>>>>   {
>>>>>> -	struct tun_file *tfile = tun->tfile;
>>>>>> -	/* Detach from net device */
>>>>>> -	netif_tx_lock_bh(tun->dev);
>>>>>> -	netif_carrier_off(tun->dev);
>>>>>> -	tun->tfile = NULL;
>>>>>> -	netif_tx_unlock_bh(tun->dev);
>>>>>> -
>>>>>> -	/* Drop read queue */
>>>>>> -	skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
>>>>>> -
>>>>>> -	/* Drop the extra count on the net device */
>>>>>> -	dev_put(tun->dev);
>>>>>> -}
>>>>>> +	struct tun_struct *tun;
>>>>>> +	struct net_device *dev = NULL;
>>>>>> +	bool destroy = false;
>>>>>>
>>>>>> -static void tun_detach(struct tun_struct *tun)
>>>>>> -{
>>>>>> -	rtnl_lock();
>>>>>> -	__tun_detach(tun);
>>>>>> -	rtnl_unlock();
>>>>>> -}
>>>>>> +	spin_lock(&tun_lock);
>>>>>>
>>>>>> -static struct tun_struct *__tun_get(struct tun_file *tfile)
>>>>>> -{
>>>>>> -	struct tun_struct *tun = NULL;
>>>>>> +	tun = rcu_dereference_protected(tfile->tun,
>>>>>> +					lockdep_is_held(&tun_lock));
>>>>>> +	if (tun) {
>>>>>> +		u16 index = tfile->queue_index;
>>>>>> +		BUG_ON(index>= tun->numqueues);
>>>>>> +		dev = tun->dev;
>>>>>> +
>>>>>> +		rcu_assign_pointer(tun->tfiles[index],
>>>>>> +				   tun->tfiles[tun->numqueues - 1]);
>>>>>> +		tun->tfiles[index]->queue_index = index;
>>>>>> +		rcu_assign_pointer(tfile->tun, NULL);
>>>>>> +		--tun->numqueues;
>>>>>> +		sock_put(&tfile->sk);
>>>>>>
>>>>>> -	if (atomic_inc_not_zero(&tfile->count))
>>>>>> -		tun = tfile->tun;
>>>>>> +		if (tun->numqueues == 0&&    !(tun->flags&    TUN_PERSIST))
>>>>>> +			destroy = true;
>>>>> Please don't use flags like that. Use dedicated labels and goto there on error.
>>>>>
>>>>>
>>>>>> +	}
>>>>>>
>>>>>> -	return tun;
>>>>>> +	spin_unlock(&tun_lock);
>>>>>> +
>>>>>> +	synchronize_rcu();
>>>>>> +	if (clean)
>>>>>> +		sock_put(&tfile->sk);
>>>>>> +
>>>>>> +	if (destroy) {
>>>>>> +		rtnl_lock();
>>>>>> +		if (dev->reg_state == NETREG_REGISTERED)
>>>>>> +			unregister_netdevice(dev);
>>>>>> +		rtnl_unlock();
>>>>>> +	}
>>>>>> +
>>>>>> +	return 0;
>>>>>>   }
>>>>>>
>>>>>> -static struct tun_struct *tun_get(struct file *file)
>>>>>> +static void tun_detach_all(struct net_device *dev)
>>>>>>   {
>>>>>> -	return __tun_get(file->private_data);
>>>>>> +	struct tun_struct *tun = netdev_priv(dev);
>>>>>> +	struct tun_file *tfile, *tfile_list[MAX_TAP_QUEUES];
>>>>>> +	int i, j = 0;
>>>>>> +
>>>>>> +	spin_lock(&tun_lock);
>>>>>> +
>>>>>> +	for (i = 0; i<    MAX_TAP_QUEUES&&    tun->numqueues; i++) {
>>>>>> +		tfile = rcu_dereference_protected(tun->tfiles[i],
>>>>>> +						lockdep_is_held(&tun_lock));
>>>>>> +		BUG_ON(!tfile);
>>>>>> +		wake_up_all(&tfile->wq.wait);
>>>>>> +		tfile_list[j++] = tfile;
>>>>>> +		rcu_assign_pointer(tfile->tun, NULL);
>>>>>> +		--tun->numqueues;
>>>>>> +	}
>>>>>> +	BUG_ON(tun->numqueues != 0);
>>>>>> +	/* guarantee that any future tun_attach will fail */
>>>>>> +	tun->numqueues = MAX_TAP_QUEUES;
>>>>>> +	spin_unlock(&tun_lock);
>>>>>> +
>>>>>> +	synchronize_rcu();
>>>>>> +	for (--j; j>= 0; j--)
>>>>>> +		sock_put(&tfile_list[j]->sk);
>>>>>>   }
>>>>>>
>>>>>> -static void tun_put(struct tun_struct *tun)
>>>>>> +static int tun_attach(struct tun_struct *tun, struct file *file)
>>>>>>   {
>>>>>> -	struct tun_file *tfile = tun->tfile;
>>>>>> +	struct tun_file *tfile = file->private_data;
>>>>>> +	int err;
>>>>>> +
>>>>>> +	ASSERT_RTNL();
>>>>>> +
>>>>>> +	spin_lock(&tun_lock);
>>>>>>
>>>>>> -	if (atomic_dec_and_test(&tfile->count))
>>>>>> -		tun_detach(tfile->tun);
>>>>>> +	err = -EINVAL;
>>>>>> +	if (rcu_dereference_protected(tfile->tun, lockdep_is_held(&tun_lock)))
>>>>>> +		goto out;
>>>>>> +
>>>>>> +	err = -EBUSY;
>>>>>> +	if (!(tun->flags&    TUN_TAP_MQ)&&    tun->numqueues == 1)
>>>>>> +		goto out;
>>>>>> +
>>>>>> +	if (tun->numqueues == MAX_TAP_QUEUES)
>>>>>> +		goto out;
>>>>>> +
>>>>>> +	err = 0;
>>>>>> +	tfile->queue_index = tun->numqueues;
>>>>>> +	rcu_assign_pointer(tfile->tun, tun);
>>>>>> +	rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
>>>>>> +	sock_hold(&tfile->sk);
>>>>>> +	tun->numqueues++;
>>>>>> +
>>>>>> +	if (tun->numqueues == 1)
>>>>>> +		netif_carrier_on(tun->dev);
>>>>>> +
>>>>>> +	/* device is allowed to go away first, so no need to hold extra
>>>>>> +	 * refcnt. */
>>>>>> +
>>>>>> +out:
>>>>>> +	spin_unlock(&tun_lock);
>>>>>> +	return err;
>>>>>>   }
>>>>>>
>>>>>>   /* TAP filtering */
>>>>>> @@ -331,16 +414,7 @@ static const struct ethtool_ops tun_ethtool_ops;
>>>>>>   /* Net device detach from fd. */
>>>>>>   static void tun_net_uninit(struct net_device *dev)
>>>>>>   {
>>>>>> -	struct tun_struct *tun = netdev_priv(dev);
>>>>>> -	struct tun_file *tfile = tun->tfile;
>>>>>> -
>>>>>> -	/* Inform the methods they need to stop using the dev.
>>>>>> -	 */
>>>>>> -	if (tfile) {
>>>>>> -		wake_up_all(&tfile->wq.wait);
>>>>>> -		if (atomic_dec_and_test(&tfile->count))
>>>>>> -			__tun_detach(tun);
>>>>>> -	}
>>>>>> +	tun_detach_all(dev);
>>>>>>   }
>>>>>>
>>>>>>   /* Net device open. */
>>>>>> @@ -360,10 +434,10 @@ static int tun_net_close(struct net_device *dev)
>>>>>>   /* Net device start xmit */
>>>>>>   static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>   {
>>>>>> -	struct tun_struct *tun = netdev_priv(dev);
>>>>>> -	struct tun_file *tfile = tun->tfile;
>>>>>> +	struct tun_file *tfile = NULL;
>>>>>>
>>>>>> -	tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
>>>>>> +	rcu_read_lock();
>>>>>> +	tfile = tun_get_queue(dev, skb);
>>>>>>
>>>>>>   	/* Drop packet if interface is not attached */
>>>>>>   	if (!tfile)
>>>>>> @@ -381,7 +455,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>
>>>>>>   	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
>>>>>>   	>= dev->tx_queue_len) {
>>>>>> -		if (!(tun->flags&    TUN_ONE_QUEUE)) {
>>>>>> +		if (!(tfile->flags&    TUN_ONE_QUEUE)&&
>>>>> Which patch moved flags from tun to tfile?
>>>>>
>>>>>> +		    !(tfile->flags&    TUN_TAP_MQ)) {
>>>>>>   			/* Normal queueing mode. */
>>>>>>   			/* Packet scheduler handles dropping of further packets. */
>>>>>>   			netif_stop_queue(dev);
>>>>>> @@ -390,7 +465,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>   			 * error is more appropriate. */
>>>>>>   			dev->stats.tx_fifo_errors++;
>>>>>>   		} else {
>>>>>> -			/* Single queue mode.
>>>>>> +			/* Single queue mode or multi queue mode.
>>>>>>   			 * Driver handles dropping of all packets itself. */
>>>>> Please don't do this. Stop the queue on overrun as appropriate.
>>>>> ONE_QUEUE is a legacy hack.
>>>>>
>>>>> BTW we really should stop queue before we start dropping packets,
>>>>> but that can be a separate patch.
>>>>>
>>>>>>   			goto drop;
>>>>>>   		}
>>>>>> @@ -408,9 +483,11 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>>>   		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>>>>>>   	wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
>>>>>>   				   POLLRDNORM | POLLRDBAND);
>>>>>> +	rcu_read_unlock();
>>>>>>   	return NETDEV_TX_OK;
>>>>>>
>>>>>>   drop:
>>>>>> +	rcu_read_unlock();
>>>>>>   	dev->stats.tx_dropped++;
>>>>>>   	kfree_skb(skb);
>>>>>>   	return NETDEV_TX_OK;
>>>>>> @@ -527,16 +604,22 @@ static void tun_net_init(struct net_device *dev)
>>>>>>   static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>>>>>   {
>>>>>>   	struct tun_file *tfile = file->private_data;
>>>>>> -	struct tun_struct *tun = __tun_get(tfile);
>>>>>> +	struct tun_struct *tun = NULL;
>>>>>>   	struct sock *sk;
>>>>>>   	unsigned int mask = 0;
>>>>>>
>>>>>> -	if (!tun)
>>>>>> +	if (!tfile)
>>>>>>   		return POLLERR;
>>>>>>
>>>>>> -	sk = tfile->socket.sk;
>>>>>> +	rcu_read_lock();
>>>>>> +	tun = rcu_dereference(tfile->tun);
>>>>>> +	if (!tun) {
>>>>>> +		rcu_read_unlock();
>>>>>> +		return POLLERR;
>>>>>> +	}
>>>>>> +	rcu_read_unlock();
>>>>>>
>>>>>> -	tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
>>>>>> +	sk =&tfile->sk;
>>>>>>
>>>>>>   	poll_wait(file,&tfile->wq.wait, wait);
>>>>>>
>>>>>> @@ -548,10 +631,12 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>>>>>   	     sock_writeable(sk)))
>>>>>>   		mask |= POLLOUT | POLLWRNORM;
>>>>>>
>>>>>> -	if (tun->dev->reg_state != NETREG_REGISTERED)
>>>>>> +	rcu_read_lock();
>>>>>> +	tun = rcu_dereference(tfile->tun);
>>>>>> +	if (!tun || tun->dev->reg_state != NETREG_REGISTERED)
>>>>>>   		mask = POLLERR;
>>>>>> +	rcu_read_unlock();
>>>>>>
>>>>>> -	tun_put(tun);
>>>>>>   	return mask;
>>>>>>   }
>>>>>>
>>>>>> @@ -708,9 +793,12 @@ static ssize_t tun_get_user(struct tun_file *tfile,
>>>>>>   		skb_shinfo(skb)->gso_segs = 0;
>>>>>>   	}
>>>>>>
>>>>>> -	tun = __tun_get(tfile);
>>>>>> -	if (!tun)
>>>>>> +	rcu_read_lock();
>>>>>> +	tun = rcu_dereference(tfile->tun);
>>>>>> +	if (!tun) {
>>>>>> +		rcu_read_unlock();
>>>>>>   		return -EBADFD;
>>>>>> +	}
>>>>>>
>>>>>>   	switch (tfile->flags&    TUN_TYPE_MASK) {
>>>>>>   	case TUN_TUN_DEV:
>>>>>> @@ -720,26 +808,30 @@ static ssize_t tun_get_user(struct tun_file *tfile,
>>>>>>   		skb->protocol = eth_type_trans(skb, tun->dev);
>>>>>>   		break;
>>>>>>   	}
>>>>>> -
>>>>>> -	netif_rx_ni(skb);
>>>>>>   	tun->dev->stats.rx_packets++;
>>>>>>   	tun->dev->stats.rx_bytes += len;
>>>>>> -	tun_put(tun);
>>>>>> +	rcu_read_unlock();
>>>>>> +
>>>>>> +	netif_rx_ni(skb);
>>>>>> +
>>>>>>   	return count;
>>>>>>
>>>>>>   err_free:
>>>>>>   	count = -EINVAL;
>>>>>>   	kfree_skb(skb);
>>>>>>   err:
>>>>>> -	tun = __tun_get(tfile);
>>>>>> -	if (!tun)
>>>>>> +	rcu_read_lock();
>>>>>> +	tun = rcu_dereference(tfile->tun);
>>>>>> +	if (!tun) {
>>>>>> +		rcu_read_unlock();
>>>>>>   		return -EBADFD;
>>>>>> +	}
>>>>>>
>>>>>>   	if (drop)
>>>>>>   		tun->dev->stats.rx_dropped++;
>>>>>>   	if (error)
>>>>>>   		tun->dev->stats.rx_frame_errors++;
>>>>>> -	tun_put(tun);
>>>>>> +	rcu_read_unlock();
>>>>>>   	return count;
>>>>>>   }
>>>>>>
>>>>>> @@ -833,12 +925,13 @@ static ssize_t tun_put_user(struct tun_file *tfile,
>>>>>>   	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
>>>>>>   	total += skb->len;
>>>>>>
>>>>>> -	tun = __tun_get(tfile);
>>>>>> +	rcu_read_lock();
>>>>>> +	tun = rcu_dereference(tfile->tun);
>>>>>>   	if (tun) {
>>>>>>   		tun->dev->stats.tx_packets++;
>>>>>>   		tun->dev->stats.tx_bytes += len;
>>>>>> -		tun_put(tun);
>>>>>>   	}
>>>>>> +	rcu_read_unlock();
>>>>>>
>>>>>>   	return total;
>>>>>>   }
>>>>>> @@ -869,28 +962,31 @@ static ssize_t tun_do_read(struct tun_file *tfile,
>>>>>>   				break;
>>>>>>   			}
>>>>>>
>>>>>> -			tun = __tun_get(tfile);
>>>>>> +			rcu_read_lock();
>>>>>> +			tun = rcu_dereference(tfile->tun);
>>>>>>   			if (!tun) {
>>>>>> -				ret = -EIO;
>>>>>> +				ret = -EBADFD;
>>>>> BADFD is for when you get passed something like -1 fd.
>>>>> Here fd is OK, it's just in a bad state so you can not do IO.
>>>>>
>>>>>
>>>>>> +				rcu_read_unlock();
>>>>>>   				break;
>>>>>>   			}
>>>>>>   			if (tun->dev->reg_state != NETREG_REGISTERED) {
>>>>>>   				ret = -EIO;
>>>>>> -				tun_put(tun);
>>>>>> +				rcu_read_unlock();
>>>>>>   				break;
>>>>>>   			}
>>>>>> -			tun_put(tun);
>>>>>> +			rcu_read_unlock();
>>>>>>
>>>>>>   			/* Nothing to read, let's sleep */
>>>>>>   			schedule();
>>>>>>   			continue;
>>>>>>   		}
>>>>>>
>>>>>> -		tun = __tun_get(tfile);
>>>>>> +		rcu_read_lock();
>>>>>> +		tun = rcu_dereference(tfile->tun);
>>>>>>   		if (tun) {
>>>>>>   			netif_wake_queue(tun->dev);
>>>>>> -			tun_put(tun);
>>>>>>   		}
>>>>>> +		rcu_read_unlock();
>>>>>>
>>>>>>   		ret = tun_put_user(tfile, skb, iv, len);
>>>>>>   		kfree_skb(skb);
>>>>>> @@ -1038,6 +1134,9 @@ static int tun_flags(struct tun_struct *tun)
>>>>>>   	if (tun->flags&    TUN_VNET_HDR)
>>>>>>   		flags |= IFF_VNET_HDR;
>>>>>>
>>>>>> +	if (tun->flags&    TUN_TAP_MQ)
>>>>>> +		flags |= IFF_MULTI_QUEUE;
>>>>>> +
>>>>>>   	return flags;
>>>>>>   }
>>>>>>
>>>>>> @@ -1097,8 +1196,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>>>   		err = tun_attach(tun, file);
>>>>>>   		if (err<    0)
>>>>>>   			return err;
>>>>>> -	}
>>>>>> -	else {
>>>>>> +	} else {
>>>>>>   		char *name;
>>>>>>   		unsigned long flags = 0;
>>>>>>
>>>>>> @@ -1142,6 +1240,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>>>   		dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
>>>>>>   			TUN_USER_FEATURES;
>>>>>>   		dev->features = dev->hw_features;
>>>>>> +		if (ifr->ifr_flags&    IFF_MULTI_QUEUE)
>>>>>> +			dev->features |= NETIF_F_LLTX;
>>>>>>
>>>>>>   		err = register_netdevice(tun->dev);
>>>>>>   		if (err<    0)
>>>>>> @@ -1154,7 +1254,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>>>
>>>>>>   		err = tun_attach(tun, file);
>>>>>>   		if (err<    0)
>>>>>> -			goto failed;
>>>>>> +			goto err_free_dev;
>>>>>>   	}
>>>>>>
>>>>>>   	tun_debug(KERN_INFO, tun, "tun_set_iff\n");
>>>>>> @@ -1174,6 +1274,11 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>>>   	else
>>>>>>   		tun->flags&= ~TUN_VNET_HDR;
>>>>>>
>>>>>> +	if (ifr->ifr_flags&    IFF_MULTI_QUEUE)
>>>>>> +		tun->flags |= TUN_TAP_MQ;
>>>>>> +	else
>>>>>> +		tun->flags&= ~TUN_TAP_MQ;
>>>>>> +
>>>>>>   	/* Cache flags from tun device */
>>>>>>   	tfile->flags = tun->flags;
>>>>>>   	/* Make sure persistent devices do not get stuck in
>>>>>> @@ -1187,7 +1292,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>>>
>>>>>>   err_free_dev:
>>>>>>   	free_netdev(dev);
>>>>>> -failed:
>>>>>>   	return err;
>>>>>>   }
>>>>>>
>>>>>> @@ -1264,38 +1368,40 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>>>>>   				(unsigned int __user*)argp);
>>>>>>   	}
>>>>>>
>>>>>> -	rtnl_lock();
>>>>>> -
>>>>>> -	tun = __tun_get(tfile);
>>>>>> -	if (cmd == TUNSETIFF&&    !tun) {
>>>>>> +	ret = 0;
>>>>>> +	if (cmd == TUNSETIFF) {
>>>>>> +		rtnl_lock();
>>>>>>   		ifr.ifr_name[IFNAMSIZ-1] = '\0';
>>>>>> -
>>>>>>   		ret = tun_set_iff(tfile->net, file,&ifr);
>>>>>> -
>>>>>> +		rtnl_unlock();
>>>>>>   		if (ret)
>>>>>> -			goto unlock;
>>>>>> -
>>>>>> +			return ret;
>>>>>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>>>>> -			ret = -EFAULT;
>>>>>> -		goto unlock;
>>>>>> +			return -EFAULT;
>>>>>> +		return ret;
>>>>>>   	}
>>>>>>
>>>>>> +	rtnl_lock();
>>>>>> +
>>>>>> +	rcu_read_lock();
>>>>>> +
>>>>>>   	ret = -EBADFD;
>>>>>> +	tun = rcu_dereference(tfile->tun);
>>>>>>   	if (!tun)
>>>>>>   		goto unlock;
>>>>>> +	else
>>>>>> +		ret = 0;
>>>>>>
>>>>>> -	tun_debug(KERN_INFO, tun, "tun_chr_ioctl cmd %d\n", cmd);
>>>>>> -
>>>>>> -	ret = 0;
>>>>>>   	switch (cmd) {
>>>>>>   	case TUNGETIFF:
>>>>>>   		ret = tun_get_iff(current->nsproxy->net_ns, tun,&ifr);
>>>>>> +		rcu_read_unlock();
>>>>>>   		if (ret)
>>>>>> -			break;
>>>>>> +			goto out;
>>>>>>
>>>>>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>>>>>   			ret = -EFAULT;
>>>>>> -		break;
>>>>>> +		goto out;
>>>>>>
>>>>>>   	case TUNSETNOCSUM:
>>>>>>   		/* Disable/Enable checksum */
>>>>>> @@ -1357,9 +1463,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>>>>>   		/* Get hw address */
>>>>>>   		memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN);
>>>>>>   		ifr.ifr_hwaddr.sa_family = tun->dev->type;
>>>>>> +		rcu_read_unlock();
>>>>>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>>>>>   			ret = -EFAULT;
>>>>>> -		break;
>>>>>> +		goto out;
>>>>>>
>>>>>>   	case SIOCSIFHWADDR:
>>>>>>   		/* Set hw address */
>>>>>> @@ -1375,9 +1482,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>>>>>   	}
>>>>>>
>>>>>>   unlock:
>>>>>> +	rcu_read_unlock();
>>>>>> +out:
>>>>>>   	rtnl_unlock();
>>>>>> -	if (tun)
>>>>>> -		tun_put(tun);
>>>>>>   	return ret;
>>>>>>   }
>>>>>>
>>>>>> @@ -1517,6 +1624,11 @@ out:
>>>>>>   	return ret;
>>>>>>   }
>>>>>>
>>>>>> +static void tun_sock_destruct(struct sock *sk)
>>>>>> +{
>>>>>> +	skb_queue_purge(&sk->sk_receive_queue);
>>>>>> +}
>>>>>> +
>>>>>>   static int tun_chr_open(struct inode *inode, struct file * file)
>>>>>>   {
>>>>>>   	struct net *net = current->nsproxy->net_ns;
>>>>>> @@ -1540,6 +1652,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>>>>>   	sock_init_data(&tfile->socket,&tfile->sk);
>>>>>>
>>>>>>   	tfile->sk.sk_write_space = tun_sock_write_space;
>>>>>> +	tfile->sk.sk_destruct = tun_sock_destruct;
>>>>>>   	tfile->sk.sk_sndbuf = INT_MAX;
>>>>>>   	file->private_data = tfile;
>>>>>>
>>>>>> @@ -1549,31 +1662,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>>>>>   static int tun_chr_close(struct inode *inode, struct file *file)
>>>>>>   {
>>>>>>   	struct tun_file *tfile = file->private_data;
>>>>>> -	struct tun_struct *tun;
>>>>>> -
>>>>>> -	tun = __tun_get(tfile);
>>>>>> -	if (tun) {
>>>>>> -		struct net_device *dev = tun->dev;
>>>>>> -
>>>>>> -		tun_debug(KERN_INFO, tun, "tun_chr_close\n");
>>>>>> -
>>>>>> -		__tun_detach(tun);
>>>>>> -
>>>>>> -		/* If desirable, unregister the netdevice. */
>>>>>> -		if (!(tun->flags&    TUN_PERSIST)) {
>>>>>> -			rtnl_lock();
>>>>>> -			if (dev->reg_state == NETREG_REGISTERED)
>>>>>> -				unregister_netdevice(dev);
>>>>>> -			rtnl_unlock();
>>>>>> -		}
>>>>>>
>>>>>> -		/* drop the reference that netdevice holds */
>>>>>> -		sock_put(&tfile->sk);
>>>>>> -
>>>>>> -	}
>>>>>> -
>>>>>> -	/* drop the reference that file holds */
>>>>>> -	sock_put(&tfile->sk);
>>>>>> +	tun_detach(tfile, true);
>>>>>>
>>>>>>   	return 0;
>>>>>>   }
>>>>>> @@ -1700,14 +1790,17 @@ static void tun_cleanup(void)
>>>>>>    * holding a reference to the file for as long as the socket is in use. */
>>>>>>   struct socket *tun_get_socket(struct file *file)
>>>>>>   {
>>>>>> -	struct tun_struct *tun;
>>>>>> +	struct tun_struct *tun = NULL;
>>>>>>   	struct tun_file *tfile = file->private_data;
>>>>>>   	if (file->f_op !=&tun_fops)
>>>>>>   		return ERR_PTR(-EINVAL);
>>>>>> -	tun = tun_get(file);
>>>>>> -	if (!tun)
>>>>>> +	rcu_read_lock();
>>>>>> +	tun = rcu_dereference(tfile->tun);
>>>>>> +	if (!tun) {
>>>>>> +		rcu_read_unlock();
>>>>>>   		return ERR_PTR(-EBADFD);
>>>>>> -	tun_put(tun);
>>>>>> +	}
>>>>>> +	rcu_read_unlock();
>>>>>>   	return&tfile->socket;
>>>>>>   }
>>>>>>   EXPORT_SYMBOL_GPL(tun_get_socket);
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support
From: Jason Wang @ 2012-06-28  3:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, netdev, linux-kernel, krkumar2, tahm, akong, davem,
	shemminger, mashirle, Eric Dumazet
In-Reply-To: <20120627082635.GB15406@redhat.com>

On 06/27/2012 04:26 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 27, 2012 at 01:59:37PM +0800, Jason Wang wrote:
>> On 06/26/2012 07:54 PM, Michael S. Tsirkin wrote:
>>> On Tue, Jun 26, 2012 at 01:52:57PM +0800, Jason Wang wrote:
>>>> On 06/25/2012 04:25 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Jun 25, 2012 at 02:10:18PM +0800, Jason Wang wrote:
>>>>>> This patch adds multiqueue support for tap device. This is done by abstracting
>>>>>> each queue as a file/socket and allowing multiple sockets to be attached to the
>>>>>> tuntap device (an array of tun_file were stored in the tun_struct). Userspace
>>>>>> could write and read from those files to do the parallel packet
>>>>>> sending/receiving.
>>>>>>
>>>>>> Unlike the previous single queue implementation, the socket and device were
>>>>>> loosely coupled, each of them were allowed to go away first. In order to let the
>>>>>> tx path lockless, netif_tx_loch_bh() is replaced by RCU/NETIF_F_LLTX to
>>>>>> synchronize between data path and system call.
>>>>> Don't use LLTX/RCU. It's not worth it.
>>>>> Use something like netif_set_real_num_tx_queues.
>>>>>
>>>> For LLTX, maybe it's better to convert it to alloc_netdev_mq() to
>>>> let the kernel see all queues and make the queue stopping and
>>>> per-queue stats eaiser.
>>>> RCU is used to handle the attaching/detaching when tun/tap is
>>>> sending and receiving packets which looks reasonalbe for me.
>>> Yes but do we have to allow this? How about we always ask
>>> userspace to attach to all active queues?
>> Attaching/detaching is a method to active/deactive a queue, if all
>> queues were kept attached, then we need other method or flag to mark
>> the queue as activateddeactived and still need to synchronize with
>> data path.
> This is what I am trying to say: use an interface flag for
> multiqueue. When it is set activate all queues attached.
> When unset deactivate all queues except the default one.
>
>
>>>> Not
>>>> sure netif_set_real_num_tx_queues() can help in this situation.
>>> Check it out.
>>>
>>>>>> The tx queue selecting is first based on the recorded rxq index of an skb, it
>>>>>> there's no such one, then choosing based on rx hashing (skb_get_rxhash()).
>>>>>>
>>>>>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>>>>> Interestingly macvtap switched to hashing first:
>>>>> ef0002b577b52941fb147128f30bd1ecfdd3ff6d
>>>>> (the commit log is corrupted but see what it
>>>>> does in the patch).
>>>>> Any idea why?
>>>>>
>>>>>> ---
>>>>>>   drivers/net/tun.c |  371 +++++++++++++++++++++++++++++++++--------------------
>>>>>>   1 files changed, 232 insertions(+), 139 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>> index 8233b0a..5c26757 100644
>>>>>> --- a/drivers/net/tun.c
>>>>>> +++ b/drivers/net/tun.c
>>>>>> @@ -107,6 +107,8 @@ struct tap_filter {
>>>>>>   	unsigned char	addr[FLT_EXACT_COUNT][ETH_ALEN];
>>>>>>   };
>>>>>>
>>>>>> +#define MAX_TAP_QUEUES (NR_CPUS<    16 ? NR_CPUS : 16)
>>>>> Why the limit? I am guessing you copied this from macvtap?
>>>>> This is problematic for a number of reasons:
>>>>> 	- will not play well with migration
>>>>> 	- will not work well for a large guest
>>>>>
>>>>> Yes, macvtap needs to be fixed too.
>>>>>
>>>>> I am guessing what it is trying to prevent is queueing
>>>>> up a huge number of packets?
>>>>> So just divide the default tx queue limit by the # of queues.
>>>> Not sure,
>>>> another reasons I can guess:
>>>> - to prevent storing a large array of pointers in tun_struct or macvlan_dev.
>>> OK so with the limit of e.g. 1024 we'd allocate at most
>>> 2 pages of memory. This doesn't look too bad. 1024 is probably a
>>> high enough limit: modern hypervisors seem to support on the order
>>> of 100-200 CPUs so this leaves us some breathing space
>>> if we want to match a queue per guest CPU.
>>> Of course we need to limit the packets per queue
>>> in such a setup more aggressively. 1000 packets * 1000 queues
>>> * 64K per packet is too much.
>>>
>>>> - it may not be suitable to allow the number of virtqueues greater
>>>> than the number of physical queues in the card
>>> Maybe for macvtap, here we have no idea which card we
>>> are working with and how many queues it has.
>>>
>>>>> And by the way, for MQ applications maybe we can finally
>>>>> ignore tx queue altogether and limit the total number
>>>>> of bytes queued?
>>>>> To avoid regressions we can make it large like 64M/# queues.
>>>>> Could be a separate patch I think, and for a single queue
>>>>> might need a compatible mode though I am not sure.
>>>> Could you explain more about this?
>>>> Did you mean to have a total
>>>> sndbuf for all sockets that attached to tun/tap?
>>> Consider that we currently limit the # of
>>> packets queued at tun for xmit to userspace.
>>> Some limit is needed but # of packets sounds
>>> very silly - limiting the total memory
>>> might be more reasonable.
>>>
>>> In case of multiqueue, we really care about
>>> total # of packets or total memory, but a simple
>>> approximation could be to divide the allocation
>>> between active queues equally.
>> A possible method is to divce the TUN_READQ_SIZE by #queues, but
>> make it at least to be equal to the vring size (256).
> I would not enforce any limit actually.
> Simply divide by # of queues, and
> fail if userspace tries to attach>  queue size packets.
>
> With 1000 queues this is 64Mbyte worst case as is.
> If someone wants to allow userspace to drink
> 256 times as much that is 16Giga byte per
> single device, let the user tweak tx queue len.
>
>
>
>>> qdisc also queues some packets, that logic is
>>> using # of packets anyway. So either make that
>>> 1000/# queues, or even set to 0 as Eric once
>>> suggested.
>>>
>>>>>> +
>>>>>>   struct tun_file {
>>>>>>   	struct sock sk;
>>>>>>   	struct socket socket;
>>>>>> @@ -114,16 +116,18 @@ struct tun_file {
>>>>>>   	int vnet_hdr_sz;
>>>>>>   	struct tap_filter txflt;
>>>>>>   	atomic_t count;
>>>>>> -	struct tun_struct *tun;
>>>>>> +	struct tun_struct __rcu *tun;
>>>>>>   	struct net *net;
>>>>>>   	struct fasync_struct *fasync;
>>>>>>   	unsigned int flags;
>>>>>> +	u16 queue_index;
>>>>>>   };
>>>>>>
>>>>>>   struct tun_sock;
>>>>>>
>>>>>>   struct tun_struct {
>>>>>> -	struct tun_file		*tfile;
>>>>>> +	struct tun_file		*tfiles[MAX_TAP_QUEUES];
>>>>>> +	unsigned int            numqueues;
>>>>>>   	unsigned int 		flags;
>>>>>>   	uid_t			owner;
>>>>>>   	gid_t			group;
>>>>>> @@ -138,80 +142,159 @@ struct tun_struct {
>>>>>>   #endif
>>>>>>   };
>>>>>>
>>>>>> -static int tun_attach(struct tun_struct *tun, struct file *file)
>>>>>> +static DEFINE_SPINLOCK(tun_lock);
>>>>>> +
>>>>>> +/*
>>>>>> + * tun_get_queue(): calculate the queue index
>>>>>> + *     - if skbs comes from mq nics, we can just borrow
>>>>>> + *     - if not, calculate from the hash
>>>>>> + */
>>>>>> +static struct tun_file *tun_get_queue(struct net_device *dev,
>>>>>> +				      struct sk_buff *skb)
>>>>>>   {
>>>>>> -	struct tun_file *tfile = file->private_data;
>>>>>> -	int err;
>>>>>> +	struct tun_struct *tun = netdev_priv(dev);
>>>>>> +	struct tun_file *tfile = NULL;
>>>>>> +	int numqueues = tun->numqueues;
>>>>>> +	__u32 rxq;
>>>>>>
>>>>>> -	ASSERT_RTNL();
>>>>>> +	BUG_ON(!rcu_read_lock_held());
>>>>>>
>>>>>> -	netif_tx_lock_bh(tun->dev);
>>>>>> +	if (!numqueues)
>>>>>> +		goto out;
>>>>>>
>>>>>> -	err = -EINVAL;
>>>>>> -	if (tfile->tun)
>>>>>> +	if (numqueues == 1) {
>>>>>> +		tfile = rcu_dereference(tun->tfiles[0]);
>>>>> Instead of hacks like this, you can ask for an MQ
>>>>> flag to be set in SETIFF. Then you won't need to
>>>>> handle attach/detach at random times.
>>>> Consier user switch between a sq guest to mq guest, qemu would
>>>> attach or detach the fd which could not be expceted in kernel.
>>> Can't userspace keep it attached always, just deactivate MQ?
>>>
>>>>> And most of the scary num_queues checks can go away.
>>>> Even we has a MQ flag, userspace could still just attach one queue
>>>> to the device.
>>> I think we allow too much flexibility if we let
>>> userspace detach a random queue.
>> The point is to let tun/tap has the same flexibility as macvtap.
>> Macvtap allows add/delete queues at any time and it's very easy to
>> add detach/attach to macvtap. So we can easily use almost the same
>> ioctls to active/deactive a queue at any time for both tap and
>> macvtap.
> Yes but userspace does not do this in practice:
> it decides how many queues and just activates them all.

The problem here I think is:

- We export files descriptors to userspace, so any of the files could  
be closed at anytime which could not be expected.
- Easy to let tap and macvtap has the same ioctls.
>
>
[...]

^ permalink raw reply

* Re: [PATCH net-next 1/2] bnx2: Add "fall through" comments
From: David Miller @ 2012-06-28  4:28 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1340845704-12580-5-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Wed, 27 Jun 2012 18:08:23 -0700

> to indicate that the mising break statements are intended.
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/2] bnx2: Add missing netif_tx_disable() in bnx2_close()
From: David Miller @ 2012-06-28  4:28 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1340845704-12580-6-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Wed, 27 Jun 2012 18:08:24 -0700

> to stop all tx queues.  Update version to 2.2.3.
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/4] cnic: Fix occasional NULL pointer dereference during reboot.
From: David Miller @ 2012-06-28  4:28 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1340845704-12580-1-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Wed, 27 Jun 2012 18:08:19 -0700

> We register with bnx2x before we allocate ctx_tbl structure, so it is
> possible for bnx2x to call cnic_ctl before the structure is allocated.
> This can sometimes cause NULL pointer dereference of cp->ctx_tbl.  We
> fix this by adding simple checking for valid state before proceeding.
> The cnic_ctl call is RCU protected so we don't have to deal with race
> conditions.
> 
> Because of the additional checking, we need to finish the shutdown
> before clearing the CNIC_UP flag.
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/4] cnic: Read bnx2x function number from internal register
From: David Miller @ 2012-06-28  4:28 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1340845704-12580-2-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Wed, 27 Jun 2012 18:08:20 -0700

> From: Eddie Wai <eddie.wai@broadcom.com>
> 
> so that it will work on any hypervisor.
> 
> Signed-off-by: Eddie Wai <eddie.wai@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 3/4] cnic: Remove uio mem[0].
From: David Miller @ 2012-06-28  4:29 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1340845704-12580-3-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Wed, 27 Jun 2012 18:08:21 -0700

> This memory region is no longer used.  Userspace gets the BAR address
> directly from sysfs.
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 4/4] cnic: Handle RAMROD_CMD_ID_CLOSE error.
From: David Miller @ 2012-06-28  4:29 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1340845704-12580-4-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Wed, 27 Jun 2012 18:08:22 -0700

> From: Eddie Wai <eddie.wai@broadcom.com>
> 
> If firmware returns error status, proceed to close the iSCSI connection.
> Update version to 2.5.11.
> 
> Signed-off-by: Eddie Wai <eddie.wai@broadcom.com>
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH v2 0/4] netdev/phy: 10G PHY support.
From: David Miller @ 2012-06-28  4:29 UTC (permalink / raw)
  To: ddaney.cavm
  Cc: grant.likely, rob.herring, devicetree-discuss, netdev,
	linux-kernel, linux-mips, afleming, david.daney
In-Reply-To: <1340818418-10382-1-git-send-email-ddaney.cavm@gmail.com>

From: David Daney <ddaney.cavm@gmail.com>
Date: Wed, 27 Jun 2012 10:33:34 -0700

> From: David Daney <david.daney@cavium.com>
> 
> The only non-cosmetic change from v1 is to pass an additional argument
> to get_phy_device() that indicates that the PHY uses 802.3 clause 45
> signaling, previously I had been using a high order bit of the addr
> parameter for this.
> 
> There are also changes from v1 in the code and comment formatting.
> These should now be closer to what David Miller prefers.

Applied, but I had to add the following warning fixup:

--------------------
phy: Fix warning in get_phy_device().

drivers/net/phy/phy_device.c: In function ‘get_phy_device’:
drivers/net/phy/phy_device.c:340:14: warning: ‘phy_id’ may be used uninitialized in this function [-Wmaybe-uninitialized]

GCC can't see that when we return zero we always initialize
phy_id and that's the only path where we use it.

Initialize phy_id to zero to shut it up.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/phy/phy_device.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index ef4cdee..47e02e7 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -327,9 +327,9 @@ static int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id,
  */
 struct phy_device *get_phy_device(struct mii_bus *bus, int addr, bool is_c45)
 {
-	struct phy_device *dev = NULL;
-	u32 phy_id;
 	struct phy_c45_device_ids c45_ids = {0};
+	struct phy_device *dev = NULL;
+	u32 phy_id = 0;
 	int r;
 
 	r = get_phy_id(bus, addr, &phy_id, is_c45, &c45_ids);
-- 
1.7.10.2


^ permalink raw reply related

* Re: [patch net-next] virtio_net: allow to change mac when iface is running
From: David Miller @ 2012-06-28  4:30 UTC (permalink / raw)
  To: jpirko; +Cc: netdev, virtualization, brouer, mst
In-Reply-To: <1340810866-1017-1-git-send-email-jpirko@redhat.com>

From: Jiri Pirko <jpirko@redhat.com>
Date: Wed, 27 Jun 2012 17:27:46 +0200

> Signed-off-by: Jiri Pirko <jpirko@redhat.com>

Applied, but this seriously makes eth_mac_addr() completely useless.

Technically, every eth_mac_addr() user in a software/virtual device
should behave the way virtio_net does now.

It therefore probably makes sense to add a boolean arg which when true
elides the netif_running() check then fixup and audit every caller.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox