Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH net-next] ipvs: add missing lock in ip_vs_ftp_init_conn()
From: Simon Horman @ 2012-06-29  0:34 UTC (permalink / raw)
  To: Xiaotian Feng
  Cc: netdev, lvs-devel, netfilter-devel, netfilter, coreteam,
	linux-kernel, Xiaotian Feng, Wensong Zhang, Julian Anastasov,
	Pablo Neira Ayuso, Patrick McHardy, David S. Miller
In-Reply-To: <1340890587-8169-1-git-send-email-xtfeng@gmail.com>

On Thu, Jun 28, 2012 at 09:36:27PM +0800, Xiaotian Feng wrote:
> We met a kernel panic in 2.6.32.43 kernel:
> 
> [2680191.848044] IPVS: ip_vs_conn_hash(): request for already hashed, called from run_timer_softirq+0x175/0x1d0
> <snip>
> [2680311.849009] general protection fault: 0000 [#1] SMP
> [2680311.853001] RIP: 0010:[<ffffffff815f155c>]  [<ffffffff815f155c>] ip_vs_conn_expire+0xdc/0x2f0
> [2680311.853001] RSP: 0018:ffff880028303e70  EFLAGS: 00010202
> [2680311.853001] RAX: dead000000200200 RBX: ffff8801aad00b80 RCX: 0000000000001d90
> [2680311.853001] RDX: dead000000100100 RSI: 000000004fd59800 RDI: ffff8801aad00c08
> <snip>
> [2680311.853001] Call Trace:
> [2680311.853001]  <IRQ>
> [2680311.853001]  [<ffffffff815f1480>] ? ip_vs_conn_expire+0x0/0x2f0
> [2680311.853001]  [<ffffffff8104e2a5>] run_timer_softirq+0x175/0x1d0
> [2680311.853001]  [<ffffffff81021a48>] ? lapic_next_event+0x18/0x20
> [2680311.853001]  [<ffffffff81049a13>] __do_softirq+0xb3/0x150
> [2680311.853001]  [<ffffffff8100cc5c>] call_softirq+0x1c/0x30
> [2680311.853001]  [<ffffffff8100ea9a>] do_softirq+0x4a/0x80
> [2680311.853001]  [<ffffffff81049957>] irq_exit+0x77/0x80
> [2680311.853001]  [<ffffffff81021f2c>] smp_apic_timer_interrupt+0x6c/0xa0
> [2680311.853001]  [<ffffffff8100c633>] apic_timer_interrupt+0x13/0x20
> [2680311.853001]  <EOI>
> [2680311.853001]  [<ffffffff81013b52>] ? mwait_idle+0x52/0x70
> [2680311.853001]  [<ffffffff8100a7b0>] ? enter_idle+0x20/0x30
> [2680311.853001]  [<ffffffff8100ac62>] ? cpu_idle+0x52/0x80
> [2680311.853001]  [<ffffffff816d504d>] ? start_secondary+0x19d/0x280
> 
> rax and rdx is LIST_POISON1 and LIST_POISON2, so kernel is list_del() on an already deleted
> connection and result the general protect fault.
> 
> The "request for already hashed" warning, told us someone might change the connection flags
> incorrectly, like described in commit aea9d711, it changes the connection flags, but doesn't
> put the connection back to the list. So ip_vs_conn_hash() throw a warning and return.
> Later, when ip_vs_conn_expire fire again, ip_vs_conn_unhash() will find the HASHED connection
> and list_del() it, then kernel panic happened.
> 
> After code review, the only chance that kernel change connection flag without protection is
> in ip_vs_ftp_init_conn().

Thanks for being thorough.

> Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
> Cc: Wensong Zhang <wensong@linux-vs.org>
> Cc: Simon Horman <horms@verge.net.au>
> Cc: Julian Anastasov <ja@ssi.bg>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: "David S. Miller" <davem@davemloft.net> 

Acked-by: Simon Horman <horms@verge.net.au>

Pablo, can you handle this directly through your tree?
I think it also needs to go to -stable.

> ---
>  net/netfilter/ipvs/ip_vs_ftp.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index b20b29c..c2bc264 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -65,8 +65,10 @@ static int ip_vs_ftp_pasv;
>  static int
>  ip_vs_ftp_init_conn(struct ip_vs_app *app, struct ip_vs_conn *cp)
>  {
> +	spin_lock(&cp->lock);
>  	/* We use connection tracking for the command connection */
>  	cp->flags |= IP_VS_CONN_F_NFCT;
> +	spin_unlock(&cp->lock);
>  	return 0;
>  }

^ permalink raw reply

* [GIT] net --> net-next merge
From: David Miller @ 2012-06-29  0:52 UTC (permalink / raw)
  To: netdev; +Cc: sjur.brandeland

I just did said merge, but there was a really difficult conflict to
resolve in the caif-hsi driver.

Sjur please send me any necessary fixups and please be more mindful in
the future of the incredible merge pain you put me through when you
have such fundamentally overlapping changes like that and don't
provide me with a sample merge resolution like other people do.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next 1/4] tcp: fix inet6_csk_route_req() for link-local addresses
From: David Miller @ 2012-06-29  0:55 UTC (permalink / raw)
  To: ncardwell; +Cc: netdev, edumazet
In-Reply-To: <1340922861-11684-1-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Thu, 28 Jun 2012 15:34:18 -0700

> Fix inet6_csk_route_req() to use as the flowi6_oif the treq->iif,
> which is correctly fixed up in tcp_v6_conn_request() to handle the
> case of link-local addresses. This brings it in line with the
> tcp_v6_send_synack() code, which is already correctly using the
> treq->iif in this way.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/4] tcp: pass fl6 to inet6_csk_route_req()
From: David Miller @ 2012-06-29  0:55 UTC (permalink / raw)
  To: ncardwell; +Cc: netdev, edumazet
In-Reply-To: <1340922861-11684-2-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Thu, 28 Jun 2012 15:34:19 -0700

> This commit changes inet_csk_route_req() so that it uses a pointer to
> a struct flowi6, rather than allocating its own on the stack. This
> brings its behavior in line with its IPv4 cousin,
> inet_csk_route_req(), and allows a follow-on patch to fix a dst leak.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 3/4] tcp: use inet6_csk_route_req() in tcp_v6_send_synack()
From: David Miller @ 2012-06-29  0:55 UTC (permalink / raw)
  To: ncardwell; +Cc: netdev, edumazet
In-Reply-To: <1340922861-11684-3-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Thu, 28 Jun 2012 15:34:20 -0700

> With the recent change (earlier in this patch series) to set
> flowi6_oif to treq->iif in inet6_csk_route_req(), the dst lookup in
> these two functions is now identical, so tcp_v6_send_synack() can now
> just call inet6_csk_route_req(), to reduce code duplication and keep
> things closer to the IPv4 side, which is structured this way.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 4/4] tcp: plug dst leak in tcp_v6_conn_request()
From: David Miller @ 2012-06-29  0:55 UTC (permalink / raw)
  To: ncardwell; +Cc: netdev, edumazet
In-Reply-To: <1340922861-11684-4-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Thu, 28 Jun 2012 15:34:21 -0700

> The code in tcp_v6_conn_request() was implicitly assuming that
> tcp_v6_send_synack() would take care of dst_release(), much as
> tcp_v4_send_synack() already does. This resulted in
> tcp_v6_conn_request() leaking a dst if sysctl_tw_recycle is enabled.
> 
> This commit restructures tcp_v6_send_synack() so that it accepts a dst
> pointer and takes care of releasing the dst that is passed in, to plug
> the leak and avoid future surprises by bringing the IPv6 behavior in
> line with the IPv4 side.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: Use NLMSG_DEFAULT_SIZE in combination with nlmsg_new()
From: David Miller @ 2012-06-29  0:57 UTC (permalink / raw)
  To: tgraf
  Cc: netdev, jpirko, dbaryshkov, slapin, johannes, lauro.venancio,
	aloisio.almeida, sameo
In-Reply-To: <caecbaa2f181c17908fbd61264e1acb9c5cb1fe1.1340891841.git.tgraf@suug.ch>

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 28 Jun 2012 15:57:45 +0200

> Using NLMSG_GOODSIZE results in multiple pages being used as
> nlmsg_new() will automatically add the size of the netlink
> header to the payload thus exceeding the page limit.
> 
> NLMSG_DEFAULT_SIZE takes this into account.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

Applied, thanks Thomas.

^ permalink raw reply

* [net-next PATCH 00/02] net/ipv4: Add support for new tunnel type VTI.
From: Saurabh @ 2012-06-29  0:52 UTC (permalink / raw)
  To: netdev

Incorporated David and Steffen's comments.
Resubmitting after taking into account review comments:
The VTI tunnel is applicable to esp, ah and ipcomp.

Introduction:
Virtual tunnel interface is a way to represent policy based IPsec tunnels as virtual interfaces in linux. This is similar to Cisco's VTI (virtual tunnel interface) and Juniper's representaion of secure tunnel (st.xx). The advantage of representing an IPsec tunnel as an interface is that it is possible to plug Ipsec tunnels into the routing protocol infrastructure of a router. Therefore it becomes possible to influence the packet path by toggling the link state of the tunnel or based on routing metrics.

Overview:
Natively linux kernel does not support ipsec as an interface. Also secure interface assume a ipsec policy 4 tupple of {dst-ip-any, src-ip-any, dst-port-any, src-port-any}. Applying this 4 tuple in linux would result in all traffic matching the ipsec policy. What is needed is a tunnel distinguisher. The linux kernel skbuff has fwmark which is used for policy based routing (PBR). Linux kernel version 2.6.35 enhanced SPD/SADB to use fwmark as part of the IPsec policy. Strongswan has also introduced support for this kernel feature with version 4.5.0. We can therefore use the fwmark as the distinguisher for tunnel interface. We can also create a light weight tunnel kernel module (vti) to give the notion of an interface for rest of the kernel routing system. The tunnel module does not do any enc
 apsulation/decapsulation. The kernel's xfrm modules still do the esp encryption/decryption.

Usage:
ip tunnel add sti15 mode vti remote 12.0.0.1 local 12.0.0.3 ikey 15
or
ip link add sti15 type vti key 15 remote 12.0.0.1 local 12.0.0.3

Sample strongswan config would be:
conn peer-12.0.0.1-tunnel-1
   left=12.0.0.3
   right=12.0.0.1
   leftsubnet=0.0.0.0/0
   rightsubnet=0.0.0.0/0
   ike=aes128-sha1-modp1024!
   ikelifetime=28800s
   keyingtries=%forever
   esp=aes128-sha1!
   keylife=3600s
   rekeymargin=540s
   type=tunnel
   pfs=yes
   compress=no
   authby=secret
   auto=start
   mark_in=0xf
   mark_out=0xf
   keyexchange=ikev1

Also you need the iptables rule for ingress esp and udp-4500 packets:
-A PREROUTING -s 12.0.0.1/32 -d 12.0.0.3/32 -p esp -j MARK --set-xmark 0xf/0xffffffff

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---

^ permalink raw reply

* [net-next PATCH 01/02] net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.
From: Saurabh @ 2012-06-29  0:52 UTC (permalink / raw)
  To: netdev



Incorporated David and Steffen's comments.
Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index e0a55df..04214c0 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1475,6 +1475,8 @@ extern int xfrm4_output(struct sk_buff *skb);
 extern int xfrm4_output_finish(struct sk_buff *skb);
 extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler, unsigned short family);
 extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler, unsigned short family);
+extern int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler);
+extern int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler);
 extern int xfrm6_extract_header(struct sk_buff *skb);
 extern int xfrm6_extract_input(struct xfrm_state *x, struct sk_buff *skb);
 extern int xfrm6_rcv_spi(struct sk_buff *skb, int nexthdr, __be32 spi);
diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
index ed4bf11..4fc2944 100644
--- a/net/ipv4/xfrm4_mode_tunnel.c
+++ b/net/ipv4/xfrm4_mode_tunnel.c
@@ -15,6 +15,68 @@
 #include <net/ip.h>
 #include <net/xfrm.h>
 
+/*
+ * Informational hook. The decap is still done here.
+ */
+static struct xfrm_tunnel __rcu *rcv_notify_handlers __read_mostly;
+static DEFINE_MUTEX(xfrm4_mode_tunnel_input_mutex);
+
+int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+
+	int ret = -EEXIST;
+	int priority = handler->priority;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+
+	for (pprev = &rcv_notify_handlers;
+		(t = rcu_dereference_protected(*pprev,
+		lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+		pprev = &t->next) {
+		if (t->priority > priority)
+			break;
+		if (t->priority == priority)
+			goto err;
+
+	}
+
+	handler->next = *pprev;
+	rcu_assign_pointer(*pprev, handler);
+
+	ret = 0;
+
+err:
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_register);
+
+int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+	int ret = -ENOENT;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+	for (pprev = &rcv_notify_handlers;
+		(t = rcu_dereference_protected(*pprev,
+		lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+		pprev = &t->next) {
+		if (t == handler) {
+			*pprev = handler->next;
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	synchronize_net();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_deregister);
+
 static inline void ipip_ecn_decapsulate(struct sk_buff *skb)
 {
 	struct iphdr *inner_iph = ipip_hdr(skb);
@@ -64,8 +126,14 @@ static int xfrm4_mode_tunnel_output(struct xfrm_state *x, struct sk_buff *skb)
 	return 0;
 }
 
+#define for_each_input_rcu(head, handler)	\
+	for (handler = rcu_dereference(head);	\
+		handler != NULL;		\
+		handler = rcu_dereference(handler->next))  \
+
 static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 {
+	struct xfrm_tunnel *handler;
 	int err = -EINVAL;
 
 	if (XFRM_MODE_SKB_CB(skb)->protocol != IPPROTO_IPIP)
@@ -74,6 +142,10 @@ static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
 		goto out;
 
+	/* The handlers do not consume the skb. */
+	for_each_input_rcu(rcv_notify_handlers, handler)
+		handler->handler(skb);
+
 	if (skb_cloned(skb) &&
 	    (err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
 		goto out;

^ permalink raw reply related

* [net-next PATCH 02/02] net/ipv4: VTI support new module for ip_vti.
From: Saurabh @ 2012-06-29  0:52 UTC (permalink / raw)
  To: netdev



Incorporated David and Steffen's comments.
New VTI tunnel kernel module, Kconfig and Makefile changes.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 16b92d0..5efff60 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -80,4 +80,18 @@ enum {
 
 #define IFLA_GRE_MAX	(__IFLA_GRE_MAX - 1)
 
+/* VTI-mode i_flags */
+#define VTI_ISVTI 0x0001
+
+enum {
+	IFLA_VTI_UNSPEC,
+	IFLA_VTI_LINK,
+	IFLA_VTI_IKEY,
+	IFLA_VTI_OKEY,
+	IFLA_VTI_LOCAL,
+	IFLA_VTI_REMOTE,
+	__IFLA_VTI_MAX,
+};
+
+#define IFLA_VTI_MAX	(__IFLA_VTI_MAX - 1)
 #endif /* _IF_TUNNEL_H_ */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 20f1cb5..32d467b 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -310,6 +310,16 @@ config SYN_COOKIES
 
 	  If unsure, say N.
 
+config NET_IPVTI
+	tristate "Virtual (secure) IP: tunneling"
+	select INET_TUNNEL
+	depends on INET_XFRM_MODE_TUNNEL
+	---help---
+	Tunneling means encapsulating data of one protocol type within
+	another protocol and sending it over a channel that understands the
+	Pencapsulating protocol. This can be used with xfrm mode tunnel to give
+	the notion of a secure tunnel for IPSEC and then use routing protocol on top.
+
 config INET_AH
 	tristate "IP: AH transformation"
 	select XFRM_ALGO
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index ff75d3b..3999ce9 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IP_MROUTE) += ipmr.o
 obj-$(CONFIG_NET_IPIP) += ipip.o
 obj-$(CONFIG_NET_IPGRE_DEMUX) += gre.o
 obj-$(CONFIG_NET_IPGRE) += ip_gre.o
+obj-$(CONFIG_NET_IPVTI) += ip_vti.o
 obj-$(CONFIG_SYN_COOKIES) += syncookies.o
 obj-$(CONFIG_INET_AH) += ah4.o
 obj-$(CONFIG_INET_ESP) += esp4.o
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
new file mode 100644
index 0000000..7aa3662
--- /dev/null
+++ b/net/ipv4/ip_vti.c
@@ -0,0 +1,970 @@
+/*
+ *	Linux NET3:	IP/IP protocol decoder modified to support virtual tunnel interface
+ *
+ *	Authors:
+ *		Saurabh Mohan (saurabh.mohan@vyatta.com) 05/07/2012
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; either version
+ *	2 of the License, or (at your option) any later version.
+ *
+ */
+
+/*
+   This version of net/ipv4/ip_vti.c is cloned of net/ipv4/ipip.c
+
+   For comments look at net/ipv4/ip_gre.c --ANK
+ */
+
+
+#include <linux/capability.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/uaccess.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/in.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/if_arp.h>
+#include <linux/mroute.h>
+#include <linux/init.h>
+#include <linux/netfilter_ipv4.h>
+#include <linux/if_ether.h>
+
+#include <net/sock.h>
+#include <net/ip.h>
+#include <net/icmp.h>
+#include <net/ipip.h>
+#include <net/inet_ecn.h>
+#include <net/xfrm.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#define HASH_SIZE  16
+#define HASH(addr) (((__force u32)addr^((__force u32)addr>>4))&0xF)
+
+static struct rtnl_link_ops vti_link_ops __read_mostly;
+
+static int vti_net_id __read_mostly;
+struct vti_net {
+	struct ip_tunnel __rcu *tunnels_r_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_r[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_wc[1];
+	struct ip_tunnel **tunnels[4];
+
+	struct net_device *fb_tunnel_dev;
+};
+
+static int vti_fb_tunnel_init(struct net_device *dev);
+static int vti_tunnel_init(struct net_device *dev);
+static void vti_tunnel_setup(struct net_device *dev);
+static void vti_dev_free(struct net_device *dev);
+static int vti_tunnel_bind_dev(struct net_device *dev);
+
+/*
+ * Locking : hash tables are protected by RCU and RTNL
+ */
+
+#define for_each_ip_tunnel_rcu(start) \
+	for (t = rcu_dereference(start); t; t = rcu_dereference(t->next))
+
+/* often modified stats are per cpu, other are shared (netdev->stats) */
+struct pcpu_tstats {
+	u64	rx_packets;
+	u64	rx_bytes;
+	u64	tx_packets;
+	u64	tx_bytes;
+	struct	u64_stats_sync	syncp;
+};
+
+#define VTI_XMIT(stats1, stats2) do {				\
+	int err;						\
+	int pkt_len = skb->len;					\
+	err = dst_output(skb);					\
+	if (net_xmit_eval(err) == 0) {				\
+		(stats1)->tx_bytes += pkt_len;			\
+		(stats1)->tx_packets++;				\
+	} else {						\
+		(stats2)->tx_errors++;				\
+		(stats2)->tx_aborted_errors++;			\
+	}							\
+} while (0)
+
+
+static struct rtnl_link_stats64 *vti_get_stats64(struct net_device *dev,
+					       struct rtnl_link_stats64 *tot)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		const struct pcpu_tstats *tstats = per_cpu_ptr(dev->tstats, i);
+		u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
+		unsigned int start;
+
+		do {
+			start = u64_stats_fetch_begin_bh(&tstats->syncp);
+			rx_packets = tstats->rx_packets;
+			tx_packets = tstats->tx_packets;
+			rx_bytes = tstats->rx_bytes;
+			tx_bytes = tstats->tx_bytes;
+		} while (u64_stats_fetch_retry_bh(&tstats->syncp, start));
+
+		tot->rx_packets += rx_packets;
+		tot->tx_packets += tx_packets;
+		tot->rx_bytes   += rx_bytes;
+		tot->tx_bytes   += tx_bytes;
+	}
+
+	tot->multicast = dev->stats.multicast;
+	tot->rx_crc_errors = dev->stats.rx_crc_errors;
+	tot->rx_fifo_errors = dev->stats.rx_fifo_errors;
+	tot->rx_length_errors = dev->stats.rx_length_errors;
+	tot->rx_errors = dev->stats.rx_errors;
+	tot->tx_fifo_errors = dev->stats.tx_fifo_errors;
+	tot->tx_carrier_errors = dev->stats.tx_carrier_errors;
+	tot->tx_dropped = dev->stats.tx_dropped;
+	tot->tx_aborted_errors = dev->stats.tx_aborted_errors;
+	tot->tx_errors = dev->stats.tx_errors;
+
+	return tot;
+}
+
+static struct ip_tunnel *vti_tunnel_lookup(struct net *net,
+					 __be32 remote, __be32 local)
+{
+	unsigned h0 = HASH(remote);
+	unsigned h1 = HASH(local);
+	struct ip_tunnel *t;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_r_l[h0 ^ h1])
+		if (local == t->parms.iph.saddr &&
+		    remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+	for_each_ip_tunnel_rcu(ipn->tunnels_r[h0])
+		if (remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_l[h1])
+		if (local == t->parms.iph.saddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_wc[0])
+		if (t && (t->dev->flags&IFF_UP))
+			return t;
+	return NULL;
+}
+
+static struct ip_tunnel **__vti_bucket(struct vti_net *ipn,
+				     struct ip_tunnel_parm *parms)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	unsigned h = 0;
+	int prio = 0;
+
+	if (remote) {
+		prio |= 2;
+		h ^= HASH(remote);
+	}
+	if (local) {
+		prio |= 1;
+		h ^= HASH(local);
+	}
+	return &ipn->tunnels[prio][h];
+}
+
+static inline struct ip_tunnel **vti_bucket(struct vti_net *ipn,
+					  struct ip_tunnel *t)
+{
+	return __vti_bucket(ipn, &t->parms);
+}
+
+static void vti_tunnel_unlink(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp;
+	struct ip_tunnel *iter;
+
+	for (tp = vti_bucket(ipn, t);
+	     (iter = rtnl_dereference(*tp)) != NULL;
+	     tp = &iter->next) {
+		if (t == iter) {
+			rcu_assign_pointer(*tp, t->next);
+			break;
+		}
+	}
+}
+
+static void vti_tunnel_link(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp = vti_bucket(ipn, t);
+
+	rcu_assign_pointer(t->next, rtnl_dereference(*tp));
+	rcu_assign_pointer(*tp, t);
+}
+
+static struct ip_tunnel *vti_tunnel_locate(struct net *net,
+					 struct ip_tunnel_parm *parms,
+					 int create)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	struct ip_tunnel *t, *nt;
+	struct ip_tunnel __rcu **tp;
+	struct net_device *dev;
+	char name[IFNAMSIZ];
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for (tp = __vti_bucket(ipn, parms);
+	     (t = rtnl_dereference(*tp)) != NULL;
+	     tp = &t->next) {
+		if (local == t->parms.iph.saddr && remote == t->parms.iph.daddr)
+			return t;
+	}
+	if (!create)
+		return NULL;
+
+	if (parms->name[0])
+		strlcpy(name, parms->name, IFNAMSIZ);
+	else
+		strcpy(name, "vti%d");
+
+	dev = alloc_netdev(sizeof(*t), name, vti_tunnel_setup);
+	if (dev == NULL)
+		return NULL;
+
+	dev_net_set(dev, net);
+
+	nt = netdev_priv(dev);
+	nt->parms = *parms;
+	dev->rtnl_link_ops = &vti_link_ops;
+
+	vti_tunnel_bind_dev(dev);
+
+	if (register_netdevice(dev) < 0)
+		goto failed_free;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+	return nt;
+
+ failed_free:
+	free_netdev(dev);
+	return NULL;
+}
+
+static void vti_tunnel_uninit(struct net_device *dev)
+{
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	if (dev == ipn->fb_tunnel_dev)
+		RCU_INIT_POINTER(ipn->tunnels_wc[0], NULL);
+	else
+		vti_tunnel_unlink(ipn, netdev_priv(dev));
+	dev_put(dev);
+}
+
+static int vti_err(struct sk_buff *skb, u32 info)
+{
+
+	/* All the routers (except for Linux) return only
+	 * 8 bytes of packet payload. It means, that precise relaying of
+	 * ICMP in the real Internet is absolutely infeasible.
+	 */
+	struct iphdr *iph = (struct iphdr *)skb->data;
+	const int type = icmp_hdr(skb)->type;
+	const int code = icmp_hdr(skb)->code;
+	struct ip_tunnel *t;
+
+	switch (type) {
+	default:
+	case ICMP_PARAMETERPROB:
+		return 0;
+
+	case ICMP_DEST_UNREACH:
+		switch (code) {
+		case ICMP_SR_FAILED:
+		case ICMP_PORT_UNREACH:
+			/* Impossible event. */
+			return 0;
+		default:
+			/* All others are translated to HOST_UNREACH. */
+			break;
+		}
+		break;
+	case ICMP_TIME_EXCEEDED:
+		if (code != ICMP_EXC_TTL)
+			return 0;
+		break;
+	}
+
+	rcu_read_lock();
+	t = vti_tunnel_lookup(dev_net(skb->dev), iph->daddr, iph->saddr);
+	if (t == NULL)
+		goto out;
+
+	if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) {
+		ipv4_update_pmtu(skb, dev_net(skb->dev), info,
+				 t->parms.link, 0, IPPROTO_IPIP, 0);
+		goto out;
+	}
+
+	if (t->parms.iph.ttl == 0 && type == ICMP_TIME_EXCEEDED)
+		goto out;
+
+	if (time_before(jiffies, t->err_time + IPTUNNEL_ERR_TIMEO))
+		t->err_count++;
+	else
+		t->err_count = 1;
+	t->err_time = jiffies;
+out:
+	rcu_read_unlock();
+	return 0;
+}
+
+/*
+ * We dont digest the packet therefore let the packet pass.
+ */
+static int vti_rcv(struct sk_buff *skb)
+{
+	struct ip_tunnel *tunnel;
+	const struct iphdr *iph = ip_hdr(skb);
+
+	rcu_read_lock();
+	tunnel = vti_tunnel_lookup(dev_net(skb->dev), iph->saddr, iph->daddr);
+	if (tunnel != NULL) {
+		struct pcpu_tstats *tstats;
+
+		tstats = this_cpu_ptr(tunnel->dev->tstats);
+		tstats->rx_packets++;
+		tstats->rx_bytes += skb->len;
+
+		skb->dev = tunnel->dev;
+		rcu_read_unlock();
+		/* We do not eat the packet here therefore return 1 */
+		return 1;
+	}
+	rcu_read_unlock();
+
+	return -1;
+}
+
+/*
+ *	This function assumes it is being called from dev_queue_xmit()
+ *	and that skb is filled properly by that function.
+ */
+
+static netdev_tx_t vti_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct pcpu_tstats *tstats;
+	struct net_device_stats *stats = &tunnel->dev->stats;
+	struct iphdr  *tiph = &tunnel->parms.iph;
+	u8     tos = tunnel->parms.iph.tos;
+	struct rtable *rt;		/* Route to the other host */
+	struct net_device *tdev;	/* Device to other host */
+	struct iphdr  *old_iph = ip_hdr(skb);
+	__be32 dst = tiph->daddr;
+	struct flowi4 fl4;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		goto tx_error;
+
+	if (tos&1)
+		tos = old_iph->tos;
+
+	if (!dst) {
+		/* NBMA tunnel */
+		rt = skb_rtable(skb);
+		if (rt == NULL) {
+			stats->tx_fifo_errors++;
+			goto tx_error;
+		}
+		dst = rt->rt_gateway;
+		if (dst == 0)
+			goto tx_error_icmp;
+	}
+
+	memset(&fl4, 0, sizeof(fl4));
+	flowi4_init_output(&fl4, tunnel->parms.link,
+		htonl(tunnel->parms.i_key), RT_TOS(tos), RT_SCOPE_UNIVERSE,
+		IPPROTO_IPIP, 0,
+		dst, tiph->saddr, 0, 0);
+	rt = ip_route_output_key(dev_net(dev), &fl4);
+	if (IS_ERR(rt)) {
+		dev->stats.tx_carrier_errors++;
+		goto tx_error_icmp;
+	}
+#ifdef CONFIG_XFRM
+		/*
+		 * if there is no transform then this tunnel is not functional.
+		 * Or if the xfrm is not mode tunnel.
+		 */
+		if (!rt->dst.xfrm || rt->dst.xfrm->props.mode != XFRM_MODE_TUNNEL) {
+			stats->tx_carrier_errors++;
+			goto tx_error_icmp;
+		}
+#endif
+	tdev = rt->dst.dev;
+
+	if (tdev == dev) {
+		ip_rt_put(rt);
+		stats->collisions++;
+		goto tx_error;
+
+	}
+
+
+	if (tunnel->err_count > 0) {
+		if (time_before(jiffies,
+				tunnel->err_time + IPTUNNEL_ERR_TIMEO)) {
+			tunnel->err_count--;
+			dst_link_failure(skb);
+		} else
+			tunnel->err_count = 0;
+	}
+
+
+	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED |
+			      IPSKB_REROUTED);
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &rt->dst);
+	nf_reset(skb);
+	skb->dev = skb_dst(skb)->dev;
+
+	tstats = this_cpu_ptr(dev->tstats);
+	VTI_XMIT(tstats, &dev->stats);
+	return NETDEV_TX_OK;
+
+tx_error_icmp:
+	dst_link_failure(skb);
+tx_error:
+	stats->tx_errors++;
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int vti_tunnel_bind_dev(struct net_device *dev)
+{
+	struct net_device *tdev = NULL;
+	struct ip_tunnel *tunnel;
+	struct iphdr *iph;
+
+	tunnel = netdev_priv(dev);
+	iph = &tunnel->parms.iph;
+
+	if (iph->daddr) {
+		struct rtable *rt;
+		struct flowi4 fl4;
+		memset(&fl4, 0, sizeof(fl4));
+		flowi4_init_output(&fl4, tunnel->parms.link,
+				htonl(tunnel->parms.i_key), RT_TOS(iph->tos), RT_SCOPE_UNIVERSE,
+				IPPROTO_IPIP, 0,
+				iph->daddr, iph->saddr, 0, 0);
+		rt = ip_route_output_key(dev_net(dev), &fl4);
+		if (!IS_ERR(rt)) {
+			tdev = rt->dst.dev;
+			ip_rt_put(rt);
+		}
+		dev->flags |= IFF_POINTOPOINT;
+	}
+
+	if (!tdev && tunnel->parms.link)
+		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+
+	if (tdev) {
+		dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr);
+		dev->mtu = tdev->mtu;
+	}
+	dev->iflink = tunnel->parms.link;
+	return dev->mtu;
+}
+
+static int
+vti_tunnel_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
+{
+	int err = 0;
+	struct ip_tunnel_parm p;
+	struct ip_tunnel *t;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	switch (cmd) {
+	case SIOCGETTUNNEL:
+		t = NULL;
+		if (dev == ipn->fb_tunnel_dev) {
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p))) {
+				err = -EFAULT;
+				break;
+			}
+			t = vti_tunnel_locate(net, &p, 0);
+		}
+		if (t == NULL)
+			t = netdev_priv(dev);
+		memcpy(&p, &t->parms, sizeof(p));
+		p.i_flags |= GRE_KEY;
+		p.o_flags |= GRE_KEY;
+		if (copy_to_user(ifr->ifr_ifru.ifru_data, &p, sizeof(p)))
+			err = -EFAULT;
+		break;
+
+	case SIOCADDTUNNEL:
+	case SIOCCHGTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		err = -EFAULT;
+		if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p)))
+			goto done;
+
+		err = -EINVAL;
+		if (p.iph.version != 4 || p.iph.protocol != IPPROTO_IPIP ||
+		    p.iph.ihl != 5 || (p.iph.frag_off&htons(~IP_DF)))
+			goto done;
+		if (p.iph.ttl)
+			p.iph.frag_off |= htons(IP_DF);
+
+		t = vti_tunnel_locate(net, &p, cmd == SIOCADDTUNNEL);
+
+		if (dev != ipn->fb_tunnel_dev && cmd == SIOCCHGTUNNEL) {
+			if (t != NULL) {
+				if (t->dev != dev) {
+					err = -EEXIST;
+					break;
+				}
+			} else {
+				if (((dev->flags&IFF_POINTOPOINT) && !p.iph.daddr) ||
+				    (!(dev->flags&IFF_POINTOPOINT) && p.iph.daddr)) {
+					err = -EINVAL;
+					break;
+				}
+				t = netdev_priv(dev);
+				vti_tunnel_unlink(ipn, t);
+				synchronize_net();
+				t->parms.iph.saddr = p.iph.saddr;
+				t->parms.iph.daddr = p.iph.daddr;
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				t->parms.iph.protocol = IPPROTO_IPIP;
+				memcpy(dev->dev_addr, &p.iph.saddr, 4);
+				memcpy(dev->broadcast, &p.iph.daddr, 4);
+				vti_tunnel_link(ipn, t);
+				netdev_state_change(dev);
+			}
+		}
+
+		if (t) {
+			err = 0;
+			if (cmd == SIOCCHGTUNNEL) {
+				t->parms.iph.ttl = p.iph.ttl;
+				t->parms.iph.tos = p.iph.tos;
+				t->parms.iph.frag_off = p.iph.frag_off;
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				if (t->parms.link != p.link) {
+					t->parms.link = p.link;
+					vti_tunnel_bind_dev(dev);
+					netdev_state_change(dev);
+				}
+			}
+			if (copy_to_user(ifr->ifr_ifru.ifru_data, &t->parms, sizeof(p)))
+				err = -EFAULT;
+		} else
+			err = (cmd == SIOCADDTUNNEL ? -ENOBUFS : -ENOENT);
+		break;
+
+	case SIOCDELTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		if (dev == ipn->fb_tunnel_dev) {
+			err = -EFAULT;
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p)))
+				goto done;
+			err = -ENOENT;
+
+			t = vti_tunnel_locate(net, &p, 0);
+			if (t == NULL)
+				goto done;
+			err = -EPERM;
+			if (t->dev == ipn->fb_tunnel_dev)
+				goto done;
+			dev = t->dev;
+		}
+		unregister_netdevice(dev);
+		err = 0;
+		break;
+
+	default:
+		err = -EINVAL;
+	}
+
+done:
+	return err;
+}
+
+static int vti_tunnel_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (new_mtu < 68 || new_mtu > 0xFFF8)
+		return -EINVAL;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static const struct net_device_ops vti_netdev_ops = {
+	.ndo_init	= vti_tunnel_init,
+	.ndo_uninit	= vti_tunnel_uninit,
+	.ndo_start_xmit	= vti_tunnel_xmit,
+	.ndo_do_ioctl	= vti_tunnel_ioctl,
+	.ndo_change_mtu	= vti_tunnel_change_mtu,
+	.ndo_get_stats64  = vti_get_stats64,
+};
+
+static void vti_dev_free(struct net_device *dev)
+{
+	free_percpu(dev->tstats);
+	free_netdev(dev);
+}
+
+static void vti_tunnel_setup(struct net_device *dev)
+{
+	dev->netdev_ops		= &vti_netdev_ops;
+	dev->destructor		= vti_dev_free;
+
+	dev->type		= ARPHRD_TUNNEL;
+	dev->hard_header_len	= LL_MAX_HEADER + sizeof(struct iphdr);
+	dev->mtu		= ETH_DATA_LEN;
+	dev->flags		= IFF_NOARP;
+	dev->iflink		= 0;
+	dev->addr_len		= 4;
+	dev->features		|= NETIF_F_NETNS_LOCAL;
+	dev->features		|= NETIF_F_LLTX;
+	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
+}
+
+static int vti_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
+	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int __net_init vti_fb_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct iphdr *iph = &tunnel->parms.iph;
+	struct vti_net *ipn = net_generic(dev_net(dev), vti_net_id);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	iph->version		= 4;
+	iph->protocol		= IPPROTO_IPIP;
+	iph->ihl		= 5;
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	dev_hold(dev);
+	rcu_assign_pointer(ipn->tunnels_wc[0], tunnel);
+	return 0;
+}
+
+static struct xfrm_tunnel vti_handler __read_mostly = {
+	.handler	=	vti_rcv,
+	.err_handler	=	vti_err,
+	.priority	=	1,
+};
+
+static void vti_destroy_tunnels(struct vti_net *ipn, struct list_head *head)
+{
+	int prio;
+
+	for (prio = 1; prio < 4; prio++) {
+		int h;
+		for (h = 0; h < HASH_SIZE; h++) {
+			struct ip_tunnel *t;
+
+			t = rtnl_dereference(ipn->tunnels[prio][h]);
+			while (t != NULL) {
+				unregister_netdevice_queue(t->dev, head);
+				t = rtnl_dereference(t->next);
+			}
+		}
+	}
+}
+
+static int __net_init vti_init_net(struct net *net)
+{
+	int err;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	ipn->tunnels[0] = ipn->tunnels_wc;
+	ipn->tunnels[1] = ipn->tunnels_l;
+	ipn->tunnels[2] = ipn->tunnels_r;
+	ipn->tunnels[3] = ipn->tunnels_r_l;
+
+	ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+					   "ip_vti0",
+					   vti_tunnel_setup);
+	if (!ipn->fb_tunnel_dev) {
+		err = -ENOMEM;
+		goto err_alloc_dev;
+	}
+	dev_net_set(ipn->fb_tunnel_dev, net);
+
+	err = vti_fb_tunnel_init(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	ipn->fb_tunnel_dev->rtnl_link_ops = &vti_link_ops;
+
+	err = register_netdev(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	return 0;
+
+err_reg_dev:
+	vti_dev_free(ipn->fb_tunnel_dev);
+err_alloc_dev:
+	/* nothing */
+	return err;
+}
+
+static void __net_exit vti_exit_net(struct net *net)
+{
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	LIST_HEAD(list);
+
+	rtnl_lock();
+	vti_destroy_tunnels(ipn, &list);
+	unregister_netdevice_many(&list);
+	rtnl_unlock();
+}
+
+static struct pernet_operations vti_net_ops = {
+	.init = vti_init_net,
+	.exit = vti_exit_net,
+	.id   = &vti_net_id,
+	.size = sizeof(struct vti_net),
+};
+
+static int vti_tunnel_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	return 0;
+}
+
+static void vti_netlink_parms(struct nlattr *data[],
+				struct ip_tunnel_parm *parms)
+{
+	memset(parms, 0, sizeof(*parms));
+
+	parms->iph.protocol = IPPROTO_IPIP;
+
+	if (!data)
+		return;
+
+	if (data[IFLA_VTI_LINK])
+		parms->link = nla_get_u32(data[IFLA_VTI_LINK]);
+
+	if (data[IFLA_VTI_IKEY])
+		parms->i_key = nla_get_be32(data[IFLA_VTI_IKEY]);
+
+	if (data[IFLA_VTI_OKEY])
+		parms->o_key = nla_get_be32(data[IFLA_VTI_OKEY]);
+
+	if (data[IFLA_VTI_LOCAL])
+		parms->iph.saddr = nla_get_be32(data[IFLA_VTI_LOCAL]);
+
+	if (data[IFLA_VTI_REMOTE])
+		parms->iph.daddr = nla_get_be32(data[IFLA_VTI_REMOTE]);
+
+}
+
+static int vti_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[],
+			 struct nlattr *data[])
+{
+	struct ip_tunnel *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	int mtu;
+	int err;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &nt->parms);
+
+	if (vti_tunnel_locate(net, &nt->parms, 0))
+		return -EEXIST;
+
+	mtu = vti_tunnel_bind_dev(dev);
+	if (!tb[IFLA_MTU])
+		dev->mtu = mtu;
+
+	err = register_netdevice(dev);
+	if (err)
+		goto out;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+
+out:
+	return err;
+	return 0;
+}
+
+static int vti_changelink(struct net_device *dev, struct nlattr *tb[],
+			    struct nlattr *data[])
+{
+	struct ip_tunnel *t, *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	struct ip_tunnel_parm p;
+	int mtu;
+
+	if (dev == ipn->fb_tunnel_dev)
+		return -EINVAL;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &p);
+
+	t = vti_tunnel_locate(net, &p, 0);
+
+	if (t) {
+		if (t->dev != dev)
+			return -EEXIST;
+	} else {
+		t = nt;
+
+		vti_tunnel_unlink(ipn, t);
+		t->parms.iph.saddr = p.iph.saddr;
+		t->parms.iph.daddr = p.iph.daddr;
+		t->parms.i_key = p.i_key;
+		t->parms.o_key = p.o_key;
+		if (dev->type != ARPHRD_ETHER) {
+			memcpy(dev->dev_addr, &p.iph.saddr, 4);
+			memcpy(dev->broadcast, &p.iph.daddr, 4);
+		}
+		vti_tunnel_link(ipn, t);
+		netdev_state_change(dev);
+	}
+
+	if (t->parms.link != p.link) {
+		t->parms.link = p.link;
+		mtu = vti_tunnel_bind_dev(dev);
+		if (!tb[IFLA_MTU])
+			dev->mtu = mtu;
+		netdev_state_change(dev);
+	}
+
+	return 0;
+}
+
+static size_t vti_get_size(const struct net_device *dev)
+{
+	return
+		/* IFLA_VTI_LINK */
+		nla_total_size(4) +
+		/* IFLA_VTI_IKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_OKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_LOCAL */
+		nla_total_size(4) +
+		/* IFLA_VTI_REMOTE */
+		nla_total_size(4) +
+		0;
+}
+
+static int vti_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct ip_tunnel *t = netdev_priv(dev);
+	struct ip_tunnel_parm *p = &t->parms;
+
+	nla_put_u32(skb, IFLA_VTI_LINK, p->link);
+	nla_put_be32(skb, IFLA_VTI_IKEY, p->i_key);
+	nla_put_be32(skb, IFLA_VTI_OKEY, p->o_key);
+	nla_put_be32(skb, IFLA_VTI_LOCAL, p->iph.saddr);
+	nla_put_be32(skb, IFLA_VTI_REMOTE, p->iph.daddr);
+
+	return 0;
+}
+
+static const struct nla_policy vti_policy[IFLA_VTI_MAX + 1] = {
+	[IFLA_VTI_LINK]		= { .type = NLA_U32 },
+	[IFLA_VTI_IKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_OKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_LOCAL]	= { .len = FIELD_SIZEOF(struct iphdr, saddr) },
+	[IFLA_VTI_REMOTE]	= { .len = FIELD_SIZEOF(struct iphdr, daddr) },
+};
+
+static struct rtnl_link_ops vti_link_ops __read_mostly = {
+	.kind		= "vti",
+	.maxtype	= IFLA_VTI_MAX,
+	.policy		= vti_policy,
+	.priv_size	= sizeof(struct ip_tunnel),
+	.setup		= vti_tunnel_setup,
+	.validate	= vti_tunnel_validate,
+	.newlink	= vti_newlink,
+	.changelink	= vti_changelink,
+	.get_size	= vti_get_size,
+	.fill_info	= vti_fill_info,
+};
+
+static int __init vti_init(void)
+{
+	int err;
+
+	pr_info("IPv4 over IPSec tunneling driver\n");
+
+	err = register_pernet_device(&vti_net_ops);
+	if (err < 0)
+		return err;
+	err = xfrm4_mode_tunnel_input_register(&vti_handler);
+	if (err < 0) {
+		unregister_pernet_device(&vti_net_ops);
+		pr_info(KERN_INFO "vti init: can't register tunnel\n");
+	}
+
+	err = rtnl_link_register(&vti_link_ops);
+	if (err < 0)
+		goto rtnl_link_failed;
+
+	return err;
+
+rtnl_link_failed:
+	xfrm4_mode_tunnel_input_deregister(&vti_handler);
+	unregister_pernet_device(&vti_net_ops);
+	return err;
+}
+
+static void __exit vti_fini(void)
+{
+	rtnl_link_unregister(&vti_link_ops);
+	if (xfrm4_mode_tunnel_input_deregister(&vti_handler))
+		pr_info("vti close: can't deregister tunnel\n");
+
+	unregister_pernet_device(&vti_net_ops);
+}
+
+module_init(vti_init);
+module_exit(vti_fini);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK("vti");
+MODULE_ALIAS_NETDEV("ip_vti0");

^ permalink raw reply related

* Re: [net-next PATCH 01/02] net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.
From: David Miller @ 2012-06-29  1:00 UTC (permalink / raw)
  To: saurabh.mohan; +Cc: netdev
In-Reply-To: <20120629005225.GA4494@debian-saurabh-64.vyatta.com>

From: Saurabh <saurabh.mohan@vyatta.com>
Date: Thu, 28 Jun 2012 17:52:25 -0700

> +	for (pprev = &rcv_notify_handlers;
> +		(t = rcu_dereference_protected(*pprev,
> +		lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
> +		pprev = &t->next) {

Please indent this properly:

	for (x;
	     y;
	     z) {

x, y, and z must all line up to the same column.

> +	for (pprev = &rcv_notify_handlers;
> +		(t = rcu_dereference_protected(*pprev,
> +		lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
> +		pprev = &t->next) {

Likewise.

> +#define for_each_input_rcu(head, handler)	\
> +	for (handler = rcu_dereference(head);	\
> +		handler != NULL;		\
> +		handler = rcu_dereference(handler->next))  \

Likewise.

^ permalink raw reply

* Re: [net-next PATCH 00/02] net/ipv4: Add support for new tunnel type VTI.
From: David Miller @ 2012-06-29  1:01 UTC (permalink / raw)
  To: saurabh.mohan; +Cc: netdev
In-Reply-To: <20120629005214.GA4477@debian-saurabh-64.vyatta.com>

From: Saurabh <saurabh.mohan@vyatta.com>
Date: Thu, 28 Jun 2012 17:52:14 -0700

> Virtual tunnel interface is a way to represent policy based IPsec tunnels as virtual interfaces in linux. This is similar to Cisco's VTI (virtual tunnel interface) and Juniper's representaion of secure tunnel (st.xx). The advantage of representing an IPsec tunnel as an interface is that it is possible to plug Ipsec tunnels into the routing protocol infrastructure of a router. Therefore it becomes possible to influence the packet path by toggling the link state of the tunnel or based on routing metrics.

Please format your text paragraphs for ~80 character lines, this is a
single 500+ character line in my email client and painful to read.

^ permalink raw reply

* Re: [net-next PATCH 02/02] net/ipv4: VTI support new module for ip_vti.
From: David Miller @ 2012-06-29  1:07 UTC (permalink / raw)
  To: saurabh.mohan; +Cc: netdev
In-Reply-To: <20120629005231.GA4511@debian-saurabh-64.vyatta.com>

From: Saurabh <saurabh.mohan@vyatta.com>
Date: Thu, 28 Jun 2012 17:52:31 -0700

> +static struct rtnl_link_stats64 *vti_get_stats64(struct net_device *dev,
> +					       struct rtnl_link_stats64 *tot)

Imporperly indentex, all of the argument lines must line up with the first
column after the initial "(" on the first line.

> +static struct ip_tunnel *vti_tunnel_lookup(struct net *net,
> +					 __be32 remote, __be32 local)

Likewise.

> +static struct ip_tunnel **__vti_bucket(struct vti_net *ipn,
> +				     struct ip_tunnel_parm *parms)

Likewise.

> +static inline struct ip_tunnel **vti_bucket(struct vti_net *ipn,
> +					  struct ip_tunnel *t)

Likewise.

> +static struct ip_tunnel *vti_tunnel_locate(struct net *net,
> +					 struct ip_tunnel_parm *parms,
> +					 int create)

Likewise.

> +	for (tp = __vti_bucket(ipn, parms);
> +	     (t = rtnl_dereference(*tp)) != NULL;
> +	     tp = &t->next) {

I find it very amusing that you are able to format this for() statement
correctly, but in the entire patch #1 you were unable to.   It's like
this patch was written by a completely different person.

> +	rcu_read_lock();
> +	t = vti_tunnel_lookup(dev_net(skb->dev), iph->daddr, iph->saddr);
> +	if (t == NULL)
> +		goto out;

You should return -ENOENT, like ipip() does, when the tunnel cannot
be found.

> +	flowi4_init_output(&fl4, tunnel->parms.link,
> +		htonl(tunnel->parms.i_key), RT_TOS(tos), RT_SCOPE_UNIVERSE,
> +		IPPROTO_IPIP, 0,
> +		dst, tiph->saddr, 0, 0);

Indent these arguments properly.

> +		/*
> +		 * if there is no transform then this tunnel is not functional.
> +		 * Or if the xfrm is not mode tunnel.
> +		 */

Format comments:

	/* Like
	 * this.
	 */

Not:

	/*
	 * Like
	 * this.
	 */

> +		flowi4_init_output(&fl4, tunnel->parms.link,
> +				htonl(tunnel->parms.i_key), RT_TOS(iph->tos), RT_SCOPE_UNIVERSE,
> +				IPPROTO_IPIP, 0,
> +				iph->daddr, iph->saddr, 0, 0);

Fix this indentation.

^ permalink raw reply

* Re: [PATCH] Allow receiving packets on the fallback tunnel if they pass sanity checks
From: David Miller @ 2012-06-29  1:09 UTC (permalink / raw)
  To: phil; +Cc: netdev, phild
In-Reply-To: <4FE92E7F.5040803@ipom.com>

From: Phil Dibowitz <phil@ipom.com>
Date: Mon, 25 Jun 2012 20:37:35 -0700

> Sure. Sorry, I just kept Ville's patch description.
> 
> We do Layer-3 DSR via IP-in-IP tunneling. Our load balancers wrap an extra IP
> header on incoming packets so they can be routed to the backend. In the v4
> tunnel driver, when these packets fall on the default tunl0 device, the
> behavior is to decapsulate them and drop them back on the stack. So our setup
> is that tunl0 has the VIP and eth0 has (obviously) the backend's real address.
> 
> In IPv6 we do the same thing, but the v6 tunnel driver didn't have this same
> behavior - if you didn't have an explicit tunnel setup, it would drop the packet.
> 
> This patch brings that v4 feature to the v6 driver.
> 
> I think that's the level of detail you're looking for, but I'm happy to expand
> on anything in particular. I also break this down in tons of detail here:

This is a lot better, please resubmit the patch with a proper verbose
commit message, and please submit it properly and without any unrelated
content in the message body as described in Documentation/SubmittingPatches

^ permalink raw reply

* [net-next PATCH 00/02] net/ipv4: Add support for new tunnel type VTI.
From: Saurabh @ 2012-06-29  1:29 UTC (permalink / raw)
  To: netdev

Addressed formating feedback from David. 
Incorporated David and Steffen's comments.
Resubmitting after taking into account review comments:
The VTI tunnel is applicable to esp, ah and ipcomp.

Introduction:
Virtual tunnel interface is a way to represent policy based IPsec tunnels as
 virtual interfaces in linux. This is similar to Cisco's VTI (virtual tunnel
 interface) and Juniper's representaion of secure tunnel (st.xx).
 The advantage of representing an IPsec tunnel as an interface is that it is
 possible to plug Ipsec tunnels into the routing protocol infrastructure of a
 router. Therefore it becomes possible to influence the packet path by toggling
 the link state of the tunnel or based on routing metrics.

Overview:
Natively linux kernel does not support ipsec as an interface. Also secure
 interface assume a ipsec policy 4 tupple of {dst-ip-any, src-ip-any,
 dst-port-any, src-port-any}. Applying this 4 tuple in linux would result in
 all traffic matching the ipsec policy. What is needed is a tunnel
 distinguisher. The linux kernel skbuff has fwmark which is used for policy
 based routing (PBR). Linux kernel version 2.6.35 enhanced SPD/SADB to use
 fwmark as part of the IPsec policy. Strongswan has also introduced support for
 this kernel feature with version 4.5.0. We can therefore use the fwmark as the
 distinguisher for tunnel interface. We can also create a light weight tunnel
 kernel module (vti) to give the notion of an interface for rest of the kernel
 routing system. The tunnel module does not do any encapsulation/decapsulation.
 The kernel's xfrm modules still do the esp encryption/decryption.

Usage:
ip tunnel add sti15 mode vti remote 12.0.0.1 local 12.0.0.3 ikey 15
or
ip link add sti15 type vti key 15 remote 12.0.0.1 local 12.0.0.3

Sample strongswan config would be:
conn peer-12.0.0.1-tunnel-1
   left=12.0.0.3
   right=12.0.0.1
   leftsubnet=0.0.0.0/0
   rightsubnet=0.0.0.0/0
   ike=aes128-sha1-modp1024!
   ikelifetime=28800s
   keyingtries=%forever
   esp=aes128-sha1!
   keylife=3600s
   rekeymargin=540s
   type=tunnel
   pfs=yes
   compress=no
   authby=secret
   auto=start
   mark_in=0xf
   mark_out=0xf
   keyexchange=ikev1

Also you need the iptables rule for ingress esp and udp-4500 packets:
-A PREROUTING -s 12.0.0.1/32 -d 12.0.0.3/32 -p esp -j MARK --set-xmark 0xf/0xffffffff

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---

^ permalink raw reply

* [net-next PATCH 01/02] net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.
From: Saurabh @ 2012-06-29  1:29 UTC (permalink / raw)
  To: netdev



Incorporated David and Steffen's comments.
Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index e0a55df..04214c0 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1475,6 +1475,8 @@ extern int xfrm4_output(struct sk_buff *skb);
 extern int xfrm4_output_finish(struct sk_buff *skb);
 extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler, unsigned short family);
 extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler, unsigned short family);
+extern int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler);
+extern int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler);
 extern int xfrm6_extract_header(struct sk_buff *skb);
 extern int xfrm6_extract_input(struct xfrm_state *x, struct sk_buff *skb);
 extern int xfrm6_rcv_spi(struct sk_buff *skb, int nexthdr, __be32 spi);
diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
index ed4bf11..af87030 100644
--- a/net/ipv4/xfrm4_mode_tunnel.c
+++ b/net/ipv4/xfrm4_mode_tunnel.c
@@ -15,6 +15,68 @@
 #include <net/ip.h>
 #include <net/xfrm.h>
 
+/*
+ * Informational hook. The decap is still done here.
+ */
+static struct xfrm_tunnel __rcu *rcv_notify_handlers __read_mostly;
+static DEFINE_MUTEX(xfrm4_mode_tunnel_input_mutex);
+
+int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+
+	int ret = -EEXIST;
+	int priority = handler->priority;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+
+	for (pprev = &rcv_notify_handlers;
+	     (t = rcu_dereference_protected(*pprev,
+	     lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+	     pprev = &t->next) {
+		if (t->priority > priority)
+			break;
+		if (t->priority == priority)
+			goto err;
+
+	}
+
+	handler->next = *pprev;
+	rcu_assign_pointer(*pprev, handler);
+
+	ret = 0;
+
+err:
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_register);
+
+int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+	int ret = -ENOENT;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+	for (pprev = &rcv_notify_handlers;
+	     (t = rcu_dereference_protected(*pprev,
+	     lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+	     pprev = &t->next) {
+		if (t == handler) {
+			*pprev = handler->next;
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	synchronize_net();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_deregister);
+
 static inline void ipip_ecn_decapsulate(struct sk_buff *skb)
 {
 	struct iphdr *inner_iph = ipip_hdr(skb);
@@ -64,8 +126,14 @@ static int xfrm4_mode_tunnel_output(struct xfrm_state *x, struct sk_buff *skb)
 	return 0;
 }
 
+#define for_each_input_rcu(head, handler)	\
+	for (handler = rcu_dereference(head);	\
+	     handler != NULL;			\
+	     handler = rcu_dereference(handler->next))
+
 static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 {
+	struct xfrm_tunnel *handler;
 	int err = -EINVAL;
 
 	if (XFRM_MODE_SKB_CB(skb)->protocol != IPPROTO_IPIP)
@@ -74,6 +142,10 @@ static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
 		goto out;
 
+	/* The handlers do not consume the skb. */
+	for_each_input_rcu(rcv_notify_handlers, handler)
+		handler->handler(skb);
+
 	if (skb_cloned(skb) &&
 	    (err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
 		goto out;

^ permalink raw reply related

* [net-next PATCH 02/02] net/ipv4: VTI support new module for ip_vti.
From: Saurabh @ 2012-06-29  1:30 UTC (permalink / raw)
  To: netdev



New VTI tunnel kernel module, Kconfig and Makefile changes.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 16b92d0..5efff60 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -80,4 +80,18 @@ enum {
 
 #define IFLA_GRE_MAX	(__IFLA_GRE_MAX - 1)
 
+/* VTI-mode i_flags */
+#define VTI_ISVTI 0x0001
+
+enum {
+	IFLA_VTI_UNSPEC,
+	IFLA_VTI_LINK,
+	IFLA_VTI_IKEY,
+	IFLA_VTI_OKEY,
+	IFLA_VTI_LOCAL,
+	IFLA_VTI_REMOTE,
+	__IFLA_VTI_MAX,
+};
+
+#define IFLA_VTI_MAX	(__IFLA_VTI_MAX - 1)
 #endif /* _IF_TUNNEL_H_ */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 20f1cb5..32d467b 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -310,6 +310,16 @@ config SYN_COOKIES
 
 	  If unsure, say N.
 
+config NET_IPVTI
+	tristate "Virtual (secure) IP: tunneling"
+	select INET_TUNNEL
+	depends on INET_XFRM_MODE_TUNNEL
+	---help---
+	Tunneling means encapsulating data of one protocol type within
+	another protocol and sending it over a channel that understands the
+	Pencapsulating protocol. This can be used with xfrm mode tunnel to give
+	the notion of a secure tunnel for IPSEC and then use routing protocol on top.
+
 config INET_AH
 	tristate "IP: AH transformation"
 	select XFRM_ALGO
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index ff75d3b..3999ce9 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IP_MROUTE) += ipmr.o
 obj-$(CONFIG_NET_IPIP) += ipip.o
 obj-$(CONFIG_NET_IPGRE_DEMUX) += gre.o
 obj-$(CONFIG_NET_IPGRE) += ip_gre.o
+obj-$(CONFIG_NET_IPVTI) += ip_vti.o
 obj-$(CONFIG_SYN_COOKIES) += syncookies.o
 obj-$(CONFIG_INET_AH) += ah4.o
 obj-$(CONFIG_INET_ESP) += esp4.o
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
new file mode 100644
index 0000000..910250d
--- /dev/null
+++ b/net/ipv4/ip_vti.c
@@ -0,0 +1,975 @@
+/*
+ *	Linux NET3: IP/IP protocol decoder modified to support virtual tunnel interface
+ *
+ *	Authors:
+ *		Saurabh Mohan (saurabh.mohan@vyatta.com) 05/07/2012
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; either version
+ *	2 of the License, or (at your option) any later version.
+ *
+ */
+
+/*
+   This version of net/ipv4/ip_vti.c is cloned of net/ipv4/ipip.c
+
+   For comments look at net/ipv4/ip_gre.c --ANK
+ */
+
+
+#include <linux/capability.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/uaccess.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/in.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/if_arp.h>
+#include <linux/mroute.h>
+#include <linux/init.h>
+#include <linux/netfilter_ipv4.h>
+#include <linux/if_ether.h>
+
+#include <net/sock.h>
+#include <net/ip.h>
+#include <net/icmp.h>
+#include <net/ipip.h>
+#include <net/inet_ecn.h>
+#include <net/xfrm.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#define HASH_SIZE  16
+#define HASH(addr) (((__force u32)addr^((__force u32)addr>>4))&0xF)
+
+static struct rtnl_link_ops vti_link_ops __read_mostly;
+
+static int vti_net_id __read_mostly;
+struct vti_net {
+	struct ip_tunnel __rcu *tunnels_r_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_r[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_wc[1];
+	struct ip_tunnel **tunnels[4];
+
+	struct net_device *fb_tunnel_dev;
+};
+
+static int vti_fb_tunnel_init(struct net_device *dev);
+static int vti_tunnel_init(struct net_device *dev);
+static void vti_tunnel_setup(struct net_device *dev);
+static void vti_dev_free(struct net_device *dev);
+static int vti_tunnel_bind_dev(struct net_device *dev);
+
+/*
+ * Locking : hash tables are protected by RCU and RTNL
+ */
+
+#define for_each_ip_tunnel_rcu(start) \
+	for (t = rcu_dereference(start); t; t = rcu_dereference(t->next))
+
+/* often modified stats are per cpu, other are shared (netdev->stats) */
+struct pcpu_tstats {
+	u64	rx_packets;
+	u64	rx_bytes;
+	u64	tx_packets;
+	u64	tx_bytes;
+	struct	u64_stats_sync	syncp;
+};
+
+#define VTI_XMIT(stats1, stats2) do {				\
+	int err;						\
+	int pkt_len = skb->len;					\
+	err = dst_output(skb);					\
+	if (net_xmit_eval(err) == 0) {				\
+		(stats1)->tx_bytes += pkt_len;			\
+		(stats1)->tx_packets++;				\
+	} else {						\
+		(stats2)->tx_errors++;				\
+		(stats2)->tx_aborted_errors++;			\
+	}							\
+} while (0)
+
+
+static struct rtnl_link_stats64 *vti_get_stats64(struct net_device *dev,
+						 struct rtnl_link_stats64 *tot)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		const struct pcpu_tstats *tstats = per_cpu_ptr(dev->tstats, i);
+		u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
+		unsigned int start;
+
+		do {
+			start = u64_stats_fetch_begin_bh(&tstats->syncp);
+			rx_packets = tstats->rx_packets;
+			tx_packets = tstats->tx_packets;
+			rx_bytes = tstats->rx_bytes;
+			tx_bytes = tstats->tx_bytes;
+		} while (u64_stats_fetch_retry_bh(&tstats->syncp, start));
+
+		tot->rx_packets += rx_packets;
+		tot->tx_packets += tx_packets;
+		tot->rx_bytes   += rx_bytes;
+		tot->tx_bytes   += tx_bytes;
+	}
+
+	tot->multicast = dev->stats.multicast;
+	tot->rx_crc_errors = dev->stats.rx_crc_errors;
+	tot->rx_fifo_errors = dev->stats.rx_fifo_errors;
+	tot->rx_length_errors = dev->stats.rx_length_errors;
+	tot->rx_errors = dev->stats.rx_errors;
+	tot->tx_fifo_errors = dev->stats.tx_fifo_errors;
+	tot->tx_carrier_errors = dev->stats.tx_carrier_errors;
+	tot->tx_dropped = dev->stats.tx_dropped;
+	tot->tx_aborted_errors = dev->stats.tx_aborted_errors;
+	tot->tx_errors = dev->stats.tx_errors;
+
+	return tot;
+}
+
+static struct ip_tunnel *vti_tunnel_lookup(struct net *net,
+					   __be32 remote, __be32 local)
+{
+	unsigned h0 = HASH(remote);
+	unsigned h1 = HASH(local);
+	struct ip_tunnel *t;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_r_l[h0 ^ h1])
+		if (local == t->parms.iph.saddr &&
+		    remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+	for_each_ip_tunnel_rcu(ipn->tunnels_r[h0])
+		if (remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_l[h1])
+		if (local == t->parms.iph.saddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_wc[0])
+		if (t && (t->dev->flags&IFF_UP))
+			return t;
+	return NULL;
+}
+
+static struct ip_tunnel **__vti_bucket(struct vti_net *ipn,
+				       struct ip_tunnel_parm *parms)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	unsigned h = 0;
+	int prio = 0;
+
+	if (remote) {
+		prio |= 2;
+		h ^= HASH(remote);
+	}
+	if (local) {
+		prio |= 1;
+		h ^= HASH(local);
+	}
+	return &ipn->tunnels[prio][h];
+}
+
+static inline struct ip_tunnel **vti_bucket(struct vti_net *ipn,
+					    struct ip_tunnel *t)
+{
+	return __vti_bucket(ipn, &t->parms);
+}
+
+static void vti_tunnel_unlink(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp;
+	struct ip_tunnel *iter;
+
+	for (tp = vti_bucket(ipn, t);
+	     (iter = rtnl_dereference(*tp)) != NULL;
+	     tp = &iter->next) {
+		if (t == iter) {
+			rcu_assign_pointer(*tp, t->next);
+			break;
+		}
+	}
+}
+
+static void vti_tunnel_link(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp = vti_bucket(ipn, t);
+
+	rcu_assign_pointer(t->next, rtnl_dereference(*tp));
+	rcu_assign_pointer(*tp, t);
+}
+
+static struct ip_tunnel *vti_tunnel_locate(struct net *net,
+					   struct ip_tunnel_parm *parms,
+					   int create)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	struct ip_tunnel *t, *nt;
+	struct ip_tunnel __rcu **tp;
+	struct net_device *dev;
+	char name[IFNAMSIZ];
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for (tp = __vti_bucket(ipn, parms);
+	     (t = rtnl_dereference(*tp)) != NULL;
+	     tp = &t->next) {
+		if (local == t->parms.iph.saddr && remote == t->parms.iph.daddr)
+			return t;
+	}
+	if (!create)
+		return NULL;
+
+	if (parms->name[0])
+		strlcpy(name, parms->name, IFNAMSIZ);
+	else
+		strcpy(name, "vti%d");
+
+	dev = alloc_netdev(sizeof(*t), name, vti_tunnel_setup);
+	if (dev == NULL)
+		return NULL;
+
+	dev_net_set(dev, net);
+
+	nt = netdev_priv(dev);
+	nt->parms = *parms;
+	dev->rtnl_link_ops = &vti_link_ops;
+
+	vti_tunnel_bind_dev(dev);
+
+	if (register_netdevice(dev) < 0)
+		goto failed_free;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+	return nt;
+
+ failed_free:
+	free_netdev(dev);
+	return NULL;
+}
+
+static void vti_tunnel_uninit(struct net_device *dev)
+{
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	if (dev == ipn->fb_tunnel_dev)
+		RCU_INIT_POINTER(ipn->tunnels_wc[0], NULL);
+	else
+		vti_tunnel_unlink(ipn, netdev_priv(dev));
+	dev_put(dev);
+}
+
+static int vti_err(struct sk_buff *skb, u32 info)
+{
+
+	/* All the routers (except for Linux) return only
+	 * 8 bytes of packet payload. It means, that precise relaying of
+	 * ICMP in the real Internet is absolutely infeasible.
+	 */
+	struct iphdr *iph = (struct iphdr *)skb->data;
+	const int type = icmp_hdr(skb)->type;
+	const int code = icmp_hdr(skb)->code;
+	struct ip_tunnel *t;
+	int err;
+
+	switch (type) {
+	default:
+	case ICMP_PARAMETERPROB:
+		return 0;
+
+	case ICMP_DEST_UNREACH:
+		switch (code) {
+		case ICMP_SR_FAILED:
+		case ICMP_PORT_UNREACH:
+			/* Impossible event. */
+			return 0;
+		default:
+			/* All others are translated to HOST_UNREACH. */
+			break;
+		}
+		break;
+	case ICMP_TIME_EXCEEDED:
+		if (code != ICMP_EXC_TTL)
+			return 0;
+		break;
+	}
+
+	err = -ENOENT;
+
+	rcu_read_lock();
+	t = vti_tunnel_lookup(dev_net(skb->dev), iph->daddr, iph->saddr);
+	if (t == NULL)
+		goto out;
+
+	if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) {
+		ipv4_update_pmtu(skb, dev_net(skb->dev), info,
+				 t->parms.link, 0, IPPROTO_IPIP, 0);
+		err = 0;
+		goto out;
+	}
+
+	err = 0;
+	if (t->parms.iph.ttl == 0 && type == ICMP_TIME_EXCEEDED)
+		goto out;
+
+	if (time_before(jiffies, t->err_time + IPTUNNEL_ERR_TIMEO))
+		t->err_count++;
+	else
+		t->err_count = 1;
+	t->err_time = jiffies;
+out:
+	rcu_read_unlock();
+	return err;
+}
+
+/*
+ * We dont digest the packet therefore let the packet pass.
+ */
+static int vti_rcv(struct sk_buff *skb)
+{
+	struct ip_tunnel *tunnel;
+	const struct iphdr *iph = ip_hdr(skb);
+
+	rcu_read_lock();
+	tunnel = vti_tunnel_lookup(dev_net(skb->dev), iph->saddr, iph->daddr);
+	if (tunnel != NULL) {
+		struct pcpu_tstats *tstats;
+
+		tstats = this_cpu_ptr(tunnel->dev->tstats);
+		tstats->rx_packets++;
+		tstats->rx_bytes += skb->len;
+
+		skb->dev = tunnel->dev;
+		rcu_read_unlock();
+		/* We do not eat the packet here therefore return 1 */
+		return 1;
+	}
+	rcu_read_unlock();
+
+	return -1;
+}
+
+/*
+ *	This function assumes it is being called from dev_queue_xmit()
+ *	and that skb is filled properly by that function.
+ */
+
+static netdev_tx_t vti_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct pcpu_tstats *tstats;
+	struct net_device_stats *stats = &tunnel->dev->stats;
+	struct iphdr  *tiph = &tunnel->parms.iph;
+	u8     tos = tunnel->parms.iph.tos;
+	struct rtable *rt;		/* Route to the other host */
+	struct net_device *tdev;	/* Device to other host */
+	struct iphdr  *old_iph = ip_hdr(skb);
+	__be32 dst = tiph->daddr;
+	struct flowi4 fl4;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		goto tx_error;
+
+	if (tos&1)
+		tos = old_iph->tos;
+
+	if (!dst) {
+		/* NBMA tunnel */
+		rt = skb_rtable(skb);
+		if (rt == NULL) {
+			stats->tx_fifo_errors++;
+			goto tx_error;
+		}
+		dst = rt->rt_gateway;
+		if (dst == 0)
+			goto tx_error_icmp;
+	}
+
+	memset(&fl4, 0, sizeof(fl4));
+	flowi4_init_output(&fl4, tunnel->parms.link,
+			   htonl(tunnel->parms.i_key), RT_TOS(tos), RT_SCOPE_UNIVERSE,
+			   IPPROTO_IPIP, 0,
+			   dst, tiph->saddr, 0, 0);
+	rt = ip_route_output_key(dev_net(dev), &fl4);
+	if (IS_ERR(rt)) {
+		dev->stats.tx_carrier_errors++;
+		goto tx_error_icmp;
+	}
+#ifdef CONFIG_XFRM
+		/* if there is no transform then this tunnel is not functional.
+		 * Or if the xfrm is not mode tunnel.
+		 */
+		if (!rt->dst.xfrm ||
+		    rt->dst.xfrm->props.mode != XFRM_MODE_TUNNEL) {
+			stats->tx_carrier_errors++;
+			goto tx_error_icmp;
+		}
+#endif
+	tdev = rt->dst.dev;
+
+	if (tdev == dev) {
+		ip_rt_put(rt);
+		stats->collisions++;
+		goto tx_error;
+
+	}
+
+
+	if (tunnel->err_count > 0) {
+		if (time_before(jiffies,
+				tunnel->err_time + IPTUNNEL_ERR_TIMEO)) {
+			tunnel->err_count--;
+			dst_link_failure(skb);
+		} else
+			tunnel->err_count = 0;
+	}
+
+
+	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED |
+			      IPSKB_REROUTED);
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &rt->dst);
+	nf_reset(skb);
+	skb->dev = skb_dst(skb)->dev;
+
+	tstats = this_cpu_ptr(dev->tstats);
+	VTI_XMIT(tstats, &dev->stats);
+	return NETDEV_TX_OK;
+
+tx_error_icmp:
+	dst_link_failure(skb);
+tx_error:
+	stats->tx_errors++;
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int vti_tunnel_bind_dev(struct net_device *dev)
+{
+	struct net_device *tdev = NULL;
+	struct ip_tunnel *tunnel;
+	struct iphdr *iph;
+
+	tunnel = netdev_priv(dev);
+	iph = &tunnel->parms.iph;
+
+	if (iph->daddr) {
+		struct rtable *rt;
+		struct flowi4 fl4;
+		memset(&fl4, 0, sizeof(fl4));
+		flowi4_init_output(&fl4, tunnel->parms.link,
+				   htonl(tunnel->parms.i_key), RT_TOS(iph->tos), RT_SCOPE_UNIVERSE,
+				   IPPROTO_IPIP, 0,
+				   iph->daddr, iph->saddr, 0, 0);
+		rt = ip_route_output_key(dev_net(dev), &fl4);
+		if (!IS_ERR(rt)) {
+			tdev = rt->dst.dev;
+			ip_rt_put(rt);
+		}
+		dev->flags |= IFF_POINTOPOINT;
+	}
+
+	if (!tdev && tunnel->parms.link)
+		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+
+	if (tdev) {
+		dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr);
+		dev->mtu = tdev->mtu;
+	}
+	dev->iflink = tunnel->parms.link;
+	return dev->mtu;
+}
+
+static int
+vti_tunnel_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
+{
+	int err = 0;
+	struct ip_tunnel_parm p;
+	struct ip_tunnel *t;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	switch (cmd) {
+	case SIOCGETTUNNEL:
+		t = NULL;
+		if (dev == ipn->fb_tunnel_dev) {
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p))) {
+				err = -EFAULT;
+				break;
+			}
+			t = vti_tunnel_locate(net, &p, 0);
+		}
+		if (t == NULL)
+			t = netdev_priv(dev);
+		memcpy(&p, &t->parms, sizeof(p));
+		p.i_flags |= GRE_KEY;
+		p.o_flags |= GRE_KEY;
+		if (copy_to_user(ifr->ifr_ifru.ifru_data, &p, sizeof(p)))
+			err = -EFAULT;
+		break;
+
+	case SIOCADDTUNNEL:
+	case SIOCCHGTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		err = -EFAULT;
+		if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p)))
+			goto done;
+
+		err = -EINVAL;
+		if (p.iph.version != 4 || p.iph.protocol != IPPROTO_IPIP ||
+		    p.iph.ihl != 5 || (p.iph.frag_off&htons(~IP_DF)))
+			goto done;
+		if (p.iph.ttl)
+			p.iph.frag_off |= htons(IP_DF);
+
+		t = vti_tunnel_locate(net, &p, cmd == SIOCADDTUNNEL);
+
+		if (dev != ipn->fb_tunnel_dev && cmd == SIOCCHGTUNNEL) {
+			if (t != NULL) {
+				if (t->dev != dev) {
+					err = -EEXIST;
+					break;
+				}
+			} else {
+				if (((dev->flags&IFF_POINTOPOINT) && !p.iph.daddr) ||
+				    (!(dev->flags&IFF_POINTOPOINT) && p.iph.daddr)) {
+					err = -EINVAL;
+					break;
+				}
+				t = netdev_priv(dev);
+				vti_tunnel_unlink(ipn, t);
+				synchronize_net();
+				t->parms.iph.saddr = p.iph.saddr;
+				t->parms.iph.daddr = p.iph.daddr;
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				t->parms.iph.protocol = IPPROTO_IPIP;
+				memcpy(dev->dev_addr, &p.iph.saddr, 4);
+				memcpy(dev->broadcast, &p.iph.daddr, 4);
+				vti_tunnel_link(ipn, t);
+				netdev_state_change(dev);
+			}
+		}
+
+		if (t) {
+			err = 0;
+			if (cmd == SIOCCHGTUNNEL) {
+				t->parms.iph.ttl = p.iph.ttl;
+				t->parms.iph.tos = p.iph.tos;
+				t->parms.iph.frag_off = p.iph.frag_off;
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				if (t->parms.link != p.link) {
+					t->parms.link = p.link;
+					vti_tunnel_bind_dev(dev);
+					netdev_state_change(dev);
+				}
+			}
+			if (copy_to_user(ifr->ifr_ifru.ifru_data, &t->parms, sizeof(p)))
+				err = -EFAULT;
+		} else
+			err = (cmd == SIOCADDTUNNEL ? -ENOBUFS : -ENOENT);
+		break;
+
+	case SIOCDELTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		if (dev == ipn->fb_tunnel_dev) {
+			err = -EFAULT;
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p)))
+				goto done;
+			err = -ENOENT;
+
+			t = vti_tunnel_locate(net, &p, 0);
+			if (t == NULL)
+				goto done;
+			err = -EPERM;
+			if (t->dev == ipn->fb_tunnel_dev)
+				goto done;
+			dev = t->dev;
+		}
+		unregister_netdevice(dev);
+		err = 0;
+		break;
+
+	default:
+		err = -EINVAL;
+	}
+
+done:
+	return err;
+}
+
+static int vti_tunnel_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (new_mtu < 68 || new_mtu > 0xFFF8)
+		return -EINVAL;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static const struct net_device_ops vti_netdev_ops = {
+	.ndo_init	= vti_tunnel_init,
+	.ndo_uninit	= vti_tunnel_uninit,
+	.ndo_start_xmit	= vti_tunnel_xmit,
+	.ndo_do_ioctl	= vti_tunnel_ioctl,
+	.ndo_change_mtu	= vti_tunnel_change_mtu,
+	.ndo_get_stats64  = vti_get_stats64,
+};
+
+static void vti_dev_free(struct net_device *dev)
+{
+	free_percpu(dev->tstats);
+	free_netdev(dev);
+}
+
+static void vti_tunnel_setup(struct net_device *dev)
+{
+	dev->netdev_ops		= &vti_netdev_ops;
+	dev->destructor		= vti_dev_free;
+
+	dev->type		= ARPHRD_TUNNEL;
+	dev->hard_header_len	= LL_MAX_HEADER + sizeof(struct iphdr);
+	dev->mtu		= ETH_DATA_LEN;
+	dev->flags		= IFF_NOARP;
+	dev->iflink		= 0;
+	dev->addr_len		= 4;
+	dev->features		|= NETIF_F_NETNS_LOCAL;
+	dev->features		|= NETIF_F_LLTX;
+	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
+}
+
+static int vti_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
+	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int __net_init vti_fb_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct iphdr *iph = &tunnel->parms.iph;
+	struct vti_net *ipn = net_generic(dev_net(dev), vti_net_id);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	iph->version		= 4;
+	iph->protocol		= IPPROTO_IPIP;
+	iph->ihl		= 5;
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	dev_hold(dev);
+	rcu_assign_pointer(ipn->tunnels_wc[0], tunnel);
+	return 0;
+}
+
+static struct xfrm_tunnel vti_handler __read_mostly = {
+	.handler	=	vti_rcv,
+	.err_handler	=	vti_err,
+	.priority	=	1,
+};
+
+static void vti_destroy_tunnels(struct vti_net *ipn, struct list_head *head)
+{
+	int prio;
+
+	for (prio = 1; prio < 4; prio++) {
+		int h;
+		for (h = 0; h < HASH_SIZE; h++) {
+			struct ip_tunnel *t;
+
+			t = rtnl_dereference(ipn->tunnels[prio][h]);
+			while (t != NULL) {
+				unregister_netdevice_queue(t->dev, head);
+				t = rtnl_dereference(t->next);
+			}
+		}
+	}
+}
+
+static int __net_init vti_init_net(struct net *net)
+{
+	int err;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	ipn->tunnels[0] = ipn->tunnels_wc;
+	ipn->tunnels[1] = ipn->tunnels_l;
+	ipn->tunnels[2] = ipn->tunnels_r;
+	ipn->tunnels[3] = ipn->tunnels_r_l;
+
+	ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+					   "ip_vti0",
+					   vti_tunnel_setup);
+	if (!ipn->fb_tunnel_dev) {
+		err = -ENOMEM;
+		goto err_alloc_dev;
+	}
+	dev_net_set(ipn->fb_tunnel_dev, net);
+
+	err = vti_fb_tunnel_init(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	ipn->fb_tunnel_dev->rtnl_link_ops = &vti_link_ops;
+
+	err = register_netdev(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	return 0;
+
+err_reg_dev:
+	vti_dev_free(ipn->fb_tunnel_dev);
+err_alloc_dev:
+	/* nothing */
+	return err;
+}
+
+static void __net_exit vti_exit_net(struct net *net)
+{
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	LIST_HEAD(list);
+
+	rtnl_lock();
+	vti_destroy_tunnels(ipn, &list);
+	unregister_netdevice_many(&list);
+	rtnl_unlock();
+}
+
+static struct pernet_operations vti_net_ops = {
+	.init = vti_init_net,
+	.exit = vti_exit_net,
+	.id   = &vti_net_id,
+	.size = sizeof(struct vti_net),
+};
+
+static int vti_tunnel_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	return 0;
+}
+
+static void vti_netlink_parms(struct nlattr *data[],
+			      struct ip_tunnel_parm *parms)
+{
+	memset(parms, 0, sizeof(*parms));
+
+	parms->iph.protocol = IPPROTO_IPIP;
+
+	if (!data)
+		return;
+
+	if (data[IFLA_VTI_LINK])
+		parms->link = nla_get_u32(data[IFLA_VTI_LINK]);
+
+	if (data[IFLA_VTI_IKEY])
+		parms->i_key = nla_get_be32(data[IFLA_VTI_IKEY]);
+
+	if (data[IFLA_VTI_OKEY])
+		parms->o_key = nla_get_be32(data[IFLA_VTI_OKEY]);
+
+	if (data[IFLA_VTI_LOCAL])
+		parms->iph.saddr = nla_get_be32(data[IFLA_VTI_LOCAL]);
+
+	if (data[IFLA_VTI_REMOTE])
+		parms->iph.daddr = nla_get_be32(data[IFLA_VTI_REMOTE]);
+
+}
+
+static int vti_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[],
+		       struct nlattr *data[])
+{
+	struct ip_tunnel *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	int mtu;
+	int err;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &nt->parms);
+
+	if (vti_tunnel_locate(net, &nt->parms, 0))
+		return -EEXIST;
+
+	mtu = vti_tunnel_bind_dev(dev);
+	if (!tb[IFLA_MTU])
+		dev->mtu = mtu;
+
+	err = register_netdevice(dev);
+	if (err)
+		goto out;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+
+out:
+	return err;
+	return 0;
+}
+
+static int vti_changelink(struct net_device *dev, struct nlattr *tb[],
+			  struct nlattr *data[])
+{
+	struct ip_tunnel *t, *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	struct ip_tunnel_parm p;
+	int mtu;
+
+	if (dev == ipn->fb_tunnel_dev)
+		return -EINVAL;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &p);
+
+	t = vti_tunnel_locate(net, &p, 0);
+
+	if (t) {
+		if (t->dev != dev)
+			return -EEXIST;
+	} else {
+		t = nt;
+
+		vti_tunnel_unlink(ipn, t);
+		t->parms.iph.saddr = p.iph.saddr;
+		t->parms.iph.daddr = p.iph.daddr;
+		t->parms.i_key = p.i_key;
+		t->parms.o_key = p.o_key;
+		if (dev->type != ARPHRD_ETHER) {
+			memcpy(dev->dev_addr, &p.iph.saddr, 4);
+			memcpy(dev->broadcast, &p.iph.daddr, 4);
+		}
+		vti_tunnel_link(ipn, t);
+		netdev_state_change(dev);
+	}
+
+	if (t->parms.link != p.link) {
+		t->parms.link = p.link;
+		mtu = vti_tunnel_bind_dev(dev);
+		if (!tb[IFLA_MTU])
+			dev->mtu = mtu;
+		netdev_state_change(dev);
+	}
+
+	return 0;
+}
+
+static size_t vti_get_size(const struct net_device *dev)
+{
+	return
+		/* IFLA_VTI_LINK */
+		nla_total_size(4) +
+		/* IFLA_VTI_IKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_OKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_LOCAL */
+		nla_total_size(4) +
+		/* IFLA_VTI_REMOTE */
+		nla_total_size(4) +
+		0;
+}
+
+static int vti_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct ip_tunnel *t = netdev_priv(dev);
+	struct ip_tunnel_parm *p = &t->parms;
+
+	nla_put_u32(skb, IFLA_VTI_LINK, p->link);
+	nla_put_be32(skb, IFLA_VTI_IKEY, p->i_key);
+	nla_put_be32(skb, IFLA_VTI_OKEY, p->o_key);
+	nla_put_be32(skb, IFLA_VTI_LOCAL, p->iph.saddr);
+	nla_put_be32(skb, IFLA_VTI_REMOTE, p->iph.daddr);
+
+	return 0;
+}
+
+static const struct nla_policy vti_policy[IFLA_VTI_MAX + 1] = {
+	[IFLA_VTI_LINK]		= { .type = NLA_U32 },
+	[IFLA_VTI_IKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_OKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_LOCAL]	= { .len = FIELD_SIZEOF(struct iphdr, saddr) },
+	[IFLA_VTI_REMOTE]	= { .len = FIELD_SIZEOF(struct iphdr, daddr) },
+};
+
+static struct rtnl_link_ops vti_link_ops __read_mostly = {
+	.kind		= "vti",
+	.maxtype	= IFLA_VTI_MAX,
+	.policy		= vti_policy,
+	.priv_size	= sizeof(struct ip_tunnel),
+	.setup		= vti_tunnel_setup,
+	.validate	= vti_tunnel_validate,
+	.newlink	= vti_newlink,
+	.changelink	= vti_changelink,
+	.get_size	= vti_get_size,
+	.fill_info	= vti_fill_info,
+};
+
+static int __init vti_init(void)
+{
+	int err;
+
+	pr_info("IPv4 over IPSec tunneling driver\n");
+
+	err = register_pernet_device(&vti_net_ops);
+	if (err < 0)
+		return err;
+	err = xfrm4_mode_tunnel_input_register(&vti_handler);
+	if (err < 0) {
+		unregister_pernet_device(&vti_net_ops);
+		pr_info(KERN_INFO "vti init: can't register tunnel\n");
+	}
+
+	err = rtnl_link_register(&vti_link_ops);
+	if (err < 0)
+		goto rtnl_link_failed;
+
+	return err;
+
+rtnl_link_failed:
+	xfrm4_mode_tunnel_input_deregister(&vti_handler);
+	unregister_pernet_device(&vti_net_ops);
+	return err;
+}
+
+static void __exit vti_fini(void)
+{
+	rtnl_link_unregister(&vti_link_ops);
+	if (xfrm4_mode_tunnel_input_deregister(&vti_handler))
+		pr_info("vti close: can't deregister tunnel\n");
+
+	unregister_pernet_device(&vti_net_ops);
+}
+
+module_init(vti_init);
+module_exit(vti_fini);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK("vti");
+MODULE_ALIAS_NETDEV("ip_vti0");

^ permalink raw reply related

* Re: [PATCH] ipv4: Create and use fib_compute_spec_dst() helper.
From: David Miller @ 2012-06-29  1:45 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <alpine.LFD.2.00.1206282309340.1722@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Fri, 29 Jun 2012 02:16:25 +0300 (EEST)

> 	This is bad for looped m/b-cast: ip_mc_output calls
> ip_dev_loopback_xmit where pkt_type is set to PACKET_LOOPBACK.
> May be we have to check skb_dst as below.

Agreed, for many reasons the pkt_type check is bogus.

> 	It is not clear to me why ip_route_input_mc and
> ip_route_input_slow (brd_input) call fib_validate_source()
> with arg oif=0. How would fib_rule_match match flowi_iif
> for such traffic then?
> 
> 	May be iif should be always set just like in
> ip_route_output_slow because now we do output lookup to
> unicast target?:
> 
> 	fl4.flowi4_iif = net->loopback_dev->ifindex;

Ok.  I am not brave enough to adjust what broadcast and
multicast code do in route.c :-)

> 	Here only 0 is allowed for m/b-cast daddr, we are not
> going to use ip_hdr(skb)->daddr, so no need to provide it:
> 
> 	fl4.saddr = 0;

Agreed.

> 	What about providing ip_hdr(skb)->saddr as
> 2nd argument here (it is validated by input routing
> to be unicast or 0.0.0.0):
 ...
> 	By this way we will prefer local address from the
> same subnet as iph->saddr. Also, we should not call
> fib_lookup if ipv4_is_zeronet(ip_hdr(skb)->saddr), we
> should use RT_SCOPE_LINK for inet_select_addr in this case.
> By this way we will prefer addresses with scope link for
> target 0.0.0.0. There is such logic that uses RT_SCOPE_LINK in
> ip_route_input_mc() and ip_route_input_slow().

Also agreed.

> 	If we use rt_flags above I'm not sure whether
> ip_options_compile is called with valid rt_flags from the
> bridging code.

It will not be the first time we need to fix bridging from
sending up garbage.

Here's what I committed based upon your feedback, thanks!

====================
ipv4: Fix bugs in fib_compute_spec_dst().

Based upon feedback from Julian Anastasov.

1) Use route flags to determine multicast/broadcast, not the
   packet flags.

2) Leave saddr unspecified in flow key.

3) Adjust how we invoke inet_select_addr().  Pass ip_hdr(skb)->saddr as
   second arg, and if it was zeronet use link scope.

4) Use loopback as input interface in flow key.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/fib_frontend.c |   34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 63b11ca..1d13217 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -185,28 +185,36 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
 	struct net_device *dev = skb->dev;
 	struct in_device *in_dev;
 	struct fib_result res;
+	struct rtable *rt;
 	struct flowi4 fl4;
 	struct net *net;
+	int scope;
 
-	if (skb->pkt_type != PACKET_BROADCAST &&
-	    skb->pkt_type != PACKET_MULTICAST)
+	rt = skb_rtable(skb);
+	if (!(rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)))
 		return ip_hdr(skb)->daddr;
 
 	in_dev = __in_dev_get_rcu(dev);
 	BUG_ON(!in_dev);
-	fl4.flowi4_oif = 0;
-	fl4.flowi4_iif = 0;
-	fl4.daddr = ip_hdr(skb)->saddr;
-	fl4.saddr = ip_hdr(skb)->daddr;
-	fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
-	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
-	fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
 
 	net = dev_net(dev);
-	if (!fib_lookup(net, &fl4, &res))
-		return FIB_RES_PREFSRC(net, res);
-	else
-		return inet_select_addr(dev, 0, RT_SCOPE_UNIVERSE);
+
+	scope = RT_SCOPE_UNIVERSE;
+	if (!ipv4_is_zeronet(ip_hdr(skb)->saddr)) {
+		fl4.flowi4_oif = 0;
+		fl4.flowi4_iif = net->loopback_dev->ifindex;
+		fl4.daddr = ip_hdr(skb)->saddr;
+		fl4.saddr = 0;
+		fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
+		fl4.flowi4_scope = scope;
+		fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
+		if (!fib_lookup(net, &fl4, &res))
+			return FIB_RES_PREFSRC(net, res);
+	} else {
+		scope = RT_SCOPE_LINK;
+	}
+
+	return inet_select_addr(dev, ip_hdr(skb)->saddr, scope);
 }
 
 /* Given (packet source, input interface) and optional (dst, oif, tos):
-- 
1.7.10

^ permalink raw reply related

* Re: [RFC PATCH net-next] ipvs: add missing lock in ip_vs_ftp_init_conn()
From: Xiaotian Feng @ 2012-06-29  1:50 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, lvs-devel, netfilter-devel, netfilter, coreteam,
	linux-kernel, Xiaotian Feng, Wensong Zhang, Simon Horman,
	Pablo Neira Ayuso, Patrick McHardy, David S. Miller
In-Reply-To: <alpine.LFD.2.00.1206290225490.1722@ja.ssi.bg>

On Fri, Jun 29, 2012 at 8:17 AM, Julian Anastasov <ja@ssi.bg> wrote:
>
>        Hello,
>
> On Thu, 28 Jun 2012, Xiaotian Feng wrote:
>
>> We met a kernel panic in 2.6.32.43 kernel:
>>
>> [2680191.848044] IPVS: ip_vs_conn_hash(): request for already hashed, called from run_timer_softirq+0x175/0x1d0
>> <snip>
>> [2680311.849009] general protection fault: 0000 [#1] SMP
>> [2680311.853001] RIP: 0010:[<ffffffff815f155c>]  [<ffffffff815f155c>] ip_vs_conn_expire+0xdc/0x2f0
>> [2680311.853001] RSP: 0018:ffff880028303e70  EFLAGS: 00010202
>> [2680311.853001] RAX: dead000000200200 RBX: ffff8801aad00b80 RCX: 0000000000001d90
>> [2680311.853001] RDX: dead000000100100 RSI: 000000004fd59800 RDI: ffff8801aad00c08
>> <snip>
>> [2680311.853001] Call Trace:
>> [2680311.853001]  <IRQ>
>> [2680311.853001]  [<ffffffff815f1480>] ? ip_vs_conn_expire+0x0/0x2f0
>> [2680311.853001]  [<ffffffff8104e2a5>] run_timer_softirq+0x175/0x1d0
>> [2680311.853001]  [<ffffffff81021a48>] ? lapic_next_event+0x18/0x20
>> [2680311.853001]  [<ffffffff81049a13>] __do_softirq+0xb3/0x150
>> [2680311.853001]  [<ffffffff8100cc5c>] call_softirq+0x1c/0x30
>> [2680311.853001]  [<ffffffff8100ea9a>] do_softirq+0x4a/0x80
>> [2680311.853001]  [<ffffffff81049957>] irq_exit+0x77/0x80
>> [2680311.853001]  [<ffffffff81021f2c>] smp_apic_timer_interrupt+0x6c/0xa0
>> [2680311.853001]  [<ffffffff8100c633>] apic_timer_interrupt+0x13/0x20
>> [2680311.853001]  <EOI>
>> [2680311.853001]  [<ffffffff81013b52>] ? mwait_idle+0x52/0x70
>> [2680311.853001]  [<ffffffff8100a7b0>] ? enter_idle+0x20/0x30
>> [2680311.853001]  [<ffffffff8100ac62>] ? cpu_idle+0x52/0x80
>> [2680311.853001]  [<ffffffff816d504d>] ? start_secondary+0x19d/0x280
>>
>> rax and rdx is LIST_POISON1 and LIST_POISON2, so kernel is list_del() on an already deleted
>> connection and result the general protect fault.
>>
>> The "request for already hashed" warning, told us someone might change the connection flags
>> incorrectly, like described in commit aea9d711, it changes the connection flags, but doesn't
>> put the connection back to the list. So ip_vs_conn_hash() throw a warning and return.
>> Later, when ip_vs_conn_expire fire again, ip_vs_conn_unhash() will find the HASHED connection
>> and list_del() it, then kernel panic happened.
>>
>> After code review, the only chance that kernel change connection flag without protection is
>> in ip_vs_ftp_init_conn().
>
>        Hm, ip_vs_ftp_init_conn is called before 1st hashing,
> from ip_vs_bind_app() in ip_vs_conn_new() before
> ip_vs_conn_hash(). It should be another problem with
> the flags. How different is IPVS in 2.6.32.43 compared to
> recent kernels? If commit aea9d711 is present, I'm not
> aware of other similar problems.

ip_vs_bind_app() is also called by ip_vs_try_bind_dest(), which can be
traced to ip_vs_proc_conn().
I've checked the changes in upstream, but nothing helps since aea9d711
has been taken into 2.6.32.28 kernel.

>
>        May be you can post some details for your setup on
> lvs-devel@vger.kernel.org. I assume FTP is used,
> what about master-backup sync? Can you confirm that
> this fix solves the problem or it happens in rare cases?

Yes, ftp and WRR scheduler is used. Unfortunately, the oops
is unreproducible :(

>
>> Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
>> Cc: Wensong Zhang <wensong@linux-vs.org>
>> Cc: Simon Horman <horms@verge.net.au>
>> Cc: Julian Anastasov <ja@ssi.bg>
>> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
>> Cc: Patrick McHardy <kaber@trash.net>
>> Cc: "David S. Miller" <davem@davemloft.net>
>> ---
>>  net/netfilter/ipvs/ip_vs_ftp.c |    2 ++
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
>> index b20b29c..c2bc264 100644
>> --- a/net/netfilter/ipvs/ip_vs_ftp.c
>> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
>> @@ -65,8 +65,10 @@ static int ip_vs_ftp_pasv;
>>  static int
>>  ip_vs_ftp_init_conn(struct ip_vs_app *app, struct ip_vs_conn *cp)
>>  {
>> +     spin_lock(&cp->lock);
>>       /* We use connection tracking for the command connection */
>>       cp->flags |= IP_VS_CONN_F_NFCT;
>> +     spin_unlock(&cp->lock);
>>       return 0;
>>  }
>>
>> --
>> 1.7.1
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* [PATCH] ipv4: Adjust in_dev handling in fib_validate_source()
From: David Miller @ 2012-06-29  2:05 UTC (permalink / raw)
  To: netdev


Checking for in_dev being NULL is pointless.

In fact, all of our callers have in_dev precomputed already,
so just pass it in and remove the NULL checking.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/ip_fib.h    |    2 +-
 net/ipv4/fib_frontend.c |   27 ++++++++++-----------------
 net/ipv4/route.c        |   10 ++++++----
 3 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 9e6c26d..619f68a 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -233,7 +233,7 @@ extern void		ip_fib_init(void);
 extern __be32 fib_compute_spec_dst(struct sk_buff *skb);
 extern int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 			       u8 tos, int oif, struct net_device *dev,
-			       u32 *itag);
+			       struct in_device *idev, u32 *itag);
 extern void fib_select_default(struct fib_result *res);
 
 /* Exported by fib_semantics.c */
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 1d13217..c84cff5 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -226,15 +226,14 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
  * called with rcu_read_lock()
  */
 int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, u8 tos,
-			int oif, struct net_device *dev, u32 *itag)
+			int oif, struct net_device *dev, struct in_device *idev,
+			u32 *itag)
 {
-	struct in_device *in_dev;
-	struct flowi4 fl4;
+	int ret, no_addr, rpf, accept_local;
 	struct fib_result res;
-	int no_addr, rpf, accept_local;
-	bool dev_match;
-	int ret;
+	struct flowi4 fl4;
 	struct net *net;
+	bool dev_match;
 
 	fl4.flowi4_oif = 0;
 	fl4.flowi4_iif = oif;
@@ -244,19 +243,13 @@ int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, u8 tos,
 	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
 
 	no_addr = rpf = accept_local = 0;
-	in_dev = __in_dev_get_rcu(dev);
-	if (in_dev) {
-		no_addr = in_dev->ifa_list == NULL;
-
-		/* Ignore rp_filter for packets protected by IPsec. */
-		rpf = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(in_dev);
+	no_addr = idev->ifa_list == NULL;
 
-		accept_local = IN_DEV_ACCEPT_LOCAL(in_dev);
-		fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
-	}
+	/* Ignore rp_filter for packets protected by IPsec. */
+	rpf = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(idev);
 
-	if (in_dev == NULL)
-		goto e_inval;
+	accept_local = IN_DEV_ACCEPT_LOCAL(idev);
+	fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
 
 	net = dev_net(dev);
 	if (fib_lookup(net, &fl4, &res))
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 83d56a0..919d69e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1999,7 +1999,8 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		if (!ipv4_is_local_multicast(daddr))
 			goto e_inval;
 	} else {
-		err = fib_validate_source(skb, saddr, 0, tos, 0, dev, &itag);
+		err = fib_validate_source(skb, saddr, 0, tos, 0, dev,
+					  in_dev, &itag);
 		if (err < 0)
 			goto e_err;
 	}
@@ -2100,7 +2101,7 @@ static int __mkroute_input(struct sk_buff *skb,
 
 
 	err = fib_validate_source(skb, saddr, daddr, tos, FIB_RES_OIF(*res),
-				  in_dev->dev, &itag);
+				  in_dev->dev, in_dev, &itag);
 	if (err < 0) {
 		ip_handle_martian_source(in_dev->dev, in_dev, skb, daddr,
 					 saddr);
@@ -2274,7 +2275,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	if (res.type == RTN_LOCAL) {
 		err = fib_validate_source(skb, saddr, daddr, tos,
 					  net->loopback_dev->ifindex,
-					  dev, &itag);
+					  dev, in_dev, &itag);
 		if (err < 0)
 			goto martian_source_keep_err;
 		if (err)
@@ -2295,7 +2296,8 @@ brd_input:
 		goto e_inval;
 
 	if (!ipv4_is_zeronet(saddr)) {
-		err = fib_validate_source(skb, saddr, 0, tos, 0, dev, &itag);
+		err = fib_validate_source(skb, saddr, 0, tos, 0, dev,
+					  in_dev, &itag);
 		if (err < 0)
 			goto martian_source_keep_err;
 		if (err)
-- 
1.7.10

^ permalink raw reply related

* "ADDRCONF(NETDEV_UP): eth0: link is not ready" with IPv6
From: Arvid Brodin @ 2012-06-29  2:36 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: Alexey Kuznetsov, Stephen Hemminger

Hi,

After 'ip link set eth0 up' on an avr32 board (network driver macb), the device ends up in
operational mode "UNKNOWN":

# ip link
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 00:24:74:00:17:9d brd ff:ff:ff:ff:ff:ff

Unplugging and plugging in the network cable gets the device to mode "UP".

This is a problem for me because I'm trying to use this device as a "slave" device (for a
virtual HSR device*) and I need to be able to decide if the slave device is operational or
not.

Following Stephen's advice here:
http://kerneltrap.org/mailarchive/linux-netdev/2008/9/24/3398834 I checked the macb.c code
and noticed they do not call netif_carrier_off() neither before register_netdev() nor in
dev_open().

I added the call before register_netdev(), which fixed the problem. However, if I then
enable IPv6:

# ip link set eth0 up
ADDRCONF(NETDEV_UP): eth0: link is not ready
eth0: link up (100/Full)
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

Any idea what is happening / what I'm doing wrong? (This is not just cosmetic; is some
situations this seems to kill the interface - e.g. ping does not work, down/up does not
help...) Things work fine without IPv6 configured.

*N.B. I'm writing a driver for a network protocol called "High-availability Seamless
Redundancy".

-- 
Arvid Brodin | Consultant (Linux)
XDIN AB | Jan Stenbecks Torg 17 | SE-164 40 Kista | Sweden | xdin.com

^ permalink raw reply

* [PATCH] ipv6_tunnel: Allow receiving packets on the fallback tunnel if they pass sanity checks
From: Phil Dibowitz @ 2012-06-29  4:15 UTC (permalink / raw)
  To: davem; +Cc: netdev, phild, ville.nuorvala, phil

From: Ville Nuorvala <ville.nuorvala@gmail.com>

At Facebook, we do Layer-3 DSR via IP-in-IP tunneling. Our load balancers wrap
an extra IP header on incoming packets so they can be routed to the backend.
In the v4 tunnel driver, when these packets fall on the default tunl0 device,
the behavior is to decapsulate them and drop them back on the stack. So our
setup is that tunl0 has the VIP and eth0 has (obviously) the backend's real
address.

In IPv6 we do the same thing, but the v6 tunnel driver didn't have this same
behavior - if you didn't have an explicit tunnel setup, it would drop the
packet.

This patch brings that v4 feature to the v6 driver.

The same IPv6 address checks are performed as with any normal tunnel,
but as the fallback tunnel endpoint addresses are unspecified, the checks
must be performed on a per-packet basis, rather than at tunnel
configuration time.

[Patch description modified by phil@ipom.com]

Signed-off-by: Ville Nuorvala <ville.nuorvala@gmail.com>
Tested-by: Phil Dibowitz <phil@ipom.com>

---
 include/net/ip6_tunnel.h |    2 +
 net/ipv6/ip6_tunnel.c    |   65 +++++++++++++++++++++++++++-------------------
 2 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index fc73e66..358fb86 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -9,6 +9,8 @@
 #define IP6_TNL_F_CAP_XMIT 0x10000
 /* capable of receiving packets */
 #define IP6_TNL_F_CAP_RCV 0x20000
+/* determine capability on a per-packet basis */
+#define IP6_TNL_F_CAP_PER_PACKET 0x40000
 
 /* IPv6 tunnel */
 
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index c9015fa..04a3cba 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -684,24 +684,50 @@ static void ip6ip6_dscp_ecn_decapsulate(const struct ip6_tnl *t,
 		IP6_ECN_set_ce(ipv6_hdr(skb));
 }
 
+static __u32 ip6_tnl_get_cap(struct ip6_tnl *t,
+			     const struct in6_addr *laddr,
+			     const struct in6_addr *raddr)
+{
+	struct ip6_tnl_parm *p = &t->parms;
+	int ltype = ipv6_addr_type(laddr);
+	int rtype = ipv6_addr_type(raddr);
+	__u32 flags = 0;
+
+	if (ltype == IPV6_ADDR_ANY || rtype == IPV6_ADDR_ANY) {
+		flags = IP6_TNL_F_CAP_PER_PACKET;
+	} else if (ltype & (IPV6_ADDR_UNICAST|IPV6_ADDR_MULTICAST) &&
+		   rtype & (IPV6_ADDR_UNICAST|IPV6_ADDR_MULTICAST) &&
+		   !((ltype|rtype) & IPV6_ADDR_LOOPBACK) &&
+		   (!((ltype|rtype) & IPV6_ADDR_LINKLOCAL) || p->link)) {
+		if (ltype&IPV6_ADDR_UNICAST)
+			flags |= IP6_TNL_F_CAP_XMIT;
+		if (rtype&IPV6_ADDR_UNICAST)
+			flags |= IP6_TNL_F_CAP_RCV;
+	}
+	return flags;
+}
+
 /* called with rcu_read_lock() */
-static inline int ip6_tnl_rcv_ctl(struct ip6_tnl *t)
+static inline int ip6_tnl_rcv_ctl(struct ip6_tnl *t,
+				  const struct in6_addr *laddr,
+				  const struct in6_addr *raddr)
 {
 	struct ip6_tnl_parm *p = &t->parms;
 	int ret = 0;
 	struct net *net = dev_net(t->dev);
 
-	if (p->flags & IP6_TNL_F_CAP_RCV) {
+	if ((p->flags & IP6_TNL_F_CAP_RCV) ||
+	    ((p->flags & IP6_TNL_F_CAP_PER_PACKET) &&
+	     (ip6_tnl_get_cap(t, laddr, raddr) & IP6_TNL_F_CAP_RCV))) {
 		struct net_device *ldev = NULL;
 
 		if (p->link)
 			ldev = dev_get_by_index_rcu(net, p->link);
 
-		if ((ipv6_addr_is_multicast(&p->laddr) ||
-		     likely(ipv6_chk_addr(net, &p->laddr, ldev, 0))) &&
-		    likely(!ipv6_chk_addr(net, &p->raddr, NULL, 0)))
+		if ((ipv6_addr_is_multicast(laddr) ||
+		     likely(ipv6_chk_addr(net, laddr, ldev, 0))) &&
+		    likely(!ipv6_chk_addr(net, raddr, NULL, 0)))
 			ret = 1;
-
 	}
 	return ret;
 }
@@ -740,7 +766,7 @@ static int ip6_tnl_rcv(struct sk_buff *skb, __u16 protocol,
 			goto discard;
 		}
 
-		if (!ip6_tnl_rcv_ctl(t)) {
+		if (!ip6_tnl_rcv_ctl(t, &ipv6h->daddr, &ipv6h->saddr)) {
 			t->dev->stats.rx_dropped++;
 			rcu_read_unlock();
 			goto discard;
@@ -1114,25 +1140,6 @@ tx_err:
 	return NETDEV_TX_OK;
 }
 
-static void ip6_tnl_set_cap(struct ip6_tnl *t)
-{
-	struct ip6_tnl_parm *p = &t->parms;
-	int ltype = ipv6_addr_type(&p->laddr);
-	int rtype = ipv6_addr_type(&p->raddr);
-
-	p->flags &= ~(IP6_TNL_F_CAP_XMIT|IP6_TNL_F_CAP_RCV);
-
-	if (ltype & (IPV6_ADDR_UNICAST|IPV6_ADDR_MULTICAST) &&
-	    rtype & (IPV6_ADDR_UNICAST|IPV6_ADDR_MULTICAST) &&
-	    !((ltype|rtype) & IPV6_ADDR_LOOPBACK) &&
-	    (!((ltype|rtype) & IPV6_ADDR_LINKLOCAL) || p->link)) {
-		if (ltype&IPV6_ADDR_UNICAST)
-			p->flags |= IP6_TNL_F_CAP_XMIT;
-		if (rtype&IPV6_ADDR_UNICAST)
-			p->flags |= IP6_TNL_F_CAP_RCV;
-	}
-}
-
 static void ip6_tnl_link_config(struct ip6_tnl *t)
 {
 	struct net_device *dev = t->dev;
@@ -1153,7 +1160,8 @@ static void ip6_tnl_link_config(struct ip6_tnl *t)
 	if (!(p->flags&IP6_TNL_F_USE_ORIG_FLOWLABEL))
 		fl6->flowlabel |= IPV6_FLOWLABEL_MASK & p->flowinfo;
 
-	ip6_tnl_set_cap(t);
+	p->flags &= ~(IP6_TNL_F_CAP_XMIT|IP6_TNL_F_CAP_RCV|IP6_TNL_F_CAP_PER_PACKET);
+	p->flags |= ip6_tnl_get_cap(t, &p->laddr, &p->raddr);
 
 	if (p->flags&IP6_TNL_F_CAP_XMIT && p->flags&IP6_TNL_F_CAP_RCV)
 		dev->flags |= IFF_POINTOPOINT;
@@ -1438,6 +1446,9 @@ static int __net_init ip6_fb_tnl_dev_init(struct net_device *dev)
 
 	t->parms.proto = IPPROTO_IPV6;
 	dev_hold(dev);
+
+	ip6_tnl_link_config(t);
+
 	rcu_assign_pointer(ip6n->tnls_wc[0], t);
 	return 0;
 }

^ permalink raw reply related

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
From: Eric Dumazet @ 2012-06-29  4:18 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: netdev, Yuchung Cheng, codel, Matt Mathis, Neal Cardwell,
	David Miller
In-Reply-To: <CAB_+Fg7x2pV72Wa45QLuokHTVPuzdmrMfwiuLjS_vi02vtasAA@mail.gmail.com>

On Thu, 2012-06-28 at 16:52 -0700, Nandita Dukkipati wrote:
> As you know I really like this idea. My main concern is that the same
> packet could cause TCP to reduce cwnd twice within an RTT - first on
> enqueue and then if this packet is ECN marked on dequeue. I don't
> think this is the desired behavior. Can we avoid it?

I'll work on this.

In my experiences, I found that no drops (or ECN marks) were done at
dequeue time once one NET_XMIT_CN was returned, but its certainly
possible if other flows compete with this one.

Strangely, SFQ has the same behavior and nobody complained yet ;)

Thanks

^ permalink raw reply

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
From: Eric Dumazet @ 2012-06-29  4:50 UTC (permalink / raw)
  To: Dave Taht
  Cc: Nandita Dukkipati, netdev, Yuchung Cheng, codel, Matt Mathis,
	Neal Cardwell, David Miller
In-Reply-To: <CAA93jw7ywd0auqN=k2J4fLOGYMLRH+agWr4P+_CisgvRtvgJUw@mail.gmail.com>

On Thu, 2012-06-28 at 19:47 -0400, Dave Taht wrote: 
> On Thu, Jun 28, 2012 at 6:56 PM, Yuchung Cheng <ycheng@google.com> wrote:
> > On Thu, Jun 28, 2012 at 11:12 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> On Thu, 2012-06-28 at 10:51 -0700, Dave Taht wrote:
> >>
> >>> clever idea. A problem is there are other forms of network traffic on
> >>> a link, and this is punishing a single tcp
> > Dave: it won't just punish a single TCP, all protocols that react to
> > XMIT_CN will share similar fate.
> 
> What protocols in the kernel do and don't? was the crux of this question.
> 

AFAIK that only tcp cares a bit, or seems to.

But not that much, since it continues to send packets.

Thats because tcp_transmit_skb() changes the NET_XMIT_CN status to plain
NET_XMIT_SUCCESS.

My long term plan is to reduce number of skbs queued in Qdisc for TCP
stack, to reduce RTT (removing the artificial RTT bias because of local
queues)

> I'm not objecting to the idea, it's clever, as I said. I'm thinking I'll
> apply it to cerowrt's next build and see what happens, if this
> will apply against 3.3. Or maybe the ns3 model. Or both.

A router will have no use of this feature, not sure you need to spend
time trying this ;)

^ permalink raw reply

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
From: Eric Dumazet @ 2012-06-29  4:53 UTC (permalink / raw)
  To: David Miller
  Cc: Nandita Dukkipati, netdev, codel, Yuchung Cheng, Neal Cardwell,
	Matt Mathis
In-Reply-To: <1340903237.13187.151.camel@edumazet-glaptop>

On Thu, 2012-06-28 at 19:07 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> At enqueue time, check sojourn time of packet at head of the queue,
> and return NET_XMIT_CN instead of NET_XMIT_SUCCESS if this sejourn
> time is above codel @target.
> 
> This permits local TCP stack to call tcp_enter_cwr() and reduce its cwnd
> without drops (for example if ECN is not enabled for the flow)
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Dave Taht <dave.taht@bufferbloat.net>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Matt Mathis <mattmathis@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Nandita Dukkipati <nanditad@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---

Please dont apply this patch, I'll submit an updated version later.

Thanks

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox