Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] ipv4: Fix nexthop exception hash computation.
From: David Miller @ 2012-07-17 20:23 UTC (permalink / raw)
  To: netdev


Need to mask it with (FNHE_HASH_SIZE - 1).

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c |   16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a5bd0b4..812e444 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1347,6 +1347,16 @@ static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash, __be3
 	return oldest;
 }
 
+static inline u32 fnhe_hashfun(__be32 daddr)
+{
+	u32 hval;
+
+	hval = (__force u32) daddr;
+	hval ^= (hval >> 11) ^ (hval >> 22);
+
+	return hval & (FNHE_HASH_SIZE - 1);
+}
+
 static struct fib_nh_exception *find_or_create_fnhe(struct fib_nh *nh, __be32 daddr)
 {
 	struct fnhe_hash_bucket *hash = nh->nh_exceptions;
@@ -1361,8 +1371,7 @@ static struct fib_nh_exception *find_or_create_fnhe(struct fib_nh *nh, __be32 da
 			return NULL;
 	}
 
-	hval = (__force u32) daddr;
-	hval ^= (hval >> 11) ^ (hval >> 22);
+	hval = fnhe_hashfun(daddr);
 	hash += hval;
 
 	depth = 0;
@@ -1890,8 +1899,7 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
 	struct fib_nh_exception *fnhe;
 	u32 hval;
 
-	hval = (__force u32) daddr;
-	hval ^= (hval >> 11) ^ (hval >> 22);
+	hval = fnhe_hashfun(daddr);
 
 	for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: David Miller @ 2012-07-17 20:20 UTC (permalink / raw)
  To: brking
  Cc: rick.jones2, cascardo, netdev, yevgenyp, ogerlitz, amirv, leitao,
	klebers
In-Reply-To: <5005C6A0.50002@linux.vnet.ibm.com>

From: Brian King <brking@linux.vnet.ibm.com>
Date: Tue, 17 Jul 2012 15:10:08 -0500

> On 07/17/2012 01:17 PM, Rick Jones wrote:
>> On 07/16/2012 10:29 PM, David Miller wrote:
>>> From: Rick Jones <rick.jones2@hp.com> Date: Mon, 16 Jul 2012
>>> 10:27:57 -0700
>>> 
>>>> That seems rather extraordinarily low - Power7 is supposed to be
>>>> a rather high performance CPU.  The last time I noticed
>>>> O(3Gbit/s) on 10G for bulk transfer was before the advent of
>>>> LRO/GRO - that was in the x86 space though.  Is mapping really
>>>> that expensive with Power7?
>>> 
>>> Unfortunately, IOMMU mappings are incredibly expensive.  I see
>>> effects like this on Sparc too.
>> 
>> OK, so that has caused some dimm memory to get a small refresh - it
>> ends up being akin to if not actually a PIO yes?  I recall schemes in
>> drivers in other stacks whereby "small" packets were copied because
>> it was cheaper to allocate/copy then it was to remap.
> 
> On Power it ends up being an hcall to the hypervisor

This is true on sparc64 niagara systems as well.

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Brian King @ 2012-07-17 20:10 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, cascardo@linux.vnet.ibm.com, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, leitao@linux.vnet.ibm.com,
	klebers@linux.vnet.ibm.com
In-Reply-To: <5005AC4A.9030208@hp.com>

On 07/17/2012 01:17 PM, Rick Jones wrote:
> On 07/16/2012 10:29 PM, David Miller wrote:
>> From: Rick Jones <rick.jones2@hp.com> Date: Mon, 16 Jul 2012
>> 10:27:57 -0700
>> 
>>> That seems rather extraordinarily low - Power7 is supposed to be
>>> a rather high performance CPU.  The last time I noticed
>>> O(3Gbit/s) on 10G for bulk transfer was before the advent of
>>> LRO/GRO - that was in the x86 space though.  Is mapping really
>>> that expensive with Power7?
>> 
>> Unfortunately, IOMMU mappings are incredibly expensive.  I see
>> effects like this on Sparc too.
> 
> OK, so that has caused some dimm memory to get a small refresh - it
> ends up being akin to if not actually a PIO yes?  I recall schemes in
> drivers in other stacks whereby "small" packets were copied because
> it was cheaper to allocate/copy then it was to remap.

On Power it ends up being an hcall to the hypervisor

-Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

^ permalink raw reply

* [net-next PATCH 02/02] net/ipv4: VTI support new module for ip_vti.
From: Saurabh @ 2012-07-17 19:44 UTC (permalink / raw)
  To: netdev



New VTI tunnel kernel module, Kconfig and Makefile changes.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 16b92d0..5efff60 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -80,4 +80,18 @@ enum {
 
 #define IFLA_GRE_MAX	(__IFLA_GRE_MAX - 1)
 
+/* VTI-mode i_flags */
+#define VTI_ISVTI 0x0001
+
+enum {
+	IFLA_VTI_UNSPEC,
+	IFLA_VTI_LINK,
+	IFLA_VTI_IKEY,
+	IFLA_VTI_OKEY,
+	IFLA_VTI_LOCAL,
+	IFLA_VTI_REMOTE,
+	__IFLA_VTI_MAX,
+};
+
+#define IFLA_VTI_MAX	(__IFLA_VTI_MAX - 1)
 #endif /* _IF_TUNNEL_H_ */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 20f1cb5..5a19aeb 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -310,6 +310,17 @@ config SYN_COOKIES
 
 	  If unsure, say N.
 
+config NET_IPVTI
+	tristate "Virtual (secure) IP: tunneling"
+	select INET_TUNNEL
+	depends on INET_XFRM_MODE_TUNNEL
+	---help---
+	  Tunneling means encapsulating data of one protocol type within
+	  another protocol and sending it over a channel that understands the
+	  encapsulating protocol. This can be used with xfrm mode tunnel to give
+	  the notion of a secure tunnel for IPSEC and then use routing protocol
+	  on top.
+
 config INET_AH
 	tristate "IP: AH transformation"
 	select XFRM_ALGO
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index ff75d3b..3999ce9 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IP_MROUTE) += ipmr.o
 obj-$(CONFIG_NET_IPIP) += ipip.o
 obj-$(CONFIG_NET_IPGRE_DEMUX) += gre.o
 obj-$(CONFIG_NET_IPGRE) += ip_gre.o
+obj-$(CONFIG_NET_IPVTI) += ip_vti.o
 obj-$(CONFIG_SYN_COOKIES) += syncookies.o
 obj-$(CONFIG_INET_AH) += ah4.o
 obj-$(CONFIG_INET_ESP) += esp4.o
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
new file mode 100644
index 0000000..c41b5c3
--- /dev/null
+++ b/net/ipv4/ip_vti.c
@@ -0,0 +1,956 @@
+/*
+ *	Linux NET3: IP/IP protocol decoder modified to support
+ *		    virtual tunnel interface
+ *
+ *	Authors:
+ *		Saurabh Mohan (saurabh.mohan@vyatta.com) 05/07/2012
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; either version
+ *	2 of the License, or (at your option) any later version.
+ *
+ */
+
+/*
+   This version of net/ipv4/ip_vti.c is cloned of net/ipv4/ipip.c
+
+   For comments look at net/ipv4/ip_gre.c --ANK
+ */
+
+
+#include <linux/capability.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/uaccess.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/in.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/if_arp.h>
+#include <linux/mroute.h>
+#include <linux/init.h>
+#include <linux/netfilter_ipv4.h>
+#include <linux/if_ether.h>
+
+#include <net/sock.h>
+#include <net/ip.h>
+#include <net/icmp.h>
+#include <net/ipip.h>
+#include <net/inet_ecn.h>
+#include <net/xfrm.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#define HASH_SIZE  16
+#define HASH(addr) (((__force u32)addr^((__force u32)addr>>4))&(HASH_SIZE-1))
+
+static struct rtnl_link_ops vti_link_ops __read_mostly;
+
+static int vti_net_id __read_mostly;
+struct vti_net {
+	struct ip_tunnel __rcu *tunnels_r_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_r[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_wc[1];
+	struct ip_tunnel **tunnels[4];
+
+	struct net_device *fb_tunnel_dev;
+};
+
+static int vti_fb_tunnel_init(struct net_device *dev);
+static int vti_tunnel_init(struct net_device *dev);
+static void vti_tunnel_setup(struct net_device *dev);
+static void vti_dev_free(struct net_device *dev);
+static int vti_tunnel_bind_dev(struct net_device *dev);
+
+/* Locking : hash tables are protected by RCU and RTNL */
+
+#define for_each_ip_tunnel_rcu(start) \
+	for (t = rcu_dereference(start); t; t = rcu_dereference(t->next))
+
+/* often modified stats are per cpu, other are shared (netdev->stats) */
+struct pcpu_tstats {
+	u64	rx_packets;
+	u64	rx_bytes;
+	u64	tx_packets;
+	u64	tx_bytes;
+	struct	u64_stats_sync	syncp;
+};
+
+#define VTI_XMIT(stats1, stats2) do {				\
+	int err;						\
+	int pkt_len = skb->len;					\
+	err = dst_output(skb);					\
+	if (net_xmit_eval(err) == 0) {				\
+		u64_stats_update_begin(&(stats1)->syncp);	\
+		(stats1)->tx_bytes += pkt_len;			\
+		(stats1)->tx_packets++;				\
+		u64_stats_update_end(&(stats1)->syncp);		\
+	} else {						\
+		(stats2)->tx_errors++;				\
+		(stats2)->tx_aborted_errors++;			\
+	}							\
+} while (0)
+
+
+static struct rtnl_link_stats64 *vti_get_stats64(struct net_device *dev,
+						 struct rtnl_link_stats64 *tot)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		const struct pcpu_tstats *tstats = per_cpu_ptr(dev->tstats, i);
+		u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
+		unsigned int start;
+
+		do {
+			start = u64_stats_fetch_begin_bh(&tstats->syncp);
+			rx_packets = tstats->rx_packets;
+			tx_packets = tstats->tx_packets;
+			rx_bytes = tstats->rx_bytes;
+			tx_bytes = tstats->tx_bytes;
+		} while (u64_stats_fetch_retry_bh(&tstats->syncp, start));
+
+		tot->rx_packets += rx_packets;
+		tot->tx_packets += tx_packets;
+		tot->rx_bytes   += rx_bytes;
+		tot->tx_bytes   += tx_bytes;
+	}
+
+	tot->multicast = dev->stats.multicast;
+	tot->rx_crc_errors = dev->stats.rx_crc_errors;
+	tot->rx_fifo_errors = dev->stats.rx_fifo_errors;
+	tot->rx_length_errors = dev->stats.rx_length_errors;
+	tot->rx_errors = dev->stats.rx_errors;
+	tot->tx_fifo_errors = dev->stats.tx_fifo_errors;
+	tot->tx_carrier_errors = dev->stats.tx_carrier_errors;
+	tot->tx_dropped = dev->stats.tx_dropped;
+	tot->tx_aborted_errors = dev->stats.tx_aborted_errors;
+	tot->tx_errors = dev->stats.tx_errors;
+
+	return tot;
+}
+
+static struct ip_tunnel *vti_tunnel_lookup(struct net *net,
+					   __be32 remote, __be32 local)
+{
+	unsigned h0 = HASH(remote);
+	unsigned h1 = HASH(local);
+	struct ip_tunnel *t;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_r_l[h0 ^ h1])
+		if (local == t->parms.iph.saddr &&
+		    remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+	for_each_ip_tunnel_rcu(ipn->tunnels_r[h0])
+		if (remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_l[h1])
+		if (local == t->parms.iph.saddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_wc[0])
+		if (t && (t->dev->flags&IFF_UP))
+			return t;
+	return NULL;
+}
+
+static struct ip_tunnel **__vti_bucket(struct vti_net *ipn,
+				       struct ip_tunnel_parm *parms)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	unsigned h = 0;
+	int prio = 0;
+
+	if (remote) {
+		prio |= 2;
+		h ^= HASH(remote);
+	}
+	if (local) {
+		prio |= 1;
+		h ^= HASH(local);
+	}
+	return &ipn->tunnels[prio][h];
+}
+
+static inline struct ip_tunnel **vti_bucket(struct vti_net *ipn,
+					    struct ip_tunnel *t)
+{
+	return __vti_bucket(ipn, &t->parms);
+}
+
+static void vti_tunnel_unlink(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp;
+	struct ip_tunnel *iter;
+
+	for (tp = vti_bucket(ipn, t);
+	     (iter = rtnl_dereference(*tp)) != NULL;
+	     tp = &iter->next) {
+		if (t == iter) {
+			rcu_assign_pointer(*tp, t->next);
+			break;
+		}
+	}
+}
+
+static void vti_tunnel_link(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp = vti_bucket(ipn, t);
+
+	rcu_assign_pointer(t->next, rtnl_dereference(*tp));
+	rcu_assign_pointer(*tp, t);
+}
+
+static struct ip_tunnel *vti_tunnel_locate(struct net *net,
+					   struct ip_tunnel_parm *parms,
+					   int create)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	struct ip_tunnel *t, *nt;
+	struct ip_tunnel __rcu **tp;
+	struct net_device *dev;
+	char name[IFNAMSIZ];
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for (tp = __vti_bucket(ipn, parms);
+	     (t = rtnl_dereference(*tp)) != NULL;
+	     tp = &t->next) {
+		if (local == t->parms.iph.saddr && remote == t->parms.iph.daddr)
+			return t;
+	}
+	if (!create)
+		return NULL;
+
+	if (parms->name[0])
+		strlcpy(name, parms->name, IFNAMSIZ);
+	else
+		strcpy(name, "vti%d");
+
+	dev = alloc_netdev(sizeof(*t), name, vti_tunnel_setup);
+	if (dev == NULL)
+		return NULL;
+
+	dev_net_set(dev, net);
+
+	nt = netdev_priv(dev);
+	nt->parms = *parms;
+	dev->rtnl_link_ops = &vti_link_ops;
+
+	vti_tunnel_bind_dev(dev);
+
+	if (register_netdevice(dev) < 0)
+		goto failed_free;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+	return nt;
+
+failed_free:
+	free_netdev(dev);
+	return NULL;
+}
+
+static void vti_tunnel_uninit(struct net_device *dev)
+{
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	vti_tunnel_unlink(ipn, netdev_priv(dev));
+	dev_put(dev);
+}
+
+static int vti_err(struct sk_buff *skb, u32 info)
+{
+
+	/* All the routers (except for Linux) return only
+	 * 8 bytes of packet payload. It means, that precise relaying of
+	 * ICMP in the real Internet is absolutely infeasible.
+	 */
+	struct iphdr *iph = (struct iphdr *)skb->data;
+	const int type = icmp_hdr(skb)->type;
+	const int code = icmp_hdr(skb)->code;
+	struct ip_tunnel *t;
+	int err;
+
+	switch (type) {
+	default:
+	case ICMP_PARAMETERPROB:
+		return 0;
+
+	case ICMP_DEST_UNREACH:
+		switch (code) {
+		case ICMP_SR_FAILED:
+		case ICMP_PORT_UNREACH:
+			/* Impossible event. */
+			return 0;
+		default:
+			/* All others are translated to HOST_UNREACH. */
+			break;
+		}
+		break;
+	case ICMP_TIME_EXCEEDED:
+		if (code != ICMP_EXC_TTL)
+			return 0;
+		break;
+	}
+
+	err = -ENOENT;
+
+	rcu_read_lock();
+	t = vti_tunnel_lookup(dev_net(skb->dev), iph->daddr, iph->saddr);
+	if (t == NULL)
+		goto out;
+
+	if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) {
+		ipv4_update_pmtu(skb, dev_net(skb->dev), info,
+				 t->parms.link, 0, IPPROTO_IPIP, 0);
+		err = 0;
+		goto out;
+	}
+
+	err = 0;
+	if (t->parms.iph.ttl == 0 && type == ICMP_TIME_EXCEEDED)
+		goto out;
+
+	if (time_before(jiffies, t->err_time + IPTUNNEL_ERR_TIMEO))
+		t->err_count++;
+	else
+		t->err_count = 1;
+	t->err_time = jiffies;
+out:
+	rcu_read_unlock();
+	return err;
+}
+
+/* We dont digest the packet therefore let the packet pass */
+static int vti_rcv(struct sk_buff *skb)
+{
+	struct ip_tunnel *tunnel;
+	const struct iphdr *iph = ip_hdr(skb);
+
+	rcu_read_lock();
+	tunnel = vti_tunnel_lookup(dev_net(skb->dev), iph->saddr, iph->daddr);
+	if (tunnel != NULL) {
+		struct pcpu_tstats *tstats;
+
+		tstats = this_cpu_ptr(tunnel->dev->tstats);
+		u64_stats_update_begin(&tstats->syncp);
+		tstats->rx_packets++;
+		tstats->rx_bytes += skb->len;
+		u64_stats_update_end(&tstats->syncp);
+
+		skb->dev = tunnel->dev;
+		rcu_read_unlock();
+		return 1;
+	}
+	rcu_read_unlock();
+
+	return -1;
+}
+
+/* This function assumes it is being called from dev_queue_xmit()
+ * and that skb is filled properly by that function.
+ */
+
+static netdev_tx_t vti_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct pcpu_tstats *tstats;
+	struct iphdr  *tiph = &tunnel->parms.iph;
+	u8     tos;
+	struct rtable *rt;		/* Route to the other host */
+	struct net_device *tdev;	/* Device to other host */
+	struct iphdr  *old_iph = ip_hdr(skb);
+	__be32 dst = tiph->daddr;
+	struct flowi4 fl4;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		goto tx_error;
+
+	tos = old_iph->tos;
+
+	memset(&fl4, 0, sizeof(fl4));
+	flowi4_init_output(&fl4, tunnel->parms.link,
+			   htonl(tunnel->parms.i_key), RT_TOS(tos),
+			   RT_SCOPE_UNIVERSE,
+			   IPPROTO_IPIP, 0,
+			   dst, tiph->saddr, 0, 0);
+	rt = ip_route_output_key(dev_net(dev), &fl4);
+	if (IS_ERR(rt)) {
+		dev->stats.tx_carrier_errors++;
+		goto tx_error_icmp;
+	}
+	/* if there is no transform then this tunnel is not functional.
+	 * Or if the xfrm is not mode tunnel.
+	 */
+	if (!rt->dst.xfrm ||
+	    rt->dst.xfrm->props.mode != XFRM_MODE_TUNNEL) {
+		dev->stats.tx_carrier_errors++;
+		goto tx_error_icmp;
+	}
+	tdev = rt->dst.dev;
+
+	if (tdev == dev) {
+		ip_rt_put(rt);
+		dev->stats.collisions++;
+		goto tx_error;
+	}
+
+	if (tunnel->err_count > 0) {
+		if (time_before(jiffies,
+				tunnel->err_time + IPTUNNEL_ERR_TIMEO)) {
+			tunnel->err_count--;
+			dst_link_failure(skb);
+		} else
+			tunnel->err_count = 0;
+	}
+
+	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED |
+			      IPSKB_REROUTED);
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &rt->dst);
+	nf_reset(skb);
+	skb->dev = skb_dst(skb)->dev;
+
+	tstats = this_cpu_ptr(dev->tstats);
+	VTI_XMIT(tstats, &dev->stats);
+	return NETDEV_TX_OK;
+
+tx_error_icmp:
+	dst_link_failure(skb);
+tx_error:
+	dev->stats.tx_errors++;
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int vti_tunnel_bind_dev(struct net_device *dev)
+{
+	struct net_device *tdev = NULL;
+	struct ip_tunnel *tunnel;
+	struct iphdr *iph;
+
+	tunnel = netdev_priv(dev);
+	iph = &tunnel->parms.iph;
+
+	if (iph->daddr) {
+		struct rtable *rt;
+		struct flowi4 fl4;
+		memset(&fl4, 0, sizeof(fl4));
+		flowi4_init_output(&fl4, tunnel->parms.link,
+				   htonl(tunnel->parms.i_key),
+				   RT_TOS(iph->tos), RT_SCOPE_UNIVERSE,
+				   IPPROTO_IPIP, 0,
+				   iph->daddr, iph->saddr, 0, 0);
+		rt = ip_route_output_key(dev_net(dev), &fl4);
+		if (!IS_ERR(rt)) {
+			tdev = rt->dst.dev;
+			ip_rt_put(rt);
+		}
+		dev->flags |= IFF_POINTOPOINT;
+	}
+
+	if (!tdev && tunnel->parms.link)
+		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+
+	if (tdev) {
+		dev->hard_header_len = tdev->hard_header_len +
+				       sizeof(struct iphdr);
+		dev->mtu = tdev->mtu;
+	}
+	dev->iflink = tunnel->parms.link;
+	return dev->mtu;
+}
+
+static int
+vti_tunnel_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
+{
+	int err = 0;
+	struct ip_tunnel_parm p;
+	struct ip_tunnel *t;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	switch (cmd) {
+	case SIOCGETTUNNEL:
+		t = NULL;
+		if (dev == ipn->fb_tunnel_dev) {
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data,
+					   sizeof(p))) {
+				err = -EFAULT;
+				break;
+			}
+			t = vti_tunnel_locate(net, &p, 0);
+		}
+		if (t == NULL)
+			t = netdev_priv(dev);
+		memcpy(&p, &t->parms, sizeof(p));
+		p.i_flags |= GRE_KEY | VTI_ISVTI;
+		p.o_flags |= GRE_KEY;
+		if (copy_to_user(ifr->ifr_ifru.ifru_data, &p, sizeof(p)))
+			err = -EFAULT;
+		break;
+
+	case SIOCADDTUNNEL:
+	case SIOCCHGTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		err = -EFAULT;
+		if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p)))
+			goto done;
+
+		err = -EINVAL;
+		if (p.iph.version != 4 || p.iph.protocol != IPPROTO_IPIP ||
+		    p.iph.ihl != 5)
+			goto done;
+
+		t = vti_tunnel_locate(net, &p, cmd == SIOCADDTUNNEL);
+
+		if (dev != ipn->fb_tunnel_dev && cmd == SIOCCHGTUNNEL) {
+			if (t != NULL) {
+				if (t->dev != dev) {
+					err = -EEXIST;
+					break;
+				}
+			} else {
+				if (((dev->flags&IFF_POINTOPOINT) &&
+				    !p.iph.daddr) ||
+				    (!(dev->flags&IFF_POINTOPOINT) &&
+				    p.iph.daddr)) {
+					err = -EINVAL;
+					break;
+				}
+				t = netdev_priv(dev);
+				vti_tunnel_unlink(ipn, t);
+				synchronize_net();
+				t->parms.iph.saddr = p.iph.saddr;
+				t->parms.iph.daddr = p.iph.daddr;
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				t->parms.iph.protocol = IPPROTO_IPIP;
+				memcpy(dev->dev_addr, &p.iph.saddr, 4);
+				memcpy(dev->broadcast, &p.iph.daddr, 4);
+				vti_tunnel_link(ipn, t);
+				netdev_state_change(dev);
+			}
+		}
+
+		if (t) {
+			err = 0;
+			if (cmd == SIOCCHGTUNNEL) {
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				if (t->parms.link != p.link) {
+					t->parms.link = p.link;
+					vti_tunnel_bind_dev(dev);
+					netdev_state_change(dev);
+				}
+			}
+			p.i_flags |= GRE_KEY | VTI_ISVTI;
+			p.o_flags |= GRE_KEY;
+			if (copy_to_user(ifr->ifr_ifru.ifru_data, &t->parms,
+					 sizeof(p)))
+				err = -EFAULT;
+		} else
+			err = (cmd == SIOCADDTUNNEL ? -ENOBUFS : -ENOENT);
+		break;
+
+	case SIOCDELTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		if (dev == ipn->fb_tunnel_dev) {
+			err = -EFAULT;
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data,
+					   sizeof(p)))
+				goto done;
+			err = -ENOENT;
+
+			t = vti_tunnel_locate(net, &p, 0);
+			if (t == NULL)
+				goto done;
+			err = -EPERM;
+			if (t->dev == ipn->fb_tunnel_dev)
+				goto done;
+			dev = t->dev;
+		}
+		unregister_netdevice(dev);
+		err = 0;
+		break;
+
+	default:
+		err = -EINVAL;
+	}
+
+done:
+	return err;
+}
+
+static int vti_tunnel_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (new_mtu < 68 || new_mtu > 0xFFF8)
+		return -EINVAL;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static const struct net_device_ops vti_netdev_ops = {
+	.ndo_init	= vti_tunnel_init,
+	.ndo_uninit	= vti_tunnel_uninit,
+	.ndo_start_xmit	= vti_tunnel_xmit,
+	.ndo_do_ioctl	= vti_tunnel_ioctl,
+	.ndo_change_mtu	= vti_tunnel_change_mtu,
+	.ndo_get_stats64 = vti_get_stats64,
+};
+
+static void vti_dev_free(struct net_device *dev)
+{
+	free_percpu(dev->tstats);
+	free_netdev(dev);
+}
+
+static void vti_tunnel_setup(struct net_device *dev)
+{
+	dev->netdev_ops		= &vti_netdev_ops;
+	dev->destructor		= vti_dev_free;
+
+	dev->type		= ARPHRD_TUNNEL;
+	dev->hard_header_len	= LL_MAX_HEADER + sizeof(struct iphdr);
+	dev->mtu		= ETH_DATA_LEN;
+	dev->flags		= IFF_NOARP;
+	dev->iflink		= 0;
+	dev->addr_len		= 4;
+	dev->features		|= NETIF_F_NETNS_LOCAL;
+	dev->features		|= NETIF_F_LLTX;
+	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
+}
+
+static int vti_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
+	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int __net_init vti_fb_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct iphdr *iph = &tunnel->parms.iph;
+	struct vti_net *ipn = net_generic(dev_net(dev), vti_net_id);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	iph->version		= 4;
+	iph->protocol		= IPPROTO_IPIP;
+	iph->ihl		= 5;
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	dev_hold(dev);
+	rcu_assign_pointer(ipn->tunnels_wc[0], tunnel);
+	return 0;
+}
+
+static struct xfrm_tunnel vti_handler __read_mostly = {
+	.handler	=	vti_rcv,
+	.err_handler	=	vti_err,
+	.priority	=	1,
+};
+
+static void vti_destroy_tunnels(struct vti_net *ipn, struct list_head *head)
+{
+	int prio;
+
+	for (prio = 1; prio < 4; prio++) {
+		int h;
+		for (h = 0; h < HASH_SIZE; h++) {
+			struct ip_tunnel *t;
+
+			t = rtnl_dereference(ipn->tunnels[prio][h]);
+			while (t != NULL) {
+				unregister_netdevice_queue(t->dev, head);
+				t = rtnl_dereference(t->next);
+			}
+		}
+	}
+}
+
+static int __net_init vti_init_net(struct net *net)
+{
+	int err;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	ipn->tunnels[0] = ipn->tunnels_wc;
+	ipn->tunnels[1] = ipn->tunnels_l;
+	ipn->tunnels[2] = ipn->tunnels_r;
+	ipn->tunnels[3] = ipn->tunnels_r_l;
+
+	ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+					  "ip_vti0",
+					  vti_tunnel_setup);
+	if (!ipn->fb_tunnel_dev) {
+		err = -ENOMEM;
+		goto err_alloc_dev;
+	}
+	dev_net_set(ipn->fb_tunnel_dev, net);
+
+	err = vti_fb_tunnel_init(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	ipn->fb_tunnel_dev->rtnl_link_ops = &vti_link_ops;
+
+	err = register_netdev(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	return 0;
+
+err_reg_dev:
+	vti_dev_free(ipn->fb_tunnel_dev);
+err_alloc_dev:
+	/* nothing */
+	return err;
+}
+
+static void __net_exit vti_exit_net(struct net *net)
+{
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	LIST_HEAD(list);
+
+	rtnl_lock();
+	vti_destroy_tunnels(ipn, &list);
+	unregister_netdevice_many(&list);
+	rtnl_unlock();
+}
+
+static struct pernet_operations vti_net_ops = {
+	.init = vti_init_net,
+	.exit = vti_exit_net,
+	.id   = &vti_net_id,
+	.size = sizeof(struct vti_net),
+};
+
+static int vti_tunnel_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	return 0;
+}
+
+static void vti_netlink_parms(struct nlattr *data[],
+			      struct ip_tunnel_parm *parms)
+{
+	memset(parms, 0, sizeof(*parms));
+
+	parms->iph.protocol = IPPROTO_IPIP;
+
+	if (!data)
+		return;
+
+	if (data[IFLA_VTI_LINK])
+		parms->link = nla_get_u32(data[IFLA_VTI_LINK]);
+
+	if (data[IFLA_VTI_IKEY])
+		parms->i_key = nla_get_be32(data[IFLA_VTI_IKEY]);
+
+	if (data[IFLA_VTI_OKEY])
+		parms->o_key = nla_get_be32(data[IFLA_VTI_OKEY]);
+
+	if (data[IFLA_VTI_LOCAL])
+		parms->iph.saddr = nla_get_be32(data[IFLA_VTI_LOCAL]);
+
+	if (data[IFLA_VTI_REMOTE])
+		parms->iph.daddr = nla_get_be32(data[IFLA_VTI_REMOTE]);
+
+}
+
+static int vti_newlink(struct net *src_net, struct net_device *dev,
+		       struct nlattr *tb[], struct nlattr *data[])
+{
+	struct ip_tunnel *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	int mtu;
+	int err;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &nt->parms);
+
+	if (vti_tunnel_locate(net, &nt->parms, 0))
+		return -EEXIST;
+
+	mtu = vti_tunnel_bind_dev(dev);
+	if (!tb[IFLA_MTU])
+		dev->mtu = mtu;
+
+	err = register_netdevice(dev);
+	if (err)
+		goto out;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+
+out:
+	return err;
+}
+
+static int vti_changelink(struct net_device *dev, struct nlattr *tb[],
+			  struct nlattr *data[])
+{
+	struct ip_tunnel *t, *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	struct ip_tunnel_parm p;
+	int mtu;
+
+	if (dev == ipn->fb_tunnel_dev)
+		return -EINVAL;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &p);
+
+	t = vti_tunnel_locate(net, &p, 0);
+
+	if (t) {
+		if (t->dev != dev)
+			return -EEXIST;
+	} else {
+		t = nt;
+
+		vti_tunnel_unlink(ipn, t);
+		t->parms.iph.saddr = p.iph.saddr;
+		t->parms.iph.daddr = p.iph.daddr;
+		t->parms.i_key = p.i_key;
+		t->parms.o_key = p.o_key;
+		if (dev->type != ARPHRD_ETHER) {
+			memcpy(dev->dev_addr, &p.iph.saddr, 4);
+			memcpy(dev->broadcast, &p.iph.daddr, 4);
+		}
+		vti_tunnel_link(ipn, t);
+		netdev_state_change(dev);
+	}
+
+	if (t->parms.link != p.link) {
+		t->parms.link = p.link;
+		mtu = vti_tunnel_bind_dev(dev);
+		if (!tb[IFLA_MTU])
+			dev->mtu = mtu;
+		netdev_state_change(dev);
+	}
+
+	return 0;
+}
+
+static size_t vti_get_size(const struct net_device *dev)
+{
+	return
+		/* IFLA_VTI_LINK */
+		nla_total_size(4) +
+		/* IFLA_VTI_IKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_OKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_LOCAL */
+		nla_total_size(4) +
+		/* IFLA_VTI_REMOTE */
+		nla_total_size(4) +
+		0;
+}
+
+static int vti_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct ip_tunnel *t = netdev_priv(dev);
+	struct ip_tunnel_parm *p = &t->parms;
+
+	nla_put_u32(skb, IFLA_VTI_LINK, p->link);
+	nla_put_be32(skb, IFLA_VTI_IKEY, p->i_key);
+	nla_put_be32(skb, IFLA_VTI_OKEY, p->o_key);
+	nla_put_be32(skb, IFLA_VTI_LOCAL, p->iph.saddr);
+	nla_put_be32(skb, IFLA_VTI_REMOTE, p->iph.daddr);
+
+	return 0;
+}
+
+static const struct nla_policy vti_policy[IFLA_VTI_MAX + 1] = {
+	[IFLA_VTI_LINK]		= { .type = NLA_U32 },
+	[IFLA_VTI_IKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_OKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_LOCAL]	= { .len = FIELD_SIZEOF(struct iphdr, saddr) },
+	[IFLA_VTI_REMOTE]	= { .len = FIELD_SIZEOF(struct iphdr, daddr) },
+};
+
+static struct rtnl_link_ops vti_link_ops __read_mostly = {
+	.kind		= "vti",
+	.maxtype	= IFLA_VTI_MAX,
+	.policy		= vti_policy,
+	.priv_size	= sizeof(struct ip_tunnel),
+	.setup		= vti_tunnel_setup,
+	.validate	= vti_tunnel_validate,
+	.newlink	= vti_newlink,
+	.changelink	= vti_changelink,
+	.get_size	= vti_get_size,
+	.fill_info	= vti_fill_info,
+};
+
+static int __init vti_init(void)
+{
+	int err;
+
+	pr_info("IPv4 over IPSec tunneling driver\n");
+
+	err = register_pernet_device(&vti_net_ops);
+	if (err < 0)
+		return err;
+	err = xfrm4_mode_tunnel_input_register(&vti_handler);
+	if (err < 0) {
+		unregister_pernet_device(&vti_net_ops);
+		pr_info(KERN_INFO "vti init: can't register tunnel\n");
+	}
+
+	err = rtnl_link_register(&vti_link_ops);
+	if (err < 0)
+		goto rtnl_link_failed;
+
+	return err;
+
+rtnl_link_failed:
+	xfrm4_mode_tunnel_input_deregister(&vti_handler);
+	unregister_pernet_device(&vti_net_ops);
+	return err;
+}
+
+static void __exit vti_fini(void)
+{
+	rtnl_link_unregister(&vti_link_ops);
+	if (xfrm4_mode_tunnel_input_deregister(&vti_handler))
+		pr_info("vti close: can't deregister tunnel\n");
+
+	unregister_pernet_device(&vti_net_ops);
+}
+
+module_init(vti_init);
+module_exit(vti_fini);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK("vti");
+MODULE_ALIAS_NETDEV("ip_vti0");

^ permalink raw reply related

* [net-next PATCH 01/02] net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.
From: Saurabh @ 2012-07-17 19:44 UTC (permalink / raw)
  To: netdev



Incorporated David and Steffen's comments.
Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index e0a55df..04214c0 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1475,6 +1475,8 @@ extern int xfrm4_output(struct sk_buff *skb);
 extern int xfrm4_output_finish(struct sk_buff *skb);
 extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler, unsigned short family);
 extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler, unsigned short family);
+extern int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler);
+extern int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler);
 extern int xfrm6_extract_header(struct sk_buff *skb);
 extern int xfrm6_extract_input(struct xfrm_state *x, struct sk_buff *skb);
 extern int xfrm6_rcv_spi(struct sk_buff *skb, int nexthdr, __be32 spi);
diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
index ed4bf11..ddee0a0 100644
--- a/net/ipv4/xfrm4_mode_tunnel.c
+++ b/net/ipv4/xfrm4_mode_tunnel.c
@@ -15,6 +15,65 @@
 #include <net/ip.h>
 #include <net/xfrm.h>
 
+/* Informational hook. The decap is still done here. */
+static struct xfrm_tunnel __rcu *rcv_notify_handlers __read_mostly;
+static DEFINE_MUTEX(xfrm4_mode_tunnel_input_mutex);
+
+int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+	int ret = -EEXIST;
+	int priority = handler->priority;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+
+	for (pprev = &rcv_notify_handlers;
+	     (t = rcu_dereference_protected(*pprev,
+	     lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+	     pprev = &t->next) {
+		if (t->priority > priority)
+			break;
+		if (t->priority == priority)
+			goto err;
+
+	}
+
+	handler->next = *pprev;
+	rcu_assign_pointer(*pprev, handler);
+
+	ret = 0;
+
+err:
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_register);
+
+int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+	int ret = -ENOENT;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+	for (pprev = &rcv_notify_handlers;
+	     (t = rcu_dereference_protected(*pprev,
+	     lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+	     pprev = &t->next) {
+		if (t == handler) {
+			*pprev = handler->next;
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	synchronize_net();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_deregister);
+
 static inline void ipip_ecn_decapsulate(struct sk_buff *skb)
 {
 	struct iphdr *inner_iph = ipip_hdr(skb);
@@ -64,8 +123,14 @@ static int xfrm4_mode_tunnel_output(struct xfrm_state *x, struct sk_buff *skb)
 	return 0;
 }
 
+#define for_each_input_rcu(head, handler)	\
+	for (handler = rcu_dereference(head);	\
+	     handler != NULL;			\
+	     handler = rcu_dereference(handler->next))
+
 static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 {
+	struct xfrm_tunnel *handler;
 	int err = -EINVAL;
 
 	if (XFRM_MODE_SKB_CB(skb)->protocol != IPPROTO_IPIP)
@@ -74,6 +139,9 @@ static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
 		goto out;
 
+	for_each_input_rcu(rcv_notify_handlers, handler)
+		handler->handler(skb);
+
 	if (skb_cloned(skb) &&
 	    (err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
 		goto out;

^ permalink raw reply related

* [net-next PATCH 00/02] net/ipv4: Add support for new tunnel type VTI.
From: Saurabh @ 2012-07-17 19:44 UTC (permalink / raw)
  To: netdev

I have accommodated all the style comments so far. If there are any more
style comments then send all your feedback in one email rather than in bits
and pieces.

IPv6 support has not yet been developed. Once I have it developed and tested
I'll submit it as well.  If this feature will not be accepted without IPv6
then let me know and I'll stop wasting my time. 

Incorporated David and Steffen's comments.
Resubmitting after taking into account review comments:
The VTI tunnel is applicable to esp, ah and ipcomp.

Introduction:
Virtual tunnel interface is a way to represent policy based IPsec tunnels as
 virtual interfaces in linux. This is similar to Cisco's VTI (virtual tunnel
 interface) and Juniper's representaion of secure tunnel (st.xx).
 The advantage of representing an IPsec tunnel as an interface is that it is
 possible to plug Ipsec tunnels into the routing protocol infrastructure of a
 router. Therefore it becomes possible to influence the packet path by toggling
 the link state of the tunnel or based on routing metrics.

Overview:
Natively linux kernel does not support ipsec as an interface. Also secure
 interface assume a ipsec policy 4 tupple of {dst-ip-any, src-ip-any,
 dst-port-any, src-port-any}. Applying this 4 tuple in linux would result in
 all traffic matching the ipsec policy. What is needed is a tunnel
 distinguisher. The linux kernel skbuff has fwmark which is used for policy
 based routing (PBR). Linux kernel version 2.6.35 enhanced SPD/SADB to use
 fwmark as part of the IPsec policy. Strongswan has also introduced support for
 this kernel feature with version 4.5.0. We can therefore use the fwmark as the
 distinguisher for tunnel interface. We can also create a light weight tunnel
 kernel module (vti) to give the notion of an interface for rest of the kernel
 routing system. The tunnel module does not do any encapsulation/decapsulation.
 The kernel's xfrm modules still do the esp encryption/decryption.

Usage:
ip tunnel add sti15 mode vti remote 12.0.0.1 local 12.0.0.3 ikey 15
or
ip link add sti15 type vti key 15 remote 12.0.0.1 local 12.0.0.3

Sample strongswan config would be:
conn peer-12.0.0.1-tunnel-1
   left=12.0.0.3
   right=12.0.0.1
   leftsubnet=0.0.0.0/0
   rightsubnet=0.0.0.0/0
   ike=aes128-sha1-modp1024!
   ikelifetime=28800s
   keyingtries=%forever
   esp=aes128-sha1!
   keylife=3600s
   rekeymargin=540s
   type=tunnel
   pfs=yes
   compress=no
   authby=secret
   auto=start
   mark_in=0xf
   mark_out=0xf
   keyexchange=ikev1

Also you need the iptables rule for ingress esp and udp-4500 packets:
-A PREROUTING -s 12.0.0.1/32 -d 12.0.0.3/32 -p esp -j MARK --set-xmark 0xf/0xffffffff

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---

^ permalink raw reply

* [PATCH] MAINTAINERS: Changes in qlcnic and qlge maintainers list
From: Anirban Chakraborty @ 2012-07-17 19:22 UTC (permalink / raw)
  To: davem; +Cc: netdev, Dept_NX_Linux_NIC_Driver, Anirban Chakraborty

From: Anirban Chakraborty <anirban.chakraborty@qlogic.com>

Please apply.

Thanks.

Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
---
 MAINTAINERS |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index b4321fb..7fda50f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5554,7 +5554,7 @@ F:	Documentation/networking/LICENSE.qla3xxx
 F:	drivers/net/ethernet/qlogic/qla3xxx.*
 
 QLOGIC QLCNIC (1/10)Gb ETHERNET DRIVER
-M:	Anirban Chakraborty <anirban.chakraborty@qlogic.com>
+M:	Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
 M:	Sony Chacko <sony.chacko@qlogic.com>
 M:	linux-driver@qlogic.com
 L:	netdev@vger.kernel.org
@@ -5562,7 +5562,6 @@ S:	Supported
 F:	drivers/net/ethernet/qlogic/qlcnic/
 
 QLOGIC QLGE 10Gb ETHERNET DRIVER
-M:	Anirban Chakraborty <anirban.chakraborty@qlogic.com>
 M:	Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
 M:	Ron Mercer <ron.mercer@qlogic.com>
 M:	linux-driver@qlogic.com
-- 
1.7.1

^ permalink raw reply related

* wireless.git frozen -- Re: That's pretty much it for 3.5.0
From: John W. Linville @ 2012-07-17 19:30 UTC (permalink / raw)
  To: linux-wireless-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <20120717.090142.125145009944045241.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Tue, Jul 17, 2012 at 09:01:42AM -0700, David Miller wrote:
> 
> Linus was _extremely_ generous and took in all the stuff that was
> pending in the net tree just now.
> 
> Besides very serious issues, I'm not willing to consider any more bug
> fixes for the 'net' tree at this time.
> 
> Only one pending known bug qualifies, and that's the CIPSO ip option
> processing OOPS'er.  And I'll work on that myself if Paul Moore
> doesn't show a sign of life in the next day.
> 
> Thanks.

Now only fixes for truly "show stopper" bugs will be accepted for
the 3.5 stream.  I don't believe that any of the handful of fixes
currently in wireless.git (but not yet in net.git) are sufficiently
important to make the cut.

I will pull the current wireless.git tree into the wireless-next.git
tree, and then wireless.git will remain frozen until 3.6-rc1 is
released.  If you have a wireless fix that you believe is sufficiently
important to merit being in 3.5, then please post it to the netdev
list (and Cc: linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org when you do so).

Thanks,

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: New commands to configure IOV features
From: Don Dutile @ 2012-07-17 19:29 UTC (permalink / raw)
  To: Yuval Mintz
  Cc: davem@davemloft.net, Chris Friesen, Ben Hutchings, Greg Rose,
	netdev@vger.kernel.org, linux-pci
In-Reply-To: <5003DC9B.8000706@broadcom.com>

On 07/16/2012 05:19 AM, Yuval Mintz wrote:
>
>>>>> If I want to pick the RFCs and add support for configuring the number
>>>>> of VFs - do you think ethtool's the right place for such added
>>>>> support?
>>>>>
>>>> I think a PCI utility tool would be better, SR-IOV is not limited to
>>>> network devices.  That's one of the reasons I dropped the RFC.  I
>>>> haven't gotten back to the idea since then due to my day job keeping me
>>>> pretty busy.
>>>
>>> For what it's worth, I agree with this.
>>
>>  From my perspective it would be ideal if this could be exported via /sys or something
>>
>
>
> Well, obviously unless there was a sudden change in our stance regarding
> sysfs we will not head that way.
>
> This thread got no replies from the pci community, and I'm unfamiliar
> with such a tool.
>
> Dave, What's your stance in the matter - do you wish us to continue pursuing
> some pci tool (which might or might not exist), or instead work on
> a networking solution to this issue?
>
> Do you happen to know such a tool?
>
> Thanks,
> Yuval
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuval (et. al.),

Not seeing the original thread on netdev,
I just had a recent discussion w/Greg Rose about providing
sysfs-based, VF enable/disable methods.
I was told that historically, VF enablement started as a sysfs-based
function, was debated and pushed toward a device/driver-specific method,
as it is implemented today.   Now, with some experience with SRIOV and
its use in the virtualization space, the discussion has renewed as to whether
a sysfs-based enable/disable method should be resurrected, so it
provides a more generic method for virtualization tools/api's to
manage SRIOV/VF devices.

I was hoping to discuss this topic with a number of folks at
LinuxCon/Plumbers/KS when the PCI mini-summit is held, to gain
further insight, or be brought up to speed on past history,
and review current uses/status of VFs.

WRT SRIOV-nic devices, the thinking goes that protocol-level
parameters associated with VFs should use protocol-specific interfaces,
e.g., ethtool, ip link set, etc. for Ethernet VFs.
Thus, the various protocol control functions/tools should
be used to control VF parameters, as one would for a physical device
of that protocol/class.

- Don

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 19:26 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: David Miller, <netdev@vger.kernel.org>,
	<linux-wireless@vger.kernel.org>,
	<netfilter-devel@vger.kernel.org>
In-Reply-To: <5005BA4C.2000602@intel.com>

On 7/17/2012 12:17 PM, John Fastabend wrote:
> On 7/17/2012 12:09 PM, John Fastabend wrote:
>> On 7/17/2012 12:00 PM, John Fastabend wrote:
>>> On 7/17/2012 11:48 AM, Rustad, Mark D wrote:
>>>> On Jul 17, 2012, at 10:41 AM, Rustad, Mark D wrote:
>>>>
>>>>> On Jul 17, 2012, at 9:01 AM, David Miller wrote:
>>>>>
>>>>>> Linus was _extremely_ generous and took in all the stuff that was
>>>>>> pending in the net tree just now.
>>>>>
>>>>> Maybe *too* generous. :-) I just updated and when I boot I get an
>>>>> early crash in update_netdev_tables which is in netprio_cgroup.c.
>>>>>
>>>>>> Besides very serious issues, I'm not willing to consider any more bug
>>>>>> fixes for the 'net' tree at this time.
>>>>>
>>>>> I think the above issue will have to be fixed, as it completely
>>>>> prevents booting for any kernel that includes the netprio_cgroup
>>>>> option.
>>>>>
>>>>>> Only one pending known bug qualifies, and that's the CIPSO ip option
>>>>>> processing OOPS'er.  And I'll work on that myself if Paul Moore
>>>>>> doesn't show a sign of life in the next day.
>>>>>>
>>>>>> Thanks.
>>>>>
>>>>>
>>>>> I can start taking a look at this if you like, but I see that Gao
>>>>> feng has two patches in the last set of patches that may be related.
>>>>>
>>>>> To give you an idea how early the crash is, here are a few log
>>>>> messages leading up to it:
>>>>>
>>>>> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
>>>>> 2097152 bytes)
>>>>> [    0.005550] Inode-cache hash table entries: 131072 (order: 8,
>>>>> 1048576 bytes)
>>>>> [    0.007165] Mount-cache hash table entries: 256
>>>>> [    0.010289] Initializing cgroup subsys net_cls
>>>>> [    0.010947] Initializing cgroup subsys net_prio
>>>>> [    0.011039] BUG: unable to handle kernel NULL pointer dereference
>>>>> at 0000000000000828
>>>>> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
>>>>
>>>>
>>>> I found that I can avoid the crash by configuring the netprio_cgroup
>>>> as a module. I don't need to have it built in, I just happened to.
>>>> This finding may lower the temperature of this issue a lot from what I
>>>> had been feeling.
>>>>
>>>
>>> hmm looks like we access init_net here,
>>>
>>> static void update_netdev_tables(void)
>>> {
>>>          struct net_device *dev;
>>>          u32 max_len = atomic_read(&max_prioidx) + 1;
>>>          struct netprio_map *map;
>>>
>>>          rtnl_lock();
>>>          for_each_netdev(&init_net, dev) {
>>>                  map = rtnl_dereference(dev->priomap);
>>>                  if ((!map) ||
>>>                      (map->priomap_len < max_len))
>>>                          extend_netdev_table(dev, max_len);
>>>          }
>>>          rtnl_unlock();
>>> }
>>>
>>> but inet_net is initialized by pure_initcall(net_ns_init) and I
>>> gather pure_initcall's should not have any dependencies but it
>>> looks like we created one here with cgroup_init_early() in
>>> start_kernel().
>>>
>>> I'll poke around some more. Also had some off list help from
>>> Mark.
>>>
>>> .John
>>>
>>
>> although we don't have an early_init hook for netprio_cgroup so this
>> is probably not correct.
>
> Hey Mark,
>
> you have better timing then me (I can't make this fail). Can you try
> cgroup_init below rest_init() in start_kernel(). That's in init/main.c
>
> .John
>

ugh nevermind that was stupid... I'm going to stop hitting the lists
with useless noise and be back with a fix in awhile.

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: David Miller @ 2012-07-17 19:24 UTC (permalink / raw)
  To: john.r.fastabend; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <5005B881.8010505@intel.com>

From: John Fastabend <john.r.fastabend@intel.com>
Date: Tue, 17 Jul 2012 12:09:53 -0700

> although we don't have an early_init hook for netprio_cgroup so this
> is probably not correct.

The dependency is actually on net_dev_init (a subsys_initcall) rather
than a pure_initcall.

net_dev_init is what registers the netdev_net_ops, which in turn
initializes the netdev list in namespaces such as &init_net

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 19:17 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: David Miller,
	<netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	<linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	<netfilter-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
In-Reply-To: <5005B881.8010505-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On 7/17/2012 12:09 PM, John Fastabend wrote:
> On 7/17/2012 12:00 PM, John Fastabend wrote:
>> On 7/17/2012 11:48 AM, Rustad, Mark D wrote:
>>> On Jul 17, 2012, at 10:41 AM, Rustad, Mark D wrote:
>>>
>>>> On Jul 17, 2012, at 9:01 AM, David Miller wrote:
>>>>
>>>>> Linus was _extremely_ generous and took in all the stuff that was
>>>>> pending in the net tree just now.
>>>>
>>>> Maybe *too* generous. :-) I just updated and when I boot I get an
>>>> early crash in update_netdev_tables which is in netprio_cgroup.c.
>>>>
>>>>> Besides very serious issues, I'm not willing to consider any more bug
>>>>> fixes for the 'net' tree at this time.
>>>>
>>>> I think the above issue will have to be fixed, as it completely
>>>> prevents booting for any kernel that includes the netprio_cgroup
>>>> option.
>>>>
>>>>> Only one pending known bug qualifies, and that's the CIPSO ip option
>>>>> processing OOPS'er.  And I'll work on that myself if Paul Moore
>>>>> doesn't show a sign of life in the next day.
>>>>>
>>>>> Thanks.
>>>>
>>>>
>>>> I can start taking a look at this if you like, but I see that Gao
>>>> feng has two patches in the last set of patches that may be related.
>>>>
>>>> To give you an idea how early the crash is, here are a few log
>>>> messages leading up to it:
>>>>
>>>> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
>>>> 2097152 bytes)
>>>> [    0.005550] Inode-cache hash table entries: 131072 (order: 8,
>>>> 1048576 bytes)
>>>> [    0.007165] Mount-cache hash table entries: 256
>>>> [    0.010289] Initializing cgroup subsys net_cls
>>>> [    0.010947] Initializing cgroup subsys net_prio
>>>> [    0.011039] BUG: unable to handle kernel NULL pointer dereference
>>>> at 0000000000000828
>>>> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
>>>
>>>
>>> I found that I can avoid the crash by configuring the netprio_cgroup
>>> as a module. I don't need to have it built in, I just happened to.
>>> This finding may lower the temperature of this issue a lot from what I
>>> had been feeling.
>>>
>>
>> hmm looks like we access init_net here,
>>
>> static void update_netdev_tables(void)
>> {
>>          struct net_device *dev;
>>          u32 max_len = atomic_read(&max_prioidx) + 1;
>>          struct netprio_map *map;
>>
>>          rtnl_lock();
>>          for_each_netdev(&init_net, dev) {
>>                  map = rtnl_dereference(dev->priomap);
>>                  if ((!map) ||
>>                      (map->priomap_len < max_len))
>>                          extend_netdev_table(dev, max_len);
>>          }
>>          rtnl_unlock();
>> }
>>
>> but inet_net is initialized by pure_initcall(net_ns_init) and I
>> gather pure_initcall's should not have any dependencies but it
>> looks like we created one here with cgroup_init_early() in
>> start_kernel().
>>
>> I'll poke around some more. Also had some off list help from
>> Mark.
>>
>> .John
>>
>
> although we don't have an early_init hook for netprio_cgroup so this
> is probably not correct.

Hey Mark,

you have better timing then me (I can't make this fail). Can you try
cgroup_init below rest_init() in start_kernel(). That's in init/main.c

.John

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 19:09 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: David Miller, <netdev@vger.kernel.org>,
	<linux-wireless@vger.kernel.org>,
	<netfilter-devel@vger.kernel.org>
In-Reply-To: <5005B643.2080009@intel.com>

On 7/17/2012 12:00 PM, John Fastabend wrote:
> On 7/17/2012 11:48 AM, Rustad, Mark D wrote:
>> On Jul 17, 2012, at 10:41 AM, Rustad, Mark D wrote:
>>
>>> On Jul 17, 2012, at 9:01 AM, David Miller wrote:
>>>
>>>> Linus was _extremely_ generous and took in all the stuff that was
>>>> pending in the net tree just now.
>>>
>>> Maybe *too* generous. :-) I just updated and when I boot I get an
>>> early crash in update_netdev_tables which is in netprio_cgroup.c.
>>>
>>>> Besides very serious issues, I'm not willing to consider any more bug
>>>> fixes for the 'net' tree at this time.
>>>
>>> I think the above issue will have to be fixed, as it completely
>>> prevents booting for any kernel that includes the netprio_cgroup option.
>>>
>>>> Only one pending known bug qualifies, and that's the CIPSO ip option
>>>> processing OOPS'er.  And I'll work on that myself if Paul Moore
>>>> doesn't show a sign of life in the next day.
>>>>
>>>> Thanks.
>>>
>>>
>>> I can start taking a look at this if you like, but I see that Gao
>>> feng has two patches in the last set of patches that may be related.
>>>
>>> To give you an idea how early the crash is, here are a few log
>>> messages leading up to it:
>>>
>>> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
>>> 2097152 bytes)
>>> [    0.005550] Inode-cache hash table entries: 131072 (order: 8,
>>> 1048576 bytes)
>>> [    0.007165] Mount-cache hash table entries: 256
>>> [    0.010289] Initializing cgroup subsys net_cls
>>> [    0.010947] Initializing cgroup subsys net_prio
>>> [    0.011039] BUG: unable to handle kernel NULL pointer dereference
>>> at 0000000000000828
>>> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
>>
>>
>> I found that I can avoid the crash by configuring the netprio_cgroup
>> as a module. I don't need to have it built in, I just happened to.
>> This finding may lower the temperature of this issue a lot from what I
>> had been feeling.
>>
>
> hmm looks like we access init_net here,
>
> static void update_netdev_tables(void)
> {
>          struct net_device *dev;
>          u32 max_len = atomic_read(&max_prioidx) + 1;
>          struct netprio_map *map;
>
>          rtnl_lock();
>          for_each_netdev(&init_net, dev) {
>                  map = rtnl_dereference(dev->priomap);
>                  if ((!map) ||
>                      (map->priomap_len < max_len))
>                          extend_netdev_table(dev, max_len);
>          }
>          rtnl_unlock();
> }
>
> but inet_net is initialized by pure_initcall(net_ns_init) and I
> gather pure_initcall's should not have any dependencies but it
> looks like we created one here with cgroup_init_early() in
> start_kernel().
>
> I'll poke around some more. Also had some off list help from
> Mark.
>
> .John
>

although we don't have an early_init hook for netprio_cgroup so this
is probably not correct.

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 19:00 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: David Miller, <netdev@vger.kernel.org>,
	<linux-wireless@vger.kernel.org>,
	<netfilter-devel@vger.kernel.org>
In-Reply-To: <DD369258-0958-4965-8E75-F6939892072D@intel.com>

On 7/17/2012 11:48 AM, Rustad, Mark D wrote:
> On Jul 17, 2012, at 10:41 AM, Rustad, Mark D wrote:
>
>> On Jul 17, 2012, at 9:01 AM, David Miller wrote:
>>
>>> Linus was _extremely_ generous and took in all the stuff that was
>>> pending in the net tree just now.
>>
>> Maybe *too* generous. :-) I just updated and when I boot I get an early crash in update_netdev_tables which is in netprio_cgroup.c.
>>
>>> Besides very serious issues, I'm not willing to consider any more bug
>>> fixes for the 'net' tree at this time.
>>
>> I think the above issue will have to be fixed, as it completely prevents booting for any kernel that includes the netprio_cgroup option.
>>
>>> Only one pending known bug qualifies, and that's the CIPSO ip option
>>> processing OOPS'er.  And I'll work on that myself if Paul Moore
>>> doesn't show a sign of life in the next day.
>>>
>>> Thanks.
>>
>>
>> I can start taking a look at this if you like, but I see that Gao feng has two patches in the last set of patches that may be related.
>>
>> To give you an idea how early the crash is, here are a few log messages leading up to it:
>>
>> [    0.003455] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
>> [    0.005550] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
>> [    0.007165] Mount-cache hash table entries: 256
>> [    0.010289] Initializing cgroup subsys net_cls
>> [    0.010947] Initializing cgroup subsys net_prio
>> [    0.011039] BUG: unable to handle kernel NULL pointer dereference at 0000000000000828
>> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
>
>
> I found that I can avoid the crash by configuring the netprio_cgroup as a module. I don't need to have it built in, I just happened to. This finding may lower the temperature of this issue a lot from what I had been feeling.
>

hmm looks like we access init_net here,

static void update_netdev_tables(void)
{
         struct net_device *dev;
         u32 max_len = atomic_read(&max_prioidx) + 1;
         struct netprio_map *map;

         rtnl_lock();
         for_each_netdev(&init_net, dev) {
                 map = rtnl_dereference(dev->priomap);
                 if ((!map) ||
                     (map->priomap_len < max_len))
                         extend_netdev_table(dev, max_len);
         }
         rtnl_unlock();
}

but inet_net is initialized by pure_initcall(net_ns_init) and I
gather pure_initcall's should not have any dependencies but it
looks like we created one here with cgroup_init_early() in
start_kernel().

I'll poke around some more. Also had some off list help from
Mark.

.John


^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: Rustad, Mark D @ 2012-07-17 18:48 UTC (permalink / raw)
  To: David Miller
  Cc: <netdev@vger.kernel.org>,
	<linux-wireless@vger.kernel.org>,
	<netfilter-devel@vger.kernel.org>
In-Reply-To: <997C449C-D599-4F46-A0A3-A2B869DEE36E@intel.com>

On Jul 17, 2012, at 10:41 AM, Rustad, Mark D wrote:

> On Jul 17, 2012, at 9:01 AM, David Miller wrote:
> 
>> Linus was _extremely_ generous and took in all the stuff that was
>> pending in the net tree just now.
> 
> Maybe *too* generous. :-) I just updated and when I boot I get an early crash in update_netdev_tables which is in netprio_cgroup.c.
> 
>> Besides very serious issues, I'm not willing to consider any more bug
>> fixes for the 'net' tree at this time.
> 
> I think the above issue will have to be fixed, as it completely prevents booting for any kernel that includes the netprio_cgroup option.
> 
>> Only one pending known bug qualifies, and that's the CIPSO ip option
>> processing OOPS'er.  And I'll work on that myself if Paul Moore
>> doesn't show a sign of life in the next day.
>> 
>> Thanks.
> 
> 
> I can start taking a look at this if you like, but I see that Gao feng has two patches in the last set of patches that may be related.
> 
> To give you an idea how early the crash is, here are a few log messages leading up to it:
> 
> [    0.003455] Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
> [    0.005550] Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
> [    0.007165] Mount-cache hash table entries: 256
> [    0.010289] Initializing cgroup subsys net_cls
> [    0.010947] Initializing cgroup subsys net_prio
> [    0.011039] BUG: unable to handle kernel NULL pointer dereference at 0000000000000828
> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0


I found that I can avoid the crash by configuring the netprio_cgroup as a module. I don't need to have it built in, I just happened to. This finding may lower the temperature of this issue a lot from what I had been feeling.

-- 
Mark Rustad, LAN Access Division, Intel Corporation

^ permalink raw reply

* RE: [PATCH 0/2] runtime PM support for cpsw and davinci mdio drivers
From: N, Mugunthan V @ 2012-07-17 18:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <20120717.111331.1669632931375975184.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, July 17, 2012 11:44 PM
> To: N, Mugunthan V
> Cc: netdev@vger.kernel.org
> Subject: Re: [PATCH 0/2] runtime PM support for cpsw and davinci mdio
> drivers
> 
> 
> How many times do you plan on posting this patch set?

Sorry, the first patch set was sent by mistake.

Regards,
Mugunthan V N.

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Rick Jones @ 2012-07-17 18:17 UTC (permalink / raw)
  To: David Miller
  Cc: cascardo@linux.vnet.ibm.com, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, brking@linux.vnet.ibm.com,
	leitao@linux.vnet.ibm.com, klebers@linux.vnet.ibm.com
In-Reply-To: <20120716.222903.367603216293954363.davem@davemloft.net>

On 07/16/2012 10:29 PM, David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Mon, 16 Jul 2012 10:27:57 -0700
>
>> That seems rather extraordinarily low - Power7 is supposed to be a
>> rather high performance CPU.  The last time I noticed O(3Gbit/s) on
>> 10G for bulk transfer was before the advent of LRO/GRO - that was in
>> the x86 space though.  Is mapping really that expensive with Power7?
>
> Unfortunately, IOMMU mappings are incredibly expensive.  I see effects
> like this on Sparc too.

OK, so that has caused some dimm memory to get a small refresh - it ends 
up being akin to if not actually a PIO yes?  I recall schemes in drivers 
in other stacks whereby "small" packets were copied because it was 
cheaper to allocate/copy then it was to remap.

rick jones

^ permalink raw reply

* Re: [PATCH 0/2] runtime PM support for cpsw and davinci mdio drivers
From: David Miller @ 2012-07-17 18:13 UTC (permalink / raw)
  To: mugunthanvnm; +Cc: netdev
In-Reply-To: <1342548590-12502-1-git-send-email-mugunthanvnm@ti.com>


How many times do you plan on posting this patch set?

^ permalink raw reply

* Re: [PATCH 12/16] ipv4: Maintain redirect and PMTU info in struct rtable again.
From: Benjamin Poirier @ 2012-07-17 18:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120710.080746.694016763983176902.davem@davemloft.net>

On 2012/07/10 08:07, David Miller wrote:
> 
> Maintaining this in the inetpeer entries was not the right way to do
> this at all.
> 

This patch makes it possible to have the same address assigned to a tunnel
interface and its lower device, whereas previously that would lead to mtu
problems because both routes shared the same pmtu info in the inet_peer.

ex: gre1 192.168.1.2/32 over eth0 192.168.1.2/24

Is such a wicked configuration supported?

^ permalink raw reply

* [PATCH 1/2] driver: net: ethernet: davinci_mdio: runtime PM support
From: Mugunthan V N @ 2012-07-17 18:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, Mugunthan V N
In-Reply-To: <1342548590-12502-1-git-send-email-mugunthanvnm@ti.com>

Enabling runtime PM support for davinci mdio driver

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
---
 drivers/net/ethernet/ti/davinci_mdio.c |   25 ++++++++++++-------------
 1 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_mdio.c b/drivers/net/ethernet/ti/davinci_mdio.c
index e4e4708..cd7ee20 100644
--- a/drivers/net/ethernet/ti/davinci_mdio.c
+++ b/drivers/net/ethernet/ti/davinci_mdio.c
@@ -34,6 +34,7 @@
 #include <linux/clk.h>
 #include <linux/err.h>
 #include <linux/io.h>
+#include <linux/pm_runtime.h>
 #include <linux/davinci_emac.h>
 
 /*
@@ -321,7 +322,9 @@ static int __devinit davinci_mdio_probe(struct platform_device *pdev)
 	snprintf(data->bus->id, MII_BUS_ID_SIZE, "%s-%x",
 		pdev->name, pdev->id);
 
-	data->clk = clk_get(dev, NULL);
+	pm_runtime_enable(&pdev->dev);
+	pm_runtime_get_sync(&pdev->dev);
+	data->clk = clk_get(&pdev->dev, "fck");
 	if (IS_ERR(data->clk)) {
 		dev_err(dev, "failed to get device clock\n");
 		ret = PTR_ERR(data->clk);
@@ -329,8 +332,6 @@ static int __devinit davinci_mdio_probe(struct platform_device *pdev)
 		goto bail_out;
 	}
 
-	clk_enable(data->clk);
-
 	dev_set_drvdata(dev, data);
 	data->dev = dev;
 	spin_lock_init(&data->lock);
@@ -378,10 +379,10 @@ bail_out:
 	if (data->bus)
 		mdiobus_free(data->bus);
 
-	if (data->clk) {
-		clk_disable(data->clk);
+	if (data->clk)
 		clk_put(data->clk);
-	}
+	pm_runtime_put_sync(&pdev->dev);
+	pm_runtime_disable(&pdev->dev);
 
 	kfree(data);
 
@@ -396,10 +397,10 @@ static int __devexit davinci_mdio_remove(struct platform_device *pdev)
 	if (data->bus)
 		mdiobus_free(data->bus);
 
-	if (data->clk) {
-		clk_disable(data->clk);
+	if (data->clk)
 		clk_put(data->clk);
-	}
+	pm_runtime_put_sync(&pdev->dev);
+	pm_runtime_disable(&pdev->dev);
 
 	dev_set_drvdata(dev, NULL);
 
@@ -421,8 +422,7 @@ static int davinci_mdio_suspend(struct device *dev)
 	__raw_writel(ctrl, &data->regs->control);
 	wait_for_idle(data);
 
-	if (data->clk)
-		clk_disable(data->clk);
+	pm_runtime_put_sync(data->dev);
 
 	data->suspended = true;
 	spin_unlock(&data->lock);
@@ -436,8 +436,7 @@ static int davinci_mdio_resume(struct device *dev)
 	u32 ctrl;
 
 	spin_lock(&data->lock);
-	if (data->clk)
-		clk_enable(data->clk);
+	pm_runtime_put_sync(data->dev);
 
 	/* restart the scan state machine */
 	ctrl = __raw_readl(&data->regs->control);
-- 
1.7.0.4

^ permalink raw reply related

* [PATCH 2/2] driver: net: ethernet: cpsw: runtime PM support
From: Mugunthan V N @ 2012-07-17 18:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, Mugunthan V N
In-Reply-To: <1342548590-12502-1-git-send-email-mugunthanvnm@ti.com>

Enabling runtime PM support for cpsw driver

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index ca381d3..1e5d85b 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -27,6 +27,7 @@
 #include <linux/phy.h>
 #include <linux/workqueue.h>
 #include <linux/delay.h>
+#include <linux/pm_runtime.h>
 
 #include <linux/platform_data/cpsw.h>
 
@@ -494,11 +495,7 @@ static int cpsw_ndo_open(struct net_device *ndev)
 	cpsw_intr_disable(priv);
 	netif_carrier_off(ndev);
 
-	ret = clk_enable(priv->clk);
-	if (ret < 0) {
-		dev_err(priv->dev, "unable to turn on device clock\n");
-		return ret;
-	}
+	pm_runtime_get_sync(&priv->pdev->dev);
 
 	reg = __raw_readl(&priv->regs->id_ver);
 
@@ -569,7 +566,7 @@ static int cpsw_ndo_stop(struct net_device *ndev)
 	netif_carrier_off(priv->ndev);
 	cpsw_ale_stop(priv->ale);
 	for_each_slave(priv, cpsw_slave_stop, priv);
-	clk_disable(priv->clk);
+	pm_runtime_put_sync(&priv->pdev->dev);
 	return 0;
 }
 
@@ -763,10 +760,12 @@ static int __devinit cpsw_probe(struct platform_device *pdev)
 	for (i = 0; i < data->slaves; i++)
 		priv->slaves[i].slave_num = i;
 
-	priv->clk = clk_get(&pdev->dev, NULL);
+	pm_runtime_enable(&pdev->dev);
+	priv->clk = clk_get(&pdev->dev, "fck");
 	if (IS_ERR(priv->clk)) {
-		dev_err(priv->dev, "failed to get device clock)\n");
-		ret = -EBUSY;
+		dev_err(&pdev->dev, "fck is not found\n");
+		ret = -ENODEV;
+		goto clean_slave_ret;
 	}
 
 	priv->cpsw_res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
@@ -935,6 +934,8 @@ clean_cpsw_iores_ret:
 			   resource_size(priv->cpsw_res));
 clean_clk_ret:
 	clk_put(priv->clk);
+clean_slave_ret:
+	pm_runtime_disable(&pdev->dev);
 	kfree(priv->slaves);
 clean_ndev_ret:
 	free_netdev(ndev);
@@ -959,6 +960,7 @@ static int __devexit cpsw_remove(struct platform_device *pdev)
 			   resource_size(priv->cpsw_res));
 	release_mem_region(priv->cpsw_ss_res->start,
 			   resource_size(priv->cpsw_ss_res));
+	pm_runtime_disable(&pdev->dev);
 	clk_put(priv->clk);
 	kfree(priv->slaves);
 	free_netdev(ndev);
@@ -973,6 +975,8 @@ static int cpsw_suspend(struct device *dev)
 
 	if (netif_running(ndev))
 		cpsw_ndo_stop(ndev);
+	pm_runtime_put_sync(&pdev->dev);
+
 	return 0;
 }
 
@@ -981,6 +985,7 @@ static int cpsw_resume(struct device *dev)
 	struct platform_device	*pdev = to_platform_device(dev);
 	struct net_device	*ndev = platform_get_drvdata(pdev);
 
+	pm_runtime_get_sync(&pdev->dev);
 	if (netif_running(ndev))
 		cpsw_ndo_open(ndev);
 	return 0;
-- 
1.7.0.4

^ permalink raw reply related

* [PATCH 0/2] runtime PM support for cpsw and davinci mdio drivers
From: Mugunthan V N @ 2012-07-17 18:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, Mugunthan V N

This patch set adds support for runtime PM support for CPSW and Davinci MDIO
drivers

Mugunthan V N (2):
  driver: net: ethernet: davinci_mdio: runtime PM support
  driver: net: ethernet: cpsw: runtime PM support

 drivers/net/ethernet/ti/cpsw.c         |   23 ++++++++++++++---------
 drivers/net/ethernet/ti/davinci_mdio.c |   25 ++++++++++++-------------
 2 files changed, 26 insertions(+), 22 deletions(-)

^ permalink raw reply

* [PATCH net-next 10/10] sfc: Correct some comments on enum reset_type
From: Ben Hutchings @ 2012-07-17 18:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1342544740.2698.13.camel@bwh-desktop.uk.solarflarecom.com>

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/enum.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sfc/enum.h b/drivers/net/ethernet/sfc/enum.h
index d725a8f..182dbe2 100644
--- a/drivers/net/ethernet/sfc/enum.h
+++ b/drivers/net/ethernet/sfc/enum.h
@@ -136,10 +136,10 @@ enum efx_loopback_mode {
  *
  * Reset methods are numbered in order of increasing scope.
  *
- * @RESET_TYPE_INVISIBLE: don't reset the PHYs or interrupts
- * @RESET_TYPE_ALL: reset everything but PCI core blocks
- * @RESET_TYPE_WORLD: reset everything, save & restore PCI config
- * @RESET_TYPE_DISABLE: disable NIC
+ * @RESET_TYPE_INVISIBLE: Reset datapath and MAC (Falcon only)
+ * @RESET_TYPE_ALL: Reset datapath, MAC and PHY
+ * @RESET_TYPE_WORLD: Reset as much as possible
+ * @RESET_TYPE_DISABLE: Reset datapath, MAC and PHY; leave NIC disabled
  * @RESET_TYPE_TX_WATCHDOG: reset due to TX watchdog
  * @RESET_TYPE_INT_ERROR: reset due to internal error
  * @RESET_TYPE_RX_RECOVERY: reset to recover from RX datapath errors
-- 
1.7.7.6


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* [PATCH net-next 09/10] sfc: Fix interface statistics running backward
From: Ben Hutchings @ 2012-07-17 18:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1342544740.2698.13.camel@bwh-desktop.uk.solarflarecom.com>

Some interface statistics are computed in such a way that they can
sometimes decrease (and even underflow).  Since the computed value
will never be greater than the true value, we fix this by only storing
the computed value when it increases.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/falcon_xmac.c |   12 ++++++------
 drivers/net/ethernet/sfc/nic.h         |   18 ++++++++++++++++++
 drivers/net/ethernet/sfc/siena.c       |    8 ++++----
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/sfc/falcon_xmac.c b/drivers/net/ethernet/sfc/falcon_xmac.c
index 6106ef1..8333865 100644
--- a/drivers/net/ethernet/sfc/falcon_xmac.c
+++ b/drivers/net/ethernet/sfc/falcon_xmac.c
@@ -341,12 +341,12 @@ void falcon_update_stats_xmac(struct efx_nic *efx)
 	FALCON_STAT(efx, XgTxIpSrcErrPkt, tx_ip_src_error);
 
 	/* Update derived statistics */
-	mac_stats->tx_good_bytes =
-		(mac_stats->tx_bytes - mac_stats->tx_bad_bytes -
-		 mac_stats->tx_control * 64);
-	mac_stats->rx_bad_bytes =
-		(mac_stats->rx_bytes - mac_stats->rx_good_bytes -
-		 mac_stats->rx_control * 64);
+	efx_update_diff_stat(&mac_stats->tx_good_bytes,
+			     mac_stats->tx_bytes - mac_stats->tx_bad_bytes -
+			     mac_stats->tx_control * 64);
+	efx_update_diff_stat(&mac_stats->rx_bad_bytes,
+			     mac_stats->rx_bytes - mac_stats->rx_good_bytes -
+			     mac_stats->rx_control * 64);
 }
 
 void falcon_poll_xmac(struct efx_nic *efx)
diff --git a/drivers/net/ethernet/sfc/nic.h b/drivers/net/ethernet/sfc/nic.h
index f48ccf6..bab5cd9 100644
--- a/drivers/net/ethernet/sfc/nic.h
+++ b/drivers/net/ethernet/sfc/nic.h
@@ -294,6 +294,24 @@ extern bool falcon_xmac_check_fault(struct efx_nic *efx);
 extern int falcon_reconfigure_xmac(struct efx_nic *efx);
 extern void falcon_update_stats_xmac(struct efx_nic *efx);
 
+/* Some statistics are computed as A - B where A and B each increase
+ * linearly with some hardware counter(s) and the counters are read
+ * asynchronously.  If the counters contributing to B are always read
+ * after those contributing to A, the computed value may be lower than
+ * the true value by some variable amount, and may decrease between
+ * subsequent computations.
+ *
+ * We should never allow statistics to decrease or to exceed the true
+ * value.  Since the computed value will never be greater than the
+ * true value, we can achieve this by only storing the computed value
+ * when it increases.
+ */
+static inline void efx_update_diff_stat(u64 *stat, u64 diff)
+{
+	if ((s64)(diff - *stat) > 0)
+		*stat = diff;
+}
+
 /* Interrupts and test events */
 extern int efx_nic_init_interrupt(struct efx_nic *efx);
 extern void efx_nic_enable_interrupts(struct efx_nic *efx);
diff --git a/drivers/net/ethernet/sfc/siena.c b/drivers/net/ethernet/sfc/siena.c
index 2354886..6bafd21 100644
--- a/drivers/net/ethernet/sfc/siena.c
+++ b/drivers/net/ethernet/sfc/siena.c
@@ -458,8 +458,8 @@ static int siena_try_update_nic_stats(struct efx_nic *efx)
 
 	MAC_STAT(tx_bytes, TX_BYTES);
 	MAC_STAT(tx_bad_bytes, TX_BAD_BYTES);
-	mac_stats->tx_good_bytes = (mac_stats->tx_bytes -
-				    mac_stats->tx_bad_bytes);
+	efx_update_diff_stat(&mac_stats->tx_good_bytes,
+			     mac_stats->tx_bytes - mac_stats->tx_bad_bytes);
 	MAC_STAT(tx_packets, TX_PKTS);
 	MAC_STAT(tx_bad, TX_BAD_FCS_PKTS);
 	MAC_STAT(tx_pause, TX_PAUSE_PKTS);
@@ -492,8 +492,8 @@ static int siena_try_update_nic_stats(struct efx_nic *efx)
 	MAC_STAT(tx_ip_src_error, TX_IP_SRC_ERR_PKTS);
 	MAC_STAT(rx_bytes, RX_BYTES);
 	MAC_STAT(rx_bad_bytes, RX_BAD_BYTES);
-	mac_stats->rx_good_bytes = (mac_stats->rx_bytes -
-				    mac_stats->rx_bad_bytes);
+	efx_update_diff_stat(&mac_stats->rx_good_bytes,
+			     mac_stats->rx_bytes - mac_stats->rx_bad_bytes);
 	MAC_STAT(rx_packets, RX_PKTS);
 	MAC_STAT(rx_good, RX_GOOD_PKTS);
 	MAC_STAT(rx_bad, RX_BAD_FCS_PKTS);
-- 
1.7.7.6



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* RE: [Upstream PATCH 0/2] runtime PM support for cpsw and davinci mdio
From: N, Mugunthan V @ 2012-07-17 18:07 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: davem@davemloft.net
In-Reply-To: <1342548353-12153-1-git-send-email-mugunthanvnm@ti.com>

> -----Original Message-----
> From: N, Mugunthan V
> Sent: Tuesday, July 17, 2012 11:36 PM
> To: netdev@vger.kernel.org
> Cc: davem@davemloft.net; N, Mugunthan V
> Subject: [Upstream PATCH 0/2] runtime PM support for cpsw and davinci
> mdio
> 
> This patch set adds support for runtime PM support for CPSW and Davinci
> MDIO
> drivers
> 
> Mugunthan V N (2):
>   driver: net: ethernet: davinci_mdio: runtime PM support
>   driver: net: ethernet: cpsw: runtime PM support
> 
>  drivers/net/ethernet/ti/cpsw.c         |   23 ++++++++++++++---------
>  drivers/net/ethernet/ti/davinci_mdio.c |   25 ++++++++++++------------
> -
>  2 files changed, 26 insertions(+), 22 deletions(-)

Please ignore this patch series as it was sent by mistake.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox