Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: New commands to configure IOV features
From: Chris Friesen @ 2012-07-17 21:08 UTC (permalink / raw)
  To: Don Dutile
  Cc: Yuval Mintz, davem@davemloft.net, Ben Hutchings, Greg Rose,
	netdev@vger.kernel.org, linux-pci
In-Reply-To: <5005BD00.4090106@redhat.com>

On 07/17/2012 01:29 PM, Don Dutile wrote:

> WRT SRIOV-nic devices, the thinking goes that protocol-level
> parameters associated with VFs should use protocol-specific interfaces,
> e.g., ethtool, ip link set, etc. for Ethernet VFs.
> Thus, the various protocol control functions/tools should
> be used to control VF parameters, as one would for a physical device
> of that protocol/class.

It seems to me that the mere act of creating one or more VFs is 
something generic, applicable to all devices that are capable of it. 
The details of configuring those VFs can then be handled by 
protocol-specific interfaces.

I'm not too worried about the exact mechanism of doing it, as long as 
it's ultimately scriptable--that is, if it's a C API then it would be 
appreciated if there was a standard tool to call that implements it. 
 From that perspective a sysfs-based interface is ideal since it is 
directly scriptable.

Chris

^ permalink raw reply

* [PATCH] cipso: don't follow a NULL pointer when setsockopt() is called
From: Paul Moore @ 2012-07-17 21:07 UTC (permalink / raw)
  To: netdev

As reported by Alan Cox, and verified by Lin Ming, when a user
attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL
tag the kernel dies a terrible death when it attempts to follow a NULL
pointer (the skb argument to cipso_v4_validate() is NULL when called via
the setsockopt() syscall).

This patch fixes this by first checking to ensure that the skb is
non-NULL before using it to find the incoming network interface.  In
the unlikely case where the skb is NULL and the user attempts to add
a CIPSO option with the _TAG_LOCAL tag we return an error as this is
not something we want to allow.

A simple reproducer, kindly supplied by Lin Ming, although you must
have the CIPSO DOI #3 configure on the system first or you will be
caught early in cipso_v4_validate():

	#include <sys/types.h>
	#include <sys/socket.h>
	#include <linux/ip.h>
	#include <linux/in.h>
	#include <string.h>

	struct local_tag {
		char type;
		char length;
		char info[4];
	};

	struct cipso {
		char type;
		char length;
		char doi[4];
		struct local_tag local;
	};

	int main(int argc, char **argv)
	{
		int sockfd;
		struct cipso cipso = {
			.type = IPOPT_CIPSO,
			.length = sizeof(struct cipso),
			.local = {
				.type = 128,
				.length = sizeof(struct local_tag),
			},
		};

		memset(cipso.doi, 0, 4);
		cipso.doi[3] = 3;

		sockfd = socket(AF_INET, SOCK_DGRAM, 0);
		#define SOL_IP 0
		setsockopt(sockfd, SOL_IP, IP_OPTIONS,
			&cipso, sizeof(struct cipso));

		return 0;
	}

CC: Lin Ming <mlin@ss.pku.edu.cn>
Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Paul Moore <pmoore@redhat.com>
---
 net/ipv4/cipso_ipv4.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index c48adc5..667c1d4 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1725,8 +1725,10 @@ int cipso_v4_validate(const struct sk_buff *skb, unsigned char **option)
 		case CIPSO_V4_TAG_LOCAL:
 			/* This is a non-standard tag that we only allow for
 			 * local connections, so if the incoming interface is
-			 * not the loopback device drop the packet. */
-			if (!(skb->dev->flags & IFF_LOOPBACK)) {
+			 * not the loopback device drop the packet. Further,
+			 * there is no legitimate reason for setting this from
+			 * userspace so reject it if skb is NULL. */
+			if (skb == NULL || !(skb->dev->flags & IFF_LOOPBACK)) {
 				err_offset = opt_iter;
 				goto validate_return_locked;
 			}

^ permalink raw reply related

* Re: [PATCH net-next] tcp: implement RFC 5961 4.2
From: Vijay Subramanian @ 2012-07-17 21:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Kiran Kumar Kella
In-Reply-To: <1342525290.2626.459.camel@edumazet-glaptop>

On 17 July 2012 04:41, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Implement the RFC 5691 mitigation against Blind
> Reset attack using SYN bit.
>
> Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
> incoming packet, instead of resetting the session.

Eric,
Section 4.2 has this to say:
"If the SYN bit is set, irrespective of the sequence number, TCP
      MUST send an ACK (also referred to as challenge ACK) to the remote
      peer:"

I believe your patch only sends challenge acks for in-window SYN packets.
After this patch, the code for out of window packets is like this:

        /* Step 1: check sequence number */
        if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) {
                /* RFC793, page 37: "In all states except SYN-SENT, all reset
                 * (RST) segments are validated by checking their SEQ-fields."
                 * And page 69: "If an incoming segment is not acceptable,
                 * an acknowledgment should be sent in reply (unless the RST
                 * bit is set, if so drop the segment and return)".
                 */
                if (!th->rst)
                        tcp_send_dupack(sk, skb);
                goto discard;
        }


For SYN packets that are not in window, we do end up calling
tcp_send_dupack() but not tcp_send_challenge_ack().  Will it be more
appropriate to call the latter so that
we do proper rate limiting of challenge acks and update SNMP counters correctly?

Thanks,
Vijay

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: David Miller @ 2012-07-17 21:02 UTC (permalink / raw)
  To: john.r.fastabend; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <5005D008.6060103@intel.com>

From: John Fastabend <john.r.fastabend@intel.com>
Date: Tue, 17 Jul 2012 13:50:16 -0700

> On 7/17/2012 12:24 PM, David Miller wrote:
>> From: John Fastabend <john.r.fastabend@intel.com>
>> Date: Tue, 17 Jul 2012 12:09:53 -0700
>>
>>> although we don't have an early_init hook for netprio_cgroup so this
>>> is probably not correct.
>>
>> The dependency is actually on net_dev_init (a subsys_initcall) rather
>> than a pure_initcall.
>>
>> net_dev_init is what registers the netdev_net_ops, which in turn
>> initializes the netdev list in namespaces such as &init_net
>>
> 
> Ah right thanks sorry for the thrash. I guess we need to check if the
> netdev list in the init_net namespace is initialized.

It's a hack, but we could export and then test dev_boot_phase == 0,
and if that test is true then skip the init_net device walk in the
cgroup code.

But I don't like that very much.

The things this code cares about can't even be an issue until
net_dev_init() runs.

There is a comment warning not to do this in linux/init.h, but we
could change the module_init() in netprio_cgroup.c to some level which
runs after subsys_inticall().  When built as a module, linux/init.h
will translate this into module_init() which is basically the behavior
we want.

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 20:50 UTC (permalink / raw)
  To: David Miller; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <20120717.122459.2240133900020140698.davem@davemloft.net>

On 7/17/2012 12:24 PM, David Miller wrote:
> From: John Fastabend <john.r.fastabend@intel.com>
> Date: Tue, 17 Jul 2012 12:09:53 -0700
>
>> although we don't have an early_init hook for netprio_cgroup so this
>> is probably not correct.
>
> The dependency is actually on net_dev_init (a subsys_initcall) rather
> than a pure_initcall.
>
> net_dev_init is what registers the netdev_net_ops, which in turn
> initializes the netdev list in namespaces such as &init_net
>

Ah right thanks sorry for the thrash. I guess we need to check if the
netdev list in the init_net namespace is initialized.

^ permalink raw reply

* Re: [PATCH net-next] ipv4: fix rcu splat
From: David Miller @ 2012-07-17 20:49 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1342557733.2626.1103.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 Jul 2012 22:42:13 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> free_nh_exceptions() should use rcu_dereference_protected(..., 1)
> since its called after one RCU grace period.
> 
> Also add some const-ification in recent code.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH v4] net: cgroup: fix access the unallocated memory in netprio cgroup
From: John Fastabend @ 2012-07-17 20:47 UTC (permalink / raw)
  To: Gao feng
  Cc: nhorman, eric.dumazet, linux-kernel, netdev, davem, Eric Dumazet,
	Rustad, Mark D
In-Reply-To: <1342079415-9631-1-git-send-email-gaofeng@cn.fujitsu.com>

On 7/12/2012 12:50 AM, Gao feng wrote:
> there are some out of bound accesses in netprio cgroup.
>
> now before accessing the dev->priomap.priomap array,we only check
> if the dev->priomap exist.and because we don't want to see
> additional bound checkings in fast path, so we should make sure
> that dev->priomap is null or array size of dev->priomap.priomap
> is equal to max_prioidx + 1;
>
> so in write_priomap logic,we should call extend_netdev_table when
> dev->priomap is null and dev->priomap.priomap_len < max_len.
> and in cgrp_create->update_netdev_tables logic,we should call
> extend_netdev_table only when dev->priomap exist and
> dev->priomap.priomap_len < max_len.
>
> and it's not needed to call update_netdev_tables in write_priomap,
> we can only allocate the net device's priomap which we change through
> net_prio.ifpriomap.
>
> this patch also add a return value for update_netdev_tables &
> extend_netdev_table, so when new_priomap is allocated failed,
> write_priomap will stop to access the priomap,and return -ENOMEM
> back to the userspace to tell the user what happend.
>
> Change From v3:
> 1. add rtnl protect when reading max_prioidx in write_priomap.
>
> 2. only call extend_netdev_table when map->priomap_len < max_len,
>     this will make sure array size of dev->map->priomap always
>     bigger than any prioidx.
>
> 3. add a function write_update_netdev_table to make codes clear.
>
> Change From v2:
> 1. protect extend_netdev_table by RTNL.
> 2. when extend_netdev_table failed,call dev_put to reduce device's refcount.
>
> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Eric Dumazet <edumazet@google.com>
> ---
>   net/core/netprio_cgroup.c |   71 ++++++++++++++++++++++++++++++++++-----------
>   1 files changed, 54 insertions(+), 17 deletions(-)
>

[...]

> +
> +static int update_netdev_tables(void)
> +{
> +	int ret = 0;
>   	struct net_device *dev;
> -	u32 max_len = atomic_read(&max_prioidx) + 1;
> +	u32 max_len;
>   	struct netprio_map *map;


need to check if net subsystem is initialized before we try
to use it here...

	if (some_check)     -> need to lookup what this check is
		return ret;

>
>   	rtnl_lock();
> +	max_len = atomic_read(&max_prioidx) + 1;
>   	for_each_netdev(&init_net, dev) {
>   		map = rtnl_dereference(dev->priomap);
> -		if ((!map) ||
> -		    (map->priomap_len < max_len))
> -			extend_netdev_table(dev, max_len);
> +		/*
> +		 * don't allocate priomap if we didn't
> +		 * change net_prio.ifpriomap (map == NULL),
> +		 * this will speed up skb_update_prio.
> +		 */
> +		if (map && map->priomap_len < max_len) {
> +			ret = extend_netdev_table(dev, max_len);
> +			if (ret < 0)
> +				break;
> +		}
>   	}
>   	rtnl_unlock();
> +	return ret;
>   }
>
>   static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
>   {
>   	struct cgroup_netprio_state *cs;
> -	int ret;
> +	int ret = -EINVAL;
>
>   	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
>   	if (!cs)
>   		return ERR_PTR(-ENOMEM);
>
> -	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
> -		kfree(cs);
> -		return ERR_PTR(-EINVAL);
> -	}
> +	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
> +		goto out;
>
>   	ret = get_prioidx(&cs->prioidx);
> -	if (ret != 0) {
> +	if (ret < 0) {
>   		pr_warn("No space in priority index array\n");
> -		kfree(cs);
> -		return ERR_PTR(ret);
> +		goto out;
> +	}
> +
> +	ret = update_netdev_tables();
> +	if (ret < 0) {
> +		put_prioidx(cs->prioidx);
> +		goto out;
>   	}

Gao,

This introduces a null ptr dereference when netprio_cgroup is built
into the kernel because update_netdev_tables() depends on init_net.
However cgrp_create is being called by cgroup_init before
do_initcalls() is called and before net_dev_init().

.John

^ permalink raw reply

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: David Miller @ 2012-07-17 20:46 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <alpine.LFD.2.00.1207172249270.1831@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Tue, 17 Jul 2012 23:41:16 +0300 (EEST)

> 	IIRC, struct fib_info was shared by different
> prefixes. It saves a lot of memory when thousands of
> routes are created to same GW. Now if we end up with 1 or
> 2 fib_info structures for default routes, the nh_exceptions list
> can become very long. May be fib_info is not a good place
> to hide such data.

Your analysis of what fib_info is and how it's intended to
work is accurate.

But we don't use a linked list for the exceptions in the final
version, we use a reclaiming RCU'd hash table like we use for TCP
metrics.

See the updated version of patch #5 and what I actually committed to
net-next.

^ permalink raw reply

* [PATCH net-next] ipv4: fix rcu splat
From: Eric Dumazet @ 2012-07-17 20:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

free_nh_exceptions() should use rcu_dereference_protected(..., 1)
since its called after one RCU grace period.

Also add some const-ification in recent code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/fib_semantics.c        |    4 ++--
 net/ipv4/inet_connection_sock.c |    4 ++--
 net/ipv4/route.c                |   13 +++++++------
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 1e09852..2b57d76 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -148,11 +148,11 @@ static void free_nh_exceptions(struct fib_nh *nh)
 	for (i = 0; i < FNHE_HASH_SIZE; i++) {
 		struct fib_nh_exception *fnhe;
 
-		fnhe = rcu_dereference(hash[i].chain);
+		fnhe = rcu_dereference_protected(hash[i].chain, 1);
 		while (fnhe) {
 			struct fib_nh_exception *next;
 			
-			next = rcu_dereference(fnhe->fnhe_next);
+			next = rcu_dereference_protected(fnhe->fnhe_next, 1);
 			kfree(fnhe);
 
 			fnhe = next;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 89e74a3..68bb5a6 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -806,8 +806,8 @@ EXPORT_SYMBOL_GPL(inet_csk_compat_setsockopt);
 
 static struct dst_entry *inet_csk_rebuild_route(struct sock *sk, struct flowi *fl)
 {
-	struct inet_sock *inet = inet_sk(sk);
-	struct ip_options_rcu *inet_opt;
+	const struct inet_sock *inet = inet_sk(sk);
+	const struct ip_options_rcu *inet_opt;
 	__be32 daddr = inet->inet_daddr;
 	struct flowi4 *fl4;
 	struct rtable *rt;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 812e444..f67e702 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1275,7 +1275,7 @@ static void rt_del(unsigned int hash, struct rtable *rt)
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 }
 
-static void __build_flow_key(struct flowi4 *fl4, struct sock *sk,
+static void __build_flow_key(struct flowi4 *fl4, const struct sock *sk,
 			     const struct iphdr *iph,
 			     int oif, u8 tos,
 			     u8 prot, u32 mark, int flow_flags)
@@ -1294,7 +1294,8 @@ static void __build_flow_key(struct flowi4 *fl4, struct sock *sk,
 			   iph->daddr, iph->saddr, 0, 0);
 }
 
-static void build_skb_flow_key(struct flowi4 *fl4, struct sk_buff *skb, struct sock *sk)
+static void build_skb_flow_key(struct flowi4 *fl4, const struct sk_buff *skb,
+			       const struct sock *sk)
 {
 	const struct iphdr *iph = ip_hdr(skb);
 	int oif = skb->dev->ifindex;
@@ -1305,10 +1306,10 @@ static void build_skb_flow_key(struct flowi4 *fl4, struct sk_buff *skb, struct s
 	__build_flow_key(fl4, sk, iph, oif, tos, prot, mark, 0);
 }
 
-static void build_sk_flow_key(struct flowi4 *fl4, struct sock *sk)
+static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk)
 {
 	const struct inet_sock *inet = inet_sk(sk);
-	struct ip_options_rcu *inet_opt;
+	const struct ip_options_rcu *inet_opt;
 	__be32 daddr = inet->inet_daddr;
 
 	rcu_read_lock();
@@ -1323,8 +1324,8 @@ static void build_sk_flow_key(struct flowi4 *fl4, struct sock *sk)
 	rcu_read_unlock();
 }
 
-static void ip_rt_build_flow_key(struct flowi4 *fl4, struct sock *sk,
-				 struct sk_buff *skb)
+static void ip_rt_build_flow_key(struct flowi4 *fl4, const struct sock *sk,
+				 const struct sk_buff *skb)
 {
 	if (skb)
 		build_skb_flow_key(fl4, skb, sk);

^ permalink raw reply related

* Re: [net-next PATCH 01/02] net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.
From: Joe Perches @ 2012-07-17 20:36 UTC (permalink / raw)
  To: Saurabh; +Cc: netdev
In-Reply-To: <20120717194449.GA3350@debian-saurabh-64.vyatta.com>

On Tue, 2012-07-17 at 12:44 -0700, Saurabh wrote:
> Incorporated David and Steffen's comments.
> Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module.
[]
> diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
[]
> +int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler)
> +{
> +	struct xfrm_tunnel __rcu **pprev;
> +	struct xfrm_tunnel *t;
> +	int ret = -EEXIST;
> +	int priority = handler->priority;
> +
> +	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
> +
> +	for (pprev = &rcv_notify_handlers;
> +	     (t = rcu_dereference_protected(*pprev,
> +	     lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
> +	     pprev = &t->next) {
> +		if (t->priority > priority)
> +			break;
> +		if (t->priority == priority)
> +			goto err;
> +
> +	}
> +
> +	handler->next = *pprev;
> +	rcu_assign_pointer(*pprev, handler);
> +
> +	ret = 0;
> +
> +err:
> +	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_register);

Isn't the multiple indirection of **pprev unnecessary?
Perhaps something like this is simpler and easier to read?

int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler)
{
	struct xfrm_tunnel __rcu *prev;
	struct xfrm_tunnel *t;
	int ret = -EEXIST;
	int priority = handler->priority;

	mutex_lock(&xfrm4_mode_tunnel_input_mutex);

	prev = rcv_notify_handlers;
	while ((t = rcu_dereference_protected(prev,
					      lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) {
		if (t->priority > priority)
			break;
		if (t->priority == priority)
			goto err;
		prev = t->next;
	}

	handler->next = prev;
	rcu_assign_pointer(prev, handler);

	ret = 0;

err:
	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
	return ret;
}

^ permalink raw reply

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: Julian Anastasov @ 2012-07-17 20:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120717.061418.1893307699868826531.davem@davemloft.net>


	Hello,

On Tue, 17 Jul 2012, David Miller wrote:

> These patches implement the final mechanism necessary to really allow
> us to go without the route cache in ipv4.
> 
> We need a place to have long-term storage of PMTU/redirect information
> which is independent of the routes themselves, yet does not get us
> back into a situation where we have to write to metrics or anything
> like that.
> 
> For this we use an "next-hop exception" table in the FIB nexthops.
> 
> Currently it is a simple linked list and uses a single global lock
> for synchronization, but that can be easily adjusted as-needed.
> 
> The one thing I desperately want to avoid is having to create clone
> routes in the FIB trie for this purpose, because that is very
> expensive.   However, I'm willing to entertain such an idea later
> if this current scheme proves to have downsides that the FIB trie
> variant would not have.

	IIRC, struct fib_info was shared by different
prefixes. It saves a lot of memory when thousands of
routes are created to same GW. Now if we end up with 1 or
2 fib_info structures for default routes, the nh_exceptions list
can become very long. May be fib_info is not a good place
to hide such data.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH] ipv4: Fix nexthop exception hash computation.
From: Eric Dumazet @ 2012-07-17 20:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120717.132350.1202093690532763592.davem@davemloft.net>

On Tue, 2012-07-17 at 13:23 -0700, David Miller wrote:
> Need to mask it with (FNHE_HASH_SIZE - 1).
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---

OK, I have a small patch too, sending in a minute.

^ permalink raw reply

* [PATCH] ipv4: Fix nexthop exception hash computation.
From: David Miller @ 2012-07-17 20:23 UTC (permalink / raw)
  To: netdev


Need to mask it with (FNHE_HASH_SIZE - 1).

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/route.c |   16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a5bd0b4..812e444 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1347,6 +1347,16 @@ static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash, __be3
 	return oldest;
 }
 
+static inline u32 fnhe_hashfun(__be32 daddr)
+{
+	u32 hval;
+
+	hval = (__force u32) daddr;
+	hval ^= (hval >> 11) ^ (hval >> 22);
+
+	return hval & (FNHE_HASH_SIZE - 1);
+}
+
 static struct fib_nh_exception *find_or_create_fnhe(struct fib_nh *nh, __be32 daddr)
 {
 	struct fnhe_hash_bucket *hash = nh->nh_exceptions;
@@ -1361,8 +1371,7 @@ static struct fib_nh_exception *find_or_create_fnhe(struct fib_nh *nh, __be32 da
 			return NULL;
 	}
 
-	hval = (__force u32) daddr;
-	hval ^= (hval >> 11) ^ (hval >> 22);
+	hval = fnhe_hashfun(daddr);
 	hash += hval;
 
 	depth = 0;
@@ -1890,8 +1899,7 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
 	struct fib_nh_exception *fnhe;
 	u32 hval;
 
-	hval = (__force u32) daddr;
-	hval ^= (hval >> 11) ^ (hval >> 22);
+	hval = fnhe_hashfun(daddr);
 
 	for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: David Miller @ 2012-07-17 20:20 UTC (permalink / raw)
  To: brking
  Cc: rick.jones2, cascardo, netdev, yevgenyp, ogerlitz, amirv, leitao,
	klebers
In-Reply-To: <5005C6A0.50002@linux.vnet.ibm.com>

From: Brian King <brking@linux.vnet.ibm.com>
Date: Tue, 17 Jul 2012 15:10:08 -0500

> On 07/17/2012 01:17 PM, Rick Jones wrote:
>> On 07/16/2012 10:29 PM, David Miller wrote:
>>> From: Rick Jones <rick.jones2@hp.com> Date: Mon, 16 Jul 2012
>>> 10:27:57 -0700
>>> 
>>>> That seems rather extraordinarily low - Power7 is supposed to be
>>>> a rather high performance CPU.  The last time I noticed
>>>> O(3Gbit/s) on 10G for bulk transfer was before the advent of
>>>> LRO/GRO - that was in the x86 space though.  Is mapping really
>>>> that expensive with Power7?
>>> 
>>> Unfortunately, IOMMU mappings are incredibly expensive.  I see
>>> effects like this on Sparc too.
>> 
>> OK, so that has caused some dimm memory to get a small refresh - it
>> ends up being akin to if not actually a PIO yes?  I recall schemes in
>> drivers in other stacks whereby "small" packets were copied because
>> it was cheaper to allocate/copy then it was to remap.
> 
> On Power it ends up being an hcall to the hypervisor

This is true on sparc64 niagara systems as well.

^ permalink raw reply

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Brian King @ 2012-07-17 20:10 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, cascardo@linux.vnet.ibm.com, netdev@vger.kernel.org,
	yevgenyp@mellanox.co.il, ogerlitz@mellanox.com,
	amirv@mellanox.com, leitao@linux.vnet.ibm.com,
	klebers@linux.vnet.ibm.com
In-Reply-To: <5005AC4A.9030208@hp.com>

On 07/17/2012 01:17 PM, Rick Jones wrote:
> On 07/16/2012 10:29 PM, David Miller wrote:
>> From: Rick Jones <rick.jones2@hp.com> Date: Mon, 16 Jul 2012
>> 10:27:57 -0700
>> 
>>> That seems rather extraordinarily low - Power7 is supposed to be
>>> a rather high performance CPU.  The last time I noticed
>>> O(3Gbit/s) on 10G for bulk transfer was before the advent of
>>> LRO/GRO - that was in the x86 space though.  Is mapping really
>>> that expensive with Power7?
>> 
>> Unfortunately, IOMMU mappings are incredibly expensive.  I see
>> effects like this on Sparc too.
> 
> OK, so that has caused some dimm memory to get a small refresh - it
> ends up being akin to if not actually a PIO yes?  I recall schemes in
> drivers in other stacks whereby "small" packets were copied because
> it was cheaper to allocate/copy then it was to remap.

On Power it ends up being an hcall to the hypervisor

-Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

^ permalink raw reply

* [net-next PATCH 02/02] net/ipv4: VTI support new module for ip_vti.
From: Saurabh @ 2012-07-17 19:44 UTC (permalink / raw)
  To: netdev



New VTI tunnel kernel module, Kconfig and Makefile changes.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 16b92d0..5efff60 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -80,4 +80,18 @@ enum {
 
 #define IFLA_GRE_MAX	(__IFLA_GRE_MAX - 1)
 
+/* VTI-mode i_flags */
+#define VTI_ISVTI 0x0001
+
+enum {
+	IFLA_VTI_UNSPEC,
+	IFLA_VTI_LINK,
+	IFLA_VTI_IKEY,
+	IFLA_VTI_OKEY,
+	IFLA_VTI_LOCAL,
+	IFLA_VTI_REMOTE,
+	__IFLA_VTI_MAX,
+};
+
+#define IFLA_VTI_MAX	(__IFLA_VTI_MAX - 1)
 #endif /* _IF_TUNNEL_H_ */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 20f1cb5..5a19aeb 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -310,6 +310,17 @@ config SYN_COOKIES
 
 	  If unsure, say N.
 
+config NET_IPVTI
+	tristate "Virtual (secure) IP: tunneling"
+	select INET_TUNNEL
+	depends on INET_XFRM_MODE_TUNNEL
+	---help---
+	  Tunneling means encapsulating data of one protocol type within
+	  another protocol and sending it over a channel that understands the
+	  encapsulating protocol. This can be used with xfrm mode tunnel to give
+	  the notion of a secure tunnel for IPSEC and then use routing protocol
+	  on top.
+
 config INET_AH
 	tristate "IP: AH transformation"
 	select XFRM_ALGO
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index ff75d3b..3999ce9 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IP_MROUTE) += ipmr.o
 obj-$(CONFIG_NET_IPIP) += ipip.o
 obj-$(CONFIG_NET_IPGRE_DEMUX) += gre.o
 obj-$(CONFIG_NET_IPGRE) += ip_gre.o
+obj-$(CONFIG_NET_IPVTI) += ip_vti.o
 obj-$(CONFIG_SYN_COOKIES) += syncookies.o
 obj-$(CONFIG_INET_AH) += ah4.o
 obj-$(CONFIG_INET_ESP) += esp4.o
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
new file mode 100644
index 0000000..c41b5c3
--- /dev/null
+++ b/net/ipv4/ip_vti.c
@@ -0,0 +1,956 @@
+/*
+ *	Linux NET3: IP/IP protocol decoder modified to support
+ *		    virtual tunnel interface
+ *
+ *	Authors:
+ *		Saurabh Mohan (saurabh.mohan@vyatta.com) 05/07/2012
+ *
+ *	This program is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU General Public License
+ *	as published by the Free Software Foundation; either version
+ *	2 of the License, or (at your option) any later version.
+ *
+ */
+
+/*
+   This version of net/ipv4/ip_vti.c is cloned of net/ipv4/ipip.c
+
+   For comments look at net/ipv4/ip_gre.c --ANK
+ */
+
+
+#include <linux/capability.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/uaccess.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/in.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/if_arp.h>
+#include <linux/mroute.h>
+#include <linux/init.h>
+#include <linux/netfilter_ipv4.h>
+#include <linux/if_ether.h>
+
+#include <net/sock.h>
+#include <net/ip.h>
+#include <net/icmp.h>
+#include <net/ipip.h>
+#include <net/inet_ecn.h>
+#include <net/xfrm.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#define HASH_SIZE  16
+#define HASH(addr) (((__force u32)addr^((__force u32)addr>>4))&(HASH_SIZE-1))
+
+static struct rtnl_link_ops vti_link_ops __read_mostly;
+
+static int vti_net_id __read_mostly;
+struct vti_net {
+	struct ip_tunnel __rcu *tunnels_r_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_r[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_wc[1];
+	struct ip_tunnel **tunnels[4];
+
+	struct net_device *fb_tunnel_dev;
+};
+
+static int vti_fb_tunnel_init(struct net_device *dev);
+static int vti_tunnel_init(struct net_device *dev);
+static void vti_tunnel_setup(struct net_device *dev);
+static void vti_dev_free(struct net_device *dev);
+static int vti_tunnel_bind_dev(struct net_device *dev);
+
+/* Locking : hash tables are protected by RCU and RTNL */
+
+#define for_each_ip_tunnel_rcu(start) \
+	for (t = rcu_dereference(start); t; t = rcu_dereference(t->next))
+
+/* often modified stats are per cpu, other are shared (netdev->stats) */
+struct pcpu_tstats {
+	u64	rx_packets;
+	u64	rx_bytes;
+	u64	tx_packets;
+	u64	tx_bytes;
+	struct	u64_stats_sync	syncp;
+};
+
+#define VTI_XMIT(stats1, stats2) do {				\
+	int err;						\
+	int pkt_len = skb->len;					\
+	err = dst_output(skb);					\
+	if (net_xmit_eval(err) == 0) {				\
+		u64_stats_update_begin(&(stats1)->syncp);	\
+		(stats1)->tx_bytes += pkt_len;			\
+		(stats1)->tx_packets++;				\
+		u64_stats_update_end(&(stats1)->syncp);		\
+	} else {						\
+		(stats2)->tx_errors++;				\
+		(stats2)->tx_aborted_errors++;			\
+	}							\
+} while (0)
+
+
+static struct rtnl_link_stats64 *vti_get_stats64(struct net_device *dev,
+						 struct rtnl_link_stats64 *tot)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		const struct pcpu_tstats *tstats = per_cpu_ptr(dev->tstats, i);
+		u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
+		unsigned int start;
+
+		do {
+			start = u64_stats_fetch_begin_bh(&tstats->syncp);
+			rx_packets = tstats->rx_packets;
+			tx_packets = tstats->tx_packets;
+			rx_bytes = tstats->rx_bytes;
+			tx_bytes = tstats->tx_bytes;
+		} while (u64_stats_fetch_retry_bh(&tstats->syncp, start));
+
+		tot->rx_packets += rx_packets;
+		tot->tx_packets += tx_packets;
+		tot->rx_bytes   += rx_bytes;
+		tot->tx_bytes   += tx_bytes;
+	}
+
+	tot->multicast = dev->stats.multicast;
+	tot->rx_crc_errors = dev->stats.rx_crc_errors;
+	tot->rx_fifo_errors = dev->stats.rx_fifo_errors;
+	tot->rx_length_errors = dev->stats.rx_length_errors;
+	tot->rx_errors = dev->stats.rx_errors;
+	tot->tx_fifo_errors = dev->stats.tx_fifo_errors;
+	tot->tx_carrier_errors = dev->stats.tx_carrier_errors;
+	tot->tx_dropped = dev->stats.tx_dropped;
+	tot->tx_aborted_errors = dev->stats.tx_aborted_errors;
+	tot->tx_errors = dev->stats.tx_errors;
+
+	return tot;
+}
+
+static struct ip_tunnel *vti_tunnel_lookup(struct net *net,
+					   __be32 remote, __be32 local)
+{
+	unsigned h0 = HASH(remote);
+	unsigned h1 = HASH(local);
+	struct ip_tunnel *t;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_r_l[h0 ^ h1])
+		if (local == t->parms.iph.saddr &&
+		    remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+	for_each_ip_tunnel_rcu(ipn->tunnels_r[h0])
+		if (remote == t->parms.iph.daddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_l[h1])
+		if (local == t->parms.iph.saddr && (t->dev->flags&IFF_UP))
+			return t;
+
+	for_each_ip_tunnel_rcu(ipn->tunnels_wc[0])
+		if (t && (t->dev->flags&IFF_UP))
+			return t;
+	return NULL;
+}
+
+static struct ip_tunnel **__vti_bucket(struct vti_net *ipn,
+				       struct ip_tunnel_parm *parms)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	unsigned h = 0;
+	int prio = 0;
+
+	if (remote) {
+		prio |= 2;
+		h ^= HASH(remote);
+	}
+	if (local) {
+		prio |= 1;
+		h ^= HASH(local);
+	}
+	return &ipn->tunnels[prio][h];
+}
+
+static inline struct ip_tunnel **vti_bucket(struct vti_net *ipn,
+					    struct ip_tunnel *t)
+{
+	return __vti_bucket(ipn, &t->parms);
+}
+
+static void vti_tunnel_unlink(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp;
+	struct ip_tunnel *iter;
+
+	for (tp = vti_bucket(ipn, t);
+	     (iter = rtnl_dereference(*tp)) != NULL;
+	     tp = &iter->next) {
+		if (t == iter) {
+			rcu_assign_pointer(*tp, t->next);
+			break;
+		}
+	}
+}
+
+static void vti_tunnel_link(struct vti_net *ipn, struct ip_tunnel *t)
+{
+	struct ip_tunnel __rcu **tp = vti_bucket(ipn, t);
+
+	rcu_assign_pointer(t->next, rtnl_dereference(*tp));
+	rcu_assign_pointer(*tp, t);
+}
+
+static struct ip_tunnel *vti_tunnel_locate(struct net *net,
+					   struct ip_tunnel_parm *parms,
+					   int create)
+{
+	__be32 remote = parms->iph.daddr;
+	__be32 local = parms->iph.saddr;
+	struct ip_tunnel *t, *nt;
+	struct ip_tunnel __rcu **tp;
+	struct net_device *dev;
+	char name[IFNAMSIZ];
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	for (tp = __vti_bucket(ipn, parms);
+	     (t = rtnl_dereference(*tp)) != NULL;
+	     tp = &t->next) {
+		if (local == t->parms.iph.saddr && remote == t->parms.iph.daddr)
+			return t;
+	}
+	if (!create)
+		return NULL;
+
+	if (parms->name[0])
+		strlcpy(name, parms->name, IFNAMSIZ);
+	else
+		strcpy(name, "vti%d");
+
+	dev = alloc_netdev(sizeof(*t), name, vti_tunnel_setup);
+	if (dev == NULL)
+		return NULL;
+
+	dev_net_set(dev, net);
+
+	nt = netdev_priv(dev);
+	nt->parms = *parms;
+	dev->rtnl_link_ops = &vti_link_ops;
+
+	vti_tunnel_bind_dev(dev);
+
+	if (register_netdevice(dev) < 0)
+		goto failed_free;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+	return nt;
+
+failed_free:
+	free_netdev(dev);
+	return NULL;
+}
+
+static void vti_tunnel_uninit(struct net_device *dev)
+{
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	vti_tunnel_unlink(ipn, netdev_priv(dev));
+	dev_put(dev);
+}
+
+static int vti_err(struct sk_buff *skb, u32 info)
+{
+
+	/* All the routers (except for Linux) return only
+	 * 8 bytes of packet payload. It means, that precise relaying of
+	 * ICMP in the real Internet is absolutely infeasible.
+	 */
+	struct iphdr *iph = (struct iphdr *)skb->data;
+	const int type = icmp_hdr(skb)->type;
+	const int code = icmp_hdr(skb)->code;
+	struct ip_tunnel *t;
+	int err;
+
+	switch (type) {
+	default:
+	case ICMP_PARAMETERPROB:
+		return 0;
+
+	case ICMP_DEST_UNREACH:
+		switch (code) {
+		case ICMP_SR_FAILED:
+		case ICMP_PORT_UNREACH:
+			/* Impossible event. */
+			return 0;
+		default:
+			/* All others are translated to HOST_UNREACH. */
+			break;
+		}
+		break;
+	case ICMP_TIME_EXCEEDED:
+		if (code != ICMP_EXC_TTL)
+			return 0;
+		break;
+	}
+
+	err = -ENOENT;
+
+	rcu_read_lock();
+	t = vti_tunnel_lookup(dev_net(skb->dev), iph->daddr, iph->saddr);
+	if (t == NULL)
+		goto out;
+
+	if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) {
+		ipv4_update_pmtu(skb, dev_net(skb->dev), info,
+				 t->parms.link, 0, IPPROTO_IPIP, 0);
+		err = 0;
+		goto out;
+	}
+
+	err = 0;
+	if (t->parms.iph.ttl == 0 && type == ICMP_TIME_EXCEEDED)
+		goto out;
+
+	if (time_before(jiffies, t->err_time + IPTUNNEL_ERR_TIMEO))
+		t->err_count++;
+	else
+		t->err_count = 1;
+	t->err_time = jiffies;
+out:
+	rcu_read_unlock();
+	return err;
+}
+
+/* We dont digest the packet therefore let the packet pass */
+static int vti_rcv(struct sk_buff *skb)
+{
+	struct ip_tunnel *tunnel;
+	const struct iphdr *iph = ip_hdr(skb);
+
+	rcu_read_lock();
+	tunnel = vti_tunnel_lookup(dev_net(skb->dev), iph->saddr, iph->daddr);
+	if (tunnel != NULL) {
+		struct pcpu_tstats *tstats;
+
+		tstats = this_cpu_ptr(tunnel->dev->tstats);
+		u64_stats_update_begin(&tstats->syncp);
+		tstats->rx_packets++;
+		tstats->rx_bytes += skb->len;
+		u64_stats_update_end(&tstats->syncp);
+
+		skb->dev = tunnel->dev;
+		rcu_read_unlock();
+		return 1;
+	}
+	rcu_read_unlock();
+
+	return -1;
+}
+
+/* This function assumes it is being called from dev_queue_xmit()
+ * and that skb is filled properly by that function.
+ */
+
+static netdev_tx_t vti_tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct pcpu_tstats *tstats;
+	struct iphdr  *tiph = &tunnel->parms.iph;
+	u8     tos;
+	struct rtable *rt;		/* Route to the other host */
+	struct net_device *tdev;	/* Device to other host */
+	struct iphdr  *old_iph = ip_hdr(skb);
+	__be32 dst = tiph->daddr;
+	struct flowi4 fl4;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		goto tx_error;
+
+	tos = old_iph->tos;
+
+	memset(&fl4, 0, sizeof(fl4));
+	flowi4_init_output(&fl4, tunnel->parms.link,
+			   htonl(tunnel->parms.i_key), RT_TOS(tos),
+			   RT_SCOPE_UNIVERSE,
+			   IPPROTO_IPIP, 0,
+			   dst, tiph->saddr, 0, 0);
+	rt = ip_route_output_key(dev_net(dev), &fl4);
+	if (IS_ERR(rt)) {
+		dev->stats.tx_carrier_errors++;
+		goto tx_error_icmp;
+	}
+	/* if there is no transform then this tunnel is not functional.
+	 * Or if the xfrm is not mode tunnel.
+	 */
+	if (!rt->dst.xfrm ||
+	    rt->dst.xfrm->props.mode != XFRM_MODE_TUNNEL) {
+		dev->stats.tx_carrier_errors++;
+		goto tx_error_icmp;
+	}
+	tdev = rt->dst.dev;
+
+	if (tdev == dev) {
+		ip_rt_put(rt);
+		dev->stats.collisions++;
+		goto tx_error;
+	}
+
+	if (tunnel->err_count > 0) {
+		if (time_before(jiffies,
+				tunnel->err_time + IPTUNNEL_ERR_TIMEO)) {
+			tunnel->err_count--;
+			dst_link_failure(skb);
+		} else
+			tunnel->err_count = 0;
+	}
+
+	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED |
+			      IPSKB_REROUTED);
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &rt->dst);
+	nf_reset(skb);
+	skb->dev = skb_dst(skb)->dev;
+
+	tstats = this_cpu_ptr(dev->tstats);
+	VTI_XMIT(tstats, &dev->stats);
+	return NETDEV_TX_OK;
+
+tx_error_icmp:
+	dst_link_failure(skb);
+tx_error:
+	dev->stats.tx_errors++;
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static int vti_tunnel_bind_dev(struct net_device *dev)
+{
+	struct net_device *tdev = NULL;
+	struct ip_tunnel *tunnel;
+	struct iphdr *iph;
+
+	tunnel = netdev_priv(dev);
+	iph = &tunnel->parms.iph;
+
+	if (iph->daddr) {
+		struct rtable *rt;
+		struct flowi4 fl4;
+		memset(&fl4, 0, sizeof(fl4));
+		flowi4_init_output(&fl4, tunnel->parms.link,
+				   htonl(tunnel->parms.i_key),
+				   RT_TOS(iph->tos), RT_SCOPE_UNIVERSE,
+				   IPPROTO_IPIP, 0,
+				   iph->daddr, iph->saddr, 0, 0);
+		rt = ip_route_output_key(dev_net(dev), &fl4);
+		if (!IS_ERR(rt)) {
+			tdev = rt->dst.dev;
+			ip_rt_put(rt);
+		}
+		dev->flags |= IFF_POINTOPOINT;
+	}
+
+	if (!tdev && tunnel->parms.link)
+		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+
+	if (tdev) {
+		dev->hard_header_len = tdev->hard_header_len +
+				       sizeof(struct iphdr);
+		dev->mtu = tdev->mtu;
+	}
+	dev->iflink = tunnel->parms.link;
+	return dev->mtu;
+}
+
+static int
+vti_tunnel_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
+{
+	int err = 0;
+	struct ip_tunnel_parm p;
+	struct ip_tunnel *t;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	switch (cmd) {
+	case SIOCGETTUNNEL:
+		t = NULL;
+		if (dev == ipn->fb_tunnel_dev) {
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data,
+					   sizeof(p))) {
+				err = -EFAULT;
+				break;
+			}
+			t = vti_tunnel_locate(net, &p, 0);
+		}
+		if (t == NULL)
+			t = netdev_priv(dev);
+		memcpy(&p, &t->parms, sizeof(p));
+		p.i_flags |= GRE_KEY | VTI_ISVTI;
+		p.o_flags |= GRE_KEY;
+		if (copy_to_user(ifr->ifr_ifru.ifru_data, &p, sizeof(p)))
+			err = -EFAULT;
+		break;
+
+	case SIOCADDTUNNEL:
+	case SIOCCHGTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		err = -EFAULT;
+		if (copy_from_user(&p, ifr->ifr_ifru.ifru_data, sizeof(p)))
+			goto done;
+
+		err = -EINVAL;
+		if (p.iph.version != 4 || p.iph.protocol != IPPROTO_IPIP ||
+		    p.iph.ihl != 5)
+			goto done;
+
+		t = vti_tunnel_locate(net, &p, cmd == SIOCADDTUNNEL);
+
+		if (dev != ipn->fb_tunnel_dev && cmd == SIOCCHGTUNNEL) {
+			if (t != NULL) {
+				if (t->dev != dev) {
+					err = -EEXIST;
+					break;
+				}
+			} else {
+				if (((dev->flags&IFF_POINTOPOINT) &&
+				    !p.iph.daddr) ||
+				    (!(dev->flags&IFF_POINTOPOINT) &&
+				    p.iph.daddr)) {
+					err = -EINVAL;
+					break;
+				}
+				t = netdev_priv(dev);
+				vti_tunnel_unlink(ipn, t);
+				synchronize_net();
+				t->parms.iph.saddr = p.iph.saddr;
+				t->parms.iph.daddr = p.iph.daddr;
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				t->parms.iph.protocol = IPPROTO_IPIP;
+				memcpy(dev->dev_addr, &p.iph.saddr, 4);
+				memcpy(dev->broadcast, &p.iph.daddr, 4);
+				vti_tunnel_link(ipn, t);
+				netdev_state_change(dev);
+			}
+		}
+
+		if (t) {
+			err = 0;
+			if (cmd == SIOCCHGTUNNEL) {
+				t->parms.i_key = p.i_key;
+				t->parms.o_key = p.o_key;
+				if (t->parms.link != p.link) {
+					t->parms.link = p.link;
+					vti_tunnel_bind_dev(dev);
+					netdev_state_change(dev);
+				}
+			}
+			p.i_flags |= GRE_KEY | VTI_ISVTI;
+			p.o_flags |= GRE_KEY;
+			if (copy_to_user(ifr->ifr_ifru.ifru_data, &t->parms,
+					 sizeof(p)))
+				err = -EFAULT;
+		} else
+			err = (cmd == SIOCADDTUNNEL ? -ENOBUFS : -ENOENT);
+		break;
+
+	case SIOCDELTUNNEL:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+
+		if (dev == ipn->fb_tunnel_dev) {
+			err = -EFAULT;
+			if (copy_from_user(&p, ifr->ifr_ifru.ifru_data,
+					   sizeof(p)))
+				goto done;
+			err = -ENOENT;
+
+			t = vti_tunnel_locate(net, &p, 0);
+			if (t == NULL)
+				goto done;
+			err = -EPERM;
+			if (t->dev == ipn->fb_tunnel_dev)
+				goto done;
+			dev = t->dev;
+		}
+		unregister_netdevice(dev);
+		err = 0;
+		break;
+
+	default:
+		err = -EINVAL;
+	}
+
+done:
+	return err;
+}
+
+static int vti_tunnel_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (new_mtu < 68 || new_mtu > 0xFFF8)
+		return -EINVAL;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static const struct net_device_ops vti_netdev_ops = {
+	.ndo_init	= vti_tunnel_init,
+	.ndo_uninit	= vti_tunnel_uninit,
+	.ndo_start_xmit	= vti_tunnel_xmit,
+	.ndo_do_ioctl	= vti_tunnel_ioctl,
+	.ndo_change_mtu	= vti_tunnel_change_mtu,
+	.ndo_get_stats64 = vti_get_stats64,
+};
+
+static void vti_dev_free(struct net_device *dev)
+{
+	free_percpu(dev->tstats);
+	free_netdev(dev);
+}
+
+static void vti_tunnel_setup(struct net_device *dev)
+{
+	dev->netdev_ops		= &vti_netdev_ops;
+	dev->destructor		= vti_dev_free;
+
+	dev->type		= ARPHRD_TUNNEL;
+	dev->hard_header_len	= LL_MAX_HEADER + sizeof(struct iphdr);
+	dev->mtu		= ETH_DATA_LEN;
+	dev->flags		= IFF_NOARP;
+	dev->iflink		= 0;
+	dev->addr_len		= 4;
+	dev->features		|= NETIF_F_NETNS_LOCAL;
+	dev->features		|= NETIF_F_LLTX;
+	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
+}
+
+static int vti_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
+	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int __net_init vti_fb_tunnel_init(struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct iphdr *iph = &tunnel->parms.iph;
+	struct vti_net *ipn = net_generic(dev_net(dev), vti_net_id);
+
+	tunnel->dev = dev;
+	strcpy(tunnel->parms.name, dev->name);
+
+	iph->version		= 4;
+	iph->protocol		= IPPROTO_IPIP;
+	iph->ihl		= 5;
+
+	dev->tstats = alloc_percpu(struct pcpu_tstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	dev_hold(dev);
+	rcu_assign_pointer(ipn->tunnels_wc[0], tunnel);
+	return 0;
+}
+
+static struct xfrm_tunnel vti_handler __read_mostly = {
+	.handler	=	vti_rcv,
+	.err_handler	=	vti_err,
+	.priority	=	1,
+};
+
+static void vti_destroy_tunnels(struct vti_net *ipn, struct list_head *head)
+{
+	int prio;
+
+	for (prio = 1; prio < 4; prio++) {
+		int h;
+		for (h = 0; h < HASH_SIZE; h++) {
+			struct ip_tunnel *t;
+
+			t = rtnl_dereference(ipn->tunnels[prio][h]);
+			while (t != NULL) {
+				unregister_netdevice_queue(t->dev, head);
+				t = rtnl_dereference(t->next);
+			}
+		}
+	}
+}
+
+static int __net_init vti_init_net(struct net *net)
+{
+	int err;
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+
+	ipn->tunnels[0] = ipn->tunnels_wc;
+	ipn->tunnels[1] = ipn->tunnels_l;
+	ipn->tunnels[2] = ipn->tunnels_r;
+	ipn->tunnels[3] = ipn->tunnels_r_l;
+
+	ipn->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel),
+					  "ip_vti0",
+					  vti_tunnel_setup);
+	if (!ipn->fb_tunnel_dev) {
+		err = -ENOMEM;
+		goto err_alloc_dev;
+	}
+	dev_net_set(ipn->fb_tunnel_dev, net);
+
+	err = vti_fb_tunnel_init(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	ipn->fb_tunnel_dev->rtnl_link_ops = &vti_link_ops;
+
+	err = register_netdev(ipn->fb_tunnel_dev);
+	if (err)
+		goto err_reg_dev;
+	return 0;
+
+err_reg_dev:
+	vti_dev_free(ipn->fb_tunnel_dev);
+err_alloc_dev:
+	/* nothing */
+	return err;
+}
+
+static void __net_exit vti_exit_net(struct net *net)
+{
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	LIST_HEAD(list);
+
+	rtnl_lock();
+	vti_destroy_tunnels(ipn, &list);
+	unregister_netdevice_many(&list);
+	rtnl_unlock();
+}
+
+static struct pernet_operations vti_net_ops = {
+	.init = vti_init_net,
+	.exit = vti_exit_net,
+	.id   = &vti_net_id,
+	.size = sizeof(struct vti_net),
+};
+
+static int vti_tunnel_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	return 0;
+}
+
+static void vti_netlink_parms(struct nlattr *data[],
+			      struct ip_tunnel_parm *parms)
+{
+	memset(parms, 0, sizeof(*parms));
+
+	parms->iph.protocol = IPPROTO_IPIP;
+
+	if (!data)
+		return;
+
+	if (data[IFLA_VTI_LINK])
+		parms->link = nla_get_u32(data[IFLA_VTI_LINK]);
+
+	if (data[IFLA_VTI_IKEY])
+		parms->i_key = nla_get_be32(data[IFLA_VTI_IKEY]);
+
+	if (data[IFLA_VTI_OKEY])
+		parms->o_key = nla_get_be32(data[IFLA_VTI_OKEY]);
+
+	if (data[IFLA_VTI_LOCAL])
+		parms->iph.saddr = nla_get_be32(data[IFLA_VTI_LOCAL]);
+
+	if (data[IFLA_VTI_REMOTE])
+		parms->iph.daddr = nla_get_be32(data[IFLA_VTI_REMOTE]);
+
+}
+
+static int vti_newlink(struct net *src_net, struct net_device *dev,
+		       struct nlattr *tb[], struct nlattr *data[])
+{
+	struct ip_tunnel *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	int mtu;
+	int err;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &nt->parms);
+
+	if (vti_tunnel_locate(net, &nt->parms, 0))
+		return -EEXIST;
+
+	mtu = vti_tunnel_bind_dev(dev);
+	if (!tb[IFLA_MTU])
+		dev->mtu = mtu;
+
+	err = register_netdevice(dev);
+	if (err)
+		goto out;
+
+	dev_hold(dev);
+	vti_tunnel_link(ipn, nt);
+
+out:
+	return err;
+}
+
+static int vti_changelink(struct net_device *dev, struct nlattr *tb[],
+			  struct nlattr *data[])
+{
+	struct ip_tunnel *t, *nt;
+	struct net *net = dev_net(dev);
+	struct vti_net *ipn = net_generic(net, vti_net_id);
+	struct ip_tunnel_parm p;
+	int mtu;
+
+	if (dev == ipn->fb_tunnel_dev)
+		return -EINVAL;
+
+	nt = netdev_priv(dev);
+	vti_netlink_parms(data, &p);
+
+	t = vti_tunnel_locate(net, &p, 0);
+
+	if (t) {
+		if (t->dev != dev)
+			return -EEXIST;
+	} else {
+		t = nt;
+
+		vti_tunnel_unlink(ipn, t);
+		t->parms.iph.saddr = p.iph.saddr;
+		t->parms.iph.daddr = p.iph.daddr;
+		t->parms.i_key = p.i_key;
+		t->parms.o_key = p.o_key;
+		if (dev->type != ARPHRD_ETHER) {
+			memcpy(dev->dev_addr, &p.iph.saddr, 4);
+			memcpy(dev->broadcast, &p.iph.daddr, 4);
+		}
+		vti_tunnel_link(ipn, t);
+		netdev_state_change(dev);
+	}
+
+	if (t->parms.link != p.link) {
+		t->parms.link = p.link;
+		mtu = vti_tunnel_bind_dev(dev);
+		if (!tb[IFLA_MTU])
+			dev->mtu = mtu;
+		netdev_state_change(dev);
+	}
+
+	return 0;
+}
+
+static size_t vti_get_size(const struct net_device *dev)
+{
+	return
+		/* IFLA_VTI_LINK */
+		nla_total_size(4) +
+		/* IFLA_VTI_IKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_OKEY */
+		nla_total_size(4) +
+		/* IFLA_VTI_LOCAL */
+		nla_total_size(4) +
+		/* IFLA_VTI_REMOTE */
+		nla_total_size(4) +
+		0;
+}
+
+static int vti_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct ip_tunnel *t = netdev_priv(dev);
+	struct ip_tunnel_parm *p = &t->parms;
+
+	nla_put_u32(skb, IFLA_VTI_LINK, p->link);
+	nla_put_be32(skb, IFLA_VTI_IKEY, p->i_key);
+	nla_put_be32(skb, IFLA_VTI_OKEY, p->o_key);
+	nla_put_be32(skb, IFLA_VTI_LOCAL, p->iph.saddr);
+	nla_put_be32(skb, IFLA_VTI_REMOTE, p->iph.daddr);
+
+	return 0;
+}
+
+static const struct nla_policy vti_policy[IFLA_VTI_MAX + 1] = {
+	[IFLA_VTI_LINK]		= { .type = NLA_U32 },
+	[IFLA_VTI_IKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_OKEY]		= { .type = NLA_U32 },
+	[IFLA_VTI_LOCAL]	= { .len = FIELD_SIZEOF(struct iphdr, saddr) },
+	[IFLA_VTI_REMOTE]	= { .len = FIELD_SIZEOF(struct iphdr, daddr) },
+};
+
+static struct rtnl_link_ops vti_link_ops __read_mostly = {
+	.kind		= "vti",
+	.maxtype	= IFLA_VTI_MAX,
+	.policy		= vti_policy,
+	.priv_size	= sizeof(struct ip_tunnel),
+	.setup		= vti_tunnel_setup,
+	.validate	= vti_tunnel_validate,
+	.newlink	= vti_newlink,
+	.changelink	= vti_changelink,
+	.get_size	= vti_get_size,
+	.fill_info	= vti_fill_info,
+};
+
+static int __init vti_init(void)
+{
+	int err;
+
+	pr_info("IPv4 over IPSec tunneling driver\n");
+
+	err = register_pernet_device(&vti_net_ops);
+	if (err < 0)
+		return err;
+	err = xfrm4_mode_tunnel_input_register(&vti_handler);
+	if (err < 0) {
+		unregister_pernet_device(&vti_net_ops);
+		pr_info(KERN_INFO "vti init: can't register tunnel\n");
+	}
+
+	err = rtnl_link_register(&vti_link_ops);
+	if (err < 0)
+		goto rtnl_link_failed;
+
+	return err;
+
+rtnl_link_failed:
+	xfrm4_mode_tunnel_input_deregister(&vti_handler);
+	unregister_pernet_device(&vti_net_ops);
+	return err;
+}
+
+static void __exit vti_fini(void)
+{
+	rtnl_link_unregister(&vti_link_ops);
+	if (xfrm4_mode_tunnel_input_deregister(&vti_handler))
+		pr_info("vti close: can't deregister tunnel\n");
+
+	unregister_pernet_device(&vti_net_ops);
+}
+
+module_init(vti_init);
+module_exit(vti_fini);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK("vti");
+MODULE_ALIAS_NETDEV("ip_vti0");

^ permalink raw reply related

* [net-next PATCH 01/02] net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.
From: Saurabh @ 2012-07-17 19:44 UTC (permalink / raw)
  To: netdev



Incorporated David and Steffen's comments.
Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index e0a55df..04214c0 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1475,6 +1475,8 @@ extern int xfrm4_output(struct sk_buff *skb);
 extern int xfrm4_output_finish(struct sk_buff *skb);
 extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler, unsigned short family);
 extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler, unsigned short family);
+extern int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler);
+extern int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler);
 extern int xfrm6_extract_header(struct sk_buff *skb);
 extern int xfrm6_extract_input(struct xfrm_state *x, struct sk_buff *skb);
 extern int xfrm6_rcv_spi(struct sk_buff *skb, int nexthdr, __be32 spi);
diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
index ed4bf11..ddee0a0 100644
--- a/net/ipv4/xfrm4_mode_tunnel.c
+++ b/net/ipv4/xfrm4_mode_tunnel.c
@@ -15,6 +15,65 @@
 #include <net/ip.h>
 #include <net/xfrm.h>
 
+/* Informational hook. The decap is still done here. */
+static struct xfrm_tunnel __rcu *rcv_notify_handlers __read_mostly;
+static DEFINE_MUTEX(xfrm4_mode_tunnel_input_mutex);
+
+int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+	int ret = -EEXIST;
+	int priority = handler->priority;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+
+	for (pprev = &rcv_notify_handlers;
+	     (t = rcu_dereference_protected(*pprev,
+	     lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+	     pprev = &t->next) {
+		if (t->priority > priority)
+			break;
+		if (t->priority == priority)
+			goto err;
+
+	}
+
+	handler->next = *pprev;
+	rcu_assign_pointer(*pprev, handler);
+
+	ret = 0;
+
+err:
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_register);
+
+int xfrm4_mode_tunnel_input_deregister(struct xfrm_tunnel *handler)
+{
+	struct xfrm_tunnel __rcu **pprev;
+	struct xfrm_tunnel *t;
+	int ret = -ENOENT;
+
+	mutex_lock(&xfrm4_mode_tunnel_input_mutex);
+	for (pprev = &rcv_notify_handlers;
+	     (t = rcu_dereference_protected(*pprev,
+	     lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
+	     pprev = &t->next) {
+		if (t == handler) {
+			*pprev = handler->next;
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
+	synchronize_net();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_deregister);
+
 static inline void ipip_ecn_decapsulate(struct sk_buff *skb)
 {
 	struct iphdr *inner_iph = ipip_hdr(skb);
@@ -64,8 +123,14 @@ static int xfrm4_mode_tunnel_output(struct xfrm_state *x, struct sk_buff *skb)
 	return 0;
 }
 
+#define for_each_input_rcu(head, handler)	\
+	for (handler = rcu_dereference(head);	\
+	     handler != NULL;			\
+	     handler = rcu_dereference(handler->next))
+
 static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 {
+	struct xfrm_tunnel *handler;
 	int err = -EINVAL;
 
 	if (XFRM_MODE_SKB_CB(skb)->protocol != IPPROTO_IPIP)
@@ -74,6 +139,9 @@ static int xfrm4_mode_tunnel_input(struct xfrm_state *x, struct sk_buff *skb)
 	if (!pskb_may_pull(skb, sizeof(struct iphdr)))
 		goto out;
 
+	for_each_input_rcu(rcv_notify_handlers, handler)
+		handler->handler(skb);
+
 	if (skb_cloned(skb) &&
 	    (err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
 		goto out;

^ permalink raw reply related

* [net-next PATCH 00/02] net/ipv4: Add support for new tunnel type VTI.
From: Saurabh @ 2012-07-17 19:44 UTC (permalink / raw)
  To: netdev

I have accommodated all the style comments so far. If there are any more
style comments then send all your feedback in one email rather than in bits
and pieces.

IPv6 support has not yet been developed. Once I have it developed and tested
I'll submit it as well.  If this feature will not be accepted without IPv6
then let me know and I'll stop wasting my time. 

Incorporated David and Steffen's comments.
Resubmitting after taking into account review comments:
The VTI tunnel is applicable to esp, ah and ipcomp.

Introduction:
Virtual tunnel interface is a way to represent policy based IPsec tunnels as
 virtual interfaces in linux. This is similar to Cisco's VTI (virtual tunnel
 interface) and Juniper's representaion of secure tunnel (st.xx).
 The advantage of representing an IPsec tunnel as an interface is that it is
 possible to plug Ipsec tunnels into the routing protocol infrastructure of a
 router. Therefore it becomes possible to influence the packet path by toggling
 the link state of the tunnel or based on routing metrics.

Overview:
Natively linux kernel does not support ipsec as an interface. Also secure
 interface assume a ipsec policy 4 tupple of {dst-ip-any, src-ip-any,
 dst-port-any, src-port-any}. Applying this 4 tuple in linux would result in
 all traffic matching the ipsec policy. What is needed is a tunnel
 distinguisher. The linux kernel skbuff has fwmark which is used for policy
 based routing (PBR). Linux kernel version 2.6.35 enhanced SPD/SADB to use
 fwmark as part of the IPsec policy. Strongswan has also introduced support for
 this kernel feature with version 4.5.0. We can therefore use the fwmark as the
 distinguisher for tunnel interface. We can also create a light weight tunnel
 kernel module (vti) to give the notion of an interface for rest of the kernel
 routing system. The tunnel module does not do any encapsulation/decapsulation.
 The kernel's xfrm modules still do the esp encryption/decryption.

Usage:
ip tunnel add sti15 mode vti remote 12.0.0.1 local 12.0.0.3 ikey 15
or
ip link add sti15 type vti key 15 remote 12.0.0.1 local 12.0.0.3

Sample strongswan config would be:
conn peer-12.0.0.1-tunnel-1
   left=12.0.0.3
   right=12.0.0.1
   leftsubnet=0.0.0.0/0
   rightsubnet=0.0.0.0/0
   ike=aes128-sha1-modp1024!
   ikelifetime=28800s
   keyingtries=%forever
   esp=aes128-sha1!
   keylife=3600s
   rekeymargin=540s
   type=tunnel
   pfs=yes
   compress=no
   authby=secret
   auto=start
   mark_in=0xf
   mark_out=0xf
   keyexchange=ikev1

Also you need the iptables rule for ingress esp and udp-4500 packets:
-A PREROUTING -s 12.0.0.1/32 -d 12.0.0.3/32 -p esp -j MARK --set-xmark 0xf/0xffffffff

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>

---

^ permalink raw reply

* [PATCH] MAINTAINERS: Changes in qlcnic and qlge maintainers list
From: Anirban Chakraborty @ 2012-07-17 19:22 UTC (permalink / raw)
  To: davem; +Cc: netdev, Dept_NX_Linux_NIC_Driver, Anirban Chakraborty

From: Anirban Chakraborty <anirban.chakraborty@qlogic.com>

Please apply.

Thanks.

Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com>
---
 MAINTAINERS |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index b4321fb..7fda50f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5554,7 +5554,7 @@ F:	Documentation/networking/LICENSE.qla3xxx
 F:	drivers/net/ethernet/qlogic/qla3xxx.*
 
 QLOGIC QLCNIC (1/10)Gb ETHERNET DRIVER
-M:	Anirban Chakraborty <anirban.chakraborty@qlogic.com>
+M:	Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
 M:	Sony Chacko <sony.chacko@qlogic.com>
 M:	linux-driver@qlogic.com
 L:	netdev@vger.kernel.org
@@ -5562,7 +5562,6 @@ S:	Supported
 F:	drivers/net/ethernet/qlogic/qlcnic/
 
 QLOGIC QLGE 10Gb ETHERNET DRIVER
-M:	Anirban Chakraborty <anirban.chakraborty@qlogic.com>
 M:	Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
 M:	Ron Mercer <ron.mercer@qlogic.com>
 M:	linux-driver@qlogic.com
-- 
1.7.1

^ permalink raw reply related

* wireless.git frozen -- Re: That's pretty much it for 3.5.0
From: John W. Linville @ 2012-07-17 19:30 UTC (permalink / raw)
  To: linux-wireless-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <20120717.090142.125145009944045241.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Tue, Jul 17, 2012 at 09:01:42AM -0700, David Miller wrote:
> 
> Linus was _extremely_ generous and took in all the stuff that was
> pending in the net tree just now.
> 
> Besides very serious issues, I'm not willing to consider any more bug
> fixes for the 'net' tree at this time.
> 
> Only one pending known bug qualifies, and that's the CIPSO ip option
> processing OOPS'er.  And I'll work on that myself if Paul Moore
> doesn't show a sign of life in the next day.
> 
> Thanks.

Now only fixes for truly "show stopper" bugs will be accepted for
the 3.5 stream.  I don't believe that any of the handful of fixes
currently in wireless.git (but not yet in net.git) are sufficiently
important to make the cut.

I will pull the current wireless.git tree into the wireless-next.git
tree, and then wireless.git will remain frozen until 3.6-rc1 is
released.  If you have a wireless fix that you believe is sufficiently
important to merit being in 3.5, then please post it to the netdev
list (and Cc: linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org when you do so).

Thanks,

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: New commands to configure IOV features
From: Don Dutile @ 2012-07-17 19:29 UTC (permalink / raw)
  To: Yuval Mintz
  Cc: davem@davemloft.net, Chris Friesen, Ben Hutchings, Greg Rose,
	netdev@vger.kernel.org, linux-pci
In-Reply-To: <5003DC9B.8000706@broadcom.com>

On 07/16/2012 05:19 AM, Yuval Mintz wrote:
>
>>>>> If I want to pick the RFCs and add support for configuring the number
>>>>> of VFs - do you think ethtool's the right place for such added
>>>>> support?
>>>>>
>>>> I think a PCI utility tool would be better, SR-IOV is not limited to
>>>> network devices.  That's one of the reasons I dropped the RFC.  I
>>>> haven't gotten back to the idea since then due to my day job keeping me
>>>> pretty busy.
>>>
>>> For what it's worth, I agree with this.
>>
>>  From my perspective it would be ideal if this could be exported via /sys or something
>>
>
>
> Well, obviously unless there was a sudden change in our stance regarding
> sysfs we will not head that way.
>
> This thread got no replies from the pci community, and I'm unfamiliar
> with such a tool.
>
> Dave, What's your stance in the matter - do you wish us to continue pursuing
> some pci tool (which might or might not exist), or instead work on
> a networking solution to this issue?
>
> Do you happen to know such a tool?
>
> Thanks,
> Yuval
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuval (et. al.),

Not seeing the original thread on netdev,
I just had a recent discussion w/Greg Rose about providing
sysfs-based, VF enable/disable methods.
I was told that historically, VF enablement started as a sysfs-based
function, was debated and pushed toward a device/driver-specific method,
as it is implemented today.   Now, with some experience with SRIOV and
its use in the virtualization space, the discussion has renewed as to whether
a sysfs-based enable/disable method should be resurrected, so it
provides a more generic method for virtualization tools/api's to
manage SRIOV/VF devices.

I was hoping to discuss this topic with a number of folks at
LinuxCon/Plumbers/KS when the PCI mini-summit is held, to gain
further insight, or be brought up to speed on past history,
and review current uses/status of VFs.

WRT SRIOV-nic devices, the thinking goes that protocol-level
parameters associated with VFs should use protocol-specific interfaces,
e.g., ethtool, ip link set, etc. for Ethernet VFs.
Thus, the various protocol control functions/tools should
be used to control VF parameters, as one would for a physical device
of that protocol/class.

- Don

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 19:26 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: David Miller, <netdev@vger.kernel.org>,
	<linux-wireless@vger.kernel.org>,
	<netfilter-devel@vger.kernel.org>
In-Reply-To: <5005BA4C.2000602@intel.com>

On 7/17/2012 12:17 PM, John Fastabend wrote:
> On 7/17/2012 12:09 PM, John Fastabend wrote:
>> On 7/17/2012 12:00 PM, John Fastabend wrote:
>>> On 7/17/2012 11:48 AM, Rustad, Mark D wrote:
>>>> On Jul 17, 2012, at 10:41 AM, Rustad, Mark D wrote:
>>>>
>>>>> On Jul 17, 2012, at 9:01 AM, David Miller wrote:
>>>>>
>>>>>> Linus was _extremely_ generous and took in all the stuff that was
>>>>>> pending in the net tree just now.
>>>>>
>>>>> Maybe *too* generous. :-) I just updated and when I boot I get an
>>>>> early crash in update_netdev_tables which is in netprio_cgroup.c.
>>>>>
>>>>>> Besides very serious issues, I'm not willing to consider any more bug
>>>>>> fixes for the 'net' tree at this time.
>>>>>
>>>>> I think the above issue will have to be fixed, as it completely
>>>>> prevents booting for any kernel that includes the netprio_cgroup
>>>>> option.
>>>>>
>>>>>> Only one pending known bug qualifies, and that's the CIPSO ip option
>>>>>> processing OOPS'er.  And I'll work on that myself if Paul Moore
>>>>>> doesn't show a sign of life in the next day.
>>>>>>
>>>>>> Thanks.
>>>>>
>>>>>
>>>>> I can start taking a look at this if you like, but I see that Gao
>>>>> feng has two patches in the last set of patches that may be related.
>>>>>
>>>>> To give you an idea how early the crash is, here are a few log
>>>>> messages leading up to it:
>>>>>
>>>>> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
>>>>> 2097152 bytes)
>>>>> [    0.005550] Inode-cache hash table entries: 131072 (order: 8,
>>>>> 1048576 bytes)
>>>>> [    0.007165] Mount-cache hash table entries: 256
>>>>> [    0.010289] Initializing cgroup subsys net_cls
>>>>> [    0.010947] Initializing cgroup subsys net_prio
>>>>> [    0.011039] BUG: unable to handle kernel NULL pointer dereference
>>>>> at 0000000000000828
>>>>> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
>>>>
>>>>
>>>> I found that I can avoid the crash by configuring the netprio_cgroup
>>>> as a module. I don't need to have it built in, I just happened to.
>>>> This finding may lower the temperature of this issue a lot from what I
>>>> had been feeling.
>>>>
>>>
>>> hmm looks like we access init_net here,
>>>
>>> static void update_netdev_tables(void)
>>> {
>>>          struct net_device *dev;
>>>          u32 max_len = atomic_read(&max_prioidx) + 1;
>>>          struct netprio_map *map;
>>>
>>>          rtnl_lock();
>>>          for_each_netdev(&init_net, dev) {
>>>                  map = rtnl_dereference(dev->priomap);
>>>                  if ((!map) ||
>>>                      (map->priomap_len < max_len))
>>>                          extend_netdev_table(dev, max_len);
>>>          }
>>>          rtnl_unlock();
>>> }
>>>
>>> but inet_net is initialized by pure_initcall(net_ns_init) and I
>>> gather pure_initcall's should not have any dependencies but it
>>> looks like we created one here with cgroup_init_early() in
>>> start_kernel().
>>>
>>> I'll poke around some more. Also had some off list help from
>>> Mark.
>>>
>>> .John
>>>
>>
>> although we don't have an early_init hook for netprio_cgroup so this
>> is probably not correct.
>
> Hey Mark,
>
> you have better timing then me (I can't make this fail). Can you try
> cgroup_init below rest_init() in start_kernel(). That's in init/main.c
>
> .John
>

ugh nevermind that was stupid... I'm going to stop hitting the lists
with useless noise and be back with a fix in awhile.

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: David Miller @ 2012-07-17 19:24 UTC (permalink / raw)
  To: john.r.fastabend; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <5005B881.8010505@intel.com>

From: John Fastabend <john.r.fastabend@intel.com>
Date: Tue, 17 Jul 2012 12:09:53 -0700

> although we don't have an early_init hook for netprio_cgroup so this
> is probably not correct.

The dependency is actually on net_dev_init (a subsys_initcall) rather
than a pure_initcall.

net_dev_init is what registers the netdev_net_ops, which in turn
initializes the netdev list in namespaces such as &init_net

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 19:17 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: David Miller,
	<netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	<linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	<netfilter-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
In-Reply-To: <5005B881.8010505-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On 7/17/2012 12:09 PM, John Fastabend wrote:
> On 7/17/2012 12:00 PM, John Fastabend wrote:
>> On 7/17/2012 11:48 AM, Rustad, Mark D wrote:
>>> On Jul 17, 2012, at 10:41 AM, Rustad, Mark D wrote:
>>>
>>>> On Jul 17, 2012, at 9:01 AM, David Miller wrote:
>>>>
>>>>> Linus was _extremely_ generous and took in all the stuff that was
>>>>> pending in the net tree just now.
>>>>
>>>> Maybe *too* generous. :-) I just updated and when I boot I get an
>>>> early crash in update_netdev_tables which is in netprio_cgroup.c.
>>>>
>>>>> Besides very serious issues, I'm not willing to consider any more bug
>>>>> fixes for the 'net' tree at this time.
>>>>
>>>> I think the above issue will have to be fixed, as it completely
>>>> prevents booting for any kernel that includes the netprio_cgroup
>>>> option.
>>>>
>>>>> Only one pending known bug qualifies, and that's the CIPSO ip option
>>>>> processing OOPS'er.  And I'll work on that myself if Paul Moore
>>>>> doesn't show a sign of life in the next day.
>>>>>
>>>>> Thanks.
>>>>
>>>>
>>>> I can start taking a look at this if you like, but I see that Gao
>>>> feng has two patches in the last set of patches that may be related.
>>>>
>>>> To give you an idea how early the crash is, here are a few log
>>>> messages leading up to it:
>>>>
>>>> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
>>>> 2097152 bytes)
>>>> [    0.005550] Inode-cache hash table entries: 131072 (order: 8,
>>>> 1048576 bytes)
>>>> [    0.007165] Mount-cache hash table entries: 256
>>>> [    0.010289] Initializing cgroup subsys net_cls
>>>> [    0.010947] Initializing cgroup subsys net_prio
>>>> [    0.011039] BUG: unable to handle kernel NULL pointer dereference
>>>> at 0000000000000828
>>>> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
>>>
>>>
>>> I found that I can avoid the crash by configuring the netprio_cgroup
>>> as a module. I don't need to have it built in, I just happened to.
>>> This finding may lower the temperature of this issue a lot from what I
>>> had been feeling.
>>>
>>
>> hmm looks like we access init_net here,
>>
>> static void update_netdev_tables(void)
>> {
>>          struct net_device *dev;
>>          u32 max_len = atomic_read(&max_prioidx) + 1;
>>          struct netprio_map *map;
>>
>>          rtnl_lock();
>>          for_each_netdev(&init_net, dev) {
>>                  map = rtnl_dereference(dev->priomap);
>>                  if ((!map) ||
>>                      (map->priomap_len < max_len))
>>                          extend_netdev_table(dev, max_len);
>>          }
>>          rtnl_unlock();
>> }
>>
>> but inet_net is initialized by pure_initcall(net_ns_init) and I
>> gather pure_initcall's should not have any dependencies but it
>> looks like we created one here with cgroup_init_early() in
>> start_kernel().
>>
>> I'll poke around some more. Also had some off list help from
>> Mark.
>>
>> .John
>>
>
> although we don't have an early_init hook for netprio_cgroup so this
> is probably not correct.

Hey Mark,

you have better timing then me (I can't make this fail). Can you try
cgroup_init below rest_init() in start_kernel(). That's in init/main.c

.John

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 19:09 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: David Miller, <netdev@vger.kernel.org>,
	<linux-wireless@vger.kernel.org>,
	<netfilter-devel@vger.kernel.org>
In-Reply-To: <5005B643.2080009@intel.com>

On 7/17/2012 12:00 PM, John Fastabend wrote:
> On 7/17/2012 11:48 AM, Rustad, Mark D wrote:
>> On Jul 17, 2012, at 10:41 AM, Rustad, Mark D wrote:
>>
>>> On Jul 17, 2012, at 9:01 AM, David Miller wrote:
>>>
>>>> Linus was _extremely_ generous and took in all the stuff that was
>>>> pending in the net tree just now.
>>>
>>> Maybe *too* generous. :-) I just updated and when I boot I get an
>>> early crash in update_netdev_tables which is in netprio_cgroup.c.
>>>
>>>> Besides very serious issues, I'm not willing to consider any more bug
>>>> fixes for the 'net' tree at this time.
>>>
>>> I think the above issue will have to be fixed, as it completely
>>> prevents booting for any kernel that includes the netprio_cgroup option.
>>>
>>>> Only one pending known bug qualifies, and that's the CIPSO ip option
>>>> processing OOPS'er.  And I'll work on that myself if Paul Moore
>>>> doesn't show a sign of life in the next day.
>>>>
>>>> Thanks.
>>>
>>>
>>> I can start taking a look at this if you like, but I see that Gao
>>> feng has two patches in the last set of patches that may be related.
>>>
>>> To give you an idea how early the crash is, here are a few log
>>> messages leading up to it:
>>>
>>> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
>>> 2097152 bytes)
>>> [    0.005550] Inode-cache hash table entries: 131072 (order: 8,
>>> 1048576 bytes)
>>> [    0.007165] Mount-cache hash table entries: 256
>>> [    0.010289] Initializing cgroup subsys net_cls
>>> [    0.010947] Initializing cgroup subsys net_prio
>>> [    0.011039] BUG: unable to handle kernel NULL pointer dereference
>>> at 0000000000000828
>>> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
>>
>>
>> I found that I can avoid the crash by configuring the netprio_cgroup
>> as a module. I don't need to have it built in, I just happened to.
>> This finding may lower the temperature of this issue a lot from what I
>> had been feeling.
>>
>
> hmm looks like we access init_net here,
>
> static void update_netdev_tables(void)
> {
>          struct net_device *dev;
>          u32 max_len = atomic_read(&max_prioidx) + 1;
>          struct netprio_map *map;
>
>          rtnl_lock();
>          for_each_netdev(&init_net, dev) {
>                  map = rtnl_dereference(dev->priomap);
>                  if ((!map) ||
>                      (map->priomap_len < max_len))
>                          extend_netdev_table(dev, max_len);
>          }
>          rtnl_unlock();
> }
>
> but inet_net is initialized by pure_initcall(net_ns_init) and I
> gather pure_initcall's should not have any dependencies but it
> looks like we created one here with cgroup_init_early() in
> start_kernel().
>
> I'll poke around some more. Also had some off list help from
> Mark.
>
> .John
>

although we don't have an early_init hook for netprio_cgroup so this
is probably not correct.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox