switching network namespace midway

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* switching network namespace midway
@ 2012-10-23 17:49 rsa
  2012-10-24 21:11 ` Eric W. Biederman
  0 siblings, 1 reply; 52+ messages in thread
From: rsa @ 2012-10-23 17:49 UTC (permalink / raw)
  To: netdev

Assuming I have a tunnel interface where two route lookups are done --
one for innter
packet and the other for outer -- do you see any issues in switching
the network
namespace prior to second route lookup (and restore to the original namespace
after the second lookup is done)?

If so, are there any other calls other than sk_change_net() needed?

thank you
rsa

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-23 17:49 switching network namespace midway rsa
@ 2012-10-24 21:11 ` Eric W. Biederman
  2012-10-24 21:21   ` Benjamin LaHaise
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2012-10-24 21:11 UTC (permalink / raw)
  To: rsa; +Cc: netdev

rsa <ravi.mlists@gmail.com> writes:

> Assuming I have a tunnel interface where two route lookups are done --
> one for innter
> packet and the other for outer -- do you see any issues in switching
> the network
> namespace prior to second route lookup (and restore to the original namespace
> after the second lookup is done)?
>
> If so, are there any other calls other than sk_change_net() needed?

In general sk_change_net is a bad idea.

Most likely what you want to do is simply memorize both struct net's
that you care about and perform the routing lookup as appropriate.

Certainly you don't want to be calling sk_change_net for every packet
that goes through your tunnel.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-24 21:11 ` Eric W. Biederman
@ 2012-10-24 21:21   ` Benjamin LaHaise
  2012-10-25  1:37     ` Eric W. Biederman
                       ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Benjamin LaHaise @ 2012-10-24 21:21 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: rsa, netdev

On Wed, Oct 24, 2012 at 02:11:14PM -0700, Eric W. Biederman wrote:
> rsa <ravi.mlists@gmail.com> writes:
> 
> > Assuming I have a tunnel interface where two route lookups are done --
> > one for innter
> > packet and the other for outer -- do you see any issues in switching
> > the network
> > namespace prior to second route lookup (and restore to the original namespace
> > after the second lookup is done)?
> >
> > If so, are there any other calls other than sk_change_net() needed?
> 
> In general sk_change_net is a bad idea.
> 
> Most likely what you want to do is simply memorize both struct net's
> that you care about and perform the routing lookup as appropriate.
> 
> Certainly you don't want to be calling sk_change_net for every packet
> that goes through your tunnel.

I've actually done this with L2TP.  The packets coming into the system from 
the tunnel are received on one UDP socket in one "struct net", but the 
decapsulated packets are received on a "struct net_device" that is in a 
different "struct net".  No special coding is required -- just move the 
tunnel's net_device into another namespace after creation and it works as 
expected.  Using sk_change_net() would be full of races and is really not 
required for the vast majority of use cases.

		-ben

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-24 21:21   ` Benjamin LaHaise
@ 2012-10-25  1:37     ` Eric W. Biederman
  2012-10-25 14:38       ` Benjamin LaHaise
  2012-10-25 15:12     ` rsa
  2012-10-25 15:29     ` rsa
  2 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2012-10-25  1:37 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: rsa, netdev

Benjamin LaHaise <bcrl@kvack.org> writes:

> On Wed, Oct 24, 2012 at 02:11:14PM -0700, Eric W. Biederman wrote:
>> rsa <ravi.mlists@gmail.com> writes:
>> 
>> > Assuming I have a tunnel interface where two route lookups are done --
>> > one for innter
>> > packet and the other for outer -- do you see any issues in switching
>> > the network
>> > namespace prior to second route lookup (and restore to the original namespace
>> > after the second lookup is done)?
>> >
>> > If so, are there any other calls other than sk_change_net() needed?
>> 
>> In general sk_change_net is a bad idea.
>> 
>> Most likely what you want to do is simply memorize both struct net's
>> that you care about and perform the routing lookup as appropriate.
>> 
>> Certainly you don't want to be calling sk_change_net for every packet
>> that goes through your tunnel.
>
> I've actually done this with L2TP.  The packets coming into the system from 
> the tunnel are received on one UDP socket in one "struct net", but the 
> decapsulated packets are received on a "struct net_device" that is in a 
> different "struct net".  No special coding is required -- just move the 
> tunnel's net_device into another namespace after creation and it works as 
> expected.  Using sk_change_net() would be full of races and is really not 
> required for the vast majority of use cases.

Yes.  Although L2TP is not an example of code I would copy.  Any other
tunnel would be better.  I haven't looked closely at L2TP but it keeps
popping up as a poster child for small little network namespace bugs
that I don't want to think about.

Last I looked to use L2TP it required a magic userspace that I couldn't
find and I haven't cared enough to write.  Ben would you be interested
in helping flush out the network namespace bugs out of L2TP?

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-25  1:37     ` Eric W. Biederman
@ 2012-10-25 14:38       ` Benjamin LaHaise
  2012-10-25 16:21         ` Stephen Hemminger
  0 siblings, 1 reply; 52+ messages in thread
From: Benjamin LaHaise @ 2012-10-25 14:38 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: rsa, netdev

Hello Eric,

On Wed, Oct 24, 2012 at 06:37:16PM -0700, Eric W. Biederman wrote:
> Yes.  Although L2TP is not an example of code I would copy.  Any other
> tunnel would be better.  I haven't looked closely at L2TP but it keeps
> popping up as a poster child for small little network namespace bugs
> that I don't want to think about.

Agreed.

> Last I looked to use L2TP it required a magic userspace that I couldn't
> find and I haven't cared enough to write.  Ben would you be interested
> in helping flush out the network namespace bugs out of L2TP?

Sure, that I can do.  To be entirely honest, I have not yet tried using 
network namespaces with the in kernel L2TP stack, but rather with the 
Babylon code.  I have, however, put together changes to make the Babylon 
userland code work with the in kernel L2TP over the past couple of months.  
Since the network namespace support is already present in the userland 
code, it shouldn't be too hard to adapt.

>From a quick read of the L2TP over UDP code paths, it looks like things 
should work, as the ingress and egress lookups use the transport socket's 
namespace.  All the reference counting looks a bit heavy handed, though.  
I also wrote a couple of test programs for setting up L2TP sockets and 
devices which may be of use -- see http://www.kvack.org/~bcrl/pppol2tp/ .

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-25 14:38       ` Benjamin LaHaise
@ 2012-10-25 16:21         ` Stephen Hemminger
  2012-10-28  5:43           ` Eric W. Biederman
  0 siblings, 1 reply; 52+ messages in thread
From: Stephen Hemminger @ 2012-10-25 16:21 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric W. Biederman, rsa, netdev

I noticed that the L2TP sockets are not being moved to the correct name
space.

Something like this is probably needed.

--- a/net/l2tp/l2tp_core.c	2012-10-25 09:11:15.691271882 -0700
+++ b/net/l2tp/l2tp_core.c	2012-10-25 09:18:58.746621418 -0700
@@ -1357,7 +1357,10 @@ static void l2tp_tunnel_free(struct l2tp
  * userspace. This is used for static tunnels where there is no
  * managing L2TP daemon.
  */
-static int l2tp_tunnel_sock_create(u32 tunnel_id, u32 peer_tunnel_id, struct l2tp_tunnel_cfg *cfg, struct socket **sockp)
+static int l2tp_tunnel_sock_create(struct net *net,
+				   u32 tunnel_id, u32 peer_tunnel_id,
+				   struct l2tp_tunnel_cfg *cfg,
+				   struct socket **sockp)
 {
 	int err = -EINVAL;
 	struct sockaddr_in udp_addr;
@@ -1372,11 +1375,12 @@ static int l2tp_tunnel_sock_create(u32 t
 	case L2TP_ENCAPTYPE_UDP:
 #if IS_ENABLED(CONFIG_IPV6)
 		if (cfg->local_ip6 && cfg->peer_ip6) {
-			err = sock_create(AF_INET6, SOCK_DGRAM, 0, sockp);
+			err = sock_create_kern(AF_INET6, SOCK_DGRAM, 0, sockp);
 			if (err < 0)
 				goto out;
 
 			sock = *sockp;
+			sk_change(sock->sk, net);
 
 			memset(&udp6_addr, 0, sizeof(udp6_addr));
 			udp6_addr.sin6_family = AF_INET6;
@@ -1400,11 +1404,12 @@ static int l2tp_tunnel_sock_create(u32 t
 		} else
 #endif
 		{
-			err = sock_create(AF_INET, SOCK_DGRAM, 0, sockp);
+			err = sock_create_kern(AF_INET, SOCK_DGRAM, 0, sockp);
 			if (err < 0)
 				goto out;
 
 			sock = *sockp;
+			sk_change(sock->sk, net);
 
 			memset(&udp_addr, 0, sizeof(udp_addr));
 			udp_addr.sin_family = AF_INET;
@@ -1433,7 +1438,7 @@ static int l2tp_tunnel_sock_create(u32 t
 	case L2TP_ENCAPTYPE_IP:
 #if IS_ENABLED(CONFIG_IPV6)
 		if (cfg->local_ip6 && cfg->peer_ip6) {
-			err = sock_create(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP,
+			err = sock_create_kern(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP,
 					  sockp);
 			if (err < 0)
 				goto out;
@@ -1462,12 +1467,13 @@ static int l2tp_tunnel_sock_create(u32 t
 		} else
 #endif
 		{
-			err = sock_create(AF_INET, SOCK_DGRAM, IPPROTO_L2TP,
+			err = sock_create_kern(AF_INET, SOCK_DGRAM, IPPROTO_L2TP,
 					  sockp);
 			if (err < 0)
 				goto out;
 
 			sock = *sockp;
+			sk_change(sock->sk, net);
 
 			memset(&ip_addr, 0, sizeof(ip_addr));
 			ip_addr.l2tp_family = AF_INET;
@@ -1517,7 +1523,8 @@ int l2tp_tunnel_create(struct net *net,
 	 * kernel socket.
 	 */
 	if (fd < 0) {
-		err = l2tp_tunnel_sock_create(tunnel_id, peer_tunnel_id, cfg, &sock);
+		err = l2tp_tunnel_sock_create(net, tunnel_id, peer_tunnel_id,
+					      cfg, &sock);
 		if (err < 0)
 			goto err;
 	} else {

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-25 16:21         ` Stephen Hemminger
@ 2012-10-28  5:43           ` Eric W. Biederman
  2012-10-29 14:23             ` Stephen Hemminger
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2012-10-28  5:43 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Benjamin LaHaise, rsa, netdev

Stephen Hemminger <shemminger@vyatta.com> writes:

> I noticed that the L2TP sockets are not being moved to the correct name
> space.
>
> Something like this is probably needed.

This is almost right.

There needs to be a line in l2tp_tunnel_create that verifies
the network namespace of the socket derived from a file descriptor
and the passed in network namespace match.

For the l2tp_tunnel_sock_create case where we have a socket that is not
exported to userspace using sk_change_net seems appropriate to avoid
reference counting problems.  And it may be worth moving that work into
sk_create_kern.  But we need a network namespace hook that will lookup
all l2tp tunnel sockets when a network namespace is being destroyed and
remove them.  I think we can hit this bug with rmmod as well.

Bleh.

Eric


> --- a/net/l2tp/l2tp_core.c	2012-10-25 09:11:15.691271882 -0700
> +++ b/net/l2tp/l2tp_core.c	2012-10-25 09:18:58.746621418 -0700
> @@ -1357,7 +1357,10 @@ static void l2tp_tunnel_free(struct l2tp
>   * userspace. This is used for static tunnels where there is no
>   * managing L2TP daemon.
>   */
> -static int l2tp_tunnel_sock_create(u32 tunnel_id, u32 peer_tunnel_id, struct l2tp_tunnel_cfg *cfg, struct socket **sockp)
> +static int l2tp_tunnel_sock_create(struct net *net,
> +				   u32 tunnel_id, u32 peer_tunnel_id,
> +				   struct l2tp_tunnel_cfg *cfg,
> +				   struct socket **sockp)
>  {
>  	int err = -EINVAL;
>  	struct sockaddr_in udp_addr;
> @@ -1372,11 +1375,12 @@ static int l2tp_tunnel_sock_create(u32 t
>  	case L2TP_ENCAPTYPE_UDP:
>  #if IS_ENABLED(CONFIG_IPV6)
>  		if (cfg->local_ip6 && cfg->peer_ip6) {
> -			err = sock_create(AF_INET6, SOCK_DGRAM, 0, sockp);
> +			err = sock_create_kern(AF_INET6, SOCK_DGRAM, 0, sockp);
>  			if (err < 0)
>  				goto out;
>  
>  			sock = *sockp;
> +			sk_change(sock->sk, net);
>  
>  			memset(&udp6_addr, 0, sizeof(udp6_addr));
>  			udp6_addr.sin6_family = AF_INET6;
> @@ -1400,11 +1404,12 @@ static int l2tp_tunnel_sock_create(u32 t
>  		} else
>  #endif
>  		{
> -			err = sock_create(AF_INET, SOCK_DGRAM, 0, sockp);
> +			err = sock_create_kern(AF_INET, SOCK_DGRAM, 0, sockp);
>  			if (err < 0)
>  				goto out;
>  
>  			sock = *sockp;
> +			sk_change(sock->sk, net);
>  
>  			memset(&udp_addr, 0, sizeof(udp_addr));
>  			udp_addr.sin_family = AF_INET;
> @@ -1433,7 +1438,7 @@ static int l2tp_tunnel_sock_create(u32 t
>  	case L2TP_ENCAPTYPE_IP:
>  #if IS_ENABLED(CONFIG_IPV6)
>  		if (cfg->local_ip6 && cfg->peer_ip6) {
> -			err = sock_create(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP,
> +			err = sock_create_kern(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP,
>  					  sockp);
>  			if (err < 0)
>  				goto out;
> @@ -1462,12 +1467,13 @@ static int l2tp_tunnel_sock_create(u32 t
>  		} else
>  #endif
>  		{
> -			err = sock_create(AF_INET, SOCK_DGRAM, IPPROTO_L2TP,
> +			err = sock_create_kern(AF_INET, SOCK_DGRAM, IPPROTO_L2TP,
>  					  sockp);
>  			if (err < 0)
>  				goto out;
>  
>  			sock = *sockp;
> +			sk_change(sock->sk, net);
>  
>  			memset(&ip_addr, 0, sizeof(ip_addr));
>  			ip_addr.l2tp_family = AF_INET;
> @@ -1517,7 +1523,8 @@ int l2tp_tunnel_create(struct net *net,
>  	 * kernel socket.
>  	 */
>  	if (fd < 0) {
> -		err = l2tp_tunnel_sock_create(tunnel_id, peer_tunnel_id, cfg, &sock);
> +		err = l2tp_tunnel_sock_create(net, tunnel_id, peer_tunnel_id,
> +					      cfg, &sock);
>  		if (err < 0)
>  			goto err;
>  	} else {
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-28  5:43           ` Eric W. Biederman
@ 2012-10-29 14:23             ` Stephen Hemminger
  2012-10-30  0:21               ` Eric W. Biederman
  0 siblings, 1 reply; 52+ messages in thread
From: Stephen Hemminger @ 2012-10-29 14:23 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Benjamin LaHaise, rsa, netdev

On Sat, 27 Oct 2012 22:43:13 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Stephen Hemminger <shemminger@vyatta.com> writes:
> 
> > I noticed that the L2TP sockets are not being moved to the correct name
> > space.
> >
> > Something like this is probably needed.
> 
> This is almost right.
> 
> There needs to be a line in l2tp_tunnel_create that verifies
> the network namespace of the socket derived from a file descriptor
> and the passed in network namespace match.
> 
> For the l2tp_tunnel_sock_create case where we have a socket that is not
> exported to userspace using sk_change_net seems appropriate to avoid
> reference counting problems.  And it may be worth moving that work into
> sk_create_kern.  But we need a network namespace hook that will lookup
> all l2tp tunnel sockets when a network namespace is being destroyed and
> remove them.  I think we can hit this bug with rmmod as well.

Since I don't use netns or L2TP for real, someone else needs to take
up the crusade here.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-29 14:23             ` Stephen Hemminger
@ 2012-10-30  0:21               ` Eric W. Biederman
  2012-10-30  8:55                 ` James Chapman
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2012-10-30  0:21 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Benjamin LaHaise, rsa, netdev, James Chapman

Stephen Hemminger <shemminger@vyatta.com> writes:

> On Sat, 27 Oct 2012 22:43:13 -0700
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> Stephen Hemminger <shemminger@vyatta.com> writes:
>> 
>> > I noticed that the L2TP sockets are not being moved to the correct name
>> > space.
>> >
>> > Something like this is probably needed.
>> 
>> This is almost right.
>> 
>> There needs to be a line in l2tp_tunnel_create that verifies
>> the network namespace of the socket derived from a file descriptor
>> and the passed in network namespace match.
>> 
>> For the l2tp_tunnel_sock_create case where we have a socket that is not
>> exported to userspace using sk_change_net seems appropriate to avoid
>> reference counting problems.  And it may be worth moving that work into
>> sk_create_kern.  But we need a network namespace hook that will lookup
>> all l2tp tunnel sockets when a network namespace is being destroyed and
>> remove them.  I think we can hit this bug with rmmod as well.
>
> Since I don't use netns or L2TP for real, someone else needs to take
> up the crusade here.

Let's see if James Chapman is interested.  I don't use L2TP for real either.

James are you at all interested in the network namespace bugs that have
been found in the l2tp code?

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-30  0:21               ` Eric W. Biederman
@ 2012-10-30  8:55                 ` James Chapman
  0 siblings, 0 replies; 52+ messages in thread
From: James Chapman @ 2012-10-30  8:55 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen Hemminger, Benjamin LaHaise, rsa, netdev

On 30/10/12 00:21, Eric W. Biederman wrote:
> Stephen Hemminger <shemminger@vyatta.com> writes:
> 
>> On Sat, 27 Oct 2012 22:43:13 -0700
>> ebiederm@xmission.com (Eric W. Biederman) wrote:
>>
>>> Stephen Hemminger <shemminger@vyatta.com> writes:
>>>
>>>> I noticed that the L2TP sockets are not being moved to the correct name
>>>> space.
>>>>
>>>> Something like this is probably needed.
>>>
>>> This is almost right.
>>>
>>> There needs to be a line in l2tp_tunnel_create that verifies
>>> the network namespace of the socket derived from a file descriptor
>>> and the passed in network namespace match.
>>>
>>> For the l2tp_tunnel_sock_create case where we have a socket that is not
>>> exported to userspace using sk_change_net seems appropriate to avoid
>>> reference counting problems.  And it may be worth moving that work into
>>> sk_create_kern.  But we need a network namespace hook that will lookup
>>> all l2tp tunnel sockets when a network namespace is being destroyed and
>>> remove them.  I think we can hit this bug with rmmod as well.
>>
>> Since I don't use netns or L2TP for real, someone else needs to take
>> up the crusade here.
> 
> Let's see if James Chapman is interested.  I don't use L2TP for real either.
> 
> James are you at all interested in the network namespace bugs that have
> been found in the l2tp code?

Very much so, Eric. Thanks for keeping me in the loop. Unfortunately, I
am busy on other things at the moment. It's in my queue. I'll get to it
as soon as I can.

> 
> Eric


-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-24 21:21   ` Benjamin LaHaise
  2012-10-25  1:37     ` Eric W. Biederman
@ 2012-10-25 15:12     ` rsa
  2012-10-25 15:29     ` rsa
  2 siblings, 0 replies; 52+ messages in thread
From: rsa @ 2012-10-25 15:12 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric W. Biederman, netdev

Thanks Eric.  That makes sense. I can remember both network namespaces
and switch to the other prior to route lookup if necessary.
Any gotchas on receiving tunneled packed on the tunnel interface?  Looking at
the code there is no change necessary.  Packet arrives on tunnel
interface which is
decapsulated and given to net_if_rx() for inner packet processing.  Am I mssing
something?

thank you
rsa

On Wed, Oct 24, 2012 at 5:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> On Wed, Oct 24, 2012 at 02:11:14PM -0700, Eric W. Biederman wrote:
>> rsa <ravi.mlists@gmail.com> writes:
>>
>> > Assuming I have a tunnel interface where two route lookups are done --
>> > one for innter
>> > packet and the other for outer -- do you see any issues in switching
>> > the network
>> > namespace prior to second route lookup (and restore to the original namespace
>> > after the second lookup is done)?
>> >
>> > If so, are there any other calls other than sk_change_net() needed?
>>
>> In general sk_change_net is a bad idea.
>>
>> Most likely what you want to do is simply memorize both struct net's
>> that you care about and perform the routing lookup as appropriate.
>>
>> Certainly you don't want to be calling sk_change_net for every packet
>> that goes through your tunnel.
>
> I've actually done this with L2TP.  The packets coming into the system from
> the tunnel are received on one UDP socket in one "struct net", but the
> decapsulated packets are received on a "struct net_device" that is in a
> different "struct net".  No special coding is required -- just move the
> tunnel's net_device into another namespace after creation and it works as
> expected.  Using sk_change_net() would be full of races and is really not
> required for the vast majority of use cases.
>
>                 -ben

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-24 21:21   ` Benjamin LaHaise
  2012-10-25  1:37     ` Eric W. Biederman
  2012-10-25 15:12     ` rsa
@ 2012-10-25 15:29     ` rsa
  2012-10-25 15:59       ` Benjamin LaHaise
  2 siblings, 1 reply; 52+ messages in thread
From: rsa @ 2012-10-25 15:29 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric W. Biederman, netdev

Hi Ben
For L2TP that should work very well. But, with other tunnel types like GRE etc.
route lookups have to be done first in the inner and then outer.  It
makes it easier
to keep keep the tunnel in the inner namespace in such cases. right?

thank you
rsa

On Wed, Oct 24, 2012 at 5:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> On Wed, Oct 24, 2012 at 02:11:14PM -0700, Eric W. Biederman wrote:
>> rsa <ravi.mlists@gmail.com> writes:
>>
>> > Assuming I have a tunnel interface where two route lookups are done --
>> > one for innter
>> > packet and the other for outer -- do you see any issues in switching
>> > the network
>> > namespace prior to second route lookup (and restore to the original namespace
>> > after the second lookup is done)?
>> >
>> > If so, are there any other calls other than sk_change_net() needed?
>>
>> In general sk_change_net is a bad idea.
>>
>> Most likely what you want to do is simply memorize both struct net's
>> that you care about and perform the routing lookup as appropriate.
>>
>> Certainly you don't want to be calling sk_change_net for every packet
>> that goes through your tunnel.
>
> I've actually done this with L2TP.  The packets coming into the system from
> the tunnel are received on one UDP socket in one "struct net", but the
> decapsulated packets are received on a "struct net_device" that is in a
> different "struct net".  No special coding is required -- just move the
> tunnel's net_device into another namespace after creation and it works as
> expected.  Using sk_change_net() would be full of races and is really not
> required for the vast majority of use cases.
>
>                 -ben

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-25 15:29     ` rsa
@ 2012-10-25 15:59       ` Benjamin LaHaise
  2012-10-25 16:15         ` Eric W. Biederman
  0 siblings, 1 reply; 52+ messages in thread
From: Benjamin LaHaise @ 2012-10-25 15:59 UTC (permalink / raw)
  To: rsa; +Cc: Eric W. Biederman, netdev

On Thu, Oct 25, 2012 at 11:29:34AM -0400, rsa wrote:
> Hi Ben
> For L2TP that should work very well. But, with other tunnel types like GRE etc.
> route lookups have to be done first in the inner and then outer.  It
> makes it easier
> to keep keep the tunnel in the inner namespace in such cases. right?

I've read IPv4 gre code, and it appears to do the right thing on rx, but it 
does *not* appear to handle namespaces correctly on transmit.  In general, 
I would expect pretty much all code to get namespace handling correct for 
the rx case.  I'll have a closer look at fixing this tomorrow if nobody else 
beats me to it.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-25 15:59       ` Benjamin LaHaise
@ 2012-10-25 16:15         ` Eric W. Biederman
  2012-11-02  2:25           ` Benjamin LaHaise
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2012-10-25 16:15 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: rsa, netdev

Benjamin LaHaise <bcrl@kvack.org> writes:

> On Thu, Oct 25, 2012 at 11:29:34AM -0400, rsa wrote:
>> Hi Ben
>> For L2TP that should work very well. But, with other tunnel types like GRE etc.
>> route lookups have to be done first in the inner and then outer.  It
>> makes it easier
>> to keep keep the tunnel in the inner namespace in such cases. right?
>
> I've read IPv4 gre code, and it appears to do the right thing on rx, but it 
> does *not* appear to handle namespaces correctly on transmit.  In general, 
> I would expect pretty much all code to get namespace handling correct for 
> the rx case.  I'll have a closer look at fixing this tomorrow if nobody else 
> beats me to it.

It will be interesting to see what you come up with. 

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-10-25 16:15         ` Eric W. Biederman
@ 2012-11-02  2:25           ` Benjamin LaHaise
  2012-11-02  6:18             ` Eric W. Biederman
  0 siblings, 1 reply; 52+ messages in thread
From: Benjamin LaHaise @ 2012-11-02  2:25 UTC (permalink / raw)
  To: Eric W. Biederman, rsa; +Cc: netdev

On Thu, Oct 25, 2012 at 09:15:34AM -0700, Eric W. Biederman wrote:
> > I've read IPv4 gre code, and it appears to do the right thing on rx, but it 
> > does *not* appear to handle namespaces correctly on transmit.  In general, 
> > I would expect pretty much all code to get namespace handling correct for 
> > the rx case.  I'll have a closer look at fixing this tomorrow if nobody else 
> > beats me to it.
> 
> It will be interesting to see what you come up with. 

Well, I finally had some time to work on the ip_gre module a bit today, 
and here's what I came up with.  The basic idea is to store the network 
namespace in the ip_tunnel structure at creation time for use in sending 
and receiving packets, allowing the gre network device to be safely moved 
into another network namespace.  This works for me in moving a gre tunnel 
into an lxc container, and survives module unload and namespace 
destruction.  I'll try to spend a bit more time adding similar support to 
the other ip_tunnel devices over the next few days.  Comments/thoughts?

		-ben
-- 
"Thought is the essence of where you are now."


diff --git a/include/net/ipip.h b/include/net/ipip.h
index ddc077c..9cfba92 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -19,6 +19,7 @@ struct ip_tunnel_6rd_parm {
 struct ip_tunnel {
 	struct ip_tunnel __rcu	*next;
 	struct net_device	*dev;
+	struct net		*net;		/* Namespace for packet i/o */
 
 	int			err_count;	/* Number of arrived ICMP errors */
 	unsigned long		err_time;	/* Time when the last ICMP error arrived */
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 7240f8e..705dc66 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -461,6 +461,7 @@ static struct ip_tunnel *ipgre_tunnel_locate(struct net *net,
 	dev_net_set(dev, net);
 
 	nt = netdev_priv(dev);
+	nt->net = net;
 	nt->parms = *parms;
 	dev->rtnl_link_ops = &ipgre_link_ops;
 
@@ -484,8 +485,10 @@ failed_free:
 
 static void ipgre_tunnel_uninit(struct net_device *dev)
 {
-	struct net *net = dev_net(dev);
-	struct ipgre_net *ign = net_generic(net, ipgre_net_id);
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct ipgre_net *ign;
+
+	ign = net_generic(tunnel->net, ipgre_net_id);
 
 	ipgre_tunnel_unlink(ign, netdev_priv(dev));
 	dev_put(dev);
@@ -837,7 +840,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 			tos = ipv6_get_dsfield((const struct ipv6hdr *)old_iph);
 	}
 
-	rt = ip_route_output_gre(dev_net(dev), &fl4, dst, tiph->saddr,
+	rt = ip_route_output_gre(tunnel->net, &fl4, dst, tiph->saddr,
 				 tunnel->parms.o_key, RT_TOS(tos),
 				 tunnel->parms.link);
 	if (IS_ERR(rt)) {
@@ -1010,7 +1013,7 @@ static int ipgre_tunnel_bind_dev(struct net_device *dev)
 		struct flowi4 fl4;
 		struct rtable *rt;
 
-		rt = ip_route_output_gre(dev_net(dev), &fl4,
+		rt = ip_route_output_gre(tunnel->net, &fl4,
 					 iph->daddr, iph->saddr,
 					 tunnel->parms.o_key,
 					 RT_TOS(iph->tos),
@@ -1341,7 +1344,6 @@ static void ipgre_tunnel_setup(struct net_device *dev)
 	dev->flags		= IFF_NOARP;
 	dev->iflink		= 0;
 	dev->addr_len		= 4;
-	dev->features		|= NETIF_F_NETNS_LOCAL;
 	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
 
 	dev->features		|= GRE_FEATURES;
@@ -1432,6 +1434,7 @@ static void ipgre_destroy_tunnels(struct ipgre_net *ign, struct list_head *head)
 static int __net_init ipgre_init_net(struct net *net)
 {
 	struct ipgre_net *ign = net_generic(net, ipgre_net_id);
+	struct ip_tunnel *tunnel;
 	int err;
 
 	ign->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel), "gre0",
@@ -1445,6 +1448,9 @@ static int __net_init ipgre_init_net(struct net *net)
 	ipgre_fb_tunnel_init(ign->fb_tunnel_dev);
 	ign->fb_tunnel_dev->rtnl_link_ops = &ipgre_link_ops;
 
+	tunnel = netdev_priv(ign->fb_tunnel_dev);
+	tunnel->net = net;
+
 	if ((err = register_netdev(ign->fb_tunnel_dev)))
 		goto err_reg_dev;
 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-11-02  2:25           ` Benjamin LaHaise
@ 2012-11-02  6:18             ` Eric W. Biederman
  2012-11-02 14:03               ` Benjamin LaHaise
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2012-11-02  6:18 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: rsa, netdev

Benjamin LaHaise <bcrl@kvack.org> writes:

> On Thu, Oct 25, 2012 at 09:15:34AM -0700, Eric W. Biederman wrote:
>> > I've read IPv4 gre code, and it appears to do the right thing on rx, but it 
>> > does *not* appear to handle namespaces correctly on transmit.  In general, 
>> > I would expect pretty much all code to get namespace handling correct for 
>> > the rx case.  I'll have a closer look at fixing this tomorrow if nobody else 
>> > beats me to it.
>> 
>> It will be interesting to see what you come up with. 
>
> Well, I finally had some time to work on the ip_gre module a bit today, 
> and here's what I came up with.  The basic idea is to store the network 
> namespace in the ip_tunnel structure at creation time for use in sending 
> and receiving packets, allowing the gre network device to be safely moved 
> into another network namespace.  This works for me in moving a gre tunnel 
> into an lxc container, and survives module unload and namespace 
> destruction.  I'll try to spend a bit more time adding similar support to 
> the other ip_tunnel devices over the next few days.  Comments/thoughts?
>
> 		-ben

You need a per network namespace exit function to delete the tunnel when
the xmit direction goes away.  Otherwise we have a very nasty race if
the original network namespace exits.

NETNS_LOCAL may make sense on the reference device that is used to
support ioctls for creating devices.

ipgre_open ?  It looks like it needs to be handled.  Probably that
ip_route_output_gre needs to be moved.

ipv6?

Eric

> -- 
> "Thought is the essence of where you are now."
> 
> 
> diff --git a/include/net/ipip.h b/include/net/ipip.h
> index ddc077c..9cfba92 100644
> --- a/include/net/ipip.h
> +++ b/include/net/ipip.h
> @@ -19,6 +19,7 @@ struct ip_tunnel_6rd_parm {
>  struct ip_tunnel {
>  	struct ip_tunnel __rcu	*next;
>  	struct net_device	*dev;
> +	struct net		*net;		/* Namespace for packet i/o */
>  
>  	int			err_count;	/* Number of arrived ICMP errors */
>  	unsigned long		err_time;	/* Time when the last ICMP error arrived */
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 7240f8e..705dc66 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -461,6 +461,7 @@ static struct ip_tunnel *ipgre_tunnel_locate(struct net *net,
>  	dev_net_set(dev, net);
>  
>  	nt = netdev_priv(dev);
> +	nt->net = net;
>  	nt->parms = *parms;
>  	dev->rtnl_link_ops = &ipgre_link_ops;
>  
> @@ -484,8 +485,10 @@ failed_free:
>  
>  static void ipgre_tunnel_uninit(struct net_device *dev)
>  {
> -	struct net *net = dev_net(dev);
> -	struct ipgre_net *ign = net_generic(net, ipgre_net_id);
> +	struct ip_tunnel *tunnel = netdev_priv(dev);
> +	struct ipgre_net *ign;
> +
> +	ign = net_generic(tunnel->net, ipgre_net_id);
>  
>  	ipgre_tunnel_unlink(ign, netdev_priv(dev));
>  	dev_put(dev);
> @@ -837,7 +840,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
>  			tos = ipv6_get_dsfield((const struct ipv6hdr *)old_iph);
>  	}
>  
> -	rt = ip_route_output_gre(dev_net(dev), &fl4, dst, tiph->saddr,
> +	rt = ip_route_output_gre(tunnel->net, &fl4, dst, tiph->saddr,
>  				 tunnel->parms.o_key, RT_TOS(tos),
>  				 tunnel->parms.link);
>  	if (IS_ERR(rt)) {
> @@ -1010,7 +1013,7 @@ static int ipgre_tunnel_bind_dev(struct net_device *dev)
>  		struct flowi4 fl4;
>  		struct rtable *rt;
>  
> -		rt = ip_route_output_gre(dev_net(dev), &fl4,
> +		rt = ip_route_output_gre(tunnel->net, &fl4,
>  					 iph->daddr, iph->saddr,
>  					 tunnel->parms.o_key,
>  					 RT_TOS(iph->tos),
> @@ -1341,7 +1344,6 @@ static void ipgre_tunnel_setup(struct net_device *dev)
>  	dev->flags		= IFF_NOARP;
>  	dev->iflink		= 0;
>  	dev->addr_len		= 4;
> -	dev->features		|= NETIF_F_NETNS_LOCAL;
>  	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
>  
>  	dev->features		|= GRE_FEATURES;
> @@ -1432,6 +1434,7 @@ static void ipgre_destroy_tunnels(struct ipgre_net *ign, struct list_head *head)
>  static int __net_init ipgre_init_net(struct net *net)
>  {
>  	struct ipgre_net *ign = net_generic(net, ipgre_net_id);
> +	struct ip_tunnel *tunnel;
>  	int err;
>  
>  	ign->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel), "gre0",
> @@ -1445,6 +1448,9 @@ static int __net_init ipgre_init_net(struct net *net)
>  	ipgre_fb_tunnel_init(ign->fb_tunnel_dev);
>  	ign->fb_tunnel_dev->rtnl_link_ops = &ipgre_link_ops;
>  
> +	tunnel = netdev_priv(ign->fb_tunnel_dev);
> +	tunnel->net = net;
> +
>  	if ((err = register_netdev(ign->fb_tunnel_dev)))
>  		goto err_reg_dev;
>  

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-11-02  6:18             ` Eric W. Biederman
@ 2012-11-02 14:03               ` Benjamin LaHaise
  2012-11-02 20:45                 ` Eric W. Biederman
  0 siblings, 1 reply; 52+ messages in thread
From: Benjamin LaHaise @ 2012-11-02 14:03 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: rsa, netdev

On Thu, Nov 01, 2012 at 11:18:58PM -0700, Eric W. Biederman wrote:
> You need a per network namespace exit function to delete the tunnel when
> the xmit direction goes away.  Otherwise we have a very nasty race if
> the original network namespace exits.

That already exists as ipgre_exit_net().  Since the ip_tunnel structure 
remains hashed in the network namespace that creation occurred in, this 
case should be covered.

> NETNS_LOCAL may make sense on the reference device that is used to
> support ioctls for creating devices.

*nod*  That makes sense.

> ipgre_open ?  It looks like it needs to be handled.  Probably that
> ip_route_output_gre needs to be moved.

Good catch.  Will respin with that changed.

> ipv6?

That's next on the list.  There are also issues with ipip, ipmr and 
ipvti, as well as their ipv6 versions.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: switching network namespace midway
  2012-11-02 14:03               ` Benjamin LaHaise
@ 2012-11-02 20:45                 ` Eric W. Biederman
  2013-06-24 14:13                   ` [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2012-11-02 20:45 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: rsa, netdev

Benjamin LaHaise <bcrl@kvack.org> writes:

> On Thu, Nov 01, 2012 at 11:18:58PM -0700, Eric W. Biederman wrote:
>> You need a per network namespace exit function to delete the tunnel when
>> the xmit direction goes away.  Otherwise we have a very nasty race if
>> the original network namespace exits.
>
> That already exists as ipgre_exit_net().  Since the ip_tunnel structure 
> remains hashed in the network namespace that creation occurred in, this 
> case should be covered.

*blink*  I had looked for that, but I definitely missed that one.

The consequence of the design where we can run ioctls on network devices
we can't even see but whose ipaddrs are in our network namespace is
interesting.  Correct but interesting.

>> NETNS_LOCAL may make sense on the reference device that is used to
>> support ioctls for creating devices.
>
> *nod*  That makes sense.

After a second look.  fb_tunnel_dev is very special in the code so
allowing fb_tunnel_dev to change network namespace is almost certain
to create problems, so it should get a NETNS_LOCAL.

>> ipgre_open ?  It looks like it needs to be handled.  Probably that
>> ip_route_output_gre needs to be moved.
>
> Good catch.  Will respin with that changed.

I'm not seeing anything else but I will look again after you respin.

>> ipv6?
>
> That's next on the list.  There are also issues with ipip, ipmr and 
> ipvti, as well as their ipv6 versions.

I don't see sense in ip multicast routing tunnels (ipmr) changing
network namespaces as they are essentially aliases for other network
devices, and aren't really proper tunnels.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap
  2012-11-02 20:45                 ` Eric W. Biederman
@ 2013-06-24 14:13                   ` Nicolas Dichtel
  2013-06-24 14:13                     ` [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() Nicolas Dichtel
  2013-06-24 14:13                     ` [RFC PATCH net-next 2/2] sit: add support of x-netns Nicolas Dichtel
  0 siblings, 2 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-24 14:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists


This patch is a follow up of the thread "switching network namespace midway":
http://marc.info/?t=135101459500004&r=1&w=2

The goal of this serie is to add x-netns support for the module sit, ie the
encapsulation addresses and the network device are not owned by the same
namespace.

Example to configure a tunnel:

modprobe sit
ip netns add netns1
ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249
ip l s sit1 netns netns1
ip netns exec netns1 ip l s lo up
ip netns exec netns1 ip l s sit1 up
ip netns exec netns1 ip a a dev sit1 192.168.2.123 remote 192.168.2.121
ip netns exec netns1 ip -6 a a dev sit1 2001:1234::123 remote 2001:1234::121

I sent this serie as an RFC to get feedback from people. Once this serie is
approved, I will add the same feature for the module ipip and ip6_tunnel.

 include/linux/netdevice.h |  1 +
 include/net/ip_tunnels.h  |  1 +
 net/core/dev.c            | 34 ++++++++++++++++++++++----------
 net/ipv4/ip_tunnel.c      |  7 ++++++-
 net/ipv6/sit.c            | 49 ++++++++++++++++++++++++-----------------------
 5 files changed, 57 insertions(+), 35 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb()
  2013-06-24 14:13                   ` [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
@ 2013-06-24 14:13                     ` Nicolas Dichtel
  2013-06-24 18:13                       ` Ben Hutchings
  2013-06-24 14:13                     ` [RFC PATCH net-next 2/2] sit: add support of x-netns Nicolas Dichtel
  1 sibling, 1 reply; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-24 14:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, Nicolas Dichtel

The goal of this new function is to perform all needed cleanup before sending
an skb into another netns.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 34 ++++++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 09b4188..9b72d87 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2321,6 +2321,7 @@ extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct netdev_queue *txq);
 extern int		dev_forward_skb(struct net_device *dev,
 					struct sk_buff *skb);
+extern void		dev_cleanup_skb(struct sk_buff *skb);
 
 extern int		netdev_budget;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 722f633..d30bc22 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1625,6 +1625,29 @@ static inline bool is_skb_forwardable(struct net_device *dev,
 }
 
 /**
+ * dev_cleanup_skb - cleanup an skb before sending it to another netns
+ *
+ * @skb: buffer to clean
+ *
+ * dev_cleanup_skb can be used to clean an skb before injecting it in
+ * another namespace. We have to clear all information in the skb that
+ * could impact namespace isolation.
+ */
+void dev_cleanup_skb(struct sk_buff *skb)
+{
+	skb_orphan(skb);
+	skb->tstamp.tv64 = 0;
+	skb->pkt_type = PACKET_HOST;
+	skb->skb_iif = 0;
+	skb_dst_drop(skb);
+	skb->mark = 0;
+	secpath_reset(skb);
+	nf_reset(skb);
+	nf_reset_trace(skb);
+}
+EXPORT_SYMBOL_GPL(dev_cleanup_skb);
+
+/**
  * dev_forward_skb - loopback an skb to another netif
  *
  * @dev: destination network device
@@ -1652,22 +1675,13 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 		}
 	}
 
-	skb_orphan(skb);
-
 	if (unlikely(!is_skb_forwardable(dev, skb))) {
 		atomic_long_inc(&dev->rx_dropped);
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
-	skb->skb_iif = 0;
-	skb_dst_drop(skb);
-	skb->tstamp.tv64 = 0;
-	skb->pkt_type = PACKET_HOST;
+	dev_cleanup_skb(skb);
 	skb->protocol = eth_type_trans(skb, dev);
-	skb->mark = 0;
-	secpath_reset(skb);
-	nf_reset(skb);
-	nf_reset_trace(skb);
 	return netif_rx(skb);
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb()
  2013-06-24 14:13                     ` [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() Nicolas Dichtel
@ 2013-06-24 18:13                       ` Ben Hutchings
  2013-06-24 19:05                         ` Eric W. Biederman
  0 siblings, 1 reply; 52+ messages in thread
From: Ben Hutchings @ 2013-06-24 18:13 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, davem, ebiederm, bcrl, ravi.mlists

On Mon, 2013-06-24 at 16:13 +0200, Nicolas Dichtel wrote:
> The goal of this new function is to perform all needed cleanup before sending
> an skb into another netns.
[...]

To 'cleanup' an object often means to destroy or free it.  So perhaps
you could find an alternate verb that doesn't have that association,
e.g. 'sanitise' or 'unmark'.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb()
  2013-06-24 18:13                       ` Ben Hutchings
@ 2013-06-24 19:05                         ` Eric W. Biederman
  0 siblings, 0 replies; 52+ messages in thread
From: Eric W. Biederman @ 2013-06-24 19:05 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Nicolas Dichtel, netdev, davem, bcrl, ravi.mlists

Ben Hutchings <bhutchings@solarflare.com> writes:

> On Mon, 2013-06-24 at 16:13 +0200, Nicolas Dichtel wrote:
>> The goal of this new function is to perform all needed cleanup before sending
>> an skb into another netns.
> [...]
>
> To 'cleanup' an object often means to destroy or free it.  So perhaps
> you could find an alternate verb that doesn't have that association,
> e.g. 'sanitise' or 'unmark'.

skb_scrub_packet sounds good to me.    It has the right connotation
and it is shorter. :)

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH net-next 2/2] sit: add support of x-netns
  2013-06-24 14:13                   ` [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
  2013-06-24 14:13                     ` [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() Nicolas Dichtel
@ 2013-06-24 14:13                     ` Nicolas Dichtel
  2013-06-24 19:28                       ` Eric W. Biederman
  1 sibling, 1 reply; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-24 14:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, Nicolas Dichtel

This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/ip_tunnels.h |  1 +
 net/ipv4/ip_tunnel.c     |  7 ++++++-
 net/ipv6/sit.c           | 49 ++++++++++++++++++++++++------------------------
 3 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index b0d9824..781b3cf 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -42,6 +42,7 @@ struct ip_tunnel {
 	struct ip_tunnel __rcu	*next;
 	struct hlist_node hash_node;
 	struct net_device	*dev;
+	struct net		*net;	/* netns for packet i/o */
 
 	int		err_count;	/* Number of arrived ICMP errors */
 	unsigned long	err_time;	/* Time when the last ICMP error
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index bd227e5..b6a7704 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net,
 
 	tunnel = netdev_priv(dev);
 	tunnel->parms = *parms;
+	tunnel->net = net;
 
 	err = register_netdevice(dev);
 	if (err)
@@ -453,6 +454,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
 	tstats->rx_bytes += skb->len;
 	u64_stats_update_end(&tstats->syncp);
 
+	if (tunnel->net != dev_net(tunnel->dev))
+		dev_cleanup_skb(skb);
+
 	if (tunnel->dev->type == ARPHRD_ETHER) {
 		skb->protocol = eth_type_trans(skb, tunnel->dev);
 		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
@@ -541,7 +545,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 			tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph);
 	}
 
-	rt = ip_route_output_tunnel(dev_net(dev), &fl4,
+	rt = ip_route_output_tunnel(tunnel->net, &fl4,
 				    tunnel->parms.iph.protocol,
 				    dst, tnl_params->saddr,
 				    tunnel->parms.o_key,
@@ -888,6 +892,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
 	if (ip_tunnel_find(itn, p, dev->type))
 		return -EEXIST;
 
+	nt->net = net;
 	nt->parms = *p;
 	err = register_netdevice(dev);
 	if (err)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index f639866..4e03904 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t)
 
 static void ipip6_tunnel_uninit(struct net_device *dev)
 {
-	struct net *net = dev_net(dev);
-	struct sit_net *sitn = net_generic(net, sit_net_id);
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct sit_net *sitn = net_generic(tunnel->net, sit_net_id);
 
 	if (dev == sitn->fb_tunnel_dev) {
 		RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL);
 	} else {
-		ipip6_tunnel_unlink(sitn, netdev_priv(dev));
-		ipip6_tunnel_del_prl(netdev_priv(dev), NULL);
+		ipip6_tunnel_unlink(sitn, tunnel);
+		ipip6_tunnel_del_prl(tunnel, NULL);
 	}
 	dev_put(dev);
 }
@@ -621,6 +621,8 @@ static int ipip6_rcv(struct sk_buff *skb)
 		tstats->rx_packets++;
 		tstats->rx_bytes += skb->len;
 
+		if (tunnel->net != dev_net(tunnel->dev))
+			dev_cleanup_skb(skb);
 		netif_rx(skb);
 
 		return 0;
@@ -803,7 +805,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			goto tx_error;
 	}
 
-	rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
+	rt = ip_route_output_ports(tunnel->net, &fl4, NULL,
 				   dst, tiph->saddr,
 				   0, 0,
 				   IPPROTO_IPV6, RT_TOS(tos),
@@ -858,6 +860,9 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			tunnel->err_count = 0;
 	}
 
+	if (tunnel->net != dev_net(dev))
+		dev_cleanup_skb(skb);
+
 	/*
 	 * Okay, now see if we can stuff it in the buffer as-is.
 	 */
@@ -944,7 +949,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
 	iph = &tunnel->parms.iph;
 
 	if (iph->daddr) {
-		struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
+		struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4,
+							  NULL,
 							  iph->daddr, iph->saddr,
 							  0, 0,
 							  IPPROTO_IPV6,
@@ -959,7 +965,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
 	}
 
 	if (!tdev && tunnel->parms.link)
-		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+		tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link);
 
 	if (tdev) {
 		dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr);
@@ -972,7 +978,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
 
 static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p)
 {
-	struct net *net = dev_net(t->dev);
+	struct net *net = t->net;
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 
 	ipip6_tunnel_unlink(sitn, t);
@@ -1248,7 +1254,6 @@ static void ipip6_tunnel_setup(struct net_device *dev)
 	dev->priv_flags	       &= ~IFF_XMIT_DST_RELEASE;
 	dev->iflink		= 0;
 	dev->addr_len		= 4;
-	dev->features		|= NETIF_F_NETNS_LOCAL;
 	dev->features		|= NETIF_F_LLTX;
 }
 
@@ -1257,6 +1262,7 @@ static int ipip6_tunnel_init(struct net_device *dev)
 	struct ip_tunnel *tunnel = netdev_priv(dev);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 
 	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
 	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
@@ -1277,6 +1283,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev)
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
 	iph->version		= 4;
@@ -1562,22 +1569,15 @@ static struct xfrm_tunnel ipip_handler __read_mostly = {
 	.priority	=	2,
 };
 
-static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head)
+static void __net_exit sit_destroy_tunnels(struct net *net,
+					   struct list_head *head)
 {
-	int prio;
+	struct net_device *dev, *aux;
 
-	for (prio = 1; prio < 4; prio++) {
-		int h;
-		for (h = 0; h < HASH_SIZE; h++) {
-			struct ip_tunnel *t;
-
-			t = rtnl_dereference(sitn->tunnels[prio][h]);
-			while (t != NULL) {
-				unregister_netdevice_queue(t->dev, head);
-				t = rtnl_dereference(t->next);
-			}
-		}
-	}
+	for_each_netdev_safe(net, dev, aux)
+		if (dev->rtnl_link_ops &&
+		    !strcmp(dev->rtnl_link_ops->kind, "sit"))
+			unregister_netdevice_queue(dev, head);
 }
 
 static int __net_init sit_init_net(struct net *net)
@@ -1598,6 +1598,7 @@ static int __net_init sit_init_net(struct net *net)
 		goto err_alloc_dev;
 	}
 	dev_net_set(sitn->fb_tunnel_dev, net);
+	sitn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL;
 
 	err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
 	if (err)
@@ -1627,7 +1628,7 @@ static void __net_exit sit_exit_net(struct net *net)
 	LIST_HEAD(list);
 
 	rtnl_lock();
-	sit_destroy_tunnels(sitn, &list);
+	sit_destroy_tunnels(net, &list);
 	unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH net-next 2/2] sit: add support of x-netns
  2013-06-24 14:13                     ` [RFC PATCH net-next 2/2] sit: add support of x-netns Nicolas Dichtel
@ 2013-06-24 19:28                       ` Eric W. Biederman
  2013-06-24 21:11                         ` Nicolas Dichtel
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2013-06-24 19:28 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev, davem, bcrl, ravi.mlists

Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:

> This patch allows to switch the netns when packet is encapsulated or
> decapsulated. In other word, the encapsulated packet is received in a netns,
> where the lookup is done to find the tunnel. Once the tunnel is found, the
> packet is decapsulated and injecting into the corresponding interface which
> stands to another netns.
>
> When one of the two netns is removed, the tunnel is destroyed.

I don't see any fundamental problems with this code.  There are bugs
with the cleanup noted below.

The primary sit interface is marked as NETNS_LOCAL which is good.  A
comment might be nice explaining the reasonsing but for code
archeologists.

Conditionally calling dev_cleanup_skb bugs me.  The extra conditional
looks like a maintenance hazard.   Unless I have missed some subtle
detail either we don't need the cleanup at all or actually it is a bug
that we aren't scrubbing our packets as they progress through tunnels
even in the same network namespace.

Can we just make that function the skb scrubbing needed for packets to
traverse a tunnel?

My concern going into this was that we would get code that would break
because it would not be tested enough.  If we can remove the conditional
from dev_cleanup_skb we won't have any code that is conditionally run
and the logic looks simple enough not to bitrot in routine maintenance.

Eric

> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---
>  include/net/ip_tunnels.h |  1 +
>  net/ipv4/ip_tunnel.c     |  7 ++++++-
>  net/ipv6/sit.c           | 49 ++++++++++++++++++++++++------------------------
>  3 files changed, 32 insertions(+), 25 deletions(-)
>
> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
> index b0d9824..781b3cf 100644
> --- a/include/net/ip_tunnels.h
> +++ b/include/net/ip_tunnels.h
> @@ -42,6 +42,7 @@ struct ip_tunnel {
>  	struct ip_tunnel __rcu	*next;
>  	struct hlist_node hash_node;
>  	struct net_device	*dev;
> +	struct net		*net;	/* netns for packet i/o */
>  
>  	int		err_count;	/* Number of arrived ICMP errors */
>  	unsigned long	err_time;	/* Time when the last ICMP error
> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
> index bd227e5..b6a7704 100644
> --- a/net/ipv4/ip_tunnel.c
> +++ b/net/ipv4/ip_tunnel.c
> @@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net,
>  
>  	tunnel = netdev_priv(dev);
>  	tunnel->parms = *parms;
> +	tunnel->net = net;
>  
>  	err = register_netdevice(dev);
>  	if (err)
> @@ -453,6 +454,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
>  	tstats->rx_bytes += skb->len;
>  	u64_stats_update_end(&tstats->syncp);
>  
> +	if (tunnel->net != dev_net(tunnel->dev))
> +		dev_cleanup_skb(skb);
> +
>  	if (tunnel->dev->type == ARPHRD_ETHER) {
>  		skb->protocol = eth_type_trans(skb, tunnel->dev);
>  		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> @@ -541,7 +545,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
>  			tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph);
>  	}
>  
> -	rt = ip_route_output_tunnel(dev_net(dev), &fl4,
> +	rt = ip_route_output_tunnel(tunnel->net, &fl4,
>  				    tunnel->parms.iph.protocol,
>  				    dst, tnl_params->saddr,
>  				    tunnel->parms.o_key,
> @@ -888,6 +892,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
>  	if (ip_tunnel_find(itn, p, dev->type))
>  		return -EEXIST;
>  
> +	nt->net = net;
>  	nt->parms = *p;
>  	err = register_netdevice(dev);
>  	if (err)
> diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
> index f639866..4e03904 100644
> --- a/net/ipv6/sit.c
> +++ b/net/ipv6/sit.c
> @@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t)
>  
>  static void ipip6_tunnel_uninit(struct net_device *dev)
>  {
> -	struct net *net = dev_net(dev);
> -	struct sit_net *sitn = net_generic(net, sit_net_id);
> +	struct ip_tunnel *tunnel = netdev_priv(dev);
> +	struct sit_net *sitn = net_generic(tunnel->net, sit_net_id);
>  
>  	if (dev == sitn->fb_tunnel_dev) {
>  		RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL);
>  	} else {
> -		ipip6_tunnel_unlink(sitn, netdev_priv(dev));
> -		ipip6_tunnel_del_prl(netdev_priv(dev), NULL);
> +		ipip6_tunnel_unlink(sitn, tunnel);
> +		ipip6_tunnel_del_prl(tunnel, NULL);
>  	}
>  	dev_put(dev);
>  }
> @@ -621,6 +621,8 @@ static int ipip6_rcv(struct sk_buff *skb)
>  		tstats->rx_packets++;
>  		tstats->rx_bytes += skb->len;
>  
> +		if (tunnel->net != dev_net(tunnel->dev))
> +			dev_cleanup_skb(skb);
>  		netif_rx(skb);
>  
>  		return 0;
> @@ -803,7 +805,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
>  			goto tx_error;
>  	}
>  
> -	rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
> +	rt = ip_route_output_ports(tunnel->net, &fl4, NULL,
>  				   dst, tiph->saddr,
>  				   0, 0,
>  				   IPPROTO_IPV6, RT_TOS(tos),
> @@ -858,6 +860,9 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
>  			tunnel->err_count = 0;
>  	}
>  
> +	if (tunnel->net != dev_net(dev))
> +		dev_cleanup_skb(skb);
> +
>  	/*
>  	 * Okay, now see if we can stuff it in the buffer as-is.
>  	 */
> @@ -944,7 +949,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
>  	iph = &tunnel->parms.iph;
>  
>  	if (iph->daddr) {
> -		struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
> +		struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4,
> +							  NULL,
>  							  iph->daddr, iph->saddr,
>  							  0, 0,
>  							  IPPROTO_IPV6,
> @@ -959,7 +965,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
>  	}
>  
>  	if (!tdev && tunnel->parms.link)
> -		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
> +		tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link);
>  
>  	if (tdev) {
>  		dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr);
> @@ -972,7 +978,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
>  
>  static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p)
>  {
> -	struct net *net = dev_net(t->dev);
> +	struct net *net = t->net;
>  	struct sit_net *sitn = net_generic(net, sit_net_id);
>  
>  	ipip6_tunnel_unlink(sitn, t);
> @@ -1248,7 +1254,6 @@ static void ipip6_tunnel_setup(struct net_device *dev)
>  	dev->priv_flags	       &= ~IFF_XMIT_DST_RELEASE;
>  	dev->iflink		= 0;
>  	dev->addr_len		= 4;
> -	dev->features		|= NETIF_F_NETNS_LOCAL;
>  	dev->features		|= NETIF_F_LLTX;
>  }
>  
> @@ -1257,6 +1262,7 @@ static int ipip6_tunnel_init(struct net_device *dev)
>  	struct ip_tunnel *tunnel = netdev_priv(dev);
>  
>  	tunnel->dev = dev;
> +	tunnel->net = dev_net(dev);
>  
>  	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
>  	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
> @@ -1277,6 +1283,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev)
>  	struct sit_net *sitn = net_generic(net, sit_net_id);
>  
>  	tunnel->dev = dev;
> +	tunnel->net = dev_net(dev);
>  	strcpy(tunnel->parms.name, dev->name);
>  
>  	iph->version		= 4;
> @@ -1562,22 +1569,15 @@ static struct xfrm_tunnel ipip_handler __read_mostly = {
>  	.priority	=	2,
>  };
>  
> -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head)
> +static void __net_exit sit_destroy_tunnels(struct net *net,
> +					   struct list_head *head)
>  {
> -	int prio;
> +	struct net_device *dev, *aux;
>  
> -	for (prio = 1; prio < 4; prio++) {
> -		int h;
> -		for (h = 0; h < HASH_SIZE; h++) {
> -			struct ip_tunnel *t;
> -
> -			t = rtnl_dereference(sitn->tunnels[prio][h]);
> -			while (t != NULL) {
> -				unregister_netdevice_queue(t->dev, head);
> -				t = rtnl_dereference(t->next);
> -			}
> -		}
> -	}
> +	for_each_netdev_safe(net, dev, aux)
> +		if (dev->rtnl_link_ops &&
> +		    !strcmp(dev->rtnl_link_ops->kind, "sit"))
> +			unregister_netdevice_queue(dev, head);

This entire idiom change is a bit ugly, and it is wrong.

You need to look for two classes of tunnels to take down.  Tunnels that
originate in net and tunnels whose netdevice is in net.

For tunnles that reside in net you should be able to just compare the
rtnl_link_ops pointer, rather than an ascii name.

Tunnels that originate in this network namespace most definitely need to
be taken down as among other things you wisely do not keep a reference
count on the originating network namespace.

>  }
>  
>  static int __net_init sit_init_net(struct net *net)
> @@ -1598,6 +1598,7 @@ static int __net_init sit_init_net(struct net *net)
>  		goto err_alloc_dev;
>  	}
>  	dev_net_set(sitn->fb_tunnel_dev, net);
> +	sitn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL;
>  
>  	err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
>  	if (err)
> @@ -1627,7 +1628,7 @@ static void __net_exit sit_exit_net(struct net *net)
>  	LIST_HEAD(list);
>  
>  	rtnl_lock();
> -	sit_destroy_tunnels(sitn, &list);
> +	sit_destroy_tunnels(net, &list);
>  	unregister_netdevice_queue(sitn->fb_tunnel_dev, &list);
>  	unregister_netdevice_many(&list);
>  	rtnl_unlock();

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH net-next 2/2] sit: add support of x-netns
  2013-06-24 19:28                       ` Eric W. Biederman
@ 2013-06-24 21:11                         ` Nicolas Dichtel
  2013-06-24 22:42                           ` Eric W. Biederman
  0 siblings, 1 reply; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-24 21:11 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, davem, bcrl, ravi.mlists

Le 24/06/2013 21:28, Eric W. Biederman a écrit :
> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>
>> This patch allows to switch the netns when packet is encapsulated or
>> decapsulated. In other word, the encapsulated packet is received in a netns,
>> where the lookup is done to find the tunnel. Once the tunnel is found, the
>> packet is decapsulated and injecting into the corresponding interface which
>> stands to another netns.
>>
>> When one of the two netns is removed, the tunnel is destroyed.
>
> I don't see any fundamental problems with this code.  There are bugs
> with the cleanup noted below.
>
> The primary sit interface is marked as NETNS_LOCAL which is good.  A
> comment might be nice explaining the reasonsing but for code
> archeologists.
Ok.

>
> Conditionally calling dev_cleanup_skb bugs me.  The extra conditional
> looks like a maintenance hazard.   Unless I have missed some subtle
> detail either we don't need the cleanup at all or actually it is a bug
> that we aren't scrubbing our packets as they progress through tunnels
> even in the same network namespace.
>
> Can we just make that function the skb scrubbing needed for packets to
> traverse a tunnel?
>
> My concern going into this was that we would get code that would break
> because it would not be tested enough.  If we can remove the conditional
> from dev_cleanup_skb we won't have any code that is conditionally run
> and the logic looks simple enough not to bitrot in routine maintenance.
My idea was to have the same level of cleanup/scrubbing that when a packet is 
sent from a netns to another netns through a veth. I cannot use 
dev_forward_skb() because this function expects to have an ethernet header, it's 
why I split it in the patch #1.

If we leave all information attached to the skb, we may have, for example, an 
skb with a conntrack from netns1 and a netdevice from netns2. It seems not safe, 
but maybe I'm wrong. And in fact, the conntrack will not be created in the 
second netns (nf_conntrack_in() => skb->nfct is not null and not a template => 
stats ignore++).
Another example is a socket from a netns and the netdevice or conntrack from 
another netns.
I was thinking that when a packet enter a namespace, it must not be associated 
to any object from the previous namespace, it should be like if we just receive 
it on the host.

Nicolas

>
> Eric
>
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> ---
>>   include/net/ip_tunnels.h |  1 +
>>   net/ipv4/ip_tunnel.c     |  7 ++++++-
>>   net/ipv6/sit.c           | 49 ++++++++++++++++++++++++------------------------
>>   3 files changed, 32 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
>> index b0d9824..781b3cf 100644
>> --- a/include/net/ip_tunnels.h
>> +++ b/include/net/ip_tunnels.h
>> @@ -42,6 +42,7 @@ struct ip_tunnel {
>>   	struct ip_tunnel __rcu	*next;
>>   	struct hlist_node hash_node;
>>   	struct net_device	*dev;
>> +	struct net		*net;	/* netns for packet i/o */
>>
>>   	int		err_count;	/* Number of arrived ICMP errors */
>>   	unsigned long	err_time;	/* Time when the last ICMP error
>> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
>> index bd227e5..b6a7704 100644
>> --- a/net/ipv4/ip_tunnel.c
>> +++ b/net/ipv4/ip_tunnel.c
>> @@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net,
>>
>>   	tunnel = netdev_priv(dev);
>>   	tunnel->parms = *parms;
>> +	tunnel->net = net;
>>
>>   	err = register_netdevice(dev);
>>   	if (err)
>> @@ -453,6 +454,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
>>   	tstats->rx_bytes += skb->len;
>>   	u64_stats_update_end(&tstats->syncp);
>>
>> +	if (tunnel->net != dev_net(tunnel->dev))
>> +		dev_cleanup_skb(skb);
>> +
>>   	if (tunnel->dev->type == ARPHRD_ETHER) {
>>   		skb->protocol = eth_type_trans(skb, tunnel->dev);
>>   		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
>> @@ -541,7 +545,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
>>   			tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph);
>>   	}
>>
>> -	rt = ip_route_output_tunnel(dev_net(dev), &fl4,
>> +	rt = ip_route_output_tunnel(tunnel->net, &fl4,
>>   				    tunnel->parms.iph.protocol,
>>   				    dst, tnl_params->saddr,
>>   				    tunnel->parms.o_key,
>> @@ -888,6 +892,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
>>   	if (ip_tunnel_find(itn, p, dev->type))
>>   		return -EEXIST;
>>
>> +	nt->net = net;
>>   	nt->parms = *p;
>>   	err = register_netdevice(dev);
>>   	if (err)
>> diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
>> index f639866..4e03904 100644
>> --- a/net/ipv6/sit.c
>> +++ b/net/ipv6/sit.c
>> @@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t)
>>
>>   static void ipip6_tunnel_uninit(struct net_device *dev)
>>   {
>> -	struct net *net = dev_net(dev);
>> -	struct sit_net *sitn = net_generic(net, sit_net_id);
>> +	struct ip_tunnel *tunnel = netdev_priv(dev);
>> +	struct sit_net *sitn = net_generic(tunnel->net, sit_net_id);
>>
>>   	if (dev == sitn->fb_tunnel_dev) {
>>   		RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL);
>>   	} else {
>> -		ipip6_tunnel_unlink(sitn, netdev_priv(dev));
>> -		ipip6_tunnel_del_prl(netdev_priv(dev), NULL);
>> +		ipip6_tunnel_unlink(sitn, tunnel);
>> +		ipip6_tunnel_del_prl(tunnel, NULL);
>>   	}
>>   	dev_put(dev);
>>   }
>> @@ -621,6 +621,8 @@ static int ipip6_rcv(struct sk_buff *skb)
>>   		tstats->rx_packets++;
>>   		tstats->rx_bytes += skb->len;
>>
>> +		if (tunnel->net != dev_net(tunnel->dev))
>> +			dev_cleanup_skb(skb);
>>   		netif_rx(skb);
>>
>>   		return 0;
>> @@ -803,7 +805,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
>>   			goto tx_error;
>>   	}
>>
>> -	rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
>> +	rt = ip_route_output_ports(tunnel->net, &fl4, NULL,
>>   				   dst, tiph->saddr,
>>   				   0, 0,
>>   				   IPPROTO_IPV6, RT_TOS(tos),
>> @@ -858,6 +860,9 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
>>   			tunnel->err_count = 0;
>>   	}
>>
>> +	if (tunnel->net != dev_net(dev))
>> +		dev_cleanup_skb(skb);
>> +
>>   	/*
>>   	 * Okay, now see if we can stuff it in the buffer as-is.
>>   	 */
>> @@ -944,7 +949,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
>>   	iph = &tunnel->parms.iph;
>>
>>   	if (iph->daddr) {
>> -		struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
>> +		struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4,
>> +							  NULL,
>>   							  iph->daddr, iph->saddr,
>>   							  0, 0,
>>   							  IPPROTO_IPV6,
>> @@ -959,7 +965,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
>>   	}
>>
>>   	if (!tdev && tunnel->parms.link)
>> -		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
>> +		tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link);
>>
>>   	if (tdev) {
>>   		dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr);
>> @@ -972,7 +978,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
>>
>>   static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p)
>>   {
>> -	struct net *net = dev_net(t->dev);
>> +	struct net *net = t->net;
>>   	struct sit_net *sitn = net_generic(net, sit_net_id);
>>
>>   	ipip6_tunnel_unlink(sitn, t);
>> @@ -1248,7 +1254,6 @@ static void ipip6_tunnel_setup(struct net_device *dev)
>>   	dev->priv_flags	       &= ~IFF_XMIT_DST_RELEASE;
>>   	dev->iflink		= 0;
>>   	dev->addr_len		= 4;
>> -	dev->features		|= NETIF_F_NETNS_LOCAL;
>>   	dev->features		|= NETIF_F_LLTX;
>>   }
>>
>> @@ -1257,6 +1262,7 @@ static int ipip6_tunnel_init(struct net_device *dev)
>>   	struct ip_tunnel *tunnel = netdev_priv(dev);
>>
>>   	tunnel->dev = dev;
>> +	tunnel->net = dev_net(dev);
>>
>>   	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
>>   	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
>> @@ -1277,6 +1283,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev)
>>   	struct sit_net *sitn = net_generic(net, sit_net_id);
>>
>>   	tunnel->dev = dev;
>> +	tunnel->net = dev_net(dev);
>>   	strcpy(tunnel->parms.name, dev->name);
>>
>>   	iph->version		= 4;
>> @@ -1562,22 +1569,15 @@ static struct xfrm_tunnel ipip_handler __read_mostly = {
>>   	.priority	=	2,
>>   };
>>
>> -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head)
>> +static void __net_exit sit_destroy_tunnels(struct net *net,
>> +					   struct list_head *head)
>>   {
>> -	int prio;
>> +	struct net_device *dev, *aux;
>>
>> -	for (prio = 1; prio < 4; prio++) {
>> -		int h;
>> -		for (h = 0; h < HASH_SIZE; h++) {
>> -			struct ip_tunnel *t;
>> -
>> -			t = rtnl_dereference(sitn->tunnels[prio][h]);
>> -			while (t != NULL) {
>> -				unregister_netdevice_queue(t->dev, head);
>> -				t = rtnl_dereference(t->next);
>> -			}
>> -		}
>> -	}
>> +	for_each_netdev_safe(net, dev, aux)
>> +		if (dev->rtnl_link_ops &&
>> +		    !strcmp(dev->rtnl_link_ops->kind, "sit"))
>> +			unregister_netdevice_queue(dev, head);
>
> This entire idiom change is a bit ugly, and it is wrong.
>
> You need to look for two classes of tunnels to take down.  Tunnels that
> originate in net and tunnels whose netdevice is in net.
>
> For tunnles that reside in net you should be able to just compare the
> rtnl_link_ops pointer, rather than an ascii name.
>
> Tunnels that originate in this network namespace most definitely need to
> be taken down as among other things you wisely do not keep a reference
> count on the originating network namespace.
Yes sure. My beta version was doing the right things, but I change this code 
before sending the patch :/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH net-next 2/2] sit: add support of x-netns
  2013-06-24 21:11                         ` Nicolas Dichtel
@ 2013-06-24 22:42                           ` Eric W. Biederman
  2013-06-25 14:10                             ` Nicolas Dichtel
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2013-06-24 22:42 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev, davem, bcrl, ravi.mlists

Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:

> Le 24/06/2013 21:28, Eric W. Biederman a écrit :
>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>
>>> This patch allows to switch the netns when packet is encapsulated or
>>> decapsulated. In other word, the encapsulated packet is received in a netns,
>>> where the lookup is done to find the tunnel. Once the tunnel is found, the
>>> packet is decapsulated and injecting into the corresponding interface which
>>> stands to another netns.
>>>
>>> When one of the two netns is removed, the tunnel is destroyed.
>>
>> I don't see any fundamental problems with this code.  There are bugs
>> with the cleanup noted below.
>>
>> The primary sit interface is marked as NETNS_LOCAL which is good.  A
>> comment might be nice explaining the reasonsing but for code
>> archeologists.
> Ok.
>
>>
>> Conditionally calling dev_cleanup_skb bugs me.  The extra conditional
>> looks like a maintenance hazard.   Unless I have missed some subtle
>> detail either we don't need the cleanup at all or actually it is a bug
>> that we aren't scrubbing our packets as they progress through tunnels
>> even in the same network namespace.
>>
>> Can we just make that function the skb scrubbing needed for packets to
>> traverse a tunnel?
>>
>> My concern going into this was that we would get code that would break
>> because it would not be tested enough.  If we can remove the conditional
>> from dev_cleanup_skb we won't have any code that is conditionally run
>> and the logic looks simple enough not to bitrot in routine maintenance.
> My idea was to have the same level of cleanup/scrubbing that when a packet is
> sent from a netns to another netns through a veth. I cannot use
> dev_forward_skb() because this function expects to have an ethernet header, it's
> why I split it in the patch #1.
>
> If we leave all information attached to the skb, we may have, for example, an
> skb with a conntrack from netns1 and a netdevice from netns2. It seems not safe,
> but maybe I'm wrong. And in fact, the conntrack will not be created in the
> second netns (nf_conntrack_in() => skb->nfct is not null and not a template =>
> stats ignore++).
> Another example is a socket from a netns and the netdevice or conntrack from
> another netns.

All of that I agree with.

I just don't see any need to make that scrubbing/cleaning of the packet
conditional.

Semantically going through a tunnel is the same as crossing between
network namespaces.  So you can change

>>> +	if (tunnel->net != dev_net(tunnel->dev))
>>> +		dev_cleanup_skb(skb);

to just:

	dev_cleanup_skb(skb);

> I was thinking that when a packet enter a namespace, it must not be associated
> to any object from the previous namespace, it should be like if we just receive
> it on the host.

Overall agree.  Tunnels have the same properties.

Which leads me to conclude either we are missing something or the
current tunnel code is mildly buggy because it does not do this level of
scrubbing.

Eric

>>> -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head)
>>> +static void __net_exit sit_destroy_tunnels(struct net *net,
>>> +					   struct list_head *head)
>>>   {
>>> -	int prio;
>>> +	struct net_device *dev, *aux;
>>>
>>> -	for (prio = 1; prio < 4; prio++) {
>>> -		int h;
>>> -		for (h = 0; h < HASH_SIZE; h++) {
>>> -			struct ip_tunnel *t;
>>> -
>>> -			t = rtnl_dereference(sitn->tunnels[prio][h]);
>>> -			while (t != NULL) {
>>> -				unregister_netdevice_queue(t->dev, head);
>>> -				t = rtnl_dereference(t->next);
>>> -			}
>>> -		}
>>> -	}
>>> +	for_each_netdev_safe(net, dev, aux)
>>> +		if (dev->rtnl_link_ops &&
>>> +		    !strcmp(dev->rtnl_link_ops->kind, "sit"))
>>> +			unregister_netdevice_queue(dev, head);
>>
>> This entire idiom change is a bit ugly, and it is wrong.
>>
>> You need to look for two classes of tunnels to take down.  Tunnels that
>> originate in net and tunnels whose netdevice is in net.
>>
>> For tunnles that reside in net you should be able to just compare the
>> rtnl_link_ops pointer, rather than an ascii name.
>>
>> Tunnels that originate in this network namespace most definitely need to
>> be taken down as among other things you wisely do not keep a reference
>> count on the originating network namespace.
> Yes sure. My beta version was doing the right things, but I change this code
> before sending the patch :/

Bahahaha!  The dangers of the last minute cleanup.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC PATCH net-next 2/2] sit: add support of x-netns
  2013-06-24 22:42                           ` Eric W. Biederman
@ 2013-06-25 14:10                             ` Nicolas Dichtel
  2013-06-25 14:24                               ` [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
  0 siblings, 1 reply; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-25 14:10 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, davem, bcrl, ravi.mlists

Le 25/06/2013 00:42, Eric W. Biederman a écrit :
> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>
>> Le 24/06/2013 21:28, Eric W. Biederman a écrit :
>>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes:
>>>
>>>> This patch allows to switch the netns when packet is encapsulated or
>>>> decapsulated. In other word, the encapsulated packet is received in a netns,
>>>> where the lookup is done to find the tunnel. Once the tunnel is found, the
>>>> packet is decapsulated and injecting into the corresponding interface which
>>>> stands to another netns.
>>>>
>>>> When one of the two netns is removed, the tunnel is destroyed.
>>>
>>> I don't see any fundamental problems with this code.  There are bugs
>>> with the cleanup noted below.
>>>
>>> The primary sit interface is marked as NETNS_LOCAL which is good.  A
>>> comment might be nice explaining the reasonsing but for code
>>> archeologists.
>> Ok.
>>
>>>
>>> Conditionally calling dev_cleanup_skb bugs me.  The extra conditional
>>> looks like a maintenance hazard.   Unless I have missed some subtle
>>> detail either we don't need the cleanup at all or actually it is a bug
>>> that we aren't scrubbing our packets as they progress through tunnels
>>> even in the same network namespace.
>>>
>>> Can we just make that function the skb scrubbing needed for packets to
>>> traverse a tunnel?
>>>
>>> My concern going into this was that we would get code that would break
>>> because it would not be tested enough.  If we can remove the conditional
>>> from dev_cleanup_skb we won't have any code that is conditionally run
>>> and the logic looks simple enough not to bitrot in routine maintenance.
>> My idea was to have the same level of cleanup/scrubbing that when a packet is
>> sent from a netns to another netns through a veth. I cannot use
>> dev_forward_skb() because this function expects to have an ethernet header, it's
>> why I split it in the patch #1.
>>
>> If we leave all information attached to the skb, we may have, for example, an
>> skb with a conntrack from netns1 and a netdevice from netns2. It seems not safe,
>> but maybe I'm wrong. And in fact, the conntrack will not be created in the
>> second netns (nf_conntrack_in() => skb->nfct is not null and not a template =>
>> stats ignore++).
>> Another example is a socket from a netns and the netdevice or conntrack from
>> another netns.
>
> All of that I agree with.
>
> I just don't see any need to make that scrubbing/cleaning of the packet
> conditional.
>
> Semantically going through a tunnel is the same as crossing between
> network namespaces.  So you can change
>
>>>> +	if (tunnel->net != dev_net(tunnel->dev))
>>>> +		dev_cleanup_skb(skb);
>
> to just:
>
> 	dev_cleanup_skb(skb);
>
>> I was thinking that when a packet enter a namespace, it must not be associated
>> to any object from the previous namespace, it should be like if we just receive
>> it on the host.
>
> Overall agree.  Tunnels have the same properties.
>
> Which leads me to conclude either we are missing something or the
> current tunnel code is mildly buggy because it does not do this level of
> scrubbing.
I'm afraid to break an existing scenario, but you're probably right. Let's 
remove this test.


Nicolas

>
> Eric
>
>>>> -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head)
>>>> +static void __net_exit sit_destroy_tunnels(struct net *net,
>>>> +					   struct list_head *head)
>>>>    {
>>>> -	int prio;
>>>> +	struct net_device *dev, *aux;
>>>>
>>>> -	for (prio = 1; prio < 4; prio++) {
>>>> -		int h;
>>>> -		for (h = 0; h < HASH_SIZE; h++) {
>>>> -			struct ip_tunnel *t;
>>>> -
>>>> -			t = rtnl_dereference(sitn->tunnels[prio][h]);
>>>> -			while (t != NULL) {
>>>> -				unregister_netdevice_queue(t->dev, head);
>>>> -				t = rtnl_dereference(t->next);
>>>> -			}
>>>> -		}
>>>> -	}
>>>> +	for_each_netdev_safe(net, dev, aux)
>>>> +		if (dev->rtnl_link_ops &&
>>>> +		    !strcmp(dev->rtnl_link_ops->kind, "sit"))
>>>> +			unregister_netdevice_queue(dev, head);
>>>
>>> This entire idiom change is a bit ugly, and it is wrong.
>>>
>>> You need to look for two classes of tunnels to take down.  Tunnels that
>>> originate in net and tunnels whose netdevice is in net.
>>>
>>> For tunnles that reside in net you should be able to just compare the
>>> rtnl_link_ops pointer, rather than an ascii name.
>>>
>>> Tunnels that originate in this network namespace most definitely need to
>>> be taken down as among other things you wisely do not keep a reference
>>> count on the originating network namespace.
>> Yes sure. My beta version was doing the right things, but I change this code
>> before sending the patch :/
>
> Bahahaha!  The dangers of the last minute cleanup.
>
> Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap
  2013-06-25 14:10                             ` Nicolas Dichtel
@ 2013-06-25 14:24                               ` Nicolas Dichtel
  2013-06-25 14:24                                 ` [PATCH v2 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel
  2013-06-25 14:24                                 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
  0 siblings, 2 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-25 14:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings


This patch is a follow up of the thread "switching network namespace midway":
http://marc.info/?t=135101459500004&r=1&w=2

The goal of this serie is to add x-netns support for the module sit, ie the
encapsulation addresses and the network device are not owned by the same
namespace.

Example to configure a tunnel:

modprobe sit
ip netns add netns1
ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249
ip l s sit1 netns netns1
ip netns exec netns1 ip l s lo up
ip netns exec netns1 ip l s sit1 up
ip netns exec netns1 ip a a dev sit1 192.168.2.123 remote 192.168.2.121
ip netns exec netns1 ip -6 a a dev sit1 2001:1234::123 remote 2001:1234::121

Once this serie is approved, I will add the same feature for the module ipip and
ip6_tunnel.

v2:
  rename dev_cleanup_skb to skb_scrub_packet
  move skb_scrub_packet to skbuff.c
  fix netns cleanup
  remove string comparison in netns cleanup
  add a comment about FB device
  call skb_scrub_packet() unconditionnaly
  remove 'RFC'

 include/linux/skbuff.h   |  1 +
 include/net/ip_tunnels.h |  1 +
 net/core/dev.c           | 11 +----------
 net/core/skbuff.c        | 23 +++++++++++++++++++++++
 net/ipv4/ip_tunnel.c     |  6 +++++-
 net/ipv6/sit.c           | 40 ++++++++++++++++++++++++++++++----------
 6 files changed, 61 insertions(+), 21 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 net-next 1/2] dev: introduce skb_scrub_packet()
  2013-06-25 14:24                               ` [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
@ 2013-06-25 14:24                                 ` Nicolas Dichtel
  2013-06-25 14:24                                 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
  1 sibling, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-25 14:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, Nicolas Dichtel

The goal of this new function is to perform all needed cleanup before sending
an skb into another netns.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/linux/skbuff.h |  1 +
 net/core/dev.c         | 11 +----------
 net/core/skbuff.c      | 23 +++++++++++++++++++++++
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a7393ad..6b06023 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2384,6 +2384,7 @@ extern void	       skb_split(struct sk_buff *skb,
 				 struct sk_buff *skb1, const u32 len);
 extern int	       skb_shift(struct sk_buff *tgt, struct sk_buff *skb,
 				 int shiftlen);
+extern void	       skb_scrub_packet(struct sk_buff *skb);
 
 extern struct sk_buff *skb_segment(struct sk_buff *skb,
 				   netdev_features_t features);
diff --git a/net/core/dev.c b/net/core/dev.c
index 722f633..370354a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1652,22 +1652,13 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 		}
 	}
 
-	skb_orphan(skb);
-
 	if (unlikely(!is_skb_forwardable(dev, skb))) {
 		atomic_long_inc(&dev->rx_dropped);
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
-	skb->skb_iif = 0;
-	skb_dst_drop(skb);
-	skb->tstamp.tv64 = 0;
-	skb->pkt_type = PACKET_HOST;
+	skb_scrub_packet(skb);
 	skb->protocol = eth_type_trans(skb, dev);
-	skb->mark = 0;
-	secpath_reset(skb);
-	nf_reset(skb);
-	nf_reset_trace(skb);
 	return netif_rx(skb);
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9f73eca..b1fcb87 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3492,3 +3492,26 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	return true;
 }
 EXPORT_SYMBOL(skb_try_coalesce);
+
+/**
+ * skb_scrub_packet - scrub an skb before sending it to another netns
+ *
+ * @skb: buffer to clean
+ *
+ * skb_scrub_packet can be used to clean an skb before injecting it in
+ * another namespace. We have to clear all information in the skb that
+ * could impact namespace isolation.
+ */
+void skb_scrub_packet(struct sk_buff *skb)
+{
+	skb_orphan(skb);
+	skb->tstamp.tv64 = 0;
+	skb->pkt_type = PACKET_HOST;
+	skb->skb_iif = 0;
+	skb_dst_drop(skb);
+	skb->mark = 0;
+	secpath_reset(skb);
+	nf_reset(skb);
+	nf_reset_trace(skb);
+}
+EXPORT_SYMBOL_GPL(skb_scrub_packet);
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 net-next 2/2] sit: add support of x-netns
  2013-06-25 14:24                               ` [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
  2013-06-25 14:24                                 ` [PATCH v2 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel
@ 2013-06-25 14:24                                 ` Nicolas Dichtel
  2013-06-25 23:56                                   ` David Miller
  1 sibling, 1 reply; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-25 14:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, Nicolas Dichtel

This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/ip_tunnels.h |  1 +
 net/ipv4/ip_tunnel.c     |  6 +++++-
 net/ipv6/sit.c           | 40 ++++++++++++++++++++++++++++++----------
 3 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index b0d9824..781b3cf 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -42,6 +42,7 @@ struct ip_tunnel {
 	struct ip_tunnel __rcu	*next;
 	struct hlist_node hash_node;
 	struct net_device	*dev;
+	struct net		*net;	/* netns for packet i/o */
 
 	int		err_count;	/* Number of arrived ICMP errors */
 	unsigned long	err_time;	/* Time when the last ICMP error
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index bd227e5..d375e4d 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net,
 
 	tunnel = netdev_priv(dev);
 	tunnel->parms = *parms;
+	tunnel->net = net;
 
 	err = register_netdevice(dev);
 	if (err)
@@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
 	tstats->rx_bytes += skb->len;
 	u64_stats_update_end(&tstats->syncp);
 
+	skb_scrub_packet(skb);
+
 	if (tunnel->dev->type == ARPHRD_ETHER) {
 		skb->protocol = eth_type_trans(skb, tunnel->dev);
 		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
@@ -541,7 +544,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 			tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph);
 	}
 
-	rt = ip_route_output_tunnel(dev_net(dev), &fl4,
+	rt = ip_route_output_tunnel(tunnel->net, &fl4,
 				    tunnel->parms.iph.protocol,
 				    dst, tnl_params->saddr,
 				    tunnel->parms.o_key,
@@ -888,6 +891,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
 	if (ip_tunnel_find(itn, p, dev->type))
 		return -EEXIST;
 
+	nt->net = net;
 	nt->parms = *p;
 	err = register_netdevice(dev);
 	if (err)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index f639866..8765f4e 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t)
 
 static void ipip6_tunnel_uninit(struct net_device *dev)
 {
-	struct net *net = dev_net(dev);
-	struct sit_net *sitn = net_generic(net, sit_net_id);
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct sit_net *sitn = net_generic(tunnel->net, sit_net_id);
 
 	if (dev == sitn->fb_tunnel_dev) {
 		RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL);
 	} else {
-		ipip6_tunnel_unlink(sitn, netdev_priv(dev));
-		ipip6_tunnel_del_prl(netdev_priv(dev), NULL);
+		ipip6_tunnel_unlink(sitn, tunnel);
+		ipip6_tunnel_del_prl(tunnel, NULL);
 	}
 	dev_put(dev);
 }
@@ -621,6 +621,7 @@ static int ipip6_rcv(struct sk_buff *skb)
 		tstats->rx_packets++;
 		tstats->rx_bytes += skb->len;
 
+		skb_scrub_packet(skb);
 		netif_rx(skb);
 
 		return 0;
@@ -803,7 +804,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			goto tx_error;
 	}
 
-	rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
+	rt = ip_route_output_ports(tunnel->net, &fl4, NULL,
 				   dst, tiph->saddr,
 				   0, 0,
 				   IPPROTO_IPV6, RT_TOS(tos),
@@ -858,6 +859,8 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			tunnel->err_count = 0;
 	}
 
+	skb_scrub_packet(skb);
+
 	/*
 	 * Okay, now see if we can stuff it in the buffer as-is.
 	 */
@@ -944,7 +947,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
 	iph = &tunnel->parms.iph;
 
 	if (iph->daddr) {
-		struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
+		struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4,
+							  NULL,
 							  iph->daddr, iph->saddr,
 							  0, 0,
 							  IPPROTO_IPV6,
@@ -959,7 +963,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
 	}
 
 	if (!tdev && tunnel->parms.link)
-		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+		tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link);
 
 	if (tdev) {
 		dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr);
@@ -972,7 +976,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
 
 static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p)
 {
-	struct net *net = dev_net(t->dev);
+	struct net *net = t->net;
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 
 	ipip6_tunnel_unlink(sitn, t);
@@ -1248,7 +1252,6 @@ static void ipip6_tunnel_setup(struct net_device *dev)
 	dev->priv_flags	       &= ~IFF_XMIT_DST_RELEASE;
 	dev->iflink		= 0;
 	dev->addr_len		= 4;
-	dev->features		|= NETIF_F_NETNS_LOCAL;
 	dev->features		|= NETIF_F_LLTX;
 }
 
@@ -1257,6 +1260,7 @@ static int ipip6_tunnel_init(struct net_device *dev)
 	struct ip_tunnel *tunnel = netdev_priv(dev);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 
 	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
 	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
@@ -1277,6 +1281,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev)
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
 	iph->version		= 4;
@@ -1564,8 +1569,14 @@ static struct xfrm_tunnel ipip_handler __read_mostly = {
 
 static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head)
 {
+	struct net *net = dev_net(sitn->fb_tunnel_dev);
+	struct net_device *dev, *aux;
 	int prio;
 
+	for_each_netdev_safe(net, dev, aux)
+		if (dev->rtnl_link_ops == &sit_link_ops)
+			unregister_netdevice_queue(dev, head);
+
 	for (prio = 1; prio < 4; prio++) {
 		int h;
 		for (h = 0; h < HASH_SIZE; h++) {
@@ -1573,7 +1584,12 @@ static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_hea
 
 			t = rtnl_dereference(sitn->tunnels[prio][h]);
 			while (t != NULL) {
-				unregister_netdevice_queue(t->dev, head);
+				/* If dev is in the same netns, it has already
+				 * been added to the list by the previous loop.
+				 */
+				if (dev_net(t->dev) != net)
+					unregister_netdevice_queue(t->dev,
+								   head);
 				t = rtnl_dereference(t->next);
 			}
 		}
@@ -1598,6 +1614,10 @@ static int __net_init sit_init_net(struct net *net)
 		goto err_alloc_dev;
 	}
 	dev_net_set(sitn->fb_tunnel_dev, net);
+	/* FB netdevice is special: we have one, and only one per netns.
+	 * Allowing to move it to another netns is clearly unsafe.
+	 */
+	sitn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL;
 
 	err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
 	if (err)
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns
  2013-06-25 14:24                                 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
@ 2013-06-25 23:56                                   ` David Miller
  2013-06-26  1:35                                     ` Eric W. Biederman
  2013-06-26 13:49                                     ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
  0 siblings, 2 replies; 52+ messages in thread
From: David Miller @ 2013-06-25 23:56 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue, 25 Jun 2013 16:24:55 +0200

> @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
>  	tstats->rx_bytes += skb->len;
>  	u64_stats_update_end(&tstats->syncp);
>  
> +	skb_scrub_packet(skb);
> +
>  	if (tunnel->dev->type == ARPHRD_ETHER) {
>  		skb->protocol = eth_type_trans(skb, tunnel->dev);
>  		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);

I can't see how this can be ok.

If something in netfilter depends upon the state you are clearing out
here, someone's packet filtering setup is going to break.

I'm not applying these patches, sorry.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns
  2013-06-25 23:56                                   ` David Miller
@ 2013-06-26  1:35                                     ` Eric W. Biederman
  2013-06-26  5:48                                       ` David Miller
  2013-06-26 13:49                                     ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2013-06-26  1:35 UTC (permalink / raw)
  To: David Miller; +Cc: nicolas.dichtel, netdev, bcrl, ravi.mlists, bhutchings

David Miller <davem@davemloft.net> writes:

> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Date: Tue, 25 Jun 2013 16:24:55 +0200
>
>> @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
>>  	tstats->rx_bytes += skb->len;
>>  	u64_stats_update_end(&tstats->syncp);
>>  
>> +	skb_scrub_packet(skb);
>> +
>>  	if (tunnel->dev->type == ARPHRD_ETHER) {
>>  		skb->protocol = eth_type_trans(skb, tunnel->dev);
>>  		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
>
> I can't see how this can be ok.
>
> If something in netfilter depends upon the state you are clearing out
> here, someone's packet filtering setup is going to break.
>
> I'm not applying these patches, sorry.

How can netfilter depend on the state of a packet inside of a tunnel?

How can it even make sense?

Or is your concern that we unintentionally allowed this in the past so
to avoid breaking binary compatibility we should continue in case
someone somewhere cares?

I really can't see how this could possibly be an intentional feature.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns
  2013-06-26  1:35                                     ` Eric W. Biederman
@ 2013-06-26  5:48                                       ` David Miller
  2013-06-26 10:03                                         ` Eric W. Biederman
  0 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2013-06-26  5:48 UTC (permalink / raw)
  To: ebiederm; +Cc: nicolas.dichtel, netdev, bcrl, ravi.mlists, bhutchings

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Tue, 25 Jun 2013 18:35:30 -0700

> David Miller <davem@davemloft.net> writes:
> 
>> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> Date: Tue, 25 Jun 2013 16:24:55 +0200
>>
>>> @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
>>>  	tstats->rx_bytes += skb->len;
>>>  	u64_stats_update_end(&tstats->syncp);
>>>  
>>> +	skb_scrub_packet(skb);
>>> +
>>>  	if (tunnel->dev->type == ARPHRD_ETHER) {
>>>  		skb->protocol = eth_type_trans(skb, tunnel->dev);
>>>  		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
>>
>> I can't see how this can be ok.
>>
>> If something in netfilter depends upon the state you are clearing out
>> here, someone's packet filtering setup is going to break.
>>
>> I'm not applying these patches, sorry.
> 
> How can netfilter depend on the state of a packet inside of a tunnel?
> 
> How can it even make sense?
> 
> Or is your concern that we unintentionally allowed this in the past so
> to avoid breaking binary compatibility we should continue in case
> someone somewhere cares?
> 
> I really can't see how this could possibly be an intentional feature.

You can make all of these issues go away by only clearing the SKB
meta state when namespaces are actually changing as we go through
the tunnel.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns
  2013-06-26  5:48                                       ` David Miller
@ 2013-06-26 10:03                                         ` Eric W. Biederman
  2013-06-26 10:22                                           ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2013-06-26 10:03 UTC (permalink / raw)
  To: David Miller; +Cc: nicolas.dichtel, netdev, bcrl, ravi.mlists, bhutchings

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Tue, 25 Jun 2013 18:35:30 -0700
>
>> David Miller <davem@davemloft.net> writes:
>> 
>>> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>>> Date: Tue, 25 Jun 2013 16:24:55 +0200
>>>
>>>> @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
>>>>  	tstats->rx_bytes += skb->len;
>>>>  	u64_stats_update_end(&tstats->syncp);
>>>>  
>>>> +	skb_scrub_packet(skb);
>>>> +
>>>>  	if (tunnel->dev->type == ARPHRD_ETHER) {
>>>>  		skb->protocol = eth_type_trans(skb, tunnel->dev);
>>>>  		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
>>>
>>> I can't see how this can be ok.
>>>
>>> If something in netfilter depends upon the state you are clearing out
>>> here, someone's packet filtering setup is going to break.
>>>
>>> I'm not applying these patches, sorry.
>> 
>> How can netfilter depend on the state of a packet inside of a tunnel?
>> 
>> How can it even make sense?
>> 
>> Or is your concern that we unintentionally allowed this in the past so
>> to avoid breaking binary compatibility we should continue in case
>> someone somewhere cares?
>> 
>> I really can't see how this could possibly be an intentional feature.
>
> You can make all of these issues go away by only clearing the SKB
> meta state when namespaces are actually changing as we go through
> the tunnel.

I have spent some time thinking about the cases where I have had an
opportunity to use the marks on packets and it turns out that if I had
been using a tunnel with any of those configurations leaving the marks
on would have either broken my configuration or at the very least have
required me to make certain I changed those marks.

So I really think this is a bug fix, for a long standing bug in a rare
corner case of kernel behavior that people just haven't noticed.   Which
is why I suggested to Nicolas Ditchtel that he remove the test to see if
we were changing network namespaces before scrubbing the packet.

That said I won't object if Nocolas Ditchel resends his patches with
that test put back in.  I just think it is silly and when someone
finally gets bit by the bug and complains we will have to go through and
remove the test.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns
  2013-06-26 10:03                                         ` Eric W. Biederman
@ 2013-06-26 10:22                                           ` Eric Dumazet
  2013-06-26 12:15                                             ` Nicolas Dichtel
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2013-06-26 10:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, nicolas.dichtel, netdev, bcrl, ravi.mlists,
	bhutchings

On Wed, 2013-06-26 at 03:03 -0700, Eric W. Biederman wrote:


> That said I won't object if Nocolas Ditchel resends his patches with
> that test put back in.  I just think it is silly and when someone
> finally gets bit by the bug and complains we will have to go through and
> remove the test.

Well, what is the reason skb_orphan() must be called in a tunnel xmit
path ?

This patch changes more things than what advertised in changelog :(

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns
  2013-06-26 10:22                                           ` Eric Dumazet
@ 2013-06-26 12:15                                             ` Nicolas Dichtel
  2013-06-26 14:11                                               ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
  0 siblings, 1 reply; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-26 12:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric W. Biederman, David Miller, netdev, bcrl, ravi.mlists,
	bhutchings

Le 26/06/2013 12:22, Eric Dumazet a écrit :
> On Wed, 2013-06-26 at 03:03 -0700, Eric W. Biederman wrote:
>
>
>> That said I won't object if Nocolas Ditchel resends his patches with
>> that test put back in.  I just think it is silly and when someone
>> finally gets bit by the bug and complains we will have to go through and
>> remove the test.
>
> Well, what is the reason skb_orphan() must be called in a tunnel xmit
> path ?
>
> This patch changes more things than what advertised in changelog :(
In fact, this is true. If we finally found that skb_scrub_packet() is needed in 
all cases (not only when changing namespace), this will be another patch.

I will resend the serie with the test put back.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap
  2013-06-26 12:15                                             ` Nicolas Dichtel
@ 2013-06-26 14:11                                               ` Nicolas Dichtel
  2013-06-26 14:11                                                 ` [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-26 14:11 UTC (permalink / raw)
  To: davem; +Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet


This patch is a follow up of the thread "switching network namespace midway":
http://marc.info/?t=135101459500004&r=1&w=2

The goal of this serie is to add x-netns support for the module sit, ie the
encapsulation addresses and the network device are not owned by the same
namespace.

Example to configure a tunnel:

modprobe sit
ip netns add netns1
ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249
ip l s sit1 netns netns1
ip netns exec netns1 ip l s lo up
ip netns exec netns1 ip l s sit1 up
ip netns exec netns1 ip a a dev sit1 192.168.2.123 remote 192.168.2.121
ip netns exec netns1 ip -6 a a dev sit1 2001:1234::123 remote 2001:1234::121

Once this serie is approved, I will add the same feature for the module ipip and
ip6_tunnel.

v3:
  put again the test about netns before calling skb_scrub_packet()
  add a missing skb_scrub_packet() call in ip_tunnel_xmit()

v2:
  rename dev_cleanup_skb to skb_scrub_packet
  move skb_scrub_packet to skbuff.c
  fix netns cleanup
  remove string comparison in netns cleanup
  add a comment about FB device
  call skb_scrub_packet() unconditionnaly
  remove 'RFC'

 include/linux/skbuff.h   |  1 +
 include/net/ip_tunnels.h |  1 +
 net/core/dev.c           | 11 +----------
 net/core/skbuff.c        | 23 +++++++++++++++++++++++
 net/ipv4/ip_tunnel.c     | 10 +++++++++-
 net/ipv6/sit.c           | 42 ++++++++++++++++++++++++++++++++----------
 6 files changed, 67 insertions(+), 21 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet()
  2013-06-26 14:11                                               ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
@ 2013-06-26 14:11                                                 ` Nicolas Dichtel
  2013-06-26 14:11                                                 ` [PATCH v3 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
  2013-06-28  5:36                                                 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap David Miller
  2 siblings, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-26 14:11 UTC (permalink / raw)
  To: davem
  Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet,
	Nicolas Dichtel

The goal of this new function is to perform all needed cleanup before sending
an skb into another netns.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/linux/skbuff.h |  1 +
 net/core/dev.c         | 11 +----------
 net/core/skbuff.c      | 23 +++++++++++++++++++++++
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a7393ad..6b06023 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2384,6 +2384,7 @@ extern void	       skb_split(struct sk_buff *skb,
 				 struct sk_buff *skb1, const u32 len);
 extern int	       skb_shift(struct sk_buff *tgt, struct sk_buff *skb,
 				 int shiftlen);
+extern void	       skb_scrub_packet(struct sk_buff *skb);
 
 extern struct sk_buff *skb_segment(struct sk_buff *skb,
 				   netdev_features_t features);
diff --git a/net/core/dev.c b/net/core/dev.c
index 722f633..370354a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1652,22 +1652,13 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 		}
 	}
 
-	skb_orphan(skb);
-
 	if (unlikely(!is_skb_forwardable(dev, skb))) {
 		atomic_long_inc(&dev->rx_dropped);
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
-	skb->skb_iif = 0;
-	skb_dst_drop(skb);
-	skb->tstamp.tv64 = 0;
-	skb->pkt_type = PACKET_HOST;
+	skb_scrub_packet(skb);
 	skb->protocol = eth_type_trans(skb, dev);
-	skb->mark = 0;
-	secpath_reset(skb);
-	nf_reset(skb);
-	nf_reset_trace(skb);
 	return netif_rx(skb);
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9f73eca..b1fcb87 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3492,3 +3492,26 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	return true;
 }
 EXPORT_SYMBOL(skb_try_coalesce);
+
+/**
+ * skb_scrub_packet - scrub an skb before sending it to another netns
+ *
+ * @skb: buffer to clean
+ *
+ * skb_scrub_packet can be used to clean an skb before injecting it in
+ * another namespace. We have to clear all information in the skb that
+ * could impact namespace isolation.
+ */
+void skb_scrub_packet(struct sk_buff *skb)
+{
+	skb_orphan(skb);
+	skb->tstamp.tv64 = 0;
+	skb->pkt_type = PACKET_HOST;
+	skb->skb_iif = 0;
+	skb_dst_drop(skb);
+	skb->mark = 0;
+	secpath_reset(skb);
+	nf_reset(skb);
+	nf_reset_trace(skb);
+}
+EXPORT_SYMBOL_GPL(skb_scrub_packet);
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 net-next 2/2] sit: add support of x-netns
  2013-06-26 14:11                                               ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
  2013-06-26 14:11                                                 ` [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel
@ 2013-06-26 14:11                                                 ` Nicolas Dichtel
  2013-06-28  5:36                                                 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap David Miller
  2 siblings, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-26 14:11 UTC (permalink / raw)
  To: davem
  Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet,
	Nicolas Dichtel

This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/ip_tunnels.h |  1 +
 net/ipv4/ip_tunnel.c     | 10 +++++++++-
 net/ipv6/sit.c           | 42 ++++++++++++++++++++++++++++++++----------
 3 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index b0d9824..781b3cf 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -42,6 +42,7 @@ struct ip_tunnel {
 	struct ip_tunnel __rcu	*next;
 	struct hlist_node hash_node;
 	struct net_device	*dev;
+	struct net		*net;	/* netns for packet i/o */
 
 	int		err_count;	/* Number of arrived ICMP errors */
 	unsigned long	err_time;	/* Time when the last ICMP error
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index bd227e5..5ae01e3 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net,
 
 	tunnel = netdev_priv(dev);
 	tunnel->parms = *parms;
+	tunnel->net = net;
 
 	err = register_netdevice(dev);
 	if (err)
@@ -453,6 +454,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
 	tstats->rx_bytes += skb->len;
 	u64_stats_update_end(&tstats->syncp);
 
+	if (tunnel->net != dev_net(tunnel->dev))
+		skb_scrub_packet(skb);
+
 	if (tunnel->dev->type == ARPHRD_ETHER) {
 		skb->protocol = eth_type_trans(skb, tunnel->dev);
 		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
@@ -541,7 +545,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 			tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph);
 	}
 
-	rt = ip_route_output_tunnel(dev_net(dev), &fl4,
+	rt = ip_route_output_tunnel(tunnel->net, &fl4,
 				    tunnel->parms.iph.protocol,
 				    dst, tnl_params->saddr,
 				    tunnel->parms.o_key,
@@ -602,6 +606,9 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 	}
 #endif
 
+	if (tunnel->net != dev_net(dev))
+		skb_scrub_packet(skb);
+
 	if (tunnel->err_count > 0) {
 		if (time_before(jiffies,
 				tunnel->err_time + IPTUNNEL_ERR_TIMEO)) {
@@ -888,6 +895,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[],
 	if (ip_tunnel_find(itn, p, dev->type))
 		return -EEXIST;
 
+	nt->net = net;
 	nt->parms = *p;
 	err = register_netdevice(dev);
 	if (err)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index f639866..97a0bfe 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t)
 
 static void ipip6_tunnel_uninit(struct net_device *dev)
 {
-	struct net *net = dev_net(dev);
-	struct sit_net *sitn = net_generic(net, sit_net_id);
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct sit_net *sitn = net_generic(tunnel->net, sit_net_id);
 
 	if (dev == sitn->fb_tunnel_dev) {
 		RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL);
 	} else {
-		ipip6_tunnel_unlink(sitn, netdev_priv(dev));
-		ipip6_tunnel_del_prl(netdev_priv(dev), NULL);
+		ipip6_tunnel_unlink(sitn, tunnel);
+		ipip6_tunnel_del_prl(tunnel, NULL);
 	}
 	dev_put(dev);
 }
@@ -621,6 +621,8 @@ static int ipip6_rcv(struct sk_buff *skb)
 		tstats->rx_packets++;
 		tstats->rx_bytes += skb->len;
 
+		if (tunnel->net != dev_net(tunnel->dev))
+			skb_scrub_packet(skb);
 		netif_rx(skb);
 
 		return 0;
@@ -803,7 +805,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			goto tx_error;
 	}
 
-	rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
+	rt = ip_route_output_ports(tunnel->net, &fl4, NULL,
 				   dst, tiph->saddr,
 				   0, 0,
 				   IPPROTO_IPV6, RT_TOS(tos),
@@ -858,6 +860,9 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			tunnel->err_count = 0;
 	}
 
+	if (tunnel->net != dev_net(dev))
+		skb_scrub_packet(skb);
+
 	/*
 	 * Okay, now see if we can stuff it in the buffer as-is.
 	 */
@@ -944,7 +949,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
 	iph = &tunnel->parms.iph;
 
 	if (iph->daddr) {
-		struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL,
+		struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4,
+							  NULL,
 							  iph->daddr, iph->saddr,
 							  0, 0,
 							  IPPROTO_IPV6,
@@ -959,7 +965,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
 	}
 
 	if (!tdev && tunnel->parms.link)
-		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+		tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link);
 
 	if (tdev) {
 		dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr);
@@ -972,7 +978,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev)
 
 static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p)
 {
-	struct net *net = dev_net(t->dev);
+	struct net *net = t->net;
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 
 	ipip6_tunnel_unlink(sitn, t);
@@ -1248,7 +1254,6 @@ static void ipip6_tunnel_setup(struct net_device *dev)
 	dev->priv_flags	       &= ~IFF_XMIT_DST_RELEASE;
 	dev->iflink		= 0;
 	dev->addr_len		= 4;
-	dev->features		|= NETIF_F_NETNS_LOCAL;
 	dev->features		|= NETIF_F_LLTX;
 }
 
@@ -1257,6 +1262,7 @@ static int ipip6_tunnel_init(struct net_device *dev)
 	struct ip_tunnel *tunnel = netdev_priv(dev);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 
 	memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4);
 	memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4);
@@ -1277,6 +1283,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev)
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
 	iph->version		= 4;
@@ -1564,8 +1571,14 @@ static struct xfrm_tunnel ipip_handler __read_mostly = {
 
 static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head)
 {
+	struct net *net = dev_net(sitn->fb_tunnel_dev);
+	struct net_device *dev, *aux;
 	int prio;
 
+	for_each_netdev_safe(net, dev, aux)
+		if (dev->rtnl_link_ops == &sit_link_ops)
+			unregister_netdevice_queue(dev, head);
+
 	for (prio = 1; prio < 4; prio++) {
 		int h;
 		for (h = 0; h < HASH_SIZE; h++) {
@@ -1573,7 +1586,12 @@ static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_hea
 
 			t = rtnl_dereference(sitn->tunnels[prio][h]);
 			while (t != NULL) {
-				unregister_netdevice_queue(t->dev, head);
+				/* If dev is in the same netns, it has already
+				 * been added to the list by the previous loop.
+				 */
+				if (dev_net(t->dev) != net)
+					unregister_netdevice_queue(t->dev,
+								   head);
 				t = rtnl_dereference(t->next);
 			}
 		}
@@ -1598,6 +1616,10 @@ static int __net_init sit_init_net(struct net *net)
 		goto err_alloc_dev;
 	}
 	dev_net_set(sitn->fb_tunnel_dev, net);
+	/* FB netdevice is special: we have one, and only one per netns.
+	 * Allowing to move it to another netns is clearly unsafe.
+	 */
+	sitn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL;
 
 	err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev);
 	if (err)
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap
  2013-06-26 14:11                                               ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
  2013-06-26 14:11                                                 ` [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel
  2013-06-26 14:11                                                 ` [PATCH v3 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
@ 2013-06-28  5:36                                                 ` David Miller
  2013-07-03 15:00                                                   ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel
  2 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2013-06-28  5:36 UTC (permalink / raw)
  To: nicolas.dichtel
  Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Wed, 26 Jun 2013 16:11:26 +0200

> This patch is a follow up of the thread "switching network namespace midway":
> http://marc.info/?t=135101459500004&r=1&w=2
> 
> The goal of this serie is to add x-netns support for the module sit, ie the
> encapsulation addresses and the network device are not owned by the same
> namespace.

Ok, applied.

And yes I agree we should look into making tunnel's behave consistently
wrt. SKB orphaning, cleaning netfilter state, etc.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap
  2013-06-28  5:36                                                 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap David Miller
@ 2013-07-03 15:00                                                   ` Nicolas Dichtel
  2013-07-03 15:00                                                     ` [PATCH net-next 1/3] sit: fix tunnel update via netlink Nicolas Dichtel
                                                                       ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-07-03 15:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet


This patch is a follow up of the previous serie witch add this functionality
for sit tunnels.

The goal is to add x-netns support for the module ipip and ip6_tunnel, ie the
encapsulation addresses and the network device are not owned by the same
namespace.

Note that the first patch is a fix of the previous serie.

Example to configure an ipip tunnel:

modprobe ipip
ip netns add netns1
ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249
ip l s ipip1 netns netns1
ip netns exec netns1 ip l s lo up
ip netns exec netns1 ip l s ipip1 up
ip netns exec netns1 ip a a dev ipip1 192.168.2.123 remote 192.168.2.121

or an ip6_tunnel:

modprobe ip6_tunnel
ip netns add netns1
ip link add ip6tnl1 type ip6tnl remote 2001:660:3008:c1c3::121 local 2001:660:3008:c1c3::123
ip l s ip6tnl1 netns netns1
ip netns exec netns1 ip l s lo up
ip netns exec netns1 ip l s ip6tnl1 up
ip netns exec netns1 ip a a dev ip6tnl1 192.168.1.123 remote 192.168.1.121
ip netns exec netns1 ip -6 a a dev ip6tnl1 2001:1235::123 remote 2001:1235::121

 include/net/ip6_tunnel.h |  1 +
 include/net/ip_tunnels.h |  2 +-
 net/ipv4/ip_gre.c        |  4 ++--
 net/ipv4/ip_tunnel.c     | 42 +++++++++++++++++++++++++++---------------
 net/ipv4/ipip.c          |  3 +--
 net/ipv6/ip6_gre.c       |  5 +++++
 net/ipv6/ip6_tunnel.c    | 41 +++++++++++++++++++++++++++++++----------
 net/ipv6/sit.c           |  4 ++--
 8 files changed, 70 insertions(+), 32 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next 1/3] sit: fix tunnel update via netlink
  2013-07-03 15:00                                                   ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel
@ 2013-07-03 15:00                                                     ` Nicolas Dichtel
  2013-07-03 15:00                                                     ` [PATCH net-next 2/3] ipip: add x-netns support Nicolas Dichtel
                                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-07-03 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet,
	Nicolas Dichtel

The device can stand in another netns, hence we need to do the lookup in netns
tunnel->net.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv6/sit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 85ff37b..a3437a4 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1426,9 +1426,9 @@ static int ipip6_newlink(struct net *src_net, struct net_device *dev,
 static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[],
 			  struct nlattr *data[])
 {
-	struct ip_tunnel *t;
+	struct ip_tunnel *t = netdev_priv(dev);
 	struct ip_tunnel_parm p;
-	struct net *net = dev_net(dev);
+	struct net *net = t->net;
 	struct sit_net *sitn = net_generic(net, sit_net_id);
 #ifdef CONFIG_IPV6_SIT_6RD
 	struct ip_tunnel_6rd ip6rd;
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH net-next 2/3] ipip: add x-netns support
  2013-07-03 15:00                                                   ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel
  2013-07-03 15:00                                                     ` [PATCH net-next 1/3] sit: fix tunnel update via netlink Nicolas Dichtel
@ 2013-07-03 15:00                                                     ` Nicolas Dichtel
  2013-07-03 15:00                                                     ` [PATCH net-next 3/3] ip6tnl: " Nicolas Dichtel
  2013-07-04 21:56                                                     ` [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap David Miller
  3 siblings, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-07-03 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet,
	Nicolas Dichtel

This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/ip_tunnels.h |  2 +-
 net/ipv4/ip_gre.c        |  4 ++--
 net/ipv4/ip_tunnel.c     | 42 +++++++++++++++++++++++++++---------------
 net/ipv4/ipip.c          |  3 +--
 4 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 781b3cf..189fcac 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -102,7 +102,7 @@ void  ip_tunnel_dellink(struct net_device *dev, struct list_head *head);
 int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 		       struct rtnl_link_ops *ops, char *devname);
 
-void ip_tunnel_delete_net(struct ip_tunnel_net *itn);
+void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops);
 
 void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 		    const struct iphdr *tnl_params, const u8 protocol);
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 1f6eab6..bc3a765 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -534,7 +534,7 @@ static int __net_init ipgre_init_net(struct net *net)
 static void __net_exit ipgre_exit_net(struct net *net)
 {
 	struct ip_tunnel_net *itn = net_generic(net, ipgre_net_id);
-	ip_tunnel_delete_net(itn);
+	ip_tunnel_delete_net(itn, &ipgre_link_ops);
 }
 
 static struct pernet_operations ipgre_net_ops = {
@@ -767,7 +767,7 @@ static int __net_init ipgre_tap_init_net(struct net *net)
 static void __net_exit ipgre_tap_exit_net(struct net *net)
 {
 	struct ip_tunnel_net *itn = net_generic(net, gre_tap_net_id);
-	ip_tunnel_delete_net(itn);
+	ip_tunnel_delete_net(itn, &ipgre_tap_ops);
 }
 
 static struct pernet_operations ipgre_tap_net_ops = {
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 945734b..b470d78 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -350,7 +350,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev)
 		struct flowi4 fl4;
 		struct rtable *rt;
 
-		rt = ip_route_output_tunnel(dev_net(dev), &fl4,
+		rt = ip_route_output_tunnel(tunnel->net, &fl4,
 					    tunnel->parms.iph.protocol,
 					    iph->daddr, iph->saddr,
 					    tunnel->parms.o_key,
@@ -365,7 +365,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev)
 	}
 
 	if (!tdev && tunnel->parms.link)
-		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+		tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link);
 
 	if (tdev) {
 		hlen = tdev->hard_header_len + tdev->needed_headroom;
@@ -653,7 +653,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 		}
 	}
 
-	err = iptunnel_xmit(dev_net(dev), rt, skb,
+	err = iptunnel_xmit(tunnel->net, rt, skb,
 			    fl4.saddr, fl4.daddr, protocol,
 			    ip_tunnel_ecn_encap(tos, inner_iph, skb), ttl, df);
 	iptunnel_xmit_stats(err, &dev->stats, dev->tstats);
@@ -820,11 +820,10 @@ static void ip_tunnel_dev_free(struct net_device *dev)
 
 void ip_tunnel_dellink(struct net_device *dev, struct list_head *head)
 {
-	struct net *net = dev_net(dev);
 	struct ip_tunnel *tunnel = netdev_priv(dev);
 	struct ip_tunnel_net *itn;
 
-	itn = net_generic(net, tunnel->ip_tnl_net_id);
+	itn = net_generic(tunnel->net, tunnel->ip_tnl_net_id);
 
 	if (itn->fb_tunnel_dev != dev) {
 		ip_tunnel_del(netdev_priv(dev));
@@ -853,6 +852,10 @@ int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 
 	rtnl_lock();
 	itn->fb_tunnel_dev = __ip_tunnel_create(net, ops, &parms);
+	/* FB netdevice is special: we have one, and only one per netns.
+	 * Allowing to move it to another netns is clearly unsafe.
+	 */
+	itn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL;
 	rtnl_unlock();
 	if (IS_ERR(itn->fb_tunnel_dev)) {
 		kfree(itn->tunnels);
@@ -863,28 +866,39 @@ int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_init_net);
 
-static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head)
+static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head,
+			      struct rtnl_link_ops *ops)
 {
+	struct net *net = dev_net(itn->fb_tunnel_dev);
+	struct net_device *dev, *aux;
 	int h;
 
+	for_each_netdev_safe(net, dev, aux)
+		if (dev->rtnl_link_ops == ops)
+			unregister_netdevice_queue(dev, head);
+
 	for (h = 0; h < IP_TNL_HASH_SIZE; h++) {
 		struct ip_tunnel *t;
 		struct hlist_node *n;
 		struct hlist_head *thead = &itn->tunnels[h];
 
 		hlist_for_each_entry_safe(t, n, thead, hash_node)
-			unregister_netdevice_queue(t->dev, head);
+			/* If dev is in the same netns, it has already
+			 * been added to the list by the previous loop.
+			 */
+			if (dev_net(t->dev) != net)
+				unregister_netdevice_queue(t->dev, head);
 	}
 	if (itn->fb_tunnel_dev)
 		unregister_netdevice_queue(itn->fb_tunnel_dev, head);
 }
 
-void ip_tunnel_delete_net(struct ip_tunnel_net *itn)
+void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops)
 {
 	LIST_HEAD(list);
 
 	rtnl_lock();
-	ip_tunnel_destroy(itn, &list);
+	ip_tunnel_destroy(itn, &list, ops);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 	kfree(itn->tunnels);
@@ -929,23 +943,21 @@ EXPORT_SYMBOL_GPL(ip_tunnel_newlink);
 int ip_tunnel_changelink(struct net_device *dev, struct nlattr *tb[],
 			 struct ip_tunnel_parm *p)
 {
-	struct ip_tunnel *t, *nt;
-	struct net *net = dev_net(dev);
+	struct ip_tunnel *t;
 	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct net *net = tunnel->net;
 	struct ip_tunnel_net *itn = net_generic(net, tunnel->ip_tnl_net_id);
 
 	if (dev == itn->fb_tunnel_dev)
 		return -EINVAL;
 
-	nt = netdev_priv(dev);
-
 	t = ip_tunnel_find(itn, p, dev->type);
 
 	if (t) {
 		if (t->dev != dev)
 			return -EEXIST;
 	} else {
-		t = nt;
+		t = tunnel;
 
 		if (dev->type != ARPHRD_ETHER) {
 			unsigned int nflags = 0;
@@ -994,8 +1006,8 @@ EXPORT_SYMBOL_GPL(ip_tunnel_init);
 
 void ip_tunnel_uninit(struct net_device *dev)
 {
-	struct net *net = dev_net(dev);
 	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct net *net = tunnel->net;
 	struct ip_tunnel_net *itn;
 
 	itn = net_generic(net, tunnel->ip_tnl_net_id);
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 51fc2a1..87bd295 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -286,7 +286,6 @@ static void ipip_tunnel_setup(struct net_device *dev)
 	dev->flags		= IFF_NOARP;
 	dev->iflink		= 0;
 	dev->addr_len		= 4;
-	dev->features		|= NETIF_F_NETNS_LOCAL;
 	dev->features		|= NETIF_F_LLTX;
 	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
 
@@ -437,7 +436,7 @@ static int __net_init ipip_init_net(struct net *net)
 static void __net_exit ipip_exit_net(struct net *net)
 {
 	struct ip_tunnel_net *itn = net_generic(net, ipip_net_id);
-	ip_tunnel_delete_net(itn);
+	ip_tunnel_delete_net(itn, &ipip_link_ops);
 }
 
 static struct pernet_operations ipip_net_ops = {
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH net-next 3/3] ip6tnl: add x-netns support
  2013-07-03 15:00                                                   ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel
  2013-07-03 15:00                                                     ` [PATCH net-next 1/3] sit: fix tunnel update via netlink Nicolas Dichtel
  2013-07-03 15:00                                                     ` [PATCH net-next 2/3] ipip: add x-netns support Nicolas Dichtel
@ 2013-07-03 15:00                                                     ` Nicolas Dichtel
  2013-07-04 21:56                                                     ` [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap David Miller
  3 siblings, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-07-03 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet,
	Nicolas Dichtel

This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/ip6_tunnel.h |  1 +
 net/ipv6/ip6_gre.c       |  5 +++++
 net/ipv6/ip6_tunnel.c    | 41 +++++++++++++++++++++++++++++++----------
 3 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 4da5de1..2265b0b 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -36,6 +36,7 @@ struct __ip6_tnl_parm {
 struct ip6_tnl {
 	struct ip6_tnl __rcu *next;	/* next tunnel in list */
 	struct net_device *dev;	/* virtual device associated with tunnel */
+	struct net *net;	/* netns for packet i/o */
 	struct __ip6_tnl_parm parms;	/* tunnel configuration parameters */
 	struct flowi fl;	/* flowi template for xmit */
 	struct dst_entry *dst_cache;    /* cached dst */
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index ecd6073..f2d0a42 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -335,6 +335,7 @@ static struct ip6_tnl *ip6gre_tunnel_locate(struct net *net,
 	dev->rtnl_link_ops = &ip6gre_link_ops;
 
 	nt->dev = dev;
+	nt->net = dev_net(dev);
 	ip6gre_tnl_link_config(nt, 1);
 
 	if (register_netdevice(dev) < 0)
@@ -1255,6 +1256,7 @@ static int ip6gre_tunnel_init(struct net_device *dev)
 	tunnel = netdev_priv(dev);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
 	memcpy(dev->dev_addr, &tunnel->parms.laddr, sizeof(struct in6_addr));
@@ -1275,6 +1277,7 @@ static void ip6gre_fb_tunnel_init(struct net_device *dev)
 	struct ip6_tnl *tunnel = netdev_priv(dev);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
 	tunnel->hlen		= sizeof(struct ipv6hdr) + 4;
@@ -1450,6 +1453,7 @@ static int ip6gre_tap_init(struct net_device *dev)
 	tunnel = netdev_priv(dev);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
 	ip6gre_tnl_link_config(tunnel, 1);
@@ -1501,6 +1505,7 @@ static int ip6gre_newlink(struct net *src_net, struct net_device *dev,
 		eth_hw_addr_random(dev);
 
 	nt->dev = dev;
+	nt->net = dev_net(dev);
 	ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
 
 	/* Can use a lockless transmit, unless we generate output sequences */
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 1e55866..d773a3c 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -315,6 +315,7 @@ static struct ip6_tnl *ip6_tnl_create(struct net *net, struct __ip6_tnl_parm *p)
 
 	t = netdev_priv(dev);
 	t->parms = *p;
+	t->net = dev_net(dev);
 	err = ip6_tnl_create2(dev);
 	if (err < 0)
 		goto failed_free;
@@ -374,7 +375,7 @@ static void
 ip6_tnl_dev_uninit(struct net_device *dev)
 {
 	struct ip6_tnl *t = netdev_priv(dev);
-	struct net *net = dev_net(dev);
+	struct net *net = t->net;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 
 	if (dev == ip6n->fb_tnl_dev)
@@ -741,7 +742,7 @@ int ip6_tnl_rcv_ctl(struct ip6_tnl *t,
 {
 	struct __ip6_tnl_parm *p = &t->parms;
 	int ret = 0;
-	struct net *net = dev_net(t->dev);
+	struct net *net = t->net;
 
 	if ((p->flags & IP6_TNL_F_CAP_RCV) ||
 	    ((p->flags & IP6_TNL_F_CAP_PER_PACKET) &&
@@ -827,6 +828,9 @@ static int ip6_tnl_rcv(struct sk_buff *skb, __u16 protocol,
 		tstats->rx_packets++;
 		tstats->rx_bytes += skb->len;
 
+		if (t->net != dev_net(t->dev))
+			skb_scrub_packet(skb);
+
 		netif_rx(skb);
 
 		rcu_read_unlock();
@@ -895,7 +899,7 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t)
 {
 	struct __ip6_tnl_parm *p = &t->parms;
 	int ret = 0;
-	struct net *net = dev_net(t->dev);
+	struct net *net = t->net;
 
 	if (p->flags & IP6_TNL_F_CAP_XMIT) {
 		struct net_device *ldev = NULL;
@@ -945,8 +949,8 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 			 int encap_limit,
 			 __u32 *pmtu)
 {
-	struct net *net = dev_net(dev);
 	struct ip6_tnl *t = netdev_priv(dev);
+	struct net *net = t->net;
 	struct net_device_stats *stats = &t->dev->stats;
 	struct ipv6hdr *ipv6h = ipv6_hdr(skb);
 	struct ipv6_tel_txoption opt;
@@ -996,6 +1000,9 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 		goto tx_err_dst_release;
 	}
 
+	if (t->net != dev_net(dev))
+		skb_scrub_packet(skb);
+
 	/*
 	 * Okay, now see if we can stuff it in the buffer as-is.
 	 */
@@ -1202,7 +1209,7 @@ static void ip6_tnl_link_config(struct ip6_tnl *t)
 		int strict = (ipv6_addr_type(&p->raddr) &
 			      (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL));
 
-		struct rt6_info *rt = rt6_lookup(dev_net(dev),
+		struct rt6_info *rt = rt6_lookup(t->net,
 						 &p->raddr, &p->laddr,
 						 p->link, strict);
 
@@ -1251,7 +1258,7 @@ ip6_tnl_change(struct ip6_tnl *t, const struct __ip6_tnl_parm *p)
 
 static int ip6_tnl_update(struct ip6_tnl *t, struct __ip6_tnl_parm *p)
 {
-	struct net *net = dev_net(t->dev);
+	struct net *net = t->net;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 	int err;
 
@@ -1463,7 +1470,6 @@ static void ip6_tnl_dev_setup(struct net_device *dev)
 		dev->mtu-=8;
 	dev->flags |= IFF_NOARP;
 	dev->addr_len = sizeof(struct in6_addr);
-	dev->features |= NETIF_F_NETNS_LOCAL;
 	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
 }
 
@@ -1479,6 +1485,7 @@ ip6_tnl_dev_init_gen(struct net_device *dev)
 	struct ip6_tnl *t = netdev_priv(dev);
 
 	t->dev = dev;
+	t->net = dev_net(dev);
 	dev->tstats = alloc_percpu(struct pcpu_tstats);
 	if (!dev->tstats)
 		return -ENOMEM;
@@ -1596,9 +1603,9 @@ static int ip6_tnl_newlink(struct net *src_net, struct net_device *dev,
 static int ip6_tnl_changelink(struct net_device *dev, struct nlattr *tb[],
 			      struct nlattr *data[])
 {
-	struct ip6_tnl *t;
+	struct ip6_tnl *t = netdev_priv(dev);
 	struct __ip6_tnl_parm p;
-	struct net *net = dev_net(dev);
+	struct net *net = t->net;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 
 	if (dev == ip6n->fb_tnl_dev)
@@ -1699,14 +1706,24 @@ static struct xfrm6_tunnel ip6ip6_handler __read_mostly = {
 
 static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n)
 {
+	struct net *net = dev_net(ip6n->fb_tnl_dev);
+	struct net_device *dev, *aux;
 	int h;
 	struct ip6_tnl *t;
 	LIST_HEAD(list);
 
+	for_each_netdev_safe(net, dev, aux)
+		if (dev->rtnl_link_ops == &ip6_link_ops)
+			unregister_netdevice_queue(dev, &list);
+
 	for (h = 0; h < HASH_SIZE; h++) {
 		t = rtnl_dereference(ip6n->tnls_r_l[h]);
 		while (t != NULL) {
-			unregister_netdevice_queue(t->dev, &list);
+			/* If dev is in the same netns, it has already
+			 * been added to the list by the previous loop.
+			 */
+			if (dev_net(t->dev) != net)
+				unregister_netdevice_queue(t->dev, &list);
 			t = rtnl_dereference(t->next);
 		}
 	}
@@ -1732,6 +1749,10 @@ static int __net_init ip6_tnl_init_net(struct net *net)
 	if (!ip6n->fb_tnl_dev)
 		goto err_alloc_dev;
 	dev_net_set(ip6n->fb_tnl_dev, net);
+	/* FB netdevice is special: we have one, and only one per netns.
+	 * Allowing to move it to another netns is clearly unsafe.
+	 */
+	ip6n->fb_tnl_dev->features |= NETIF_F_NETNS_LOCAL;
 
 	err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev);
 	if (err < 0)
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap
  2013-07-03 15:00                                                   ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel
                                                                       ` (2 preceding siblings ...)
  2013-07-03 15:00                                                     ` [PATCH net-next 3/3] ip6tnl: " Nicolas Dichtel
@ 2013-07-04 21:56                                                     ` David Miller
  2013-08-13 15:51                                                       ` [PATCH net-next v2 0/4] " Nicolas Dichtel
  3 siblings, 1 reply; 52+ messages in thread
From: David Miller @ 2013-07-04 21:56 UTC (permalink / raw)
  To: nicolas.dichtel
  Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Wed,  3 Jul 2013 17:00:33 +0200

> 
> This patch is a follow up of the previous serie witch add this functionality
> for sit tunnels.
> 
> The goal is to add x-netns support for the module ipip and ip6_tunnel, ie the
> encapsulation addresses and the network device are not owned by the same
> namespace.
> 
> Note that the first patch is a fix of the previous serie.

The first patch, as it is a bug fix, is fine and is applied.

The rest will have to wait until the net-next tree opens again,
sorry.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap
  2013-07-04 21:56                                                     ` [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap David Miller
@ 2013-08-13 15:51                                                       ` Nicolas Dichtel
  2013-08-13 15:51                                                         ` [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans() Nicolas Dichtel
                                                                           ` (4 more replies)
  0 siblings, 5 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet


This serie is a follow up of the previous serie witch adds this functionality
for sit tunnels.

The goal is to add x-netns support for the module ipip and ip6_tunnel, ie the
encapsulation addresses and the network device are not owned by the same
namespace.

Note that the two first patches are cleanup.

Example to configure an ipip tunnel:

modprobe ipip
ip netns add netns1
ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249
ip l s ipip1 netns netns1
ip netns exec netns1 ip l s lo up
ip netns exec netns1 ip l s ipip1 up
ip netns exec netns1 ip a a dev ipip1 192.168.2.123 remote 192.168.2.121

or an ip6_tunnel:

modprobe ip6_tunnel
ip netns add netns1
ip link add ip6tnl1 type ip6tnl remote 2001:660:3008:c1c3::121 local 2001:660:3008:c1c3::123
ip l s ip6tnl1 netns netns1
ip netns exec netns1 ip l s lo up
ip netns exec netns1 ip l s ip6tnl1 up
ip netns exec netns1 ip a a dev ip6tnl1 192.168.1.123 remote 192.168.1.121
ip netns exec netns1 ip -6 a a dev ip6tnl1 2001:1235::123 remote 2001:1235::121

v2: remove the patch 1/3 of the v1 serie (already included)
    use net_eq()
    add patch 1/4 and 2/4

 include/net/ip6_tunnel.h |  1 +
 include/net/ip_tunnels.h |  2 +-
 net/core/dev.c           |  6 +++---
 net/ipv4/ip_gre.c        |  4 ++--
 net/ipv4/ip_tunnel.c     | 52 ++++++++++++++++++++++++++++++------------------
 net/ipv4/ip_vti.c        |  2 +-
 net/ipv4/ipip.c          |  3 +--
 net/ipv6/ip6_gre.c       |  5 +++++
 net/ipv6/ip6_tunnel.c    | 41 ++++++++++++++++++++++++++++----------
 net/ipv6/sit.c           |  6 +++---
 10 files changed, 81 insertions(+), 41 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans()
  2013-08-13 15:51                                                       ` [PATCH net-next v2 0/4] " Nicolas Dichtel
@ 2013-08-13 15:51                                                         ` Nicolas Dichtel
  2013-08-13 15:51                                                         ` [PATCH net-next v2 2/4] ipv4 tunnels: use net_eq() helper to check netns Nicolas Dichtel
                                                                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw)
  To: netdev
  Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet,
	Nicolas Dichtel

skb_scrub_packet() was called before eth_type_trans() to let eth_type_trans()
set pkt_type.

In fact, we should force pkt_type to PACKET_HOST, so move the call after
eth_type_trans().

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/core/dev.c       | 6 +++---
 net/ipv4/ip_tunnel.c | 7 ++++---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 58eb802584b9..1ed2b66a10a6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1691,13 +1691,13 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
-	skb_scrub_packet(skb);
 	skb->protocol = eth_type_trans(skb, dev);
 
 	/* eth_type_trans() can set pkt_type.
-	 * clear pkt_type _after_ calling eth_type_trans()
+	 * call skb_scrub_packet() after it to clear pkt_type _after_ calling
+	 * eth_type_trans().
 	 */
-	skb->pkt_type = PACKET_HOST;
+	skb_scrub_packet(skb);
 
 	return netif_rx(skb);
 }
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 9fdf8a6d95f3..fbc1094964bf 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -454,15 +454,16 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
 	tstats->rx_bytes += skb->len;
 	u64_stats_update_end(&tstats->syncp);
 
-	if (tunnel->net != dev_net(tunnel->dev))
-		skb_scrub_packet(skb);
-
 	if (tunnel->dev->type == ARPHRD_ETHER) {
 		skb->protocol = eth_type_trans(skb, tunnel->dev);
 		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
 	} else {
 		skb->dev = tunnel->dev;
 	}
+
+	if (tunnel->net != dev_net(tunnel->dev))
+		skb_scrub_packet(skb);
+
 	gro_cells_receive(&tunnel->gro_cells, skb);
 	return 0;
 
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH net-next v2 2/4] ipv4 tunnels: use net_eq() helper to check netns
  2013-08-13 15:51                                                       ` [PATCH net-next v2 0/4] " Nicolas Dichtel
  2013-08-13 15:51                                                         ` [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans() Nicolas Dichtel
@ 2013-08-13 15:51                                                         ` Nicolas Dichtel
  2013-08-13 15:51                                                         ` [PATCH net-next v2 3/4] ipip: add x-netns support Nicolas Dichtel
                                                                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw)
  To: netdev
  Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet,
	Nicolas Dichtel

It's better to use available helpers for these tests.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 net/ipv4/ip_tunnel.c | 4 ++--
 net/ipv6/sit.c       | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index fbc1094964bf..a351a003ee6b 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -461,7 +461,7 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
 		skb->dev = tunnel->dev;
 	}
 
-	if (tunnel->net != dev_net(tunnel->dev))
+	if (!net_eq(tunnel->net, dev_net(tunnel->dev)))
 		skb_scrub_packet(skb);
 
 	gro_cells_receive(&tunnel->gro_cells, skb);
@@ -614,7 +614,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 		goto tx_error;
 	}
 
-	if (tunnel->net != dev_net(dev))
+	if (!net_eq(tunnel->net, dev_net(dev)))
 		skb_scrub_packet(skb);
 
 	if (tunnel->err_count > 0) {
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index a3437a4cd07e..f18f842ac893 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -621,7 +621,7 @@ static int ipip6_rcv(struct sk_buff *skb)
 		tstats->rx_packets++;
 		tstats->rx_bytes += skb->len;
 
-		if (tunnel->net != dev_net(tunnel->dev))
+		if (!net_eq(tunnel->net, dev_net(tunnel->dev)))
 			skb_scrub_packet(skb);
 		netif_rx(skb);
 
@@ -860,7 +860,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			tunnel->err_count = 0;
 	}
 
-	if (tunnel->net != dev_net(dev))
+	if (!net_eq(tunnel->net, dev_net(dev)))
 		skb_scrub_packet(skb);
 
 	/*
@@ -1589,7 +1589,7 @@ static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_hea
 				/* If dev is in the same netns, it has already
 				 * been added to the list by the previous loop.
 				 */
-				if (dev_net(t->dev) != net)
+				if (!net_eq(dev_net(t->dev), net))
 					unregister_netdevice_queue(t->dev,
 								   head);
 				t = rtnl_dereference(t->next);
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH net-next v2 3/4] ipip: add x-netns support
  2013-08-13 15:51                                                       ` [PATCH net-next v2 0/4] " Nicolas Dichtel
  2013-08-13 15:51                                                         ` [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans() Nicolas Dichtel
  2013-08-13 15:51                                                         ` [PATCH net-next v2 2/4] ipv4 tunnels: use net_eq() helper to check netns Nicolas Dichtel
@ 2013-08-13 15:51                                                         ` Nicolas Dichtel
  2013-08-13 15:51                                                         ` [PATCH net-next v2 4/4] ip6tnl: " Nicolas Dichtel
  2013-08-15  8:01                                                         ` [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap David Miller
  4 siblings, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw)
  To: netdev
  Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet,
	Nicolas Dichtel

This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/ip_tunnels.h |  2 +-
 net/ipv4/ip_gre.c        |  4 ++--
 net/ipv4/ip_tunnel.c     | 43 ++++++++++++++++++++++++++++---------------
 net/ipv4/ip_vti.c        |  2 +-
 net/ipv4/ipip.c          |  3 +--
 5 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index c6acd9f8f877..5a76f2bef822 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -102,7 +102,7 @@ void  ip_tunnel_dellink(struct net_device *dev, struct list_head *head);
 int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 		       struct rtnl_link_ops *ops, char *devname);
 
-void ip_tunnel_delete_net(struct ip_tunnel_net *itn);
+void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops);
 
 void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 		    const struct iphdr *tnl_params, const u8 protocol);
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 1f6eab66f7ce..bc3a76521deb 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -534,7 +534,7 @@ static int __net_init ipgre_init_net(struct net *net)
 static void __net_exit ipgre_exit_net(struct net *net)
 {
 	struct ip_tunnel_net *itn = net_generic(net, ipgre_net_id);
-	ip_tunnel_delete_net(itn);
+	ip_tunnel_delete_net(itn, &ipgre_link_ops);
 }
 
 static struct pernet_operations ipgre_net_ops = {
@@ -767,7 +767,7 @@ static int __net_init ipgre_tap_init_net(struct net *net)
 static void __net_exit ipgre_tap_exit_net(struct net *net)
 {
 	struct ip_tunnel_net *itn = net_generic(net, gre_tap_net_id);
-	ip_tunnel_delete_net(itn);
+	ip_tunnel_delete_net(itn, &ipgre_tap_ops);
 }
 
 static struct pernet_operations ipgre_tap_net_ops = {
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index a351a003ee6b..a4d9126c7b51 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -350,7 +350,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev)
 		struct flowi4 fl4;
 		struct rtable *rt;
 
-		rt = ip_route_output_tunnel(dev_net(dev), &fl4,
+		rt = ip_route_output_tunnel(tunnel->net, &fl4,
 					    tunnel->parms.iph.protocol,
 					    iph->daddr, iph->saddr,
 					    tunnel->parms.o_key,
@@ -365,7 +365,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev)
 	}
 
 	if (!tdev && tunnel->parms.link)
-		tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link);
+		tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link);
 
 	if (tdev) {
 		hlen = tdev->hard_header_len + tdev->needed_headroom;
@@ -654,7 +654,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 		}
 	}
 
-	err = iptunnel_xmit(dev_net(dev), rt, skb,
+	err = iptunnel_xmit(tunnel->net, rt, skb,
 			    fl4.saddr, fl4.daddr, protocol,
 			    ip_tunnel_ecn_encap(tos, inner_iph, skb), ttl, df);
 	iptunnel_xmit_stats(err, &dev->stats, dev->tstats);
@@ -821,11 +821,10 @@ static void ip_tunnel_dev_free(struct net_device *dev)
 
 void ip_tunnel_dellink(struct net_device *dev, struct list_head *head)
 {
-	struct net *net = dev_net(dev);
 	struct ip_tunnel *tunnel = netdev_priv(dev);
 	struct ip_tunnel_net *itn;
 
-	itn = net_generic(net, tunnel->ip_tnl_net_id);
+	itn = net_generic(tunnel->net, tunnel->ip_tnl_net_id);
 
 	if (itn->fb_tunnel_dev != dev) {
 		ip_tunnel_del(netdev_priv(dev));
@@ -855,6 +854,10 @@ int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 
 	rtnl_lock();
 	itn->fb_tunnel_dev = __ip_tunnel_create(net, ops, &parms);
+	/* FB netdevice is special: we have one, and only one per netns.
+	 * Allowing to move it to another netns is clearly unsafe.
+	 */
+	itn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL;
 	rtnl_unlock();
 
 	if (IS_ERR(itn->fb_tunnel_dev))
@@ -864,28 +867,39 @@ int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_init_net);
 
-static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head)
+static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head,
+			      struct rtnl_link_ops *ops)
 {
+	struct net *net = dev_net(itn->fb_tunnel_dev);
+	struct net_device *dev, *aux;
 	int h;
 
+	for_each_netdev_safe(net, dev, aux)
+		if (dev->rtnl_link_ops == ops)
+			unregister_netdevice_queue(dev, head);
+
 	for (h = 0; h < IP_TNL_HASH_SIZE; h++) {
 		struct ip_tunnel *t;
 		struct hlist_node *n;
 		struct hlist_head *thead = &itn->tunnels[h];
 
 		hlist_for_each_entry_safe(t, n, thead, hash_node)
-			unregister_netdevice_queue(t->dev, head);
+			/* If dev is in the same netns, it has already
+			 * been added to the list by the previous loop.
+			 */
+			if (!net_eq(dev_net(t->dev), net))
+				unregister_netdevice_queue(t->dev, head);
 	}
 	if (itn->fb_tunnel_dev)
 		unregister_netdevice_queue(itn->fb_tunnel_dev, head);
 }
 
-void ip_tunnel_delete_net(struct ip_tunnel_net *itn)
+void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops)
 {
 	LIST_HEAD(list);
 
 	rtnl_lock();
-	ip_tunnel_destroy(itn, &list);
+	ip_tunnel_destroy(itn, &list, ops);
 	unregister_netdevice_many(&list);
 	rtnl_unlock();
 }
@@ -929,23 +943,21 @@ EXPORT_SYMBOL_GPL(ip_tunnel_newlink);
 int ip_tunnel_changelink(struct net_device *dev, struct nlattr *tb[],
 			 struct ip_tunnel_parm *p)
 {
-	struct ip_tunnel *t, *nt;
-	struct net *net = dev_net(dev);
+	struct ip_tunnel *t;
 	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct net *net = tunnel->net;
 	struct ip_tunnel_net *itn = net_generic(net, tunnel->ip_tnl_net_id);
 
 	if (dev == itn->fb_tunnel_dev)
 		return -EINVAL;
 
-	nt = netdev_priv(dev);
-
 	t = ip_tunnel_find(itn, p, dev->type);
 
 	if (t) {
 		if (t->dev != dev)
 			return -EEXIST;
 	} else {
-		t = nt;
+		t = tunnel;
 
 		if (dev->type != ARPHRD_ETHER) {
 			unsigned int nflags = 0;
@@ -984,6 +996,7 @@ int ip_tunnel_init(struct net_device *dev)
 	}
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 	iph->version		= 4;
 	iph->ihl		= 5;
@@ -994,8 +1007,8 @@ EXPORT_SYMBOL_GPL(ip_tunnel_init);
 
 void ip_tunnel_uninit(struct net_device *dev)
 {
-	struct net *net = dev_net(dev);
 	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct net *net = tunnel->net;
 	struct ip_tunnel_net *itn;
 
 	itn = net_generic(net, tunnel->ip_tnl_net_id);
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 79b263da4168..e805e7b3030e 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -318,7 +318,7 @@ static int __net_init vti_init_net(struct net *net)
 static void __net_exit vti_exit_net(struct net *net)
 {
 	struct ip_tunnel_net *itn = net_generic(net, vti_net_id);
-	ip_tunnel_delete_net(itn);
+	ip_tunnel_delete_net(itn, &vti_link_ops);
 }
 
 static struct pernet_operations vti_net_ops = {
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 51fc2a1dcdd3..87bd2952c733 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -286,7 +286,6 @@ static void ipip_tunnel_setup(struct net_device *dev)
 	dev->flags		= IFF_NOARP;
 	dev->iflink		= 0;
 	dev->addr_len		= 4;
-	dev->features		|= NETIF_F_NETNS_LOCAL;
 	dev->features		|= NETIF_F_LLTX;
 	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
 
@@ -437,7 +436,7 @@ static int __net_init ipip_init_net(struct net *net)
 static void __net_exit ipip_exit_net(struct net *net)
 {
 	struct ip_tunnel_net *itn = net_generic(net, ipip_net_id);
-	ip_tunnel_delete_net(itn);
+	ip_tunnel_delete_net(itn, &ipip_link_ops);
 }
 
 static struct pernet_operations ipip_net_ops = {
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH net-next v2 4/4] ip6tnl: add x-netns support
  2013-08-13 15:51                                                       ` [PATCH net-next v2 0/4] " Nicolas Dichtel
                                                                           ` (2 preceding siblings ...)
  2013-08-13 15:51                                                         ` [PATCH net-next v2 3/4] ipip: add x-netns support Nicolas Dichtel
@ 2013-08-13 15:51                                                         ` Nicolas Dichtel
  2013-08-15  8:01                                                         ` [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap David Miller
  4 siblings, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw)
  To: netdev
  Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet,
	Nicolas Dichtel

This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/net/ip6_tunnel.h |  1 +
 net/ipv6/ip6_gre.c       |  5 +++++
 net/ipv6/ip6_tunnel.c    | 41 +++++++++++++++++++++++++++++++----------
 3 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 4da5de10d1d4..2265b0bf97e5 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -36,6 +36,7 @@ struct __ip6_tnl_parm {
 struct ip6_tnl {
 	struct ip6_tnl __rcu *next;	/* next tunnel in list */
 	struct net_device *dev;	/* virtual device associated with tunnel */
+	struct net *net;	/* netns for packet i/o */
 	struct __ip6_tnl_parm parms;	/* tunnel configuration parameters */
 	struct flowi fl;	/* flowi template for xmit */
 	struct dst_entry *dst_cache;    /* cached dst */
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index ecd60733e5e2..f2d0a42f8057 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -335,6 +335,7 @@ static struct ip6_tnl *ip6gre_tunnel_locate(struct net *net,
 	dev->rtnl_link_ops = &ip6gre_link_ops;
 
 	nt->dev = dev;
+	nt->net = dev_net(dev);
 	ip6gre_tnl_link_config(nt, 1);
 
 	if (register_netdevice(dev) < 0)
@@ -1255,6 +1256,7 @@ static int ip6gre_tunnel_init(struct net_device *dev)
 	tunnel = netdev_priv(dev);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
 	memcpy(dev->dev_addr, &tunnel->parms.laddr, sizeof(struct in6_addr));
@@ -1275,6 +1277,7 @@ static void ip6gre_fb_tunnel_init(struct net_device *dev)
 	struct ip6_tnl *tunnel = netdev_priv(dev);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
 	tunnel->hlen		= sizeof(struct ipv6hdr) + 4;
@@ -1450,6 +1453,7 @@ static int ip6gre_tap_init(struct net_device *dev)
 	tunnel = netdev_priv(dev);
 
 	tunnel->dev = dev;
+	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
 	ip6gre_tnl_link_config(tunnel, 1);
@@ -1501,6 +1505,7 @@ static int ip6gre_newlink(struct net *src_net, struct net_device *dev,
 		eth_hw_addr_random(dev);
 
 	nt->dev = dev;
+	nt->net = dev_net(dev);
 	ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
 
 	/* Can use a lockless transmit, unless we generate output sequences */
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 1e55866cead7..cc3bb201b8b0 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -315,6 +315,7 @@ static struct ip6_tnl *ip6_tnl_create(struct net *net, struct __ip6_tnl_parm *p)
 
 	t = netdev_priv(dev);
 	t->parms = *p;
+	t->net = dev_net(dev);
 	err = ip6_tnl_create2(dev);
 	if (err < 0)
 		goto failed_free;
@@ -374,7 +375,7 @@ static void
 ip6_tnl_dev_uninit(struct net_device *dev)
 {
 	struct ip6_tnl *t = netdev_priv(dev);
-	struct net *net = dev_net(dev);
+	struct net *net = t->net;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 
 	if (dev == ip6n->fb_tnl_dev)
@@ -741,7 +742,7 @@ int ip6_tnl_rcv_ctl(struct ip6_tnl *t,
 {
 	struct __ip6_tnl_parm *p = &t->parms;
 	int ret = 0;
-	struct net *net = dev_net(t->dev);
+	struct net *net = t->net;
 
 	if ((p->flags & IP6_TNL_F_CAP_RCV) ||
 	    ((p->flags & IP6_TNL_F_CAP_PER_PACKET) &&
@@ -827,6 +828,9 @@ static int ip6_tnl_rcv(struct sk_buff *skb, __u16 protocol,
 		tstats->rx_packets++;
 		tstats->rx_bytes += skb->len;
 
+		if (!net_eq(t->net, dev_net(t->dev)))
+			skb_scrub_packet(skb);
+
 		netif_rx(skb);
 
 		rcu_read_unlock();
@@ -895,7 +899,7 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t)
 {
 	struct __ip6_tnl_parm *p = &t->parms;
 	int ret = 0;
-	struct net *net = dev_net(t->dev);
+	struct net *net = t->net;
 
 	if (p->flags & IP6_TNL_F_CAP_XMIT) {
 		struct net_device *ldev = NULL;
@@ -945,8 +949,8 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 			 int encap_limit,
 			 __u32 *pmtu)
 {
-	struct net *net = dev_net(dev);
 	struct ip6_tnl *t = netdev_priv(dev);
+	struct net *net = t->net;
 	struct net_device_stats *stats = &t->dev->stats;
 	struct ipv6hdr *ipv6h = ipv6_hdr(skb);
 	struct ipv6_tel_txoption opt;
@@ -996,6 +1000,9 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 		goto tx_err_dst_release;
 	}
 
+	if (!net_eq(t->net, dev_net(dev)))
+		skb_scrub_packet(skb);
+
 	/*
 	 * Okay, now see if we can stuff it in the buffer as-is.
 	 */
@@ -1202,7 +1209,7 @@ static void ip6_tnl_link_config(struct ip6_tnl *t)
 		int strict = (ipv6_addr_type(&p->raddr) &
 			      (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL));
 
-		struct rt6_info *rt = rt6_lookup(dev_net(dev),
+		struct rt6_info *rt = rt6_lookup(t->net,
 						 &p->raddr, &p->laddr,
 						 p->link, strict);
 
@@ -1251,7 +1258,7 @@ ip6_tnl_change(struct ip6_tnl *t, const struct __ip6_tnl_parm *p)
 
 static int ip6_tnl_update(struct ip6_tnl *t, struct __ip6_tnl_parm *p)
 {
-	struct net *net = dev_net(t->dev);
+	struct net *net = t->net;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 	int err;
 
@@ -1463,7 +1470,6 @@ static void ip6_tnl_dev_setup(struct net_device *dev)
 		dev->mtu-=8;
 	dev->flags |= IFF_NOARP;
 	dev->addr_len = sizeof(struct in6_addr);
-	dev->features |= NETIF_F_NETNS_LOCAL;
 	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
 }
 
@@ -1479,6 +1485,7 @@ ip6_tnl_dev_init_gen(struct net_device *dev)
 	struct ip6_tnl *t = netdev_priv(dev);
 
 	t->dev = dev;
+	t->net = dev_net(dev);
 	dev->tstats = alloc_percpu(struct pcpu_tstats);
 	if (!dev->tstats)
 		return -ENOMEM;
@@ -1596,9 +1603,9 @@ static int ip6_tnl_newlink(struct net *src_net, struct net_device *dev,
 static int ip6_tnl_changelink(struct net_device *dev, struct nlattr *tb[],
 			      struct nlattr *data[])
 {
-	struct ip6_tnl *t;
+	struct ip6_tnl *t = netdev_priv(dev);
 	struct __ip6_tnl_parm p;
-	struct net *net = dev_net(dev);
+	struct net *net = t->net;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 
 	if (dev == ip6n->fb_tnl_dev)
@@ -1699,14 +1706,24 @@ static struct xfrm6_tunnel ip6ip6_handler __read_mostly = {
 
 static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n)
 {
+	struct net *net = dev_net(ip6n->fb_tnl_dev);
+	struct net_device *dev, *aux;
 	int h;
 	struct ip6_tnl *t;
 	LIST_HEAD(list);
 
+	for_each_netdev_safe(net, dev, aux)
+		if (dev->rtnl_link_ops == &ip6_link_ops)
+			unregister_netdevice_queue(dev, &list);
+
 	for (h = 0; h < HASH_SIZE; h++) {
 		t = rtnl_dereference(ip6n->tnls_r_l[h]);
 		while (t != NULL) {
-			unregister_netdevice_queue(t->dev, &list);
+			/* If dev is in the same netns, it has already
+			 * been added to the list by the previous loop.
+			 */
+			if (!net_eq(dev_net(t->dev), net))
+				unregister_netdevice_queue(t->dev, &list);
 			t = rtnl_dereference(t->next);
 		}
 	}
@@ -1732,6 +1749,10 @@ static int __net_init ip6_tnl_init_net(struct net *net)
 	if (!ip6n->fb_tnl_dev)
 		goto err_alloc_dev;
 	dev_net_set(ip6n->fb_tnl_dev, net);
+	/* FB netdevice is special: we have one, and only one per netns.
+	 * Allowing to move it to another netns is clearly unsafe.
+	 */
+	ip6n->fb_tnl_dev->features |= NETIF_F_NETNS_LOCAL;
 
 	err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev);
 	if (err < 0)
-- 
1.8.2.1

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap
  2013-08-13 15:51                                                       ` [PATCH net-next v2 0/4] " Nicolas Dichtel
                                                                           ` (3 preceding siblings ...)
  2013-08-13 15:51                                                         ` [PATCH net-next v2 4/4] ip6tnl: " Nicolas Dichtel
@ 2013-08-15  8:01                                                         ` David Miller
  4 siblings, 0 replies; 52+ messages in thread
From: David Miller @ 2013-08-15  8:01 UTC (permalink / raw)
  To: nicolas.dichtel
  Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue, 13 Aug 2013 17:51:08 +0200

> 
> This serie is a follow up of the previous serie witch adds this functionality
> for sit tunnels.
> 
> The goal is to add x-netns support for the module ipip and ip6_tunnel, ie the
> encapsulation addresses and the network device are not owned by the same
> namespace.
> 
> Note that the two first patches are cleanup.

Looks good, series applied, thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns
  2013-06-25 23:56                                   ` David Miller
  2013-06-26  1:35                                     ` Eric W. Biederman
@ 2013-06-26 13:49                                     ` Nicolas Dichtel
  1 sibling, 0 replies; 52+ messages in thread
From: Nicolas Dichtel @ 2013-06-26 13:49 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, Eric Dumazet

Le 26/06/2013 01:56, David Miller a écrit :
> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Date: Tue, 25 Jun 2013 16:24:55 +0200
>
>> @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
>>   	tstats->rx_bytes += skb->len;
>>   	u64_stats_update_end(&tstats->syncp);
>>
>> +	skb_scrub_packet(skb);
>> +
>>   	if (tunnel->dev->type == ARPHRD_ETHER) {
>>   		skb->protocol = eth_type_trans(skb, tunnel->dev);
>>   		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
>
> I can't see how this can be ok.
>
> If something in netfilter depends upon the state you are clearing out
> here, someone's packet filtering setup is going to break.
Just for the record, note that nf_reset() is already called in 
iptunnel_pull_header() and iptunnel_xmit().
Hence 4in4 (ipip and sit) and gre tunnels are already reseting netfilter state.
6in4 (sit) do it only in xmit path and Xin6 (ip6_tunnel) never.

We can also notice that nf_reset() was added by the commit 3d7b46cd20e3 
"ip_tunnel: push generic protocol handling to ip_tunnel module." (net-next only) 
in rx path.

The nf_reset() of xmit path of 4in4 (ipip) is here for years (at least 2.6.12).
For gre, it has been added by c54419321455 "GRE: Refactor GRE tunneling code." 
(v3.10-rc1).

It seems that the code is different depending of the type of the tunnel. If we 
omit skb_orphan() (and maybe another one?, to be done only when changing 
namespace), it can be good to have a common function to have the same behavior 
for each tunnel.

Maybe something like:
void skb_scrub_packet(bool netnschange)
{
	if (netnschange)
	        skb_orphan(skb);
         skb->tstamp.tv64 = 0;
         skb->pkt_type = PACKET_HOST;
         skb->skb_iif = 0;
         skb_dst_drop(skb);
         skb->mark = 0;
         secpath_reset(skb);
         nf_reset(skb);
         nf_reset_trace(skb);
}

What's your opinion?

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2013-08-15  8:01 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-23 17:49 switching network namespace midway rsa
2012-10-24 21:11 ` Eric W. Biederman
2012-10-24 21:21   ` Benjamin LaHaise
2012-10-25  1:37     ` Eric W. Biederman
2012-10-25 14:38       ` Benjamin LaHaise
2012-10-25 16:21         ` Stephen Hemminger
2012-10-28  5:43           ` Eric W. Biederman
2012-10-29 14:23             ` Stephen Hemminger
2012-10-30  0:21               ` Eric W. Biederman
2012-10-30  8:55                 ` James Chapman
2012-10-25 15:12     ` rsa
2012-10-25 15:29     ` rsa
2012-10-25 15:59       ` Benjamin LaHaise
2012-10-25 16:15         ` Eric W. Biederman
2012-11-02  2:25           ` Benjamin LaHaise
2012-11-02  6:18             ` Eric W. Biederman
2012-11-02 14:03               ` Benjamin LaHaise
2012-11-02 20:45                 ` Eric W. Biederman
2013-06-24 14:13                   ` [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
2013-06-24 14:13                     ` [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() Nicolas Dichtel
2013-06-24 18:13                       ` Ben Hutchings
2013-06-24 19:05                         ` Eric W. Biederman
2013-06-24 14:13                     ` [RFC PATCH net-next 2/2] sit: add support of x-netns Nicolas Dichtel
2013-06-24 19:28                       ` Eric W. Biederman
2013-06-24 21:11                         ` Nicolas Dichtel
2013-06-24 22:42                           ` Eric W. Biederman
2013-06-25 14:10                             ` Nicolas Dichtel
2013-06-25 14:24                               ` [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
2013-06-25 14:24                                 ` [PATCH v2 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel
2013-06-25 14:24                                 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
2013-06-25 23:56                                   ` David Miller
2013-06-26  1:35                                     ` Eric W. Biederman
2013-06-26  5:48                                       ` David Miller
2013-06-26 10:03                                         ` Eric W. Biederman
2013-06-26 10:22                                           ` Eric Dumazet
2013-06-26 12:15                                             ` Nicolas Dichtel
2013-06-26 14:11                                               ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel
2013-06-26 14:11                                                 ` [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel
2013-06-26 14:11                                                 ` [PATCH v3 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
2013-06-28  5:36                                                 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap David Miller
2013-07-03 15:00                                                   ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel
2013-07-03 15:00                                                     ` [PATCH net-next 1/3] sit: fix tunnel update via netlink Nicolas Dichtel
2013-07-03 15:00                                                     ` [PATCH net-next 2/3] ipip: add x-netns support Nicolas Dichtel
2013-07-03 15:00                                                     ` [PATCH net-next 3/3] ip6tnl: " Nicolas Dichtel
2013-07-04 21:56                                                     ` [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap David Miller
2013-08-13 15:51                                                       ` [PATCH net-next v2 0/4] " Nicolas Dichtel
2013-08-13 15:51                                                         ` [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans() Nicolas Dichtel
2013-08-13 15:51                                                         ` [PATCH net-next v2 2/4] ipv4 tunnels: use net_eq() helper to check netns Nicolas Dichtel
2013-08-13 15:51                                                         ` [PATCH net-next v2 3/4] ipip: add x-netns support Nicolas Dichtel
2013-08-13 15:51                                                         ` [PATCH net-next v2 4/4] ip6tnl: " Nicolas Dichtel
2013-08-15  8:01                                                         ` [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap David Miller
2013-06-26 13:49                                     ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).