* switching network namespace midway @ 2012-10-23 17:49 rsa 2012-10-24 21:11 ` Eric W. Biederman 0 siblings, 1 reply; 52+ messages in thread From: rsa @ 2012-10-23 17:49 UTC (permalink / raw) To: netdev Assuming I have a tunnel interface where two route lookups are done -- one for innter packet and the other for outer -- do you see any issues in switching the network namespace prior to second route lookup (and restore to the original namespace after the second lookup is done)? If so, are there any other calls other than sk_change_net() needed? thank you rsa ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-23 17:49 switching network namespace midway rsa @ 2012-10-24 21:11 ` Eric W. Biederman 2012-10-24 21:21 ` Benjamin LaHaise 0 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2012-10-24 21:11 UTC (permalink / raw) To: rsa; +Cc: netdev rsa <ravi.mlists@gmail.com> writes: > Assuming I have a tunnel interface where two route lookups are done -- > one for innter > packet and the other for outer -- do you see any issues in switching > the network > namespace prior to second route lookup (and restore to the original namespace > after the second lookup is done)? > > If so, are there any other calls other than sk_change_net() needed? In general sk_change_net is a bad idea. Most likely what you want to do is simply memorize both struct net's that you care about and perform the routing lookup as appropriate. Certainly you don't want to be calling sk_change_net for every packet that goes through your tunnel. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-24 21:11 ` Eric W. Biederman @ 2012-10-24 21:21 ` Benjamin LaHaise 2012-10-25 1:37 ` Eric W. Biederman ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Benjamin LaHaise @ 2012-10-24 21:21 UTC (permalink / raw) To: Eric W. Biederman; +Cc: rsa, netdev On Wed, Oct 24, 2012 at 02:11:14PM -0700, Eric W. Biederman wrote: > rsa <ravi.mlists@gmail.com> writes: > > > Assuming I have a tunnel interface where two route lookups are done -- > > one for innter > > packet and the other for outer -- do you see any issues in switching > > the network > > namespace prior to second route lookup (and restore to the original namespace > > after the second lookup is done)? > > > > If so, are there any other calls other than sk_change_net() needed? > > In general sk_change_net is a bad idea. > > Most likely what you want to do is simply memorize both struct net's > that you care about and perform the routing lookup as appropriate. > > Certainly you don't want to be calling sk_change_net for every packet > that goes through your tunnel. I've actually done this with L2TP. The packets coming into the system from the tunnel are received on one UDP socket in one "struct net", but the decapsulated packets are received on a "struct net_device" that is in a different "struct net". No special coding is required -- just move the tunnel's net_device into another namespace after creation and it works as expected. Using sk_change_net() would be full of races and is really not required for the vast majority of use cases. -ben ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-24 21:21 ` Benjamin LaHaise @ 2012-10-25 1:37 ` Eric W. Biederman 2012-10-25 14:38 ` Benjamin LaHaise 2012-10-25 15:12 ` rsa 2012-10-25 15:29 ` rsa 2 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2012-10-25 1:37 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: rsa, netdev Benjamin LaHaise <bcrl@kvack.org> writes: > On Wed, Oct 24, 2012 at 02:11:14PM -0700, Eric W. Biederman wrote: >> rsa <ravi.mlists@gmail.com> writes: >> >> > Assuming I have a tunnel interface where two route lookups are done -- >> > one for innter >> > packet and the other for outer -- do you see any issues in switching >> > the network >> > namespace prior to second route lookup (and restore to the original namespace >> > after the second lookup is done)? >> > >> > If so, are there any other calls other than sk_change_net() needed? >> >> In general sk_change_net is a bad idea. >> >> Most likely what you want to do is simply memorize both struct net's >> that you care about and perform the routing lookup as appropriate. >> >> Certainly you don't want to be calling sk_change_net for every packet >> that goes through your tunnel. > > I've actually done this with L2TP. The packets coming into the system from > the tunnel are received on one UDP socket in one "struct net", but the > decapsulated packets are received on a "struct net_device" that is in a > different "struct net". No special coding is required -- just move the > tunnel's net_device into another namespace after creation and it works as > expected. Using sk_change_net() would be full of races and is really not > required for the vast majority of use cases. Yes. Although L2TP is not an example of code I would copy. Any other tunnel would be better. I haven't looked closely at L2TP but it keeps popping up as a poster child for small little network namespace bugs that I don't want to think about. Last I looked to use L2TP it required a magic userspace that I couldn't find and I haven't cared enough to write. Ben would you be interested in helping flush out the network namespace bugs out of L2TP? Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-25 1:37 ` Eric W. Biederman @ 2012-10-25 14:38 ` Benjamin LaHaise 2012-10-25 16:21 ` Stephen Hemminger 0 siblings, 1 reply; 52+ messages in thread From: Benjamin LaHaise @ 2012-10-25 14:38 UTC (permalink / raw) To: Eric W. Biederman; +Cc: rsa, netdev Hello Eric, On Wed, Oct 24, 2012 at 06:37:16PM -0700, Eric W. Biederman wrote: > Yes. Although L2TP is not an example of code I would copy. Any other > tunnel would be better. I haven't looked closely at L2TP but it keeps > popping up as a poster child for small little network namespace bugs > that I don't want to think about. Agreed. > Last I looked to use L2TP it required a magic userspace that I couldn't > find and I haven't cared enough to write. Ben would you be interested > in helping flush out the network namespace bugs out of L2TP? Sure, that I can do. To be entirely honest, I have not yet tried using network namespaces with the in kernel L2TP stack, but rather with the Babylon code. I have, however, put together changes to make the Babylon userland code work with the in kernel L2TP over the past couple of months. Since the network namespace support is already present in the userland code, it shouldn't be too hard to adapt. >From a quick read of the L2TP over UDP code paths, it looks like things should work, as the ingress and egress lookups use the transport socket's namespace. All the reference counting looks a bit heavy handed, though. I also wrote a couple of test programs for setting up L2TP sockets and devices which may be of use -- see http://www.kvack.org/~bcrl/pppol2tp/ . -ben -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-25 14:38 ` Benjamin LaHaise @ 2012-10-25 16:21 ` Stephen Hemminger 2012-10-28 5:43 ` Eric W. Biederman 0 siblings, 1 reply; 52+ messages in thread From: Stephen Hemminger @ 2012-10-25 16:21 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric W. Biederman, rsa, netdev I noticed that the L2TP sockets are not being moved to the correct name space. Something like this is probably needed. --- a/net/l2tp/l2tp_core.c 2012-10-25 09:11:15.691271882 -0700 +++ b/net/l2tp/l2tp_core.c 2012-10-25 09:18:58.746621418 -0700 @@ -1357,7 +1357,10 @@ static void l2tp_tunnel_free(struct l2tp * userspace. This is used for static tunnels where there is no * managing L2TP daemon. */ -static int l2tp_tunnel_sock_create(u32 tunnel_id, u32 peer_tunnel_id, struct l2tp_tunnel_cfg *cfg, struct socket **sockp) +static int l2tp_tunnel_sock_create(struct net *net, + u32 tunnel_id, u32 peer_tunnel_id, + struct l2tp_tunnel_cfg *cfg, + struct socket **sockp) { int err = -EINVAL; struct sockaddr_in udp_addr; @@ -1372,11 +1375,12 @@ static int l2tp_tunnel_sock_create(u32 t case L2TP_ENCAPTYPE_UDP: #if IS_ENABLED(CONFIG_IPV6) if (cfg->local_ip6 && cfg->peer_ip6) { - err = sock_create(AF_INET6, SOCK_DGRAM, 0, sockp); + err = sock_create_kern(AF_INET6, SOCK_DGRAM, 0, sockp); if (err < 0) goto out; sock = *sockp; + sk_change(sock->sk, net); memset(&udp6_addr, 0, sizeof(udp6_addr)); udp6_addr.sin6_family = AF_INET6; @@ -1400,11 +1404,12 @@ static int l2tp_tunnel_sock_create(u32 t } else #endif { - err = sock_create(AF_INET, SOCK_DGRAM, 0, sockp); + err = sock_create_kern(AF_INET, SOCK_DGRAM, 0, sockp); if (err < 0) goto out; sock = *sockp; + sk_change(sock->sk, net); memset(&udp_addr, 0, sizeof(udp_addr)); udp_addr.sin_family = AF_INET; @@ -1433,7 +1438,7 @@ static int l2tp_tunnel_sock_create(u32 t case L2TP_ENCAPTYPE_IP: #if IS_ENABLED(CONFIG_IPV6) if (cfg->local_ip6 && cfg->peer_ip6) { - err = sock_create(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP, + err = sock_create_kern(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP, sockp); if (err < 0) goto out; @@ -1462,12 +1467,13 @@ static int l2tp_tunnel_sock_create(u32 t } else #endif { - err = sock_create(AF_INET, SOCK_DGRAM, IPPROTO_L2TP, + err = sock_create_kern(AF_INET, SOCK_DGRAM, IPPROTO_L2TP, sockp); if (err < 0) goto out; sock = *sockp; + sk_change(sock->sk, net); memset(&ip_addr, 0, sizeof(ip_addr)); ip_addr.l2tp_family = AF_INET; @@ -1517,7 +1523,8 @@ int l2tp_tunnel_create(struct net *net, * kernel socket. */ if (fd < 0) { - err = l2tp_tunnel_sock_create(tunnel_id, peer_tunnel_id, cfg, &sock); + err = l2tp_tunnel_sock_create(net, tunnel_id, peer_tunnel_id, + cfg, &sock); if (err < 0) goto err; } else { ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-25 16:21 ` Stephen Hemminger @ 2012-10-28 5:43 ` Eric W. Biederman 2012-10-29 14:23 ` Stephen Hemminger 0 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2012-10-28 5:43 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Benjamin LaHaise, rsa, netdev Stephen Hemminger <shemminger@vyatta.com> writes: > I noticed that the L2TP sockets are not being moved to the correct name > space. > > Something like this is probably needed. This is almost right. There needs to be a line in l2tp_tunnel_create that verifies the network namespace of the socket derived from a file descriptor and the passed in network namespace match. For the l2tp_tunnel_sock_create case where we have a socket that is not exported to userspace using sk_change_net seems appropriate to avoid reference counting problems. And it may be worth moving that work into sk_create_kern. But we need a network namespace hook that will lookup all l2tp tunnel sockets when a network namespace is being destroyed and remove them. I think we can hit this bug with rmmod as well. Bleh. Eric > --- a/net/l2tp/l2tp_core.c 2012-10-25 09:11:15.691271882 -0700 > +++ b/net/l2tp/l2tp_core.c 2012-10-25 09:18:58.746621418 -0700 > @@ -1357,7 +1357,10 @@ static void l2tp_tunnel_free(struct l2tp > * userspace. This is used for static tunnels where there is no > * managing L2TP daemon. > */ > -static int l2tp_tunnel_sock_create(u32 tunnel_id, u32 peer_tunnel_id, struct l2tp_tunnel_cfg *cfg, struct socket **sockp) > +static int l2tp_tunnel_sock_create(struct net *net, > + u32 tunnel_id, u32 peer_tunnel_id, > + struct l2tp_tunnel_cfg *cfg, > + struct socket **sockp) > { > int err = -EINVAL; > struct sockaddr_in udp_addr; > @@ -1372,11 +1375,12 @@ static int l2tp_tunnel_sock_create(u32 t > case L2TP_ENCAPTYPE_UDP: > #if IS_ENABLED(CONFIG_IPV6) > if (cfg->local_ip6 && cfg->peer_ip6) { > - err = sock_create(AF_INET6, SOCK_DGRAM, 0, sockp); > + err = sock_create_kern(AF_INET6, SOCK_DGRAM, 0, sockp); > if (err < 0) > goto out; > > sock = *sockp; > + sk_change(sock->sk, net); > > memset(&udp6_addr, 0, sizeof(udp6_addr)); > udp6_addr.sin6_family = AF_INET6; > @@ -1400,11 +1404,12 @@ static int l2tp_tunnel_sock_create(u32 t > } else > #endif > { > - err = sock_create(AF_INET, SOCK_DGRAM, 0, sockp); > + err = sock_create_kern(AF_INET, SOCK_DGRAM, 0, sockp); > if (err < 0) > goto out; > > sock = *sockp; > + sk_change(sock->sk, net); > > memset(&udp_addr, 0, sizeof(udp_addr)); > udp_addr.sin_family = AF_INET; > @@ -1433,7 +1438,7 @@ static int l2tp_tunnel_sock_create(u32 t > case L2TP_ENCAPTYPE_IP: > #if IS_ENABLED(CONFIG_IPV6) > if (cfg->local_ip6 && cfg->peer_ip6) { > - err = sock_create(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP, > + err = sock_create_kern(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP, > sockp); > if (err < 0) > goto out; > @@ -1462,12 +1467,13 @@ static int l2tp_tunnel_sock_create(u32 t > } else > #endif > { > - err = sock_create(AF_INET, SOCK_DGRAM, IPPROTO_L2TP, > + err = sock_create_kern(AF_INET, SOCK_DGRAM, IPPROTO_L2TP, > sockp); > if (err < 0) > goto out; > > sock = *sockp; > + sk_change(sock->sk, net); > > memset(&ip_addr, 0, sizeof(ip_addr)); > ip_addr.l2tp_family = AF_INET; > @@ -1517,7 +1523,8 @@ int l2tp_tunnel_create(struct net *net, > * kernel socket. > */ > if (fd < 0) { > - err = l2tp_tunnel_sock_create(tunnel_id, peer_tunnel_id, cfg, &sock); > + err = l2tp_tunnel_sock_create(net, tunnel_id, peer_tunnel_id, > + cfg, &sock); > if (err < 0) > goto err; > } else { > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-28 5:43 ` Eric W. Biederman @ 2012-10-29 14:23 ` Stephen Hemminger 2012-10-30 0:21 ` Eric W. Biederman 0 siblings, 1 reply; 52+ messages in thread From: Stephen Hemminger @ 2012-10-29 14:23 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Benjamin LaHaise, rsa, netdev On Sat, 27 Oct 2012 22:43:13 -0700 ebiederm@xmission.com (Eric W. Biederman) wrote: > Stephen Hemminger <shemminger@vyatta.com> writes: > > > I noticed that the L2TP sockets are not being moved to the correct name > > space. > > > > Something like this is probably needed. > > This is almost right. > > There needs to be a line in l2tp_tunnel_create that verifies > the network namespace of the socket derived from a file descriptor > and the passed in network namespace match. > > For the l2tp_tunnel_sock_create case where we have a socket that is not > exported to userspace using sk_change_net seems appropriate to avoid > reference counting problems. And it may be worth moving that work into > sk_create_kern. But we need a network namespace hook that will lookup > all l2tp tunnel sockets when a network namespace is being destroyed and > remove them. I think we can hit this bug with rmmod as well. Since I don't use netns or L2TP for real, someone else needs to take up the crusade here. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-29 14:23 ` Stephen Hemminger @ 2012-10-30 0:21 ` Eric W. Biederman 2012-10-30 8:55 ` James Chapman 0 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2012-10-30 0:21 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Benjamin LaHaise, rsa, netdev, James Chapman Stephen Hemminger <shemminger@vyatta.com> writes: > On Sat, 27 Oct 2012 22:43:13 -0700 > ebiederm@xmission.com (Eric W. Biederman) wrote: > >> Stephen Hemminger <shemminger@vyatta.com> writes: >> >> > I noticed that the L2TP sockets are not being moved to the correct name >> > space. >> > >> > Something like this is probably needed. >> >> This is almost right. >> >> There needs to be a line in l2tp_tunnel_create that verifies >> the network namespace of the socket derived from a file descriptor >> and the passed in network namespace match. >> >> For the l2tp_tunnel_sock_create case where we have a socket that is not >> exported to userspace using sk_change_net seems appropriate to avoid >> reference counting problems. And it may be worth moving that work into >> sk_create_kern. But we need a network namespace hook that will lookup >> all l2tp tunnel sockets when a network namespace is being destroyed and >> remove them. I think we can hit this bug with rmmod as well. > > Since I don't use netns or L2TP for real, someone else needs to take > up the crusade here. Let's see if James Chapman is interested. I don't use L2TP for real either. James are you at all interested in the network namespace bugs that have been found in the l2tp code? Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-30 0:21 ` Eric W. Biederman @ 2012-10-30 8:55 ` James Chapman 0 siblings, 0 replies; 52+ messages in thread From: James Chapman @ 2012-10-30 8:55 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Stephen Hemminger, Benjamin LaHaise, rsa, netdev On 30/10/12 00:21, Eric W. Biederman wrote: > Stephen Hemminger <shemminger@vyatta.com> writes: > >> On Sat, 27 Oct 2012 22:43:13 -0700 >> ebiederm@xmission.com (Eric W. Biederman) wrote: >> >>> Stephen Hemminger <shemminger@vyatta.com> writes: >>> >>>> I noticed that the L2TP sockets are not being moved to the correct name >>>> space. >>>> >>>> Something like this is probably needed. >>> >>> This is almost right. >>> >>> There needs to be a line in l2tp_tunnel_create that verifies >>> the network namespace of the socket derived from a file descriptor >>> and the passed in network namespace match. >>> >>> For the l2tp_tunnel_sock_create case where we have a socket that is not >>> exported to userspace using sk_change_net seems appropriate to avoid >>> reference counting problems. And it may be worth moving that work into >>> sk_create_kern. But we need a network namespace hook that will lookup >>> all l2tp tunnel sockets when a network namespace is being destroyed and >>> remove them. I think we can hit this bug with rmmod as well. >> >> Since I don't use netns or L2TP for real, someone else needs to take >> up the crusade here. > > Let's see if James Chapman is interested. I don't use L2TP for real either. > > James are you at all interested in the network namespace bugs that have > been found in the l2tp code? Very much so, Eric. Thanks for keeping me in the loop. Unfortunately, I am busy on other things at the moment. It's in my queue. I'll get to it as soon as I can. > > Eric -- James Chapman Katalix Systems Ltd http://www.katalix.com Catalysts for your Embedded Linux software development ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-24 21:21 ` Benjamin LaHaise 2012-10-25 1:37 ` Eric W. Biederman @ 2012-10-25 15:12 ` rsa 2012-10-25 15:29 ` rsa 2 siblings, 0 replies; 52+ messages in thread From: rsa @ 2012-10-25 15:12 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric W. Biederman, netdev Thanks Eric. That makes sense. I can remember both network namespaces and switch to the other prior to route lookup if necessary. Any gotchas on receiving tunneled packed on the tunnel interface? Looking at the code there is no change necessary. Packet arrives on tunnel interface which is decapsulated and given to net_if_rx() for inner packet processing. Am I mssing something? thank you rsa On Wed, Oct 24, 2012 at 5:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > On Wed, Oct 24, 2012 at 02:11:14PM -0700, Eric W. Biederman wrote: >> rsa <ravi.mlists@gmail.com> writes: >> >> > Assuming I have a tunnel interface where two route lookups are done -- >> > one for innter >> > packet and the other for outer -- do you see any issues in switching >> > the network >> > namespace prior to second route lookup (and restore to the original namespace >> > after the second lookup is done)? >> > >> > If so, are there any other calls other than sk_change_net() needed? >> >> In general sk_change_net is a bad idea. >> >> Most likely what you want to do is simply memorize both struct net's >> that you care about and perform the routing lookup as appropriate. >> >> Certainly you don't want to be calling sk_change_net for every packet >> that goes through your tunnel. > > I've actually done this with L2TP. The packets coming into the system from > the tunnel are received on one UDP socket in one "struct net", but the > decapsulated packets are received on a "struct net_device" that is in a > different "struct net". No special coding is required -- just move the > tunnel's net_device into another namespace after creation and it works as > expected. Using sk_change_net() would be full of races and is really not > required for the vast majority of use cases. > > -ben ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-24 21:21 ` Benjamin LaHaise 2012-10-25 1:37 ` Eric W. Biederman 2012-10-25 15:12 ` rsa @ 2012-10-25 15:29 ` rsa 2012-10-25 15:59 ` Benjamin LaHaise 2 siblings, 1 reply; 52+ messages in thread From: rsa @ 2012-10-25 15:29 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric W. Biederman, netdev Hi Ben For L2TP that should work very well. But, with other tunnel types like GRE etc. route lookups have to be done first in the inner and then outer. It makes it easier to keep keep the tunnel in the inner namespace in such cases. right? thank you rsa On Wed, Oct 24, 2012 at 5:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > On Wed, Oct 24, 2012 at 02:11:14PM -0700, Eric W. Biederman wrote: >> rsa <ravi.mlists@gmail.com> writes: >> >> > Assuming I have a tunnel interface where two route lookups are done -- >> > one for innter >> > packet and the other for outer -- do you see any issues in switching >> > the network >> > namespace prior to second route lookup (and restore to the original namespace >> > after the second lookup is done)? >> > >> > If so, are there any other calls other than sk_change_net() needed? >> >> In general sk_change_net is a bad idea. >> >> Most likely what you want to do is simply memorize both struct net's >> that you care about and perform the routing lookup as appropriate. >> >> Certainly you don't want to be calling sk_change_net for every packet >> that goes through your tunnel. > > I've actually done this with L2TP. The packets coming into the system from > the tunnel are received on one UDP socket in one "struct net", but the > decapsulated packets are received on a "struct net_device" that is in a > different "struct net". No special coding is required -- just move the > tunnel's net_device into another namespace after creation and it works as > expected. Using sk_change_net() would be full of races and is really not > required for the vast majority of use cases. > > -ben ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-25 15:29 ` rsa @ 2012-10-25 15:59 ` Benjamin LaHaise 2012-10-25 16:15 ` Eric W. Biederman 0 siblings, 1 reply; 52+ messages in thread From: Benjamin LaHaise @ 2012-10-25 15:59 UTC (permalink / raw) To: rsa; +Cc: Eric W. Biederman, netdev On Thu, Oct 25, 2012 at 11:29:34AM -0400, rsa wrote: > Hi Ben > For L2TP that should work very well. But, with other tunnel types like GRE etc. > route lookups have to be done first in the inner and then outer. It > makes it easier > to keep keep the tunnel in the inner namespace in such cases. right? I've read IPv4 gre code, and it appears to do the right thing on rx, but it does *not* appear to handle namespaces correctly on transmit. In general, I would expect pretty much all code to get namespace handling correct for the rx case. I'll have a closer look at fixing this tomorrow if nobody else beats me to it. -ben -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-25 15:59 ` Benjamin LaHaise @ 2012-10-25 16:15 ` Eric W. Biederman 2012-11-02 2:25 ` Benjamin LaHaise 0 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2012-10-25 16:15 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: rsa, netdev Benjamin LaHaise <bcrl@kvack.org> writes: > On Thu, Oct 25, 2012 at 11:29:34AM -0400, rsa wrote: >> Hi Ben >> For L2TP that should work very well. But, with other tunnel types like GRE etc. >> route lookups have to be done first in the inner and then outer. It >> makes it easier >> to keep keep the tunnel in the inner namespace in such cases. right? > > I've read IPv4 gre code, and it appears to do the right thing on rx, but it > does *not* appear to handle namespaces correctly on transmit. In general, > I would expect pretty much all code to get namespace handling correct for > the rx case. I'll have a closer look at fixing this tomorrow if nobody else > beats me to it. It will be interesting to see what you come up with. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-10-25 16:15 ` Eric W. Biederman @ 2012-11-02 2:25 ` Benjamin LaHaise 2012-11-02 6:18 ` Eric W. Biederman 0 siblings, 1 reply; 52+ messages in thread From: Benjamin LaHaise @ 2012-11-02 2:25 UTC (permalink / raw) To: Eric W. Biederman, rsa; +Cc: netdev On Thu, Oct 25, 2012 at 09:15:34AM -0700, Eric W. Biederman wrote: > > I've read IPv4 gre code, and it appears to do the right thing on rx, but it > > does *not* appear to handle namespaces correctly on transmit. In general, > > I would expect pretty much all code to get namespace handling correct for > > the rx case. I'll have a closer look at fixing this tomorrow if nobody else > > beats me to it. > > It will be interesting to see what you come up with. Well, I finally had some time to work on the ip_gre module a bit today, and here's what I came up with. The basic idea is to store the network namespace in the ip_tunnel structure at creation time for use in sending and receiving packets, allowing the gre network device to be safely moved into another network namespace. This works for me in moving a gre tunnel into an lxc container, and survives module unload and namespace destruction. I'll try to spend a bit more time adding similar support to the other ip_tunnel devices over the next few days. Comments/thoughts? -ben -- "Thought is the essence of where you are now." diff --git a/include/net/ipip.h b/include/net/ipip.h index ddc077c..9cfba92 100644 --- a/include/net/ipip.h +++ b/include/net/ipip.h @@ -19,6 +19,7 @@ struct ip_tunnel_6rd_parm { struct ip_tunnel { struct ip_tunnel __rcu *next; struct net_device *dev; + struct net *net; /* Namespace for packet i/o */ int err_count; /* Number of arrived ICMP errors */ unsigned long err_time; /* Time when the last ICMP error arrived */ diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 7240f8e..705dc66 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -461,6 +461,7 @@ static struct ip_tunnel *ipgre_tunnel_locate(struct net *net, dev_net_set(dev, net); nt = netdev_priv(dev); + nt->net = net; nt->parms = *parms; dev->rtnl_link_ops = &ipgre_link_ops; @@ -484,8 +485,10 @@ failed_free: static void ipgre_tunnel_uninit(struct net_device *dev) { - struct net *net = dev_net(dev); - struct ipgre_net *ign = net_generic(net, ipgre_net_id); + struct ip_tunnel *tunnel = netdev_priv(dev); + struct ipgre_net *ign; + + ign = net_generic(tunnel->net, ipgre_net_id); ipgre_tunnel_unlink(ign, netdev_priv(dev)); dev_put(dev); @@ -837,7 +840,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev tos = ipv6_get_dsfield((const struct ipv6hdr *)old_iph); } - rt = ip_route_output_gre(dev_net(dev), &fl4, dst, tiph->saddr, + rt = ip_route_output_gre(tunnel->net, &fl4, dst, tiph->saddr, tunnel->parms.o_key, RT_TOS(tos), tunnel->parms.link); if (IS_ERR(rt)) { @@ -1010,7 +1013,7 @@ static int ipgre_tunnel_bind_dev(struct net_device *dev) struct flowi4 fl4; struct rtable *rt; - rt = ip_route_output_gre(dev_net(dev), &fl4, + rt = ip_route_output_gre(tunnel->net, &fl4, iph->daddr, iph->saddr, tunnel->parms.o_key, RT_TOS(iph->tos), @@ -1341,7 +1344,6 @@ static void ipgre_tunnel_setup(struct net_device *dev) dev->flags = IFF_NOARP; dev->iflink = 0; dev->addr_len = 4; - dev->features |= NETIF_F_NETNS_LOCAL; dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->features |= GRE_FEATURES; @@ -1432,6 +1434,7 @@ static void ipgre_destroy_tunnels(struct ipgre_net *ign, struct list_head *head) static int __net_init ipgre_init_net(struct net *net) { struct ipgre_net *ign = net_generic(net, ipgre_net_id); + struct ip_tunnel *tunnel; int err; ign->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel), "gre0", @@ -1445,6 +1448,9 @@ static int __net_init ipgre_init_net(struct net *net) ipgre_fb_tunnel_init(ign->fb_tunnel_dev); ign->fb_tunnel_dev->rtnl_link_ops = &ipgre_link_ops; + tunnel = netdev_priv(ign->fb_tunnel_dev); + tunnel->net = net; + if ((err = register_netdev(ign->fb_tunnel_dev))) goto err_reg_dev; ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-11-02 2:25 ` Benjamin LaHaise @ 2012-11-02 6:18 ` Eric W. Biederman 2012-11-02 14:03 ` Benjamin LaHaise 0 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2012-11-02 6:18 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: rsa, netdev Benjamin LaHaise <bcrl@kvack.org> writes: > On Thu, Oct 25, 2012 at 09:15:34AM -0700, Eric W. Biederman wrote: >> > I've read IPv4 gre code, and it appears to do the right thing on rx, but it >> > does *not* appear to handle namespaces correctly on transmit. In general, >> > I would expect pretty much all code to get namespace handling correct for >> > the rx case. I'll have a closer look at fixing this tomorrow if nobody else >> > beats me to it. >> >> It will be interesting to see what you come up with. > > Well, I finally had some time to work on the ip_gre module a bit today, > and here's what I came up with. The basic idea is to store the network > namespace in the ip_tunnel structure at creation time for use in sending > and receiving packets, allowing the gre network device to be safely moved > into another network namespace. This works for me in moving a gre tunnel > into an lxc container, and survives module unload and namespace > destruction. I'll try to spend a bit more time adding similar support to > the other ip_tunnel devices over the next few days. Comments/thoughts? > > -ben You need a per network namespace exit function to delete the tunnel when the xmit direction goes away. Otherwise we have a very nasty race if the original network namespace exits. NETNS_LOCAL may make sense on the reference device that is used to support ioctls for creating devices. ipgre_open ? It looks like it needs to be handled. Probably that ip_route_output_gre needs to be moved. ipv6? Eric > -- > "Thought is the essence of where you are now." > > > diff --git a/include/net/ipip.h b/include/net/ipip.h > index ddc077c..9cfba92 100644 > --- a/include/net/ipip.h > +++ b/include/net/ipip.h > @@ -19,6 +19,7 @@ struct ip_tunnel_6rd_parm { > struct ip_tunnel { > struct ip_tunnel __rcu *next; > struct net_device *dev; > + struct net *net; /* Namespace for packet i/o */ > > int err_count; /* Number of arrived ICMP errors */ > unsigned long err_time; /* Time when the last ICMP error arrived */ > diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c > index 7240f8e..705dc66 100644 > --- a/net/ipv4/ip_gre.c > +++ b/net/ipv4/ip_gre.c > @@ -461,6 +461,7 @@ static struct ip_tunnel *ipgre_tunnel_locate(struct net *net, > dev_net_set(dev, net); > > nt = netdev_priv(dev); > + nt->net = net; > nt->parms = *parms; > dev->rtnl_link_ops = &ipgre_link_ops; > > @@ -484,8 +485,10 @@ failed_free: > > static void ipgre_tunnel_uninit(struct net_device *dev) > { > - struct net *net = dev_net(dev); > - struct ipgre_net *ign = net_generic(net, ipgre_net_id); > + struct ip_tunnel *tunnel = netdev_priv(dev); > + struct ipgre_net *ign; > + > + ign = net_generic(tunnel->net, ipgre_net_id); > > ipgre_tunnel_unlink(ign, netdev_priv(dev)); > dev_put(dev); > @@ -837,7 +840,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev > tos = ipv6_get_dsfield((const struct ipv6hdr *)old_iph); > } > > - rt = ip_route_output_gre(dev_net(dev), &fl4, dst, tiph->saddr, > + rt = ip_route_output_gre(tunnel->net, &fl4, dst, tiph->saddr, > tunnel->parms.o_key, RT_TOS(tos), > tunnel->parms.link); > if (IS_ERR(rt)) { > @@ -1010,7 +1013,7 @@ static int ipgre_tunnel_bind_dev(struct net_device *dev) > struct flowi4 fl4; > struct rtable *rt; > > - rt = ip_route_output_gre(dev_net(dev), &fl4, > + rt = ip_route_output_gre(tunnel->net, &fl4, > iph->daddr, iph->saddr, > tunnel->parms.o_key, > RT_TOS(iph->tos), > @@ -1341,7 +1344,6 @@ static void ipgre_tunnel_setup(struct net_device *dev) > dev->flags = IFF_NOARP; > dev->iflink = 0; > dev->addr_len = 4; > - dev->features |= NETIF_F_NETNS_LOCAL; > dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > > dev->features |= GRE_FEATURES; > @@ -1432,6 +1434,7 @@ static void ipgre_destroy_tunnels(struct ipgre_net *ign, struct list_head *head) > static int __net_init ipgre_init_net(struct net *net) > { > struct ipgre_net *ign = net_generic(net, ipgre_net_id); > + struct ip_tunnel *tunnel; > int err; > > ign->fb_tunnel_dev = alloc_netdev(sizeof(struct ip_tunnel), "gre0", > @@ -1445,6 +1448,9 @@ static int __net_init ipgre_init_net(struct net *net) > ipgre_fb_tunnel_init(ign->fb_tunnel_dev); > ign->fb_tunnel_dev->rtnl_link_ops = &ipgre_link_ops; > > + tunnel = netdev_priv(ign->fb_tunnel_dev); > + tunnel->net = net; > + > if ((err = register_netdev(ign->fb_tunnel_dev))) > goto err_reg_dev; > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-11-02 6:18 ` Eric W. Biederman @ 2012-11-02 14:03 ` Benjamin LaHaise 2012-11-02 20:45 ` Eric W. Biederman 0 siblings, 1 reply; 52+ messages in thread From: Benjamin LaHaise @ 2012-11-02 14:03 UTC (permalink / raw) To: Eric W. Biederman; +Cc: rsa, netdev On Thu, Nov 01, 2012 at 11:18:58PM -0700, Eric W. Biederman wrote: > You need a per network namespace exit function to delete the tunnel when > the xmit direction goes away. Otherwise we have a very nasty race if > the original network namespace exits. That already exists as ipgre_exit_net(). Since the ip_tunnel structure remains hashed in the network namespace that creation occurred in, this case should be covered. > NETNS_LOCAL may make sense on the reference device that is used to > support ioctls for creating devices. *nod* That makes sense. > ipgre_open ? It looks like it needs to be handled. Probably that > ip_route_output_gre needs to be moved. Good catch. Will respin with that changed. > ipv6? That's next on the list. There are also issues with ipip, ipmr and ipvti, as well as their ipv6 versions. -ben -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: switching network namespace midway 2012-11-02 14:03 ` Benjamin LaHaise @ 2012-11-02 20:45 ` Eric W. Biederman 2013-06-24 14:13 ` [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 0 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2012-11-02 20:45 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: rsa, netdev Benjamin LaHaise <bcrl@kvack.org> writes: > On Thu, Nov 01, 2012 at 11:18:58PM -0700, Eric W. Biederman wrote: >> You need a per network namespace exit function to delete the tunnel when >> the xmit direction goes away. Otherwise we have a very nasty race if >> the original network namespace exits. > > That already exists as ipgre_exit_net(). Since the ip_tunnel structure > remains hashed in the network namespace that creation occurred in, this > case should be covered. *blink* I had looked for that, but I definitely missed that one. The consequence of the design where we can run ioctls on network devices we can't even see but whose ipaddrs are in our network namespace is interesting. Correct but interesting. >> NETNS_LOCAL may make sense on the reference device that is used to >> support ioctls for creating devices. > > *nod* That makes sense. After a second look. fb_tunnel_dev is very special in the code so allowing fb_tunnel_dev to change network namespace is almost certain to create problems, so it should get a NETNS_LOCAL. >> ipgre_open ? It looks like it needs to be handled. Probably that >> ip_route_output_gre needs to be moved. > > Good catch. Will respin with that changed. I'm not seeing anything else but I will look again after you respin. >> ipv6? > > That's next on the list. There are also issues with ipip, ipmr and > ipvti, as well as their ipv6 versions. I don't see sense in ip multicast routing tunnels (ipmr) changing network namespaces as they are essentially aliases for other network devices, and aren't really proper tunnels. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap 2012-11-02 20:45 ` Eric W. Biederman @ 2013-06-24 14:13 ` Nicolas Dichtel 2013-06-24 14:13 ` [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() Nicolas Dichtel 2013-06-24 14:13 ` [RFC PATCH net-next 2/2] sit: add support of x-netns Nicolas Dichtel 0 siblings, 2 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-24 14:13 UTC (permalink / raw) To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists This patch is a follow up of the thread "switching network namespace midway": http://marc.info/?t=135101459500004&r=1&w=2 The goal of this serie is to add x-netns support for the module sit, ie the encapsulation addresses and the network device are not owned by the same namespace. Example to configure a tunnel: modprobe sit ip netns add netns1 ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249 ip l s sit1 netns netns1 ip netns exec netns1 ip l s lo up ip netns exec netns1 ip l s sit1 up ip netns exec netns1 ip a a dev sit1 192.168.2.123 remote 192.168.2.121 ip netns exec netns1 ip -6 a a dev sit1 2001:1234::123 remote 2001:1234::121 I sent this serie as an RFC to get feedback from people. Once this serie is approved, I will add the same feature for the module ipip and ip6_tunnel. include/linux/netdevice.h | 1 + include/net/ip_tunnels.h | 1 + net/core/dev.c | 34 ++++++++++++++++++++++---------- net/ipv4/ip_tunnel.c | 7 ++++++- net/ipv6/sit.c | 49 ++++++++++++++++++++++++----------------------- 5 files changed, 57 insertions(+), 35 deletions(-) Comments are welcome. Regards, Nicolas ^ permalink raw reply [flat|nested] 52+ messages in thread
* [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() 2013-06-24 14:13 ` [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel @ 2013-06-24 14:13 ` Nicolas Dichtel 2013-06-24 18:13 ` Ben Hutchings 2013-06-24 14:13 ` [RFC PATCH net-next 2/2] sit: add support of x-netns Nicolas Dichtel 1 sibling, 1 reply; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-24 14:13 UTC (permalink / raw) To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, Nicolas Dichtel The goal of this new function is to perform all needed cleanup before sending an skb into another netns. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/linux/netdevice.h | 1 + net/core/dev.c | 34 ++++++++++++++++++++++++---------- 2 files changed, 25 insertions(+), 10 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 09b4188..9b72d87 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2321,6 +2321,7 @@ extern int dev_hard_start_xmit(struct sk_buff *skb, struct netdev_queue *txq); extern int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); +extern void dev_cleanup_skb(struct sk_buff *skb); extern int netdev_budget; diff --git a/net/core/dev.c b/net/core/dev.c index 722f633..d30bc22 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1625,6 +1625,29 @@ static inline bool is_skb_forwardable(struct net_device *dev, } /** + * dev_cleanup_skb - cleanup an skb before sending it to another netns + * + * @skb: buffer to clean + * + * dev_cleanup_skb can be used to clean an skb before injecting it in + * another namespace. We have to clear all information in the skb that + * could impact namespace isolation. + */ +void dev_cleanup_skb(struct sk_buff *skb) +{ + skb_orphan(skb); + skb->tstamp.tv64 = 0; + skb->pkt_type = PACKET_HOST; + skb->skb_iif = 0; + skb_dst_drop(skb); + skb->mark = 0; + secpath_reset(skb); + nf_reset(skb); + nf_reset_trace(skb); +} +EXPORT_SYMBOL_GPL(dev_cleanup_skb); + +/** * dev_forward_skb - loopback an skb to another netif * * @dev: destination network device @@ -1652,22 +1675,13 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) } } - skb_orphan(skb); - if (unlikely(!is_skb_forwardable(dev, skb))) { atomic_long_inc(&dev->rx_dropped); kfree_skb(skb); return NET_RX_DROP; } - skb->skb_iif = 0; - skb_dst_drop(skb); - skb->tstamp.tv64 = 0; - skb->pkt_type = PACKET_HOST; + dev_cleanup_skb(skb); skb->protocol = eth_type_trans(skb, dev); - skb->mark = 0; - secpath_reset(skb); - nf_reset(skb); - nf_reset_trace(skb); return netif_rx(skb); } EXPORT_SYMBOL_GPL(dev_forward_skb); -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() 2013-06-24 14:13 ` [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() Nicolas Dichtel @ 2013-06-24 18:13 ` Ben Hutchings 2013-06-24 19:05 ` Eric W. Biederman 0 siblings, 1 reply; 52+ messages in thread From: Ben Hutchings @ 2013-06-24 18:13 UTC (permalink / raw) To: Nicolas Dichtel; +Cc: netdev, davem, ebiederm, bcrl, ravi.mlists On Mon, 2013-06-24 at 16:13 +0200, Nicolas Dichtel wrote: > The goal of this new function is to perform all needed cleanup before sending > an skb into another netns. [...] To 'cleanup' an object often means to destroy or free it. So perhaps you could find an alternate verb that doesn't have that association, e.g. 'sanitise' or 'unmark'. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() 2013-06-24 18:13 ` Ben Hutchings @ 2013-06-24 19:05 ` Eric W. Biederman 0 siblings, 0 replies; 52+ messages in thread From: Eric W. Biederman @ 2013-06-24 19:05 UTC (permalink / raw) To: Ben Hutchings; +Cc: Nicolas Dichtel, netdev, davem, bcrl, ravi.mlists Ben Hutchings <bhutchings@solarflare.com> writes: > On Mon, 2013-06-24 at 16:13 +0200, Nicolas Dichtel wrote: >> The goal of this new function is to perform all needed cleanup before sending >> an skb into another netns. > [...] > > To 'cleanup' an object often means to destroy or free it. So perhaps > you could find an alternate verb that doesn't have that association, > e.g. 'sanitise' or 'unmark'. skb_scrub_packet sounds good to me. It has the right connotation and it is shorter. :) Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* [RFC PATCH net-next 2/2] sit: add support of x-netns 2013-06-24 14:13 ` [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 2013-06-24 14:13 ` [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() Nicolas Dichtel @ 2013-06-24 14:13 ` Nicolas Dichtel 2013-06-24 19:28 ` Eric W. Biederman 1 sibling, 1 reply; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-24 14:13 UTC (permalink / raw) To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, Nicolas Dichtel This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/net/ip_tunnels.h | 1 + net/ipv4/ip_tunnel.c | 7 ++++++- net/ipv6/sit.c | 49 ++++++++++++++++++++++++------------------------ 3 files changed, 32 insertions(+), 25 deletions(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index b0d9824..781b3cf 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -42,6 +42,7 @@ struct ip_tunnel { struct ip_tunnel __rcu *next; struct hlist_node hash_node; struct net_device *dev; + struct net *net; /* netns for packet i/o */ int err_count; /* Number of arrived ICMP errors */ unsigned long err_time; /* Time when the last ICMP error diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index bd227e5..b6a7704 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net, tunnel = netdev_priv(dev); tunnel->parms = *parms; + tunnel->net = net; err = register_netdevice(dev); if (err) @@ -453,6 +454,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, tstats->rx_bytes += skb->len; u64_stats_update_end(&tstats->syncp); + if (tunnel->net != dev_net(tunnel->dev)) + dev_cleanup_skb(skb); + if (tunnel->dev->type == ARPHRD_ETHER) { skb->protocol = eth_type_trans(skb, tunnel->dev); skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); @@ -541,7 +545,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph); } - rt = ip_route_output_tunnel(dev_net(dev), &fl4, + rt = ip_route_output_tunnel(tunnel->net, &fl4, tunnel->parms.iph.protocol, dst, tnl_params->saddr, tunnel->parms.o_key, @@ -888,6 +892,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[], if (ip_tunnel_find(itn, p, dev->type)) return -EEXIST; + nt->net = net; nt->parms = *p; err = register_netdevice(dev); if (err) diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index f639866..4e03904 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t) static void ipip6_tunnel_uninit(struct net_device *dev) { - struct net *net = dev_net(dev); - struct sit_net *sitn = net_generic(net, sit_net_id); + struct ip_tunnel *tunnel = netdev_priv(dev); + struct sit_net *sitn = net_generic(tunnel->net, sit_net_id); if (dev == sitn->fb_tunnel_dev) { RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL); } else { - ipip6_tunnel_unlink(sitn, netdev_priv(dev)); - ipip6_tunnel_del_prl(netdev_priv(dev), NULL); + ipip6_tunnel_unlink(sitn, tunnel); + ipip6_tunnel_del_prl(tunnel, NULL); } dev_put(dev); } @@ -621,6 +621,8 @@ static int ipip6_rcv(struct sk_buff *skb) tstats->rx_packets++; tstats->rx_bytes += skb->len; + if (tunnel->net != dev_net(tunnel->dev)) + dev_cleanup_skb(skb); netif_rx(skb); return 0; @@ -803,7 +805,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, goto tx_error; } - rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, + rt = ip_route_output_ports(tunnel->net, &fl4, NULL, dst, tiph->saddr, 0, 0, IPPROTO_IPV6, RT_TOS(tos), @@ -858,6 +860,9 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, tunnel->err_count = 0; } + if (tunnel->net != dev_net(dev)) + dev_cleanup_skb(skb); + /* * Okay, now see if we can stuff it in the buffer as-is. */ @@ -944,7 +949,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) iph = &tunnel->parms.iph; if (iph->daddr) { - struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, + struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4, + NULL, iph->daddr, iph->saddr, 0, 0, IPPROTO_IPV6, @@ -959,7 +965,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) } if (!tdev && tunnel->parms.link) - tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link); + tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link); if (tdev) { dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr); @@ -972,7 +978,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p) { - struct net *net = dev_net(t->dev); + struct net *net = t->net; struct sit_net *sitn = net_generic(net, sit_net_id); ipip6_tunnel_unlink(sitn, t); @@ -1248,7 +1254,6 @@ static void ipip6_tunnel_setup(struct net_device *dev) dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->iflink = 0; dev->addr_len = 4; - dev->features |= NETIF_F_NETNS_LOCAL; dev->features |= NETIF_F_LLTX; } @@ -1257,6 +1262,7 @@ static int ipip6_tunnel_init(struct net_device *dev) struct ip_tunnel *tunnel = netdev_priv(dev); tunnel->dev = dev; + tunnel->net = dev_net(dev); memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4); memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4); @@ -1277,6 +1283,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev) struct sit_net *sitn = net_generic(net, sit_net_id); tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); iph->version = 4; @@ -1562,22 +1569,15 @@ static struct xfrm_tunnel ipip_handler __read_mostly = { .priority = 2, }; -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head) +static void __net_exit sit_destroy_tunnels(struct net *net, + struct list_head *head) { - int prio; + struct net_device *dev, *aux; - for (prio = 1; prio < 4; prio++) { - int h; - for (h = 0; h < HASH_SIZE; h++) { - struct ip_tunnel *t; - - t = rtnl_dereference(sitn->tunnels[prio][h]); - while (t != NULL) { - unregister_netdevice_queue(t->dev, head); - t = rtnl_dereference(t->next); - } - } - } + for_each_netdev_safe(net, dev, aux) + if (dev->rtnl_link_ops && + !strcmp(dev->rtnl_link_ops->kind, "sit")) + unregister_netdevice_queue(dev, head); } static int __net_init sit_init_net(struct net *net) @@ -1598,6 +1598,7 @@ static int __net_init sit_init_net(struct net *net) goto err_alloc_dev; } dev_net_set(sitn->fb_tunnel_dev, net); + sitn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL; err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev); if (err) @@ -1627,7 +1628,7 @@ static void __net_exit sit_exit_net(struct net *net) LIST_HEAD(list); rtnl_lock(); - sit_destroy_tunnels(sitn, &list); + sit_destroy_tunnels(net, &list); unregister_netdevice_queue(sitn->fb_tunnel_dev, &list); unregister_netdevice_many(&list); rtnl_unlock(); -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [RFC PATCH net-next 2/2] sit: add support of x-netns 2013-06-24 14:13 ` [RFC PATCH net-next 2/2] sit: add support of x-netns Nicolas Dichtel @ 2013-06-24 19:28 ` Eric W. Biederman 2013-06-24 21:11 ` Nicolas Dichtel 0 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2013-06-24 19:28 UTC (permalink / raw) To: Nicolas Dichtel; +Cc: netdev, davem, bcrl, ravi.mlists Nicolas Dichtel <nicolas.dichtel@6wind.com> writes: > This patch allows to switch the netns when packet is encapsulated or > decapsulated. In other word, the encapsulated packet is received in a netns, > where the lookup is done to find the tunnel. Once the tunnel is found, the > packet is decapsulated and injecting into the corresponding interface which > stands to another netns. > > When one of the two netns is removed, the tunnel is destroyed. I don't see any fundamental problems with this code. There are bugs with the cleanup noted below. The primary sit interface is marked as NETNS_LOCAL which is good. A comment might be nice explaining the reasonsing but for code archeologists. Conditionally calling dev_cleanup_skb bugs me. The extra conditional looks like a maintenance hazard. Unless I have missed some subtle detail either we don't need the cleanup at all or actually it is a bug that we aren't scrubbing our packets as they progress through tunnels even in the same network namespace. Can we just make that function the skb scrubbing needed for packets to traverse a tunnel? My concern going into this was that we would get code that would break because it would not be tested enough. If we can remove the conditional from dev_cleanup_skb we won't have any code that is conditionally run and the logic looks simple enough not to bitrot in routine maintenance. Eric > Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> > --- > include/net/ip_tunnels.h | 1 + > net/ipv4/ip_tunnel.c | 7 ++++++- > net/ipv6/sit.c | 49 ++++++++++++++++++++++++------------------------ > 3 files changed, 32 insertions(+), 25 deletions(-) > > diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h > index b0d9824..781b3cf 100644 > --- a/include/net/ip_tunnels.h > +++ b/include/net/ip_tunnels.h > @@ -42,6 +42,7 @@ struct ip_tunnel { > struct ip_tunnel __rcu *next; > struct hlist_node hash_node; > struct net_device *dev; > + struct net *net; /* netns for packet i/o */ > > int err_count; /* Number of arrived ICMP errors */ > unsigned long err_time; /* Time when the last ICMP error > diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c > index bd227e5..b6a7704 100644 > --- a/net/ipv4/ip_tunnel.c > +++ b/net/ipv4/ip_tunnel.c > @@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net, > > tunnel = netdev_priv(dev); > tunnel->parms = *parms; > + tunnel->net = net; > > err = register_netdevice(dev); > if (err) > @@ -453,6 +454,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, > tstats->rx_bytes += skb->len; > u64_stats_update_end(&tstats->syncp); > > + if (tunnel->net != dev_net(tunnel->dev)) > + dev_cleanup_skb(skb); > + > if (tunnel->dev->type == ARPHRD_ETHER) { > skb->protocol = eth_type_trans(skb, tunnel->dev); > skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); > @@ -541,7 +545,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, > tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph); > } > > - rt = ip_route_output_tunnel(dev_net(dev), &fl4, > + rt = ip_route_output_tunnel(tunnel->net, &fl4, > tunnel->parms.iph.protocol, > dst, tnl_params->saddr, > tunnel->parms.o_key, > @@ -888,6 +892,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[], > if (ip_tunnel_find(itn, p, dev->type)) > return -EEXIST; > > + nt->net = net; > nt->parms = *p; > err = register_netdevice(dev); > if (err) > diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c > index f639866..4e03904 100644 > --- a/net/ipv6/sit.c > +++ b/net/ipv6/sit.c > @@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t) > > static void ipip6_tunnel_uninit(struct net_device *dev) > { > - struct net *net = dev_net(dev); > - struct sit_net *sitn = net_generic(net, sit_net_id); > + struct ip_tunnel *tunnel = netdev_priv(dev); > + struct sit_net *sitn = net_generic(tunnel->net, sit_net_id); > > if (dev == sitn->fb_tunnel_dev) { > RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL); > } else { > - ipip6_tunnel_unlink(sitn, netdev_priv(dev)); > - ipip6_tunnel_del_prl(netdev_priv(dev), NULL); > + ipip6_tunnel_unlink(sitn, tunnel); > + ipip6_tunnel_del_prl(tunnel, NULL); > } > dev_put(dev); > } > @@ -621,6 +621,8 @@ static int ipip6_rcv(struct sk_buff *skb) > tstats->rx_packets++; > tstats->rx_bytes += skb->len; > > + if (tunnel->net != dev_net(tunnel->dev)) > + dev_cleanup_skb(skb); > netif_rx(skb); > > return 0; > @@ -803,7 +805,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, > goto tx_error; > } > > - rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, > + rt = ip_route_output_ports(tunnel->net, &fl4, NULL, > dst, tiph->saddr, > 0, 0, > IPPROTO_IPV6, RT_TOS(tos), > @@ -858,6 +860,9 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, > tunnel->err_count = 0; > } > > + if (tunnel->net != dev_net(dev)) > + dev_cleanup_skb(skb); > + > /* > * Okay, now see if we can stuff it in the buffer as-is. > */ > @@ -944,7 +949,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) > iph = &tunnel->parms.iph; > > if (iph->daddr) { > - struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, > + struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4, > + NULL, > iph->daddr, iph->saddr, > 0, 0, > IPPROTO_IPV6, > @@ -959,7 +965,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) > } > > if (!tdev && tunnel->parms.link) > - tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link); > + tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link); > > if (tdev) { > dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr); > @@ -972,7 +978,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) > > static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p) > { > - struct net *net = dev_net(t->dev); > + struct net *net = t->net; > struct sit_net *sitn = net_generic(net, sit_net_id); > > ipip6_tunnel_unlink(sitn, t); > @@ -1248,7 +1254,6 @@ static void ipip6_tunnel_setup(struct net_device *dev) > dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > dev->iflink = 0; > dev->addr_len = 4; > - dev->features |= NETIF_F_NETNS_LOCAL; > dev->features |= NETIF_F_LLTX; > } > > @@ -1257,6 +1262,7 @@ static int ipip6_tunnel_init(struct net_device *dev) > struct ip_tunnel *tunnel = netdev_priv(dev); > > tunnel->dev = dev; > + tunnel->net = dev_net(dev); > > memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4); > memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4); > @@ -1277,6 +1283,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev) > struct sit_net *sitn = net_generic(net, sit_net_id); > > tunnel->dev = dev; > + tunnel->net = dev_net(dev); > strcpy(tunnel->parms.name, dev->name); > > iph->version = 4; > @@ -1562,22 +1569,15 @@ static struct xfrm_tunnel ipip_handler __read_mostly = { > .priority = 2, > }; > > -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head) > +static void __net_exit sit_destroy_tunnels(struct net *net, > + struct list_head *head) > { > - int prio; > + struct net_device *dev, *aux; > > - for (prio = 1; prio < 4; prio++) { > - int h; > - for (h = 0; h < HASH_SIZE; h++) { > - struct ip_tunnel *t; > - > - t = rtnl_dereference(sitn->tunnels[prio][h]); > - while (t != NULL) { > - unregister_netdevice_queue(t->dev, head); > - t = rtnl_dereference(t->next); > - } > - } > - } > + for_each_netdev_safe(net, dev, aux) > + if (dev->rtnl_link_ops && > + !strcmp(dev->rtnl_link_ops->kind, "sit")) > + unregister_netdevice_queue(dev, head); This entire idiom change is a bit ugly, and it is wrong. You need to look for two classes of tunnels to take down. Tunnels that originate in net and tunnels whose netdevice is in net. For tunnles that reside in net you should be able to just compare the rtnl_link_ops pointer, rather than an ascii name. Tunnels that originate in this network namespace most definitely need to be taken down as among other things you wisely do not keep a reference count on the originating network namespace. > } > > static int __net_init sit_init_net(struct net *net) > @@ -1598,6 +1598,7 @@ static int __net_init sit_init_net(struct net *net) > goto err_alloc_dev; > } > dev_net_set(sitn->fb_tunnel_dev, net); > + sitn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL; > > err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev); > if (err) > @@ -1627,7 +1628,7 @@ static void __net_exit sit_exit_net(struct net *net) > LIST_HEAD(list); > > rtnl_lock(); > - sit_destroy_tunnels(sitn, &list); > + sit_destroy_tunnels(net, &list); > unregister_netdevice_queue(sitn->fb_tunnel_dev, &list); > unregister_netdevice_many(&list); > rtnl_unlock(); ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH net-next 2/2] sit: add support of x-netns 2013-06-24 19:28 ` Eric W. Biederman @ 2013-06-24 21:11 ` Nicolas Dichtel 2013-06-24 22:42 ` Eric W. Biederman 0 siblings, 1 reply; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-24 21:11 UTC (permalink / raw) To: Eric W. Biederman; +Cc: netdev, davem, bcrl, ravi.mlists Le 24/06/2013 21:28, Eric W. Biederman a écrit : > Nicolas Dichtel <nicolas.dichtel@6wind.com> writes: > >> This patch allows to switch the netns when packet is encapsulated or >> decapsulated. In other word, the encapsulated packet is received in a netns, >> where the lookup is done to find the tunnel. Once the tunnel is found, the >> packet is decapsulated and injecting into the corresponding interface which >> stands to another netns. >> >> When one of the two netns is removed, the tunnel is destroyed. > > I don't see any fundamental problems with this code. There are bugs > with the cleanup noted below. > > The primary sit interface is marked as NETNS_LOCAL which is good. A > comment might be nice explaining the reasonsing but for code > archeologists. Ok. > > Conditionally calling dev_cleanup_skb bugs me. The extra conditional > looks like a maintenance hazard. Unless I have missed some subtle > detail either we don't need the cleanup at all or actually it is a bug > that we aren't scrubbing our packets as they progress through tunnels > even in the same network namespace. > > Can we just make that function the skb scrubbing needed for packets to > traverse a tunnel? > > My concern going into this was that we would get code that would break > because it would not be tested enough. If we can remove the conditional > from dev_cleanup_skb we won't have any code that is conditionally run > and the logic looks simple enough not to bitrot in routine maintenance. My idea was to have the same level of cleanup/scrubbing that when a packet is sent from a netns to another netns through a veth. I cannot use dev_forward_skb() because this function expects to have an ethernet header, it's why I split it in the patch #1. If we leave all information attached to the skb, we may have, for example, an skb with a conntrack from netns1 and a netdevice from netns2. It seems not safe, but maybe I'm wrong. And in fact, the conntrack will not be created in the second netns (nf_conntrack_in() => skb->nfct is not null and not a template => stats ignore++). Another example is a socket from a netns and the netdevice or conntrack from another netns. I was thinking that when a packet enter a namespace, it must not be associated to any object from the previous namespace, it should be like if we just receive it on the host. Nicolas > > Eric > >> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> >> --- >> include/net/ip_tunnels.h | 1 + >> net/ipv4/ip_tunnel.c | 7 ++++++- >> net/ipv6/sit.c | 49 ++++++++++++++++++++++++------------------------ >> 3 files changed, 32 insertions(+), 25 deletions(-) >> >> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h >> index b0d9824..781b3cf 100644 >> --- a/include/net/ip_tunnels.h >> +++ b/include/net/ip_tunnels.h >> @@ -42,6 +42,7 @@ struct ip_tunnel { >> struct ip_tunnel __rcu *next; >> struct hlist_node hash_node; >> struct net_device *dev; >> + struct net *net; /* netns for packet i/o */ >> >> int err_count; /* Number of arrived ICMP errors */ >> unsigned long err_time; /* Time when the last ICMP error >> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c >> index bd227e5..b6a7704 100644 >> --- a/net/ipv4/ip_tunnel.c >> +++ b/net/ipv4/ip_tunnel.c >> @@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net, >> >> tunnel = netdev_priv(dev); >> tunnel->parms = *parms; >> + tunnel->net = net; >> >> err = register_netdevice(dev); >> if (err) >> @@ -453,6 +454,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, >> tstats->rx_bytes += skb->len; >> u64_stats_update_end(&tstats->syncp); >> >> + if (tunnel->net != dev_net(tunnel->dev)) >> + dev_cleanup_skb(skb); >> + >> if (tunnel->dev->type == ARPHRD_ETHER) { >> skb->protocol = eth_type_trans(skb, tunnel->dev); >> skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); >> @@ -541,7 +545,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, >> tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph); >> } >> >> - rt = ip_route_output_tunnel(dev_net(dev), &fl4, >> + rt = ip_route_output_tunnel(tunnel->net, &fl4, >> tunnel->parms.iph.protocol, >> dst, tnl_params->saddr, >> tunnel->parms.o_key, >> @@ -888,6 +892,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[], >> if (ip_tunnel_find(itn, p, dev->type)) >> return -EEXIST; >> >> + nt->net = net; >> nt->parms = *p; >> err = register_netdevice(dev); >> if (err) >> diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c >> index f639866..4e03904 100644 >> --- a/net/ipv6/sit.c >> +++ b/net/ipv6/sit.c >> @@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t) >> >> static void ipip6_tunnel_uninit(struct net_device *dev) >> { >> - struct net *net = dev_net(dev); >> - struct sit_net *sitn = net_generic(net, sit_net_id); >> + struct ip_tunnel *tunnel = netdev_priv(dev); >> + struct sit_net *sitn = net_generic(tunnel->net, sit_net_id); >> >> if (dev == sitn->fb_tunnel_dev) { >> RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL); >> } else { >> - ipip6_tunnel_unlink(sitn, netdev_priv(dev)); >> - ipip6_tunnel_del_prl(netdev_priv(dev), NULL); >> + ipip6_tunnel_unlink(sitn, tunnel); >> + ipip6_tunnel_del_prl(tunnel, NULL); >> } >> dev_put(dev); >> } >> @@ -621,6 +621,8 @@ static int ipip6_rcv(struct sk_buff *skb) >> tstats->rx_packets++; >> tstats->rx_bytes += skb->len; >> >> + if (tunnel->net != dev_net(tunnel->dev)) >> + dev_cleanup_skb(skb); >> netif_rx(skb); >> >> return 0; >> @@ -803,7 +805,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, >> goto tx_error; >> } >> >> - rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, >> + rt = ip_route_output_ports(tunnel->net, &fl4, NULL, >> dst, tiph->saddr, >> 0, 0, >> IPPROTO_IPV6, RT_TOS(tos), >> @@ -858,6 +860,9 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, >> tunnel->err_count = 0; >> } >> >> + if (tunnel->net != dev_net(dev)) >> + dev_cleanup_skb(skb); >> + >> /* >> * Okay, now see if we can stuff it in the buffer as-is. >> */ >> @@ -944,7 +949,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) >> iph = &tunnel->parms.iph; >> >> if (iph->daddr) { >> - struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, >> + struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4, >> + NULL, >> iph->daddr, iph->saddr, >> 0, 0, >> IPPROTO_IPV6, >> @@ -959,7 +965,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) >> } >> >> if (!tdev && tunnel->parms.link) >> - tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link); >> + tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link); >> >> if (tdev) { >> dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr); >> @@ -972,7 +978,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) >> >> static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p) >> { >> - struct net *net = dev_net(t->dev); >> + struct net *net = t->net; >> struct sit_net *sitn = net_generic(net, sit_net_id); >> >> ipip6_tunnel_unlink(sitn, t); >> @@ -1248,7 +1254,6 @@ static void ipip6_tunnel_setup(struct net_device *dev) >> dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; >> dev->iflink = 0; >> dev->addr_len = 4; >> - dev->features |= NETIF_F_NETNS_LOCAL; >> dev->features |= NETIF_F_LLTX; >> } >> >> @@ -1257,6 +1262,7 @@ static int ipip6_tunnel_init(struct net_device *dev) >> struct ip_tunnel *tunnel = netdev_priv(dev); >> >> tunnel->dev = dev; >> + tunnel->net = dev_net(dev); >> >> memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4); >> memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4); >> @@ -1277,6 +1283,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev) >> struct sit_net *sitn = net_generic(net, sit_net_id); >> >> tunnel->dev = dev; >> + tunnel->net = dev_net(dev); >> strcpy(tunnel->parms.name, dev->name); >> >> iph->version = 4; >> @@ -1562,22 +1569,15 @@ static struct xfrm_tunnel ipip_handler __read_mostly = { >> .priority = 2, >> }; >> >> -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head) >> +static void __net_exit sit_destroy_tunnels(struct net *net, >> + struct list_head *head) >> { >> - int prio; >> + struct net_device *dev, *aux; >> >> - for (prio = 1; prio < 4; prio++) { >> - int h; >> - for (h = 0; h < HASH_SIZE; h++) { >> - struct ip_tunnel *t; >> - >> - t = rtnl_dereference(sitn->tunnels[prio][h]); >> - while (t != NULL) { >> - unregister_netdevice_queue(t->dev, head); >> - t = rtnl_dereference(t->next); >> - } >> - } >> - } >> + for_each_netdev_safe(net, dev, aux) >> + if (dev->rtnl_link_ops && >> + !strcmp(dev->rtnl_link_ops->kind, "sit")) >> + unregister_netdevice_queue(dev, head); > > This entire idiom change is a bit ugly, and it is wrong. > > You need to look for two classes of tunnels to take down. Tunnels that > originate in net and tunnels whose netdevice is in net. > > For tunnles that reside in net you should be able to just compare the > rtnl_link_ops pointer, rather than an ascii name. > > Tunnels that originate in this network namespace most definitely need to > be taken down as among other things you wisely do not keep a reference > count on the originating network namespace. Yes sure. My beta version was doing the right things, but I change this code before sending the patch :/ ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH net-next 2/2] sit: add support of x-netns 2013-06-24 21:11 ` Nicolas Dichtel @ 2013-06-24 22:42 ` Eric W. Biederman 2013-06-25 14:10 ` Nicolas Dichtel 0 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2013-06-24 22:42 UTC (permalink / raw) To: nicolas.dichtel; +Cc: netdev, davem, bcrl, ravi.mlists Nicolas Dichtel <nicolas.dichtel@6wind.com> writes: > Le 24/06/2013 21:28, Eric W. Biederman a écrit : >> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes: >> >>> This patch allows to switch the netns when packet is encapsulated or >>> decapsulated. In other word, the encapsulated packet is received in a netns, >>> where the lookup is done to find the tunnel. Once the tunnel is found, the >>> packet is decapsulated and injecting into the corresponding interface which >>> stands to another netns. >>> >>> When one of the two netns is removed, the tunnel is destroyed. >> >> I don't see any fundamental problems with this code. There are bugs >> with the cleanup noted below. >> >> The primary sit interface is marked as NETNS_LOCAL which is good. A >> comment might be nice explaining the reasonsing but for code >> archeologists. > Ok. > >> >> Conditionally calling dev_cleanup_skb bugs me. The extra conditional >> looks like a maintenance hazard. Unless I have missed some subtle >> detail either we don't need the cleanup at all or actually it is a bug >> that we aren't scrubbing our packets as they progress through tunnels >> even in the same network namespace. >> >> Can we just make that function the skb scrubbing needed for packets to >> traverse a tunnel? >> >> My concern going into this was that we would get code that would break >> because it would not be tested enough. If we can remove the conditional >> from dev_cleanup_skb we won't have any code that is conditionally run >> and the logic looks simple enough not to bitrot in routine maintenance. > My idea was to have the same level of cleanup/scrubbing that when a packet is > sent from a netns to another netns through a veth. I cannot use > dev_forward_skb() because this function expects to have an ethernet header, it's > why I split it in the patch #1. > > If we leave all information attached to the skb, we may have, for example, an > skb with a conntrack from netns1 and a netdevice from netns2. It seems not safe, > but maybe I'm wrong. And in fact, the conntrack will not be created in the > second netns (nf_conntrack_in() => skb->nfct is not null and not a template => > stats ignore++). > Another example is a socket from a netns and the netdevice or conntrack from > another netns. All of that I agree with. I just don't see any need to make that scrubbing/cleaning of the packet conditional. Semantically going through a tunnel is the same as crossing between network namespaces. So you can change >>> + if (tunnel->net != dev_net(tunnel->dev)) >>> + dev_cleanup_skb(skb); to just: dev_cleanup_skb(skb); > I was thinking that when a packet enter a namespace, it must not be associated > to any object from the previous namespace, it should be like if we just receive > it on the host. Overall agree. Tunnels have the same properties. Which leads me to conclude either we are missing something or the current tunnel code is mildly buggy because it does not do this level of scrubbing. Eric >>> -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head) >>> +static void __net_exit sit_destroy_tunnels(struct net *net, >>> + struct list_head *head) >>> { >>> - int prio; >>> + struct net_device *dev, *aux; >>> >>> - for (prio = 1; prio < 4; prio++) { >>> - int h; >>> - for (h = 0; h < HASH_SIZE; h++) { >>> - struct ip_tunnel *t; >>> - >>> - t = rtnl_dereference(sitn->tunnels[prio][h]); >>> - while (t != NULL) { >>> - unregister_netdevice_queue(t->dev, head); >>> - t = rtnl_dereference(t->next); >>> - } >>> - } >>> - } >>> + for_each_netdev_safe(net, dev, aux) >>> + if (dev->rtnl_link_ops && >>> + !strcmp(dev->rtnl_link_ops->kind, "sit")) >>> + unregister_netdevice_queue(dev, head); >> >> This entire idiom change is a bit ugly, and it is wrong. >> >> You need to look for two classes of tunnels to take down. Tunnels that >> originate in net and tunnels whose netdevice is in net. >> >> For tunnles that reside in net you should be able to just compare the >> rtnl_link_ops pointer, rather than an ascii name. >> >> Tunnels that originate in this network namespace most definitely need to >> be taken down as among other things you wisely do not keep a reference >> count on the originating network namespace. > Yes sure. My beta version was doing the right things, but I change this code > before sending the patch :/ Bahahaha! The dangers of the last minute cleanup. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC PATCH net-next 2/2] sit: add support of x-netns 2013-06-24 22:42 ` Eric W. Biederman @ 2013-06-25 14:10 ` Nicolas Dichtel 2013-06-25 14:24 ` [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 0 siblings, 1 reply; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-25 14:10 UTC (permalink / raw) To: Eric W. Biederman; +Cc: netdev, davem, bcrl, ravi.mlists Le 25/06/2013 00:42, Eric W. Biederman a écrit : > Nicolas Dichtel <nicolas.dichtel@6wind.com> writes: > >> Le 24/06/2013 21:28, Eric W. Biederman a écrit : >>> Nicolas Dichtel <nicolas.dichtel@6wind.com> writes: >>> >>>> This patch allows to switch the netns when packet is encapsulated or >>>> decapsulated. In other word, the encapsulated packet is received in a netns, >>>> where the lookup is done to find the tunnel. Once the tunnel is found, the >>>> packet is decapsulated and injecting into the corresponding interface which >>>> stands to another netns. >>>> >>>> When one of the two netns is removed, the tunnel is destroyed. >>> >>> I don't see any fundamental problems with this code. There are bugs >>> with the cleanup noted below. >>> >>> The primary sit interface is marked as NETNS_LOCAL which is good. A >>> comment might be nice explaining the reasonsing but for code >>> archeologists. >> Ok. >> >>> >>> Conditionally calling dev_cleanup_skb bugs me. The extra conditional >>> looks like a maintenance hazard. Unless I have missed some subtle >>> detail either we don't need the cleanup at all or actually it is a bug >>> that we aren't scrubbing our packets as they progress through tunnels >>> even in the same network namespace. >>> >>> Can we just make that function the skb scrubbing needed for packets to >>> traverse a tunnel? >>> >>> My concern going into this was that we would get code that would break >>> because it would not be tested enough. If we can remove the conditional >>> from dev_cleanup_skb we won't have any code that is conditionally run >>> and the logic looks simple enough not to bitrot in routine maintenance. >> My idea was to have the same level of cleanup/scrubbing that when a packet is >> sent from a netns to another netns through a veth. I cannot use >> dev_forward_skb() because this function expects to have an ethernet header, it's >> why I split it in the patch #1. >> >> If we leave all information attached to the skb, we may have, for example, an >> skb with a conntrack from netns1 and a netdevice from netns2. It seems not safe, >> but maybe I'm wrong. And in fact, the conntrack will not be created in the >> second netns (nf_conntrack_in() => skb->nfct is not null and not a template => >> stats ignore++). >> Another example is a socket from a netns and the netdevice or conntrack from >> another netns. > > All of that I agree with. > > I just don't see any need to make that scrubbing/cleaning of the packet > conditional. > > Semantically going through a tunnel is the same as crossing between > network namespaces. So you can change > >>>> + if (tunnel->net != dev_net(tunnel->dev)) >>>> + dev_cleanup_skb(skb); > > to just: > > dev_cleanup_skb(skb); > >> I was thinking that when a packet enter a namespace, it must not be associated >> to any object from the previous namespace, it should be like if we just receive >> it on the host. > > Overall agree. Tunnels have the same properties. > > Which leads me to conclude either we are missing something or the > current tunnel code is mildly buggy because it does not do this level of > scrubbing. I'm afraid to break an existing scenario, but you're probably right. Let's remove this test. Nicolas > > Eric > >>>> -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head) >>>> +static void __net_exit sit_destroy_tunnels(struct net *net, >>>> + struct list_head *head) >>>> { >>>> - int prio; >>>> + struct net_device *dev, *aux; >>>> >>>> - for (prio = 1; prio < 4; prio++) { >>>> - int h; >>>> - for (h = 0; h < HASH_SIZE; h++) { >>>> - struct ip_tunnel *t; >>>> - >>>> - t = rtnl_dereference(sitn->tunnels[prio][h]); >>>> - while (t != NULL) { >>>> - unregister_netdevice_queue(t->dev, head); >>>> - t = rtnl_dereference(t->next); >>>> - } >>>> - } >>>> - } >>>> + for_each_netdev_safe(net, dev, aux) >>>> + if (dev->rtnl_link_ops && >>>> + !strcmp(dev->rtnl_link_ops->kind, "sit")) >>>> + unregister_netdevice_queue(dev, head); >>> >>> This entire idiom change is a bit ugly, and it is wrong. >>> >>> You need to look for two classes of tunnels to take down. Tunnels that >>> originate in net and tunnels whose netdevice is in net. >>> >>> For tunnles that reside in net you should be able to just compare the >>> rtnl_link_ops pointer, rather than an ascii name. >>> >>> Tunnels that originate in this network namespace most definitely need to >>> be taken down as among other things you wisely do not keep a reference >>> count on the originating network namespace. >> Yes sure. My beta version was doing the right things, but I change this code >> before sending the patch :/ > > Bahahaha! The dangers of the last minute cleanup. > > Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap 2013-06-25 14:10 ` Nicolas Dichtel @ 2013-06-25 14:24 ` Nicolas Dichtel 2013-06-25 14:24 ` [PATCH v2 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel 2013-06-25 14:24 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel 0 siblings, 2 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-25 14:24 UTC (permalink / raw) To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings This patch is a follow up of the thread "switching network namespace midway": http://marc.info/?t=135101459500004&r=1&w=2 The goal of this serie is to add x-netns support for the module sit, ie the encapsulation addresses and the network device are not owned by the same namespace. Example to configure a tunnel: modprobe sit ip netns add netns1 ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249 ip l s sit1 netns netns1 ip netns exec netns1 ip l s lo up ip netns exec netns1 ip l s sit1 up ip netns exec netns1 ip a a dev sit1 192.168.2.123 remote 192.168.2.121 ip netns exec netns1 ip -6 a a dev sit1 2001:1234::123 remote 2001:1234::121 Once this serie is approved, I will add the same feature for the module ipip and ip6_tunnel. v2: rename dev_cleanup_skb to skb_scrub_packet move skb_scrub_packet to skbuff.c fix netns cleanup remove string comparison in netns cleanup add a comment about FB device call skb_scrub_packet() unconditionnaly remove 'RFC' include/linux/skbuff.h | 1 + include/net/ip_tunnels.h | 1 + net/core/dev.c | 11 +---------- net/core/skbuff.c | 23 +++++++++++++++++++++++ net/ipv4/ip_tunnel.c | 6 +++++- net/ipv6/sit.c | 40 ++++++++++++++++++++++++++++++---------- 6 files changed, 61 insertions(+), 21 deletions(-) Comments are welcome. Regards, Nicolas ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v2 net-next 1/2] dev: introduce skb_scrub_packet() 2013-06-25 14:24 ` [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel @ 2013-06-25 14:24 ` Nicolas Dichtel 2013-06-25 14:24 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel 1 sibling, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-25 14:24 UTC (permalink / raw) To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, Nicolas Dichtel The goal of this new function is to perform all needed cleanup before sending an skb into another netns. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/linux/skbuff.h | 1 + net/core/dev.c | 11 +---------- net/core/skbuff.c | 23 +++++++++++++++++++++++ 3 files changed, 25 insertions(+), 10 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a7393ad..6b06023 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2384,6 +2384,7 @@ extern void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); extern int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); +extern void skb_scrub_packet(struct sk_buff *skb); extern struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); diff --git a/net/core/dev.c b/net/core/dev.c index 722f633..370354a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1652,22 +1652,13 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) } } - skb_orphan(skb); - if (unlikely(!is_skb_forwardable(dev, skb))) { atomic_long_inc(&dev->rx_dropped); kfree_skb(skb); return NET_RX_DROP; } - skb->skb_iif = 0; - skb_dst_drop(skb); - skb->tstamp.tv64 = 0; - skb->pkt_type = PACKET_HOST; + skb_scrub_packet(skb); skb->protocol = eth_type_trans(skb, dev); - skb->mark = 0; - secpath_reset(skb); - nf_reset(skb); - nf_reset_trace(skb); return netif_rx(skb); } EXPORT_SYMBOL_GPL(dev_forward_skb); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9f73eca..b1fcb87 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3492,3 +3492,26 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, return true; } EXPORT_SYMBOL(skb_try_coalesce); + +/** + * skb_scrub_packet - scrub an skb before sending it to another netns + * + * @skb: buffer to clean + * + * skb_scrub_packet can be used to clean an skb before injecting it in + * another namespace. We have to clear all information in the skb that + * could impact namespace isolation. + */ +void skb_scrub_packet(struct sk_buff *skb) +{ + skb_orphan(skb); + skb->tstamp.tv64 = 0; + skb->pkt_type = PACKET_HOST; + skb->skb_iif = 0; + skb_dst_drop(skb); + skb->mark = 0; + secpath_reset(skb); + nf_reset(skb); + nf_reset_trace(skb); +} +EXPORT_SYMBOL_GPL(skb_scrub_packet); -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v2 net-next 2/2] sit: add support of x-netns 2013-06-25 14:24 ` [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 2013-06-25 14:24 ` [PATCH v2 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel @ 2013-06-25 14:24 ` Nicolas Dichtel 2013-06-25 23:56 ` David Miller 1 sibling, 1 reply; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-25 14:24 UTC (permalink / raw) To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, Nicolas Dichtel This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/net/ip_tunnels.h | 1 + net/ipv4/ip_tunnel.c | 6 +++++- net/ipv6/sit.c | 40 ++++++++++++++++++++++++++++++---------- 3 files changed, 36 insertions(+), 11 deletions(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index b0d9824..781b3cf 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -42,6 +42,7 @@ struct ip_tunnel { struct ip_tunnel __rcu *next; struct hlist_node hash_node; struct net_device *dev; + struct net *net; /* netns for packet i/o */ int err_count; /* Number of arrived ICMP errors */ unsigned long err_time; /* Time when the last ICMP error diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index bd227e5..d375e4d 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net, tunnel = netdev_priv(dev); tunnel->parms = *parms; + tunnel->net = net; err = register_netdevice(dev); if (err) @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, tstats->rx_bytes += skb->len; u64_stats_update_end(&tstats->syncp); + skb_scrub_packet(skb); + if (tunnel->dev->type == ARPHRD_ETHER) { skb->protocol = eth_type_trans(skb, tunnel->dev); skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); @@ -541,7 +544,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph); } - rt = ip_route_output_tunnel(dev_net(dev), &fl4, + rt = ip_route_output_tunnel(tunnel->net, &fl4, tunnel->parms.iph.protocol, dst, tnl_params->saddr, tunnel->parms.o_key, @@ -888,6 +891,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[], if (ip_tunnel_find(itn, p, dev->type)) return -EEXIST; + nt->net = net; nt->parms = *p; err = register_netdevice(dev); if (err) diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index f639866..8765f4e 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t) static void ipip6_tunnel_uninit(struct net_device *dev) { - struct net *net = dev_net(dev); - struct sit_net *sitn = net_generic(net, sit_net_id); + struct ip_tunnel *tunnel = netdev_priv(dev); + struct sit_net *sitn = net_generic(tunnel->net, sit_net_id); if (dev == sitn->fb_tunnel_dev) { RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL); } else { - ipip6_tunnel_unlink(sitn, netdev_priv(dev)); - ipip6_tunnel_del_prl(netdev_priv(dev), NULL); + ipip6_tunnel_unlink(sitn, tunnel); + ipip6_tunnel_del_prl(tunnel, NULL); } dev_put(dev); } @@ -621,6 +621,7 @@ static int ipip6_rcv(struct sk_buff *skb) tstats->rx_packets++; tstats->rx_bytes += skb->len; + skb_scrub_packet(skb); netif_rx(skb); return 0; @@ -803,7 +804,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, goto tx_error; } - rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, + rt = ip_route_output_ports(tunnel->net, &fl4, NULL, dst, tiph->saddr, 0, 0, IPPROTO_IPV6, RT_TOS(tos), @@ -858,6 +859,8 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, tunnel->err_count = 0; } + skb_scrub_packet(skb); + /* * Okay, now see if we can stuff it in the buffer as-is. */ @@ -944,7 +947,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) iph = &tunnel->parms.iph; if (iph->daddr) { - struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, + struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4, + NULL, iph->daddr, iph->saddr, 0, 0, IPPROTO_IPV6, @@ -959,7 +963,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) } if (!tdev && tunnel->parms.link) - tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link); + tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link); if (tdev) { dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr); @@ -972,7 +976,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p) { - struct net *net = dev_net(t->dev); + struct net *net = t->net; struct sit_net *sitn = net_generic(net, sit_net_id); ipip6_tunnel_unlink(sitn, t); @@ -1248,7 +1252,6 @@ static void ipip6_tunnel_setup(struct net_device *dev) dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->iflink = 0; dev->addr_len = 4; - dev->features |= NETIF_F_NETNS_LOCAL; dev->features |= NETIF_F_LLTX; } @@ -1257,6 +1260,7 @@ static int ipip6_tunnel_init(struct net_device *dev) struct ip_tunnel *tunnel = netdev_priv(dev); tunnel->dev = dev; + tunnel->net = dev_net(dev); memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4); memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4); @@ -1277,6 +1281,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev) struct sit_net *sitn = net_generic(net, sit_net_id); tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); iph->version = 4; @@ -1564,8 +1569,14 @@ static struct xfrm_tunnel ipip_handler __read_mostly = { static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head) { + struct net *net = dev_net(sitn->fb_tunnel_dev); + struct net_device *dev, *aux; int prio; + for_each_netdev_safe(net, dev, aux) + if (dev->rtnl_link_ops == &sit_link_ops) + unregister_netdevice_queue(dev, head); + for (prio = 1; prio < 4; prio++) { int h; for (h = 0; h < HASH_SIZE; h++) { @@ -1573,7 +1584,12 @@ static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_hea t = rtnl_dereference(sitn->tunnels[prio][h]); while (t != NULL) { - unregister_netdevice_queue(t->dev, head); + /* If dev is in the same netns, it has already + * been added to the list by the previous loop. + */ + if (dev_net(t->dev) != net) + unregister_netdevice_queue(t->dev, + head); t = rtnl_dereference(t->next); } } @@ -1598,6 +1614,10 @@ static int __net_init sit_init_net(struct net *net) goto err_alloc_dev; } dev_net_set(sitn->fb_tunnel_dev, net); + /* FB netdevice is special: we have one, and only one per netns. + * Allowing to move it to another netns is clearly unsafe. + */ + sitn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL; err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev); if (err) -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns 2013-06-25 14:24 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel @ 2013-06-25 23:56 ` David Miller 2013-06-26 1:35 ` Eric W. Biederman 2013-06-26 13:49 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel 0 siblings, 2 replies; 52+ messages in thread From: David Miller @ 2013-06-25 23:56 UTC (permalink / raw) To: nicolas.dichtel; +Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings From: Nicolas Dichtel <nicolas.dichtel@6wind.com> Date: Tue, 25 Jun 2013 16:24:55 +0200 > @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, > tstats->rx_bytes += skb->len; > u64_stats_update_end(&tstats->syncp); > > + skb_scrub_packet(skb); > + > if (tunnel->dev->type == ARPHRD_ETHER) { > skb->protocol = eth_type_trans(skb, tunnel->dev); > skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); I can't see how this can be ok. If something in netfilter depends upon the state you are clearing out here, someone's packet filtering setup is going to break. I'm not applying these patches, sorry. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns 2013-06-25 23:56 ` David Miller @ 2013-06-26 1:35 ` Eric W. Biederman 2013-06-26 5:48 ` David Miller 2013-06-26 13:49 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel 1 sibling, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2013-06-26 1:35 UTC (permalink / raw) To: David Miller; +Cc: nicolas.dichtel, netdev, bcrl, ravi.mlists, bhutchings David Miller <davem@davemloft.net> writes: > From: Nicolas Dichtel <nicolas.dichtel@6wind.com> > Date: Tue, 25 Jun 2013 16:24:55 +0200 > >> @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, >> tstats->rx_bytes += skb->len; >> u64_stats_update_end(&tstats->syncp); >> >> + skb_scrub_packet(skb); >> + >> if (tunnel->dev->type == ARPHRD_ETHER) { >> skb->protocol = eth_type_trans(skb, tunnel->dev); >> skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); > > I can't see how this can be ok. > > If something in netfilter depends upon the state you are clearing out > here, someone's packet filtering setup is going to break. > > I'm not applying these patches, sorry. How can netfilter depend on the state of a packet inside of a tunnel? How can it even make sense? Or is your concern that we unintentionally allowed this in the past so to avoid breaking binary compatibility we should continue in case someone somewhere cares? I really can't see how this could possibly be an intentional feature. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns 2013-06-26 1:35 ` Eric W. Biederman @ 2013-06-26 5:48 ` David Miller 2013-06-26 10:03 ` Eric W. Biederman 0 siblings, 1 reply; 52+ messages in thread From: David Miller @ 2013-06-26 5:48 UTC (permalink / raw) To: ebiederm; +Cc: nicolas.dichtel, netdev, bcrl, ravi.mlists, bhutchings From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 25 Jun 2013 18:35:30 -0700 > David Miller <davem@davemloft.net> writes: > >> From: Nicolas Dichtel <nicolas.dichtel@6wind.com> >> Date: Tue, 25 Jun 2013 16:24:55 +0200 >> >>> @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, >>> tstats->rx_bytes += skb->len; >>> u64_stats_update_end(&tstats->syncp); >>> >>> + skb_scrub_packet(skb); >>> + >>> if (tunnel->dev->type == ARPHRD_ETHER) { >>> skb->protocol = eth_type_trans(skb, tunnel->dev); >>> skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); >> >> I can't see how this can be ok. >> >> If something in netfilter depends upon the state you are clearing out >> here, someone's packet filtering setup is going to break. >> >> I'm not applying these patches, sorry. > > How can netfilter depend on the state of a packet inside of a tunnel? > > How can it even make sense? > > Or is your concern that we unintentionally allowed this in the past so > to avoid breaking binary compatibility we should continue in case > someone somewhere cares? > > I really can't see how this could possibly be an intentional feature. You can make all of these issues go away by only clearing the SKB meta state when namespaces are actually changing as we go through the tunnel. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns 2013-06-26 5:48 ` David Miller @ 2013-06-26 10:03 ` Eric W. Biederman 2013-06-26 10:22 ` Eric Dumazet 0 siblings, 1 reply; 52+ messages in thread From: Eric W. Biederman @ 2013-06-26 10:03 UTC (permalink / raw) To: David Miller; +Cc: nicolas.dichtel, netdev, bcrl, ravi.mlists, bhutchings David Miller <davem@davemloft.net> writes: > From: ebiederm@xmission.com (Eric W. Biederman) > Date: Tue, 25 Jun 2013 18:35:30 -0700 > >> David Miller <davem@davemloft.net> writes: >> >>> From: Nicolas Dichtel <nicolas.dichtel@6wind.com> >>> Date: Tue, 25 Jun 2013 16:24:55 +0200 >>> >>>> @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, >>>> tstats->rx_bytes += skb->len; >>>> u64_stats_update_end(&tstats->syncp); >>>> >>>> + skb_scrub_packet(skb); >>>> + >>>> if (tunnel->dev->type == ARPHRD_ETHER) { >>>> skb->protocol = eth_type_trans(skb, tunnel->dev); >>>> skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); >>> >>> I can't see how this can be ok. >>> >>> If something in netfilter depends upon the state you are clearing out >>> here, someone's packet filtering setup is going to break. >>> >>> I'm not applying these patches, sorry. >> >> How can netfilter depend on the state of a packet inside of a tunnel? >> >> How can it even make sense? >> >> Or is your concern that we unintentionally allowed this in the past so >> to avoid breaking binary compatibility we should continue in case >> someone somewhere cares? >> >> I really can't see how this could possibly be an intentional feature. > > You can make all of these issues go away by only clearing the SKB > meta state when namespaces are actually changing as we go through > the tunnel. I have spent some time thinking about the cases where I have had an opportunity to use the marks on packets and it turns out that if I had been using a tunnel with any of those configurations leaving the marks on would have either broken my configuration or at the very least have required me to make certain I changed those marks. So I really think this is a bug fix, for a long standing bug in a rare corner case of kernel behavior that people just haven't noticed. Which is why I suggested to Nicolas Ditchtel that he remove the test to see if we were changing network namespaces before scrubbing the packet. That said I won't object if Nocolas Ditchel resends his patches with that test put back in. I just think it is silly and when someone finally gets bit by the bug and complains we will have to go through and remove the test. Eric ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns 2013-06-26 10:03 ` Eric W. Biederman @ 2013-06-26 10:22 ` Eric Dumazet 2013-06-26 12:15 ` Nicolas Dichtel 0 siblings, 1 reply; 52+ messages in thread From: Eric Dumazet @ 2013-06-26 10:22 UTC (permalink / raw) To: Eric W. Biederman Cc: David Miller, nicolas.dichtel, netdev, bcrl, ravi.mlists, bhutchings On Wed, 2013-06-26 at 03:03 -0700, Eric W. Biederman wrote: > That said I won't object if Nocolas Ditchel resends his patches with > that test put back in. I just think it is silly and when someone > finally gets bit by the bug and complains we will have to go through and > remove the test. Well, what is the reason skb_orphan() must be called in a tunnel xmit path ? This patch changes more things than what advertised in changelog :( ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns 2013-06-26 10:22 ` Eric Dumazet @ 2013-06-26 12:15 ` Nicolas Dichtel 2013-06-26 14:11 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 0 siblings, 1 reply; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-26 12:15 UTC (permalink / raw) To: Eric Dumazet Cc: Eric W. Biederman, David Miller, netdev, bcrl, ravi.mlists, bhutchings Le 26/06/2013 12:22, Eric Dumazet a écrit : > On Wed, 2013-06-26 at 03:03 -0700, Eric W. Biederman wrote: > > >> That said I won't object if Nocolas Ditchel resends his patches with >> that test put back in. I just think it is silly and when someone >> finally gets bit by the bug and complains we will have to go through and >> remove the test. > > Well, what is the reason skb_orphan() must be called in a tunnel xmit > path ? > > This patch changes more things than what advertised in changelog :( In fact, this is true. If we finally found that skb_scrub_packet() is needed in all cases (not only when changing namespace), this will be another patch. I will resend the serie with the test put back. ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap 2013-06-26 12:15 ` Nicolas Dichtel @ 2013-06-26 14:11 ` Nicolas Dichtel 2013-06-26 14:11 ` [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-26 14:11 UTC (permalink / raw) To: davem; +Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet This patch is a follow up of the thread "switching network namespace midway": http://marc.info/?t=135101459500004&r=1&w=2 The goal of this serie is to add x-netns support for the module sit, ie the encapsulation addresses and the network device are not owned by the same namespace. Example to configure a tunnel: modprobe sit ip netns add netns1 ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249 ip l s sit1 netns netns1 ip netns exec netns1 ip l s lo up ip netns exec netns1 ip l s sit1 up ip netns exec netns1 ip a a dev sit1 192.168.2.123 remote 192.168.2.121 ip netns exec netns1 ip -6 a a dev sit1 2001:1234::123 remote 2001:1234::121 Once this serie is approved, I will add the same feature for the module ipip and ip6_tunnel. v3: put again the test about netns before calling skb_scrub_packet() add a missing skb_scrub_packet() call in ip_tunnel_xmit() v2: rename dev_cleanup_skb to skb_scrub_packet move skb_scrub_packet to skbuff.c fix netns cleanup remove string comparison in netns cleanup add a comment about FB device call skb_scrub_packet() unconditionnaly remove 'RFC' include/linux/skbuff.h | 1 + include/net/ip_tunnels.h | 1 + net/core/dev.c | 11 +---------- net/core/skbuff.c | 23 +++++++++++++++++++++++ net/ipv4/ip_tunnel.c | 10 +++++++++- net/ipv6/sit.c | 42 ++++++++++++++++++++++++++++++++---------- 6 files changed, 67 insertions(+), 21 deletions(-) Comments are welcome. Regards, Nicolas ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet() 2013-06-26 14:11 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel @ 2013-06-26 14:11 ` Nicolas Dichtel 2013-06-26 14:11 ` [PATCH v3 net-next 2/2] sit: add support of x-netns Nicolas Dichtel 2013-06-28 5:36 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap David Miller 2 siblings, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-26 14:11 UTC (permalink / raw) To: davem Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet, Nicolas Dichtel The goal of this new function is to perform all needed cleanup before sending an skb into another netns. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/linux/skbuff.h | 1 + net/core/dev.c | 11 +---------- net/core/skbuff.c | 23 +++++++++++++++++++++++ 3 files changed, 25 insertions(+), 10 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a7393ad..6b06023 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2384,6 +2384,7 @@ extern void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); extern int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); +extern void skb_scrub_packet(struct sk_buff *skb); extern struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); diff --git a/net/core/dev.c b/net/core/dev.c index 722f633..370354a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1652,22 +1652,13 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) } } - skb_orphan(skb); - if (unlikely(!is_skb_forwardable(dev, skb))) { atomic_long_inc(&dev->rx_dropped); kfree_skb(skb); return NET_RX_DROP; } - skb->skb_iif = 0; - skb_dst_drop(skb); - skb->tstamp.tv64 = 0; - skb->pkt_type = PACKET_HOST; + skb_scrub_packet(skb); skb->protocol = eth_type_trans(skb, dev); - skb->mark = 0; - secpath_reset(skb); - nf_reset(skb); - nf_reset_trace(skb); return netif_rx(skb); } EXPORT_SYMBOL_GPL(dev_forward_skb); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9f73eca..b1fcb87 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3492,3 +3492,26 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, return true; } EXPORT_SYMBOL(skb_try_coalesce); + +/** + * skb_scrub_packet - scrub an skb before sending it to another netns + * + * @skb: buffer to clean + * + * skb_scrub_packet can be used to clean an skb before injecting it in + * another namespace. We have to clear all information in the skb that + * could impact namespace isolation. + */ +void skb_scrub_packet(struct sk_buff *skb) +{ + skb_orphan(skb); + skb->tstamp.tv64 = 0; + skb->pkt_type = PACKET_HOST; + skb->skb_iif = 0; + skb_dst_drop(skb); + skb->mark = 0; + secpath_reset(skb); + nf_reset(skb); + nf_reset_trace(skb); +} +EXPORT_SYMBOL_GPL(skb_scrub_packet); -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH v3 net-next 2/2] sit: add support of x-netns 2013-06-26 14:11 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 2013-06-26 14:11 ` [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel @ 2013-06-26 14:11 ` Nicolas Dichtel 2013-06-28 5:36 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap David Miller 2 siblings, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-26 14:11 UTC (permalink / raw) To: davem Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet, Nicolas Dichtel This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/net/ip_tunnels.h | 1 + net/ipv4/ip_tunnel.c | 10 +++++++++- net/ipv6/sit.c | 42 ++++++++++++++++++++++++++++++++---------- 3 files changed, 42 insertions(+), 11 deletions(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index b0d9824..781b3cf 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -42,6 +42,7 @@ struct ip_tunnel { struct ip_tunnel __rcu *next; struct hlist_node hash_node; struct net_device *dev; + struct net *net; /* netns for packet i/o */ int err_count; /* Number of arrived ICMP errors */ unsigned long err_time; /* Time when the last ICMP error diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index bd227e5..5ae01e3 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(struct net *net, tunnel = netdev_priv(dev); tunnel->parms = *parms; + tunnel->net = net; err = register_netdevice(dev); if (err) @@ -453,6 +454,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, tstats->rx_bytes += skb->len; u64_stats_update_end(&tstats->syncp); + if (tunnel->net != dev_net(tunnel->dev)) + skb_scrub_packet(skb); + if (tunnel->dev->type == ARPHRD_ETHER) { skb->protocol = eth_type_trans(skb, tunnel->dev); skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); @@ -541,7 +545,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, tos = ipv6_get_dsfield((const struct ipv6hdr *)inner_iph); } - rt = ip_route_output_tunnel(dev_net(dev), &fl4, + rt = ip_route_output_tunnel(tunnel->net, &fl4, tunnel->parms.iph.protocol, dst, tnl_params->saddr, tunnel->parms.o_key, @@ -602,6 +606,9 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, } #endif + if (tunnel->net != dev_net(dev)) + skb_scrub_packet(skb); + if (tunnel->err_count > 0) { if (time_before(jiffies, tunnel->err_time + IPTUNNEL_ERR_TIMEO)) { @@ -888,6 +895,7 @@ int ip_tunnel_newlink(struct net_device *dev, struct nlattr *tb[], if (ip_tunnel_find(itn, p, dev->type)) return -EEXIST; + nt->net = net; nt->parms = *p; err = register_netdevice(dev); if (err) diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index f639866..97a0bfe 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struct iphdr *iph, struct ip_tunnel *t) static void ipip6_tunnel_uninit(struct net_device *dev) { - struct net *net = dev_net(dev); - struct sit_net *sitn = net_generic(net, sit_net_id); + struct ip_tunnel *tunnel = netdev_priv(dev); + struct sit_net *sitn = net_generic(tunnel->net, sit_net_id); if (dev == sitn->fb_tunnel_dev) { RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL); } else { - ipip6_tunnel_unlink(sitn, netdev_priv(dev)); - ipip6_tunnel_del_prl(netdev_priv(dev), NULL); + ipip6_tunnel_unlink(sitn, tunnel); + ipip6_tunnel_del_prl(tunnel, NULL); } dev_put(dev); } @@ -621,6 +621,8 @@ static int ipip6_rcv(struct sk_buff *skb) tstats->rx_packets++; tstats->rx_bytes += skb->len; + if (tunnel->net != dev_net(tunnel->dev)) + skb_scrub_packet(skb); netif_rx(skb); return 0; @@ -803,7 +805,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, goto tx_error; } - rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, + rt = ip_route_output_ports(tunnel->net, &fl4, NULL, dst, tiph->saddr, 0, 0, IPPROTO_IPV6, RT_TOS(tos), @@ -858,6 +860,9 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, tunnel->err_count = 0; } + if (tunnel->net != dev_net(dev)) + skb_scrub_packet(skb); + /* * Okay, now see if we can stuff it in the buffer as-is. */ @@ -944,7 +949,8 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) iph = &tunnel->parms.iph; if (iph->daddr) { - struct rtable *rt = ip_route_output_ports(dev_net(dev), &fl4, NULL, + struct rtable *rt = ip_route_output_ports(tunnel->net, &fl4, + NULL, iph->daddr, iph->saddr, 0, 0, IPPROTO_IPV6, @@ -959,7 +965,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) } if (!tdev && tunnel->parms.link) - tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link); + tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link); if (tdev) { dev->hard_header_len = tdev->hard_header_len + sizeof(struct iphdr); @@ -972,7 +978,7 @@ static void ipip6_tunnel_bind_dev(struct net_device *dev) static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tunnel_parm *p) { - struct net *net = dev_net(t->dev); + struct net *net = t->net; struct sit_net *sitn = net_generic(net, sit_net_id); ipip6_tunnel_unlink(sitn, t); @@ -1248,7 +1254,6 @@ static void ipip6_tunnel_setup(struct net_device *dev) dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->iflink = 0; dev->addr_len = 4; - dev->features |= NETIF_F_NETNS_LOCAL; dev->features |= NETIF_F_LLTX; } @@ -1257,6 +1262,7 @@ static int ipip6_tunnel_init(struct net_device *dev) struct ip_tunnel *tunnel = netdev_priv(dev); tunnel->dev = dev; + tunnel->net = dev_net(dev); memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4); memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4); @@ -1277,6 +1283,7 @@ static int __net_init ipip6_fb_tunnel_init(struct net_device *dev) struct sit_net *sitn = net_generic(net, sit_net_id); tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); iph->version = 4; @@ -1564,8 +1571,14 @@ static struct xfrm_tunnel ipip_handler __read_mostly = { static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_head *head) { + struct net *net = dev_net(sitn->fb_tunnel_dev); + struct net_device *dev, *aux; int prio; + for_each_netdev_safe(net, dev, aux) + if (dev->rtnl_link_ops == &sit_link_ops) + unregister_netdevice_queue(dev, head); + for (prio = 1; prio < 4; prio++) { int h; for (h = 0; h < HASH_SIZE; h++) { @@ -1573,7 +1586,12 @@ static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_hea t = rtnl_dereference(sitn->tunnels[prio][h]); while (t != NULL) { - unregister_netdevice_queue(t->dev, head); + /* If dev is in the same netns, it has already + * been added to the list by the previous loop. + */ + if (dev_net(t->dev) != net) + unregister_netdevice_queue(t->dev, + head); t = rtnl_dereference(t->next); } } @@ -1598,6 +1616,10 @@ static int __net_init sit_init_net(struct net *net) goto err_alloc_dev; } dev_net_set(sitn->fb_tunnel_dev, net); + /* FB netdevice is special: we have one, and only one per netns. + * Allowing to move it to another netns is clearly unsafe. + */ + sitn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL; err = ipip6_fb_tunnel_init(sitn->fb_tunnel_dev); if (err) -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap 2013-06-26 14:11 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 2013-06-26 14:11 ` [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel 2013-06-26 14:11 ` [PATCH v3 net-next 2/2] sit: add support of x-netns Nicolas Dichtel @ 2013-06-28 5:36 ` David Miller 2013-07-03 15:00 ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel 2 siblings, 1 reply; 52+ messages in thread From: David Miller @ 2013-06-28 5:36 UTC (permalink / raw) To: nicolas.dichtel Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet From: Nicolas Dichtel <nicolas.dichtel@6wind.com> Date: Wed, 26 Jun 2013 16:11:26 +0200 > This patch is a follow up of the thread "switching network namespace midway": > http://marc.info/?t=135101459500004&r=1&w=2 > > The goal of this serie is to add x-netns support for the module sit, ie the > encapsulation addresses and the network device are not owned by the same > namespace. Ok, applied. And yes I agree we should look into making tunnel's behave consistently wrt. SKB orphaning, cleaning netfilter state, etc. ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap 2013-06-28 5:36 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap David Miller @ 2013-07-03 15:00 ` Nicolas Dichtel 2013-07-03 15:00 ` [PATCH net-next 1/3] sit: fix tunnel update via netlink Nicolas Dichtel ` (3 more replies) 0 siblings, 4 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-07-03 15:00 UTC (permalink / raw) To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet This patch is a follow up of the previous serie witch add this functionality for sit tunnels. The goal is to add x-netns support for the module ipip and ip6_tunnel, ie the encapsulation addresses and the network device are not owned by the same namespace. Note that the first patch is a fix of the previous serie. Example to configure an ipip tunnel: modprobe ipip ip netns add netns1 ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 ip l s ipip1 netns netns1 ip netns exec netns1 ip l s lo up ip netns exec netns1 ip l s ipip1 up ip netns exec netns1 ip a a dev ipip1 192.168.2.123 remote 192.168.2.121 or an ip6_tunnel: modprobe ip6_tunnel ip netns add netns1 ip link add ip6tnl1 type ip6tnl remote 2001:660:3008:c1c3::121 local 2001:660:3008:c1c3::123 ip l s ip6tnl1 netns netns1 ip netns exec netns1 ip l s lo up ip netns exec netns1 ip l s ip6tnl1 up ip netns exec netns1 ip a a dev ip6tnl1 192.168.1.123 remote 192.168.1.121 ip netns exec netns1 ip -6 a a dev ip6tnl1 2001:1235::123 remote 2001:1235::121 include/net/ip6_tunnel.h | 1 + include/net/ip_tunnels.h | 2 +- net/ipv4/ip_gre.c | 4 ++-- net/ipv4/ip_tunnel.c | 42 +++++++++++++++++++++++++++--------------- net/ipv4/ipip.c | 3 +-- net/ipv6/ip6_gre.c | 5 +++++ net/ipv6/ip6_tunnel.c | 41 +++++++++++++++++++++++++++++++---------- net/ipv6/sit.c | 4 ++-- 8 files changed, 70 insertions(+), 32 deletions(-) Comments are welcome. Regards, Nicolas ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH net-next 1/3] sit: fix tunnel update via netlink 2013-07-03 15:00 ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel @ 2013-07-03 15:00 ` Nicolas Dichtel 2013-07-03 15:00 ` [PATCH net-next 2/3] ipip: add x-netns support Nicolas Dichtel ` (2 subsequent siblings) 3 siblings, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-07-03 15:00 UTC (permalink / raw) To: netdev Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet, Nicolas Dichtel The device can stand in another netns, hence we need to do the lookup in netns tunnel->net. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- net/ipv6/sit.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index 85ff37b..a3437a4 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -1426,9 +1426,9 @@ static int ipip6_newlink(struct net *src_net, struct net_device *dev, static int ipip6_changelink(struct net_device *dev, struct nlattr *tb[], struct nlattr *data[]) { - struct ip_tunnel *t; + struct ip_tunnel *t = netdev_priv(dev); struct ip_tunnel_parm p; - struct net *net = dev_net(dev); + struct net *net = t->net; struct sit_net *sitn = net_generic(net, sit_net_id); #ifdef CONFIG_IPV6_SIT_6RD struct ip_tunnel_6rd ip6rd; -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH net-next 2/3] ipip: add x-netns support 2013-07-03 15:00 ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel 2013-07-03 15:00 ` [PATCH net-next 1/3] sit: fix tunnel update via netlink Nicolas Dichtel @ 2013-07-03 15:00 ` Nicolas Dichtel 2013-07-03 15:00 ` [PATCH net-next 3/3] ip6tnl: " Nicolas Dichtel 2013-07-04 21:56 ` [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap David Miller 3 siblings, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-07-03 15:00 UTC (permalink / raw) To: netdev Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet, Nicolas Dichtel This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/net/ip_tunnels.h | 2 +- net/ipv4/ip_gre.c | 4 ++-- net/ipv4/ip_tunnel.c | 42 +++++++++++++++++++++++++++--------------- net/ipv4/ipip.c | 3 +-- 4 files changed, 31 insertions(+), 20 deletions(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index 781b3cf..189fcac 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -102,7 +102,7 @@ void ip_tunnel_dellink(struct net_device *dev, struct list_head *head); int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id, struct rtnl_link_ops *ops, char *devname); -void ip_tunnel_delete_net(struct ip_tunnel_net *itn); +void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops); void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, const struct iphdr *tnl_params, const u8 protocol); diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 1f6eab6..bc3a765 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -534,7 +534,7 @@ static int __net_init ipgre_init_net(struct net *net) static void __net_exit ipgre_exit_net(struct net *net) { struct ip_tunnel_net *itn = net_generic(net, ipgre_net_id); - ip_tunnel_delete_net(itn); + ip_tunnel_delete_net(itn, &ipgre_link_ops); } static struct pernet_operations ipgre_net_ops = { @@ -767,7 +767,7 @@ static int __net_init ipgre_tap_init_net(struct net *net) static void __net_exit ipgre_tap_exit_net(struct net *net) { struct ip_tunnel_net *itn = net_generic(net, gre_tap_net_id); - ip_tunnel_delete_net(itn); + ip_tunnel_delete_net(itn, &ipgre_tap_ops); } static struct pernet_operations ipgre_tap_net_ops = { diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index 945734b..b470d78 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -350,7 +350,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev) struct flowi4 fl4; struct rtable *rt; - rt = ip_route_output_tunnel(dev_net(dev), &fl4, + rt = ip_route_output_tunnel(tunnel->net, &fl4, tunnel->parms.iph.protocol, iph->daddr, iph->saddr, tunnel->parms.o_key, @@ -365,7 +365,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev) } if (!tdev && tunnel->parms.link) - tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link); + tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link); if (tdev) { hlen = tdev->hard_header_len + tdev->needed_headroom; @@ -653,7 +653,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, } } - err = iptunnel_xmit(dev_net(dev), rt, skb, + err = iptunnel_xmit(tunnel->net, rt, skb, fl4.saddr, fl4.daddr, protocol, ip_tunnel_ecn_encap(tos, inner_iph, skb), ttl, df); iptunnel_xmit_stats(err, &dev->stats, dev->tstats); @@ -820,11 +820,10 @@ static void ip_tunnel_dev_free(struct net_device *dev) void ip_tunnel_dellink(struct net_device *dev, struct list_head *head) { - struct net *net = dev_net(dev); struct ip_tunnel *tunnel = netdev_priv(dev); struct ip_tunnel_net *itn; - itn = net_generic(net, tunnel->ip_tnl_net_id); + itn = net_generic(tunnel->net, tunnel->ip_tnl_net_id); if (itn->fb_tunnel_dev != dev) { ip_tunnel_del(netdev_priv(dev)); @@ -853,6 +852,10 @@ int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id, rtnl_lock(); itn->fb_tunnel_dev = __ip_tunnel_create(net, ops, &parms); + /* FB netdevice is special: we have one, and only one per netns. + * Allowing to move it to another netns is clearly unsafe. + */ + itn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL; rtnl_unlock(); if (IS_ERR(itn->fb_tunnel_dev)) { kfree(itn->tunnels); @@ -863,28 +866,39 @@ int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id, } EXPORT_SYMBOL_GPL(ip_tunnel_init_net); -static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head) +static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head, + struct rtnl_link_ops *ops) { + struct net *net = dev_net(itn->fb_tunnel_dev); + struct net_device *dev, *aux; int h; + for_each_netdev_safe(net, dev, aux) + if (dev->rtnl_link_ops == ops) + unregister_netdevice_queue(dev, head); + for (h = 0; h < IP_TNL_HASH_SIZE; h++) { struct ip_tunnel *t; struct hlist_node *n; struct hlist_head *thead = &itn->tunnels[h]; hlist_for_each_entry_safe(t, n, thead, hash_node) - unregister_netdevice_queue(t->dev, head); + /* If dev is in the same netns, it has already + * been added to the list by the previous loop. + */ + if (dev_net(t->dev) != net) + unregister_netdevice_queue(t->dev, head); } if (itn->fb_tunnel_dev) unregister_netdevice_queue(itn->fb_tunnel_dev, head); } -void ip_tunnel_delete_net(struct ip_tunnel_net *itn) +void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops) { LIST_HEAD(list); rtnl_lock(); - ip_tunnel_destroy(itn, &list); + ip_tunnel_destroy(itn, &list, ops); unregister_netdevice_many(&list); rtnl_unlock(); kfree(itn->tunnels); @@ -929,23 +943,21 @@ EXPORT_SYMBOL_GPL(ip_tunnel_newlink); int ip_tunnel_changelink(struct net_device *dev, struct nlattr *tb[], struct ip_tunnel_parm *p) { - struct ip_tunnel *t, *nt; - struct net *net = dev_net(dev); + struct ip_tunnel *t; struct ip_tunnel *tunnel = netdev_priv(dev); + struct net *net = tunnel->net; struct ip_tunnel_net *itn = net_generic(net, tunnel->ip_tnl_net_id); if (dev == itn->fb_tunnel_dev) return -EINVAL; - nt = netdev_priv(dev); - t = ip_tunnel_find(itn, p, dev->type); if (t) { if (t->dev != dev) return -EEXIST; } else { - t = nt; + t = tunnel; if (dev->type != ARPHRD_ETHER) { unsigned int nflags = 0; @@ -994,8 +1006,8 @@ EXPORT_SYMBOL_GPL(ip_tunnel_init); void ip_tunnel_uninit(struct net_device *dev) { - struct net *net = dev_net(dev); struct ip_tunnel *tunnel = netdev_priv(dev); + struct net *net = tunnel->net; struct ip_tunnel_net *itn; itn = net_generic(net, tunnel->ip_tnl_net_id); diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index 51fc2a1..87bd295 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -286,7 +286,6 @@ static void ipip_tunnel_setup(struct net_device *dev) dev->flags = IFF_NOARP; dev->iflink = 0; dev->addr_len = 4; - dev->features |= NETIF_F_NETNS_LOCAL; dev->features |= NETIF_F_LLTX; dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; @@ -437,7 +436,7 @@ static int __net_init ipip_init_net(struct net *net) static void __net_exit ipip_exit_net(struct net *net) { struct ip_tunnel_net *itn = net_generic(net, ipip_net_id); - ip_tunnel_delete_net(itn); + ip_tunnel_delete_net(itn, &ipip_link_ops); } static struct pernet_operations ipip_net_ops = { -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH net-next 3/3] ip6tnl: add x-netns support 2013-07-03 15:00 ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel 2013-07-03 15:00 ` [PATCH net-next 1/3] sit: fix tunnel update via netlink Nicolas Dichtel 2013-07-03 15:00 ` [PATCH net-next 2/3] ipip: add x-netns support Nicolas Dichtel @ 2013-07-03 15:00 ` Nicolas Dichtel 2013-07-04 21:56 ` [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap David Miller 3 siblings, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-07-03 15:00 UTC (permalink / raw) To: netdev Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet, Nicolas Dichtel This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/net/ip6_tunnel.h | 1 + net/ipv6/ip6_gre.c | 5 +++++ net/ipv6/ip6_tunnel.c | 41 +++++++++++++++++++++++++++++++---------- 3 files changed, 37 insertions(+), 10 deletions(-) diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h index 4da5de1..2265b0b 100644 --- a/include/net/ip6_tunnel.h +++ b/include/net/ip6_tunnel.h @@ -36,6 +36,7 @@ struct __ip6_tnl_parm { struct ip6_tnl { struct ip6_tnl __rcu *next; /* next tunnel in list */ struct net_device *dev; /* virtual device associated with tunnel */ + struct net *net; /* netns for packet i/o */ struct __ip6_tnl_parm parms; /* tunnel configuration parameters */ struct flowi fl; /* flowi template for xmit */ struct dst_entry *dst_cache; /* cached dst */ diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index ecd6073..f2d0a42 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -335,6 +335,7 @@ static struct ip6_tnl *ip6gre_tunnel_locate(struct net *net, dev->rtnl_link_ops = &ip6gre_link_ops; nt->dev = dev; + nt->net = dev_net(dev); ip6gre_tnl_link_config(nt, 1); if (register_netdevice(dev) < 0) @@ -1255,6 +1256,7 @@ static int ip6gre_tunnel_init(struct net_device *dev) tunnel = netdev_priv(dev); tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); memcpy(dev->dev_addr, &tunnel->parms.laddr, sizeof(struct in6_addr)); @@ -1275,6 +1277,7 @@ static void ip6gre_fb_tunnel_init(struct net_device *dev) struct ip6_tnl *tunnel = netdev_priv(dev); tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); tunnel->hlen = sizeof(struct ipv6hdr) + 4; @@ -1450,6 +1453,7 @@ static int ip6gre_tap_init(struct net_device *dev) tunnel = netdev_priv(dev); tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); ip6gre_tnl_link_config(tunnel, 1); @@ -1501,6 +1505,7 @@ static int ip6gre_newlink(struct net *src_net, struct net_device *dev, eth_hw_addr_random(dev); nt->dev = dev; + nt->net = dev_net(dev); ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]); /* Can use a lockless transmit, unless we generate output sequences */ diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 1e55866..d773a3c 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -315,6 +315,7 @@ static struct ip6_tnl *ip6_tnl_create(struct net *net, struct __ip6_tnl_parm *p) t = netdev_priv(dev); t->parms = *p; + t->net = dev_net(dev); err = ip6_tnl_create2(dev); if (err < 0) goto failed_free; @@ -374,7 +375,7 @@ static void ip6_tnl_dev_uninit(struct net_device *dev) { struct ip6_tnl *t = netdev_priv(dev); - struct net *net = dev_net(dev); + struct net *net = t->net; struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id); if (dev == ip6n->fb_tnl_dev) @@ -741,7 +742,7 @@ int ip6_tnl_rcv_ctl(struct ip6_tnl *t, { struct __ip6_tnl_parm *p = &t->parms; int ret = 0; - struct net *net = dev_net(t->dev); + struct net *net = t->net; if ((p->flags & IP6_TNL_F_CAP_RCV) || ((p->flags & IP6_TNL_F_CAP_PER_PACKET) && @@ -827,6 +828,9 @@ static int ip6_tnl_rcv(struct sk_buff *skb, __u16 protocol, tstats->rx_packets++; tstats->rx_bytes += skb->len; + if (t->net != dev_net(t->dev)) + skb_scrub_packet(skb); + netif_rx(skb); rcu_read_unlock(); @@ -895,7 +899,7 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t) { struct __ip6_tnl_parm *p = &t->parms; int ret = 0; - struct net *net = dev_net(t->dev); + struct net *net = t->net; if (p->flags & IP6_TNL_F_CAP_XMIT) { struct net_device *ldev = NULL; @@ -945,8 +949,8 @@ static int ip6_tnl_xmit2(struct sk_buff *skb, int encap_limit, __u32 *pmtu) { - struct net *net = dev_net(dev); struct ip6_tnl *t = netdev_priv(dev); + struct net *net = t->net; struct net_device_stats *stats = &t->dev->stats; struct ipv6hdr *ipv6h = ipv6_hdr(skb); struct ipv6_tel_txoption opt; @@ -996,6 +1000,9 @@ static int ip6_tnl_xmit2(struct sk_buff *skb, goto tx_err_dst_release; } + if (t->net != dev_net(dev)) + skb_scrub_packet(skb); + /* * Okay, now see if we can stuff it in the buffer as-is. */ @@ -1202,7 +1209,7 @@ static void ip6_tnl_link_config(struct ip6_tnl *t) int strict = (ipv6_addr_type(&p->raddr) & (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL)); - struct rt6_info *rt = rt6_lookup(dev_net(dev), + struct rt6_info *rt = rt6_lookup(t->net, &p->raddr, &p->laddr, p->link, strict); @@ -1251,7 +1258,7 @@ ip6_tnl_change(struct ip6_tnl *t, const struct __ip6_tnl_parm *p) static int ip6_tnl_update(struct ip6_tnl *t, struct __ip6_tnl_parm *p) { - struct net *net = dev_net(t->dev); + struct net *net = t->net; struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id); int err; @@ -1463,7 +1470,6 @@ static void ip6_tnl_dev_setup(struct net_device *dev) dev->mtu-=8; dev->flags |= IFF_NOARP; dev->addr_len = sizeof(struct in6_addr); - dev->features |= NETIF_F_NETNS_LOCAL; dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } @@ -1479,6 +1485,7 @@ ip6_tnl_dev_init_gen(struct net_device *dev) struct ip6_tnl *t = netdev_priv(dev); t->dev = dev; + t->net = dev_net(dev); dev->tstats = alloc_percpu(struct pcpu_tstats); if (!dev->tstats) return -ENOMEM; @@ -1596,9 +1603,9 @@ static int ip6_tnl_newlink(struct net *src_net, struct net_device *dev, static int ip6_tnl_changelink(struct net_device *dev, struct nlattr *tb[], struct nlattr *data[]) { - struct ip6_tnl *t; + struct ip6_tnl *t = netdev_priv(dev); struct __ip6_tnl_parm p; - struct net *net = dev_net(dev); + struct net *net = t->net; struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id); if (dev == ip6n->fb_tnl_dev) @@ -1699,14 +1706,24 @@ static struct xfrm6_tunnel ip6ip6_handler __read_mostly = { static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n) { + struct net *net = dev_net(ip6n->fb_tnl_dev); + struct net_device *dev, *aux; int h; struct ip6_tnl *t; LIST_HEAD(list); + for_each_netdev_safe(net, dev, aux) + if (dev->rtnl_link_ops == &ip6_link_ops) + unregister_netdevice_queue(dev, &list); + for (h = 0; h < HASH_SIZE; h++) { t = rtnl_dereference(ip6n->tnls_r_l[h]); while (t != NULL) { - unregister_netdevice_queue(t->dev, &list); + /* If dev is in the same netns, it has already + * been added to the list by the previous loop. + */ + if (dev_net(t->dev) != net) + unregister_netdevice_queue(t->dev, &list); t = rtnl_dereference(t->next); } } @@ -1732,6 +1749,10 @@ static int __net_init ip6_tnl_init_net(struct net *net) if (!ip6n->fb_tnl_dev) goto err_alloc_dev; dev_net_set(ip6n->fb_tnl_dev, net); + /* FB netdevice is special: we have one, and only one per netns. + * Allowing to move it to another netns is clearly unsafe. + */ + ip6n->fb_tnl_dev->features |= NETIF_F_NETNS_LOCAL; err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev); if (err < 0) -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap 2013-07-03 15:00 ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel ` (2 preceding siblings ...) 2013-07-03 15:00 ` [PATCH net-next 3/3] ip6tnl: " Nicolas Dichtel @ 2013-07-04 21:56 ` David Miller 2013-08-13 15:51 ` [PATCH net-next v2 0/4] " Nicolas Dichtel 3 siblings, 1 reply; 52+ messages in thread From: David Miller @ 2013-07-04 21:56 UTC (permalink / raw) To: nicolas.dichtel Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet From: Nicolas Dichtel <nicolas.dichtel@6wind.com> Date: Wed, 3 Jul 2013 17:00:33 +0200 > > This patch is a follow up of the previous serie witch add this functionality > for sit tunnels. > > The goal is to add x-netns support for the module ipip and ip6_tunnel, ie the > encapsulation addresses and the network device are not owned by the same > namespace. > > Note that the first patch is a fix of the previous serie. The first patch, as it is a bug fix, is fine and is applied. The rest will have to wait until the net-next tree opens again, sorry. ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap 2013-07-04 21:56 ` [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap David Miller @ 2013-08-13 15:51 ` Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans() Nicolas Dichtel ` (4 more replies) 0 siblings, 5 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw) To: netdev; +Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet This serie is a follow up of the previous serie witch adds this functionality for sit tunnels. The goal is to add x-netns support for the module ipip and ip6_tunnel, ie the encapsulation addresses and the network device are not owned by the same namespace. Note that the two first patches are cleanup. Example to configure an ipip tunnel: modprobe ipip ip netns add netns1 ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 ip l s ipip1 netns netns1 ip netns exec netns1 ip l s lo up ip netns exec netns1 ip l s ipip1 up ip netns exec netns1 ip a a dev ipip1 192.168.2.123 remote 192.168.2.121 or an ip6_tunnel: modprobe ip6_tunnel ip netns add netns1 ip link add ip6tnl1 type ip6tnl remote 2001:660:3008:c1c3::121 local 2001:660:3008:c1c3::123 ip l s ip6tnl1 netns netns1 ip netns exec netns1 ip l s lo up ip netns exec netns1 ip l s ip6tnl1 up ip netns exec netns1 ip a a dev ip6tnl1 192.168.1.123 remote 192.168.1.121 ip netns exec netns1 ip -6 a a dev ip6tnl1 2001:1235::123 remote 2001:1235::121 v2: remove the patch 1/3 of the v1 serie (already included) use net_eq() add patch 1/4 and 2/4 include/net/ip6_tunnel.h | 1 + include/net/ip_tunnels.h | 2 +- net/core/dev.c | 6 +++--- net/ipv4/ip_gre.c | 4 ++-- net/ipv4/ip_tunnel.c | 52 ++++++++++++++++++++++++++++++------------------ net/ipv4/ip_vti.c | 2 +- net/ipv4/ipip.c | 3 +-- net/ipv6/ip6_gre.c | 5 +++++ net/ipv6/ip6_tunnel.c | 41 ++++++++++++++++++++++++++++---------- net/ipv6/sit.c | 6 +++--- 10 files changed, 81 insertions(+), 41 deletions(-) Comments are welcome. Regards, Nicolas ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans() 2013-08-13 15:51 ` [PATCH net-next v2 0/4] " Nicolas Dichtel @ 2013-08-13 15:51 ` Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 2/4] ipv4 tunnels: use net_eq() helper to check netns Nicolas Dichtel ` (3 subsequent siblings) 4 siblings, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw) To: netdev Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet, Nicolas Dichtel skb_scrub_packet() was called before eth_type_trans() to let eth_type_trans() set pkt_type. In fact, we should force pkt_type to PACKET_HOST, so move the call after eth_type_trans(). Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- net/core/dev.c | 6 +++--- net/ipv4/ip_tunnel.c | 7 ++++--- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 58eb802584b9..1ed2b66a10a6 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1691,13 +1691,13 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) kfree_skb(skb); return NET_RX_DROP; } - skb_scrub_packet(skb); skb->protocol = eth_type_trans(skb, dev); /* eth_type_trans() can set pkt_type. - * clear pkt_type _after_ calling eth_type_trans() + * call skb_scrub_packet() after it to clear pkt_type _after_ calling + * eth_type_trans(). */ - skb->pkt_type = PACKET_HOST; + skb_scrub_packet(skb); return netif_rx(skb); } diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index 9fdf8a6d95f3..fbc1094964bf 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -454,15 +454,16 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, tstats->rx_bytes += skb->len; u64_stats_update_end(&tstats->syncp); - if (tunnel->net != dev_net(tunnel->dev)) - skb_scrub_packet(skb); - if (tunnel->dev->type == ARPHRD_ETHER) { skb->protocol = eth_type_trans(skb, tunnel->dev); skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); } else { skb->dev = tunnel->dev; } + + if (tunnel->net != dev_net(tunnel->dev)) + skb_scrub_packet(skb); + gro_cells_receive(&tunnel->gro_cells, skb); return 0; -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH net-next v2 2/4] ipv4 tunnels: use net_eq() helper to check netns 2013-08-13 15:51 ` [PATCH net-next v2 0/4] " Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans() Nicolas Dichtel @ 2013-08-13 15:51 ` Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 3/4] ipip: add x-netns support Nicolas Dichtel ` (2 subsequent siblings) 4 siblings, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw) To: netdev Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet, Nicolas Dichtel It's better to use available helpers for these tests. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- net/ipv4/ip_tunnel.c | 4 ++-- net/ipv6/sit.c | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index fbc1094964bf..a351a003ee6b 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -461,7 +461,7 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, skb->dev = tunnel->dev; } - if (tunnel->net != dev_net(tunnel->dev)) + if (!net_eq(tunnel->net, dev_net(tunnel->dev))) skb_scrub_packet(skb); gro_cells_receive(&tunnel->gro_cells, skb); @@ -614,7 +614,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, goto tx_error; } - if (tunnel->net != dev_net(dev)) + if (!net_eq(tunnel->net, dev_net(dev))) skb_scrub_packet(skb); if (tunnel->err_count > 0) { diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index a3437a4cd07e..f18f842ac893 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -621,7 +621,7 @@ static int ipip6_rcv(struct sk_buff *skb) tstats->rx_packets++; tstats->rx_bytes += skb->len; - if (tunnel->net != dev_net(tunnel->dev)) + if (!net_eq(tunnel->net, dev_net(tunnel->dev))) skb_scrub_packet(skb); netif_rx(skb); @@ -860,7 +860,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb, tunnel->err_count = 0; } - if (tunnel->net != dev_net(dev)) + if (!net_eq(tunnel->net, dev_net(dev))) skb_scrub_packet(skb); /* @@ -1589,7 +1589,7 @@ static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, struct list_hea /* If dev is in the same netns, it has already * been added to the list by the previous loop. */ - if (dev_net(t->dev) != net) + if (!net_eq(dev_net(t->dev), net)) unregister_netdevice_queue(t->dev, head); t = rtnl_dereference(t->next); -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH net-next v2 3/4] ipip: add x-netns support 2013-08-13 15:51 ` [PATCH net-next v2 0/4] " Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans() Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 2/4] ipv4 tunnels: use net_eq() helper to check netns Nicolas Dichtel @ 2013-08-13 15:51 ` Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 4/4] ip6tnl: " Nicolas Dichtel 2013-08-15 8:01 ` [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap David Miller 4 siblings, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw) To: netdev Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet, Nicolas Dichtel This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/net/ip_tunnels.h | 2 +- net/ipv4/ip_gre.c | 4 ++-- net/ipv4/ip_tunnel.c | 43 ++++++++++++++++++++++++++++--------------- net/ipv4/ip_vti.c | 2 +- net/ipv4/ipip.c | 3 +-- 5 files changed, 33 insertions(+), 21 deletions(-) diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index c6acd9f8f877..5a76f2bef822 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -102,7 +102,7 @@ void ip_tunnel_dellink(struct net_device *dev, struct list_head *head); int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id, struct rtnl_link_ops *ops, char *devname); -void ip_tunnel_delete_net(struct ip_tunnel_net *itn); +void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops); void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, const struct iphdr *tnl_params, const u8 protocol); diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 1f6eab66f7ce..bc3a76521deb 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -534,7 +534,7 @@ static int __net_init ipgre_init_net(struct net *net) static void __net_exit ipgre_exit_net(struct net *net) { struct ip_tunnel_net *itn = net_generic(net, ipgre_net_id); - ip_tunnel_delete_net(itn); + ip_tunnel_delete_net(itn, &ipgre_link_ops); } static struct pernet_operations ipgre_net_ops = { @@ -767,7 +767,7 @@ static int __net_init ipgre_tap_init_net(struct net *net) static void __net_exit ipgre_tap_exit_net(struct net *net) { struct ip_tunnel_net *itn = net_generic(net, gre_tap_net_id); - ip_tunnel_delete_net(itn); + ip_tunnel_delete_net(itn, &ipgre_tap_ops); } static struct pernet_operations ipgre_tap_net_ops = { diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c index a351a003ee6b..a4d9126c7b51 100644 --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@ -350,7 +350,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev) struct flowi4 fl4; struct rtable *rt; - rt = ip_route_output_tunnel(dev_net(dev), &fl4, + rt = ip_route_output_tunnel(tunnel->net, &fl4, tunnel->parms.iph.protocol, iph->daddr, iph->saddr, tunnel->parms.o_key, @@ -365,7 +365,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev) } if (!tdev && tunnel->parms.link) - tdev = __dev_get_by_index(dev_net(dev), tunnel->parms.link); + tdev = __dev_get_by_index(tunnel->net, tunnel->parms.link); if (tdev) { hlen = tdev->hard_header_len + tdev->needed_headroom; @@ -654,7 +654,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, } } - err = iptunnel_xmit(dev_net(dev), rt, skb, + err = iptunnel_xmit(tunnel->net, rt, skb, fl4.saddr, fl4.daddr, protocol, ip_tunnel_ecn_encap(tos, inner_iph, skb), ttl, df); iptunnel_xmit_stats(err, &dev->stats, dev->tstats); @@ -821,11 +821,10 @@ static void ip_tunnel_dev_free(struct net_device *dev) void ip_tunnel_dellink(struct net_device *dev, struct list_head *head) { - struct net *net = dev_net(dev); struct ip_tunnel *tunnel = netdev_priv(dev); struct ip_tunnel_net *itn; - itn = net_generic(net, tunnel->ip_tnl_net_id); + itn = net_generic(tunnel->net, tunnel->ip_tnl_net_id); if (itn->fb_tunnel_dev != dev) { ip_tunnel_del(netdev_priv(dev)); @@ -855,6 +854,10 @@ int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id, rtnl_lock(); itn->fb_tunnel_dev = __ip_tunnel_create(net, ops, &parms); + /* FB netdevice is special: we have one, and only one per netns. + * Allowing to move it to another netns is clearly unsafe. + */ + itn->fb_tunnel_dev->features |= NETIF_F_NETNS_LOCAL; rtnl_unlock(); if (IS_ERR(itn->fb_tunnel_dev)) @@ -864,28 +867,39 @@ int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id, } EXPORT_SYMBOL_GPL(ip_tunnel_init_net); -static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head) +static void ip_tunnel_destroy(struct ip_tunnel_net *itn, struct list_head *head, + struct rtnl_link_ops *ops) { + struct net *net = dev_net(itn->fb_tunnel_dev); + struct net_device *dev, *aux; int h; + for_each_netdev_safe(net, dev, aux) + if (dev->rtnl_link_ops == ops) + unregister_netdevice_queue(dev, head); + for (h = 0; h < IP_TNL_HASH_SIZE; h++) { struct ip_tunnel *t; struct hlist_node *n; struct hlist_head *thead = &itn->tunnels[h]; hlist_for_each_entry_safe(t, n, thead, hash_node) - unregister_netdevice_queue(t->dev, head); + /* If dev is in the same netns, it has already + * been added to the list by the previous loop. + */ + if (!net_eq(dev_net(t->dev), net)) + unregister_netdevice_queue(t->dev, head); } if (itn->fb_tunnel_dev) unregister_netdevice_queue(itn->fb_tunnel_dev, head); } -void ip_tunnel_delete_net(struct ip_tunnel_net *itn) +void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops) { LIST_HEAD(list); rtnl_lock(); - ip_tunnel_destroy(itn, &list); + ip_tunnel_destroy(itn, &list, ops); unregister_netdevice_many(&list); rtnl_unlock(); } @@ -929,23 +943,21 @@ EXPORT_SYMBOL_GPL(ip_tunnel_newlink); int ip_tunnel_changelink(struct net_device *dev, struct nlattr *tb[], struct ip_tunnel_parm *p) { - struct ip_tunnel *t, *nt; - struct net *net = dev_net(dev); + struct ip_tunnel *t; struct ip_tunnel *tunnel = netdev_priv(dev); + struct net *net = tunnel->net; struct ip_tunnel_net *itn = net_generic(net, tunnel->ip_tnl_net_id); if (dev == itn->fb_tunnel_dev) return -EINVAL; - nt = netdev_priv(dev); - t = ip_tunnel_find(itn, p, dev->type); if (t) { if (t->dev != dev) return -EEXIST; } else { - t = nt; + t = tunnel; if (dev->type != ARPHRD_ETHER) { unsigned int nflags = 0; @@ -984,6 +996,7 @@ int ip_tunnel_init(struct net_device *dev) } tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); iph->version = 4; iph->ihl = 5; @@ -994,8 +1007,8 @@ EXPORT_SYMBOL_GPL(ip_tunnel_init); void ip_tunnel_uninit(struct net_device *dev) { - struct net *net = dev_net(dev); struct ip_tunnel *tunnel = netdev_priv(dev); + struct net *net = tunnel->net; struct ip_tunnel_net *itn; itn = net_generic(net, tunnel->ip_tnl_net_id); diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c index 79b263da4168..e805e7b3030e 100644 --- a/net/ipv4/ip_vti.c +++ b/net/ipv4/ip_vti.c @@ -318,7 +318,7 @@ static int __net_init vti_init_net(struct net *net) static void __net_exit vti_exit_net(struct net *net) { struct ip_tunnel_net *itn = net_generic(net, vti_net_id); - ip_tunnel_delete_net(itn); + ip_tunnel_delete_net(itn, &vti_link_ops); } static struct pernet_operations vti_net_ops = { diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c index 51fc2a1dcdd3..87bd2952c733 100644 --- a/net/ipv4/ipip.c +++ b/net/ipv4/ipip.c @@ -286,7 +286,6 @@ static void ipip_tunnel_setup(struct net_device *dev) dev->flags = IFF_NOARP; dev->iflink = 0; dev->addr_len = 4; - dev->features |= NETIF_F_NETNS_LOCAL; dev->features |= NETIF_F_LLTX; dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; @@ -437,7 +436,7 @@ static int __net_init ipip_init_net(struct net *net) static void __net_exit ipip_exit_net(struct net *net) { struct ip_tunnel_net *itn = net_generic(net, ipip_net_id); - ip_tunnel_delete_net(itn); + ip_tunnel_delete_net(itn, &ipip_link_ops); } static struct pernet_operations ipip_net_ops = { -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* [PATCH net-next v2 4/4] ip6tnl: add x-netns support 2013-08-13 15:51 ` [PATCH net-next v2 0/4] " Nicolas Dichtel ` (2 preceding siblings ...) 2013-08-13 15:51 ` [PATCH net-next v2 3/4] ipip: add x-netns support Nicolas Dichtel @ 2013-08-13 15:51 ` Nicolas Dichtel 2013-08-15 8:01 ` [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap David Miller 4 siblings, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-08-13 15:51 UTC (permalink / raw) To: netdev Cc: davem, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet, Nicolas Dichtel This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> --- include/net/ip6_tunnel.h | 1 + net/ipv6/ip6_gre.c | 5 +++++ net/ipv6/ip6_tunnel.c | 41 +++++++++++++++++++++++++++++++---------- 3 files changed, 37 insertions(+), 10 deletions(-) diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h index 4da5de10d1d4..2265b0bf97e5 100644 --- a/include/net/ip6_tunnel.h +++ b/include/net/ip6_tunnel.h @@ -36,6 +36,7 @@ struct __ip6_tnl_parm { struct ip6_tnl { struct ip6_tnl __rcu *next; /* next tunnel in list */ struct net_device *dev; /* virtual device associated with tunnel */ + struct net *net; /* netns for packet i/o */ struct __ip6_tnl_parm parms; /* tunnel configuration parameters */ struct flowi fl; /* flowi template for xmit */ struct dst_entry *dst_cache; /* cached dst */ diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index ecd60733e5e2..f2d0a42f8057 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -335,6 +335,7 @@ static struct ip6_tnl *ip6gre_tunnel_locate(struct net *net, dev->rtnl_link_ops = &ip6gre_link_ops; nt->dev = dev; + nt->net = dev_net(dev); ip6gre_tnl_link_config(nt, 1); if (register_netdevice(dev) < 0) @@ -1255,6 +1256,7 @@ static int ip6gre_tunnel_init(struct net_device *dev) tunnel = netdev_priv(dev); tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); memcpy(dev->dev_addr, &tunnel->parms.laddr, sizeof(struct in6_addr)); @@ -1275,6 +1277,7 @@ static void ip6gre_fb_tunnel_init(struct net_device *dev) struct ip6_tnl *tunnel = netdev_priv(dev); tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); tunnel->hlen = sizeof(struct ipv6hdr) + 4; @@ -1450,6 +1453,7 @@ static int ip6gre_tap_init(struct net_device *dev) tunnel = netdev_priv(dev); tunnel->dev = dev; + tunnel->net = dev_net(dev); strcpy(tunnel->parms.name, dev->name); ip6gre_tnl_link_config(tunnel, 1); @@ -1501,6 +1505,7 @@ static int ip6gre_newlink(struct net *src_net, struct net_device *dev, eth_hw_addr_random(dev); nt->dev = dev; + nt->net = dev_net(dev); ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]); /* Can use a lockless transmit, unless we generate output sequences */ diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 1e55866cead7..cc3bb201b8b0 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -315,6 +315,7 @@ static struct ip6_tnl *ip6_tnl_create(struct net *net, struct __ip6_tnl_parm *p) t = netdev_priv(dev); t->parms = *p; + t->net = dev_net(dev); err = ip6_tnl_create2(dev); if (err < 0) goto failed_free; @@ -374,7 +375,7 @@ static void ip6_tnl_dev_uninit(struct net_device *dev) { struct ip6_tnl *t = netdev_priv(dev); - struct net *net = dev_net(dev); + struct net *net = t->net; struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id); if (dev == ip6n->fb_tnl_dev) @@ -741,7 +742,7 @@ int ip6_tnl_rcv_ctl(struct ip6_tnl *t, { struct __ip6_tnl_parm *p = &t->parms; int ret = 0; - struct net *net = dev_net(t->dev); + struct net *net = t->net; if ((p->flags & IP6_TNL_F_CAP_RCV) || ((p->flags & IP6_TNL_F_CAP_PER_PACKET) && @@ -827,6 +828,9 @@ static int ip6_tnl_rcv(struct sk_buff *skb, __u16 protocol, tstats->rx_packets++; tstats->rx_bytes += skb->len; + if (!net_eq(t->net, dev_net(t->dev))) + skb_scrub_packet(skb); + netif_rx(skb); rcu_read_unlock(); @@ -895,7 +899,7 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t) { struct __ip6_tnl_parm *p = &t->parms; int ret = 0; - struct net *net = dev_net(t->dev); + struct net *net = t->net; if (p->flags & IP6_TNL_F_CAP_XMIT) { struct net_device *ldev = NULL; @@ -945,8 +949,8 @@ static int ip6_tnl_xmit2(struct sk_buff *skb, int encap_limit, __u32 *pmtu) { - struct net *net = dev_net(dev); struct ip6_tnl *t = netdev_priv(dev); + struct net *net = t->net; struct net_device_stats *stats = &t->dev->stats; struct ipv6hdr *ipv6h = ipv6_hdr(skb); struct ipv6_tel_txoption opt; @@ -996,6 +1000,9 @@ static int ip6_tnl_xmit2(struct sk_buff *skb, goto tx_err_dst_release; } + if (!net_eq(t->net, dev_net(dev))) + skb_scrub_packet(skb); + /* * Okay, now see if we can stuff it in the buffer as-is. */ @@ -1202,7 +1209,7 @@ static void ip6_tnl_link_config(struct ip6_tnl *t) int strict = (ipv6_addr_type(&p->raddr) & (IPV6_ADDR_MULTICAST|IPV6_ADDR_LINKLOCAL)); - struct rt6_info *rt = rt6_lookup(dev_net(dev), + struct rt6_info *rt = rt6_lookup(t->net, &p->raddr, &p->laddr, p->link, strict); @@ -1251,7 +1258,7 @@ ip6_tnl_change(struct ip6_tnl *t, const struct __ip6_tnl_parm *p) static int ip6_tnl_update(struct ip6_tnl *t, struct __ip6_tnl_parm *p) { - struct net *net = dev_net(t->dev); + struct net *net = t->net; struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id); int err; @@ -1463,7 +1470,6 @@ static void ip6_tnl_dev_setup(struct net_device *dev) dev->mtu-=8; dev->flags |= IFF_NOARP; dev->addr_len = sizeof(struct in6_addr); - dev->features |= NETIF_F_NETNS_LOCAL; dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } @@ -1479,6 +1485,7 @@ ip6_tnl_dev_init_gen(struct net_device *dev) struct ip6_tnl *t = netdev_priv(dev); t->dev = dev; + t->net = dev_net(dev); dev->tstats = alloc_percpu(struct pcpu_tstats); if (!dev->tstats) return -ENOMEM; @@ -1596,9 +1603,9 @@ static int ip6_tnl_newlink(struct net *src_net, struct net_device *dev, static int ip6_tnl_changelink(struct net_device *dev, struct nlattr *tb[], struct nlattr *data[]) { - struct ip6_tnl *t; + struct ip6_tnl *t = netdev_priv(dev); struct __ip6_tnl_parm p; - struct net *net = dev_net(dev); + struct net *net = t->net; struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id); if (dev == ip6n->fb_tnl_dev) @@ -1699,14 +1706,24 @@ static struct xfrm6_tunnel ip6ip6_handler __read_mostly = { static void __net_exit ip6_tnl_destroy_tunnels(struct ip6_tnl_net *ip6n) { + struct net *net = dev_net(ip6n->fb_tnl_dev); + struct net_device *dev, *aux; int h; struct ip6_tnl *t; LIST_HEAD(list); + for_each_netdev_safe(net, dev, aux) + if (dev->rtnl_link_ops == &ip6_link_ops) + unregister_netdevice_queue(dev, &list); + for (h = 0; h < HASH_SIZE; h++) { t = rtnl_dereference(ip6n->tnls_r_l[h]); while (t != NULL) { - unregister_netdevice_queue(t->dev, &list); + /* If dev is in the same netns, it has already + * been added to the list by the previous loop. + */ + if (!net_eq(dev_net(t->dev), net)) + unregister_netdevice_queue(t->dev, &list); t = rtnl_dereference(t->next); } } @@ -1732,6 +1749,10 @@ static int __net_init ip6_tnl_init_net(struct net *net) if (!ip6n->fb_tnl_dev) goto err_alloc_dev; dev_net_set(ip6n->fb_tnl_dev, net); + /* FB netdevice is special: we have one, and only one per netns. + * Allowing to move it to another netns is clearly unsafe. + */ + ip6n->fb_tnl_dev->features |= NETIF_F_NETNS_LOCAL; err = ip6_fb_tnl_dev_init(ip6n->fb_tnl_dev); if (err < 0) -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap 2013-08-13 15:51 ` [PATCH net-next v2 0/4] " Nicolas Dichtel ` (3 preceding siblings ...) 2013-08-13 15:51 ` [PATCH net-next v2 4/4] ip6tnl: " Nicolas Dichtel @ 2013-08-15 8:01 ` David Miller 4 siblings, 0 replies; 52+ messages in thread From: David Miller @ 2013-08-15 8:01 UTC (permalink / raw) To: nicolas.dichtel Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, eric.dumazet From: Nicolas Dichtel <nicolas.dichtel@6wind.com> Date: Tue, 13 Aug 2013 17:51:08 +0200 > > This serie is a follow up of the previous serie witch adds this functionality > for sit tunnels. > > The goal is to add x-netns support for the module ipip and ip6_tunnel, ie the > encapsulation addresses and the network device are not owned by the same > namespace. > > Note that the two first patches are cleanup. Looks good, series applied, thanks. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v2 net-next 2/2] sit: add support of x-netns 2013-06-25 23:56 ` David Miller 2013-06-26 1:35 ` Eric W. Biederman @ 2013-06-26 13:49 ` Nicolas Dichtel 1 sibling, 0 replies; 52+ messages in thread From: Nicolas Dichtel @ 2013-06-26 13:49 UTC (permalink / raw) To: David Miller Cc: netdev, ebiederm, bcrl, ravi.mlists, bhutchings, Eric Dumazet Le 26/06/2013 01:56, David Miller a écrit : > From: Nicolas Dichtel <nicolas.dichtel@6wind.com> > Date: Tue, 25 Jun 2013 16:24:55 +0200 > >> @@ -453,6 +454,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, >> tstats->rx_bytes += skb->len; >> u64_stats_update_end(&tstats->syncp); >> >> + skb_scrub_packet(skb); >> + >> if (tunnel->dev->type == ARPHRD_ETHER) { >> skb->protocol = eth_type_trans(skb, tunnel->dev); >> skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); > > I can't see how this can be ok. > > If something in netfilter depends upon the state you are clearing out > here, someone's packet filtering setup is going to break. Just for the record, note that nf_reset() is already called in iptunnel_pull_header() and iptunnel_xmit(). Hence 4in4 (ipip and sit) and gre tunnels are already reseting netfilter state. 6in4 (sit) do it only in xmit path and Xin6 (ip6_tunnel) never. We can also notice that nf_reset() was added by the commit 3d7b46cd20e3 "ip_tunnel: push generic protocol handling to ip_tunnel module." (net-next only) in rx path. The nf_reset() of xmit path of 4in4 (ipip) is here for years (at least 2.6.12). For gre, it has been added by c54419321455 "GRE: Refactor GRE tunneling code." (v3.10-rc1). It seems that the code is different depending of the type of the tunnel. If we omit skb_orphan() (and maybe another one?, to be done only when changing namespace), it can be good to have a common function to have the same behavior for each tunnel. Maybe something like: void skb_scrub_packet(bool netnschange) { if (netnschange) skb_orphan(skb); skb->tstamp.tv64 = 0; skb->pkt_type = PACKET_HOST; skb->skb_iif = 0; skb_dst_drop(skb); skb->mark = 0; secpath_reset(skb); nf_reset(skb); nf_reset_trace(skb); } What's your opinion? ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2013-08-15 8:01 UTC | newest] Thread overview: 52+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-10-23 17:49 switching network namespace midway rsa 2012-10-24 21:11 ` Eric W. Biederman 2012-10-24 21:21 ` Benjamin LaHaise 2012-10-25 1:37 ` Eric W. Biederman 2012-10-25 14:38 ` Benjamin LaHaise 2012-10-25 16:21 ` Stephen Hemminger 2012-10-28 5:43 ` Eric W. Biederman 2012-10-29 14:23 ` Stephen Hemminger 2012-10-30 0:21 ` Eric W. Biederman 2012-10-30 8:55 ` James Chapman 2012-10-25 15:12 ` rsa 2012-10-25 15:29 ` rsa 2012-10-25 15:59 ` Benjamin LaHaise 2012-10-25 16:15 ` Eric W. Biederman 2012-11-02 2:25 ` Benjamin LaHaise 2012-11-02 6:18 ` Eric W. Biederman 2012-11-02 14:03 ` Benjamin LaHaise 2012-11-02 20:45 ` Eric W. Biederman 2013-06-24 14:13 ` [RFC PATCH net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 2013-06-24 14:13 ` [RFC PATCH net-next 1/2] dev: introduce dev_cleanup_skb() Nicolas Dichtel 2013-06-24 18:13 ` Ben Hutchings 2013-06-24 19:05 ` Eric W. Biederman 2013-06-24 14:13 ` [RFC PATCH net-next 2/2] sit: add support of x-netns Nicolas Dichtel 2013-06-24 19:28 ` Eric W. Biederman 2013-06-24 21:11 ` Nicolas Dichtel 2013-06-24 22:42 ` Eric W. Biederman 2013-06-25 14:10 ` Nicolas Dichtel 2013-06-25 14:24 ` [PATCH v2 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 2013-06-25 14:24 ` [PATCH v2 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel 2013-06-25 14:24 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel 2013-06-25 23:56 ` David Miller 2013-06-26 1:35 ` Eric W. Biederman 2013-06-26 5:48 ` David Miller 2013-06-26 10:03 ` Eric W. Biederman 2013-06-26 10:22 ` Eric Dumazet 2013-06-26 12:15 ` Nicolas Dichtel 2013-06-26 14:11 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap Nicolas Dichtel 2013-06-26 14:11 ` [PATCH v3 net-next 1/2] dev: introduce skb_scrub_packet() Nicolas Dichtel 2013-06-26 14:11 ` [PATCH v3 net-next 2/2] sit: add support of x-netns Nicolas Dichtel 2013-06-28 5:36 ` [PATCH v3 net-next 0/2] sit: allow to switch netns during encap/decap David Miller 2013-07-03 15:00 ` [PATCH net-next 0/3] ipip/ip6tnl: " Nicolas Dichtel 2013-07-03 15:00 ` [PATCH net-next 1/3] sit: fix tunnel update via netlink Nicolas Dichtel 2013-07-03 15:00 ` [PATCH net-next 2/3] ipip: add x-netns support Nicolas Dichtel 2013-07-03 15:00 ` [PATCH net-next 3/3] ip6tnl: " Nicolas Dichtel 2013-07-04 21:56 ` [PATCH net-next 0/3] ipip/ip6tnl: allow to switch netns during encap/decap David Miller 2013-08-13 15:51 ` [PATCH net-next v2 0/4] " Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 1/4] dev: move skb_scrub_packet() after eth_type_trans() Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 2/4] ipv4 tunnels: use net_eq() helper to check netns Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 3/4] ipip: add x-netns support Nicolas Dichtel 2013-08-13 15:51 ` [PATCH net-next v2 4/4] ip6tnl: " Nicolas Dichtel 2013-08-15 8:01 ` [PATCH net-next v2 0/4] ipip/ip6tnl: allow to switch netns during encap/decap David Miller 2013-06-26 13:49 ` [PATCH v2 net-next 2/2] sit: add support of x-netns Nicolas Dichtel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).