From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nicolas Dichtel Subject: Re: [RFC PATCH net-next 2/2] sit: add support of x-netns Date: Mon, 24 Jun 2013 23:11:18 +0200 Message-ID: <51C8B5F6.7020603@6wind.com> References: <87y5ijd98e.fsf@xmission.com> <1372083239-9451-1-git-send-email-nicolas.dichtel@6wind.com> <1372083239-9451-3-git-send-email-nicolas.dichtel@6wind.com> <878v1zjokr.fsf@xmission.com> Reply-To: nicolas.dichtel@6wind.com Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: netdev@vger.kernel.org, davem@davemloft.net, bcrl@kvack.org, ravi.mlists@gmail.com To: "Eric W. Biederman" Return-path: Received: from mail-we0-f178.google.com ([74.125.82.178]:63483 "EHLO mail-we0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750797Ab3FXVLV (ORCPT ); Mon, 24 Jun 2013 17:11:21 -0400 Received: by mail-we0-f178.google.com with SMTP id u53so8726573wes.9 for ; Mon, 24 Jun 2013 14:11:20 -0700 (PDT) In-Reply-To: <878v1zjokr.fsf@xmission.com> Sender: netdev-owner@vger.kernel.org List-ID: Le 24/06/2013 21:28, Eric W. Biederman a =C3=A9crit : > Nicolas Dichtel writes: > >> This patch allows to switch the netns when packet is encapsulated or >> decapsulated. In other word, the encapsulated packet is received in = a netns, >> where the lookup is done to find the tunnel. Once the tunnel is foun= d, the >> packet is decapsulated and injecting into the corresponding interfac= e which >> stands to another netns. >> >> When one of the two netns is removed, the tunnel is destroyed. > > I don't see any fundamental problems with this code. There are bugs > with the cleanup noted below. > > The primary sit interface is marked as NETNS_LOCAL which is good. A > comment might be nice explaining the reasonsing but for code > archeologists. Ok. > > Conditionally calling dev_cleanup_skb bugs me. The extra conditional > looks like a maintenance hazard. Unless I have missed some subtle > detail either we don't need the cleanup at all or actually it is a bu= g > that we aren't scrubbing our packets as they progress through tunnels > even in the same network namespace. > > Can we just make that function the skb scrubbing needed for packets t= o > traverse a tunnel? > > My concern going into this was that we would get code that would brea= k > because it would not be tested enough. If we can remove the conditio= nal > from dev_cleanup_skb we won't have any code that is conditionally run > and the logic looks simple enough not to bitrot in routine maintenanc= e. My idea was to have the same level of cleanup/scrubbing that when a pac= ket is=20 sent from a netns to another netns through a veth. I cannot use=20 dev_forward_skb() because this function expects to have an ethernet hea= der, it's=20 why I split it in the patch #1. If we leave all information attached to the skb, we may have, for examp= le, an=20 skb with a conntrack from netns1 and a netdevice from netns2. It seems = not safe,=20 but maybe I'm wrong. And in fact, the conntrack will not be created in = the=20 second netns (nf_conntrack_in() =3D> skb->nfct is not null and not a te= mplate =3D>=20 stats ignore++). Another example is a socket from a netns and the netdevice or conntrack= from=20 another netns. I was thinking that when a packet enter a namespace, it must not be ass= ociated=20 to any object from the previous namespace, it should be like if we just= receive=20 it on the host. Nicolas > > Eric > >> Signed-off-by: Nicolas Dichtel >> --- >> include/net/ip_tunnels.h | 1 + >> net/ipv4/ip_tunnel.c | 7 ++++++- >> net/ipv6/sit.c | 49 ++++++++++++++++++++++++------------= ------------ >> 3 files changed, 32 insertions(+), 25 deletions(-) >> >> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h >> index b0d9824..781b3cf 100644 >> --- a/include/net/ip_tunnels.h >> +++ b/include/net/ip_tunnels.h >> @@ -42,6 +42,7 @@ struct ip_tunnel { >> struct ip_tunnel __rcu *next; >> struct hlist_node hash_node; >> struct net_device *dev; >> + struct net *net; /* netns for packet i/o */ >> >> int err_count; /* Number of arrived ICMP errors */ >> unsigned long err_time; /* Time when the last ICMP error >> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c >> index bd227e5..b6a7704 100644 >> --- a/net/ipv4/ip_tunnel.c >> +++ b/net/ipv4/ip_tunnel.c >> @@ -304,6 +304,7 @@ static struct net_device *__ip_tunnel_create(str= uct net *net, >> >> tunnel =3D netdev_priv(dev); >> tunnel->parms =3D *parms; >> + tunnel->net =3D net; >> >> err =3D register_netdevice(dev); >> if (err) >> @@ -453,6 +454,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, stru= ct sk_buff *skb, >> tstats->rx_bytes +=3D skb->len; >> u64_stats_update_end(&tstats->syncp); >> >> + if (tunnel->net !=3D dev_net(tunnel->dev)) >> + dev_cleanup_skb(skb); >> + >> if (tunnel->dev->type =3D=3D ARPHRD_ETHER) { >> skb->protocol =3D eth_type_trans(skb, tunnel->dev); >> skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN); >> @@ -541,7 +545,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct = net_device *dev, >> tos =3D ipv6_get_dsfield((const struct ipv6hdr *)inner_iph); >> } >> >> - rt =3D ip_route_output_tunnel(dev_net(dev), &fl4, >> + rt =3D ip_route_output_tunnel(tunnel->net, &fl4, >> tunnel->parms.iph.protocol, >> dst, tnl_params->saddr, >> tunnel->parms.o_key, >> @@ -888,6 +892,7 @@ int ip_tunnel_newlink(struct net_device *dev, st= ruct nlattr *tb[], >> if (ip_tunnel_find(itn, p, dev->type)) >> return -EEXIST; >> >> + nt->net =3D net; >> nt->parms =3D *p; >> err =3D register_netdevice(dev); >> if (err) >> diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c >> index f639866..4e03904 100644 >> --- a/net/ipv6/sit.c >> +++ b/net/ipv6/sit.c >> @@ -466,14 +466,14 @@ isatap_chksrc(struct sk_buff *skb, const struc= t iphdr *iph, struct ip_tunnel *t) >> >> static void ipip6_tunnel_uninit(struct net_device *dev) >> { >> - struct net *net =3D dev_net(dev); >> - struct sit_net *sitn =3D net_generic(net, sit_net_id); >> + struct ip_tunnel *tunnel =3D netdev_priv(dev); >> + struct sit_net *sitn =3D net_generic(tunnel->net, sit_net_id); >> >> if (dev =3D=3D sitn->fb_tunnel_dev) { >> RCU_INIT_POINTER(sitn->tunnels_wc[0], NULL); >> } else { >> - ipip6_tunnel_unlink(sitn, netdev_priv(dev)); >> - ipip6_tunnel_del_prl(netdev_priv(dev), NULL); >> + ipip6_tunnel_unlink(sitn, tunnel); >> + ipip6_tunnel_del_prl(tunnel, NULL); >> } >> dev_put(dev); >> } >> @@ -621,6 +621,8 @@ static int ipip6_rcv(struct sk_buff *skb) >> tstats->rx_packets++; >> tstats->rx_bytes +=3D skb->len; >> >> + if (tunnel->net !=3D dev_net(tunnel->dev)) >> + dev_cleanup_skb(skb); >> netif_rx(skb); >> >> return 0; >> @@ -803,7 +805,7 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_b= uff *skb, >> goto tx_error; >> } >> >> - rt =3D ip_route_output_ports(dev_net(dev), &fl4, NULL, >> + rt =3D ip_route_output_ports(tunnel->net, &fl4, NULL, >> dst, tiph->saddr, >> 0, 0, >> IPPROTO_IPV6, RT_TOS(tos), >> @@ -858,6 +860,9 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_b= uff *skb, >> tunnel->err_count =3D 0; >> } >> >> + if (tunnel->net !=3D dev_net(dev)) >> + dev_cleanup_skb(skb); >> + >> /* >> * Okay, now see if we can stuff it in the buffer as-is. >> */ >> @@ -944,7 +949,8 @@ static void ipip6_tunnel_bind_dev(struct net_dev= ice *dev) >> iph =3D &tunnel->parms.iph; >> >> if (iph->daddr) { >> - struct rtable *rt =3D ip_route_output_ports(dev_net(dev), &fl4, N= ULL, >> + struct rtable *rt =3D ip_route_output_ports(tunnel->net, &fl4, >> + NULL, >> iph->daddr, iph->saddr, >> 0, 0, >> IPPROTO_IPV6, >> @@ -959,7 +965,7 @@ static void ipip6_tunnel_bind_dev(struct net_dev= ice *dev) >> } >> >> if (!tdev && tunnel->parms.link) >> - tdev =3D __dev_get_by_index(dev_net(dev), tunnel->parms.link); >> + tdev =3D __dev_get_by_index(tunnel->net, tunnel->parms.link); >> >> if (tdev) { >> dev->hard_header_len =3D tdev->hard_header_len + sizeof(struct i= phdr); >> @@ -972,7 +978,7 @@ static void ipip6_tunnel_bind_dev(struct net_dev= ice *dev) >> >> static void ipip6_tunnel_update(struct ip_tunnel *t, struct ip_tun= nel_parm *p) >> { >> - struct net *net =3D dev_net(t->dev); >> + struct net *net =3D t->net; >> struct sit_net *sitn =3D net_generic(net, sit_net_id); >> >> ipip6_tunnel_unlink(sitn, t); >> @@ -1248,7 +1254,6 @@ static void ipip6_tunnel_setup(struct net_devi= ce *dev) >> dev->priv_flags &=3D ~IFF_XMIT_DST_RELEASE; >> dev->iflink =3D 0; >> dev->addr_len =3D 4; >> - dev->features |=3D NETIF_F_NETNS_LOCAL; >> dev->features |=3D NETIF_F_LLTX; >> } >> >> @@ -1257,6 +1262,7 @@ static int ipip6_tunnel_init(struct net_device= *dev) >> struct ip_tunnel *tunnel =3D netdev_priv(dev); >> >> tunnel->dev =3D dev; >> + tunnel->net =3D dev_net(dev); >> >> memcpy(dev->dev_addr, &tunnel->parms.iph.saddr, 4); >> memcpy(dev->broadcast, &tunnel->parms.iph.daddr, 4); >> @@ -1277,6 +1283,7 @@ static int __net_init ipip6_fb_tunnel_init(str= uct net_device *dev) >> struct sit_net *sitn =3D net_generic(net, sit_net_id); >> >> tunnel->dev =3D dev; >> + tunnel->net =3D dev_net(dev); >> strcpy(tunnel->parms.name, dev->name); >> >> iph->version =3D 4; >> @@ -1562,22 +1569,15 @@ static struct xfrm_tunnel ipip_handler __rea= d_mostly =3D { >> .priority =3D 2, >> }; >> >> -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, st= ruct list_head *head) >> +static void __net_exit sit_destroy_tunnels(struct net *net, >> + struct list_head *head) >> { >> - int prio; >> + struct net_device *dev, *aux; >> >> - for (prio =3D 1; prio < 4; prio++) { >> - int h; >> - for (h =3D 0; h < HASH_SIZE; h++) { >> - struct ip_tunnel *t; >> - >> - t =3D rtnl_dereference(sitn->tunnels[prio][h]); >> - while (t !=3D NULL) { >> - unregister_netdevice_queue(t->dev, head); >> - t =3D rtnl_dereference(t->next); >> - } >> - } >> - } >> + for_each_netdev_safe(net, dev, aux) >> + if (dev->rtnl_link_ops && >> + !strcmp(dev->rtnl_link_ops->kind, "sit")) >> + unregister_netdevice_queue(dev, head); > > This entire idiom change is a bit ugly, and it is wrong. > > You need to look for two classes of tunnels to take down. Tunnels th= at > originate in net and tunnels whose netdevice is in net. > > For tunnles that reside in net you should be able to just compare the > rtnl_link_ops pointer, rather than an ascii name. > > Tunnels that originate in this network namespace most definitely need= to > be taken down as among other things you wisely do not keep a referenc= e > count on the originating network namespace. Yes sure. My beta version was doing the right things, but I change this= code=20 before sending the patch :/