From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nicolas Dichtel Subject: Re: [RFC PATCH net-next 2/2] sit: add support of x-netns Date: Tue, 25 Jun 2013 16:10:43 +0200 Message-ID: <51C9A4E3.2060906@6wind.com> References: <87y5ijd98e.fsf@xmission.com> <1372083239-9451-1-git-send-email-nicolas.dichtel@6wind.com> <1372083239-9451-3-git-send-email-nicolas.dichtel@6wind.com> <878v1zjokr.fsf@xmission.com> <51C8B5F6.7020603@6wind.com> <874ncni114.fsf@xmission.com> Reply-To: nicolas.dichtel@6wind.com Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: netdev@vger.kernel.org, davem@davemloft.net, bcrl@kvack.org, ravi.mlists@gmail.com To: "Eric W. Biederman" Return-path: Received: from mail-we0-f174.google.com ([74.125.82.174]:42117 "EHLO mail-we0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751102Ab3FYOKr (ORCPT ); Tue, 25 Jun 2013 10:10:47 -0400 Received: by mail-we0-f174.google.com with SMTP id q58so9257374wes.19 for ; Tue, 25 Jun 2013 07:10:46 -0700 (PDT) In-Reply-To: <874ncni114.fsf@xmission.com> Sender: netdev-owner@vger.kernel.org List-ID: Le 25/06/2013 00:42, Eric W. Biederman a =C3=A9crit : > Nicolas Dichtel writes: > >> Le 24/06/2013 21:28, Eric W. Biederman a =C3=A9crit : >>> Nicolas Dichtel writes: >>> >>>> This patch allows to switch the netns when packet is encapsulated = or >>>> decapsulated. In other word, the encapsulated packet is received i= n a netns, >>>> where the lookup is done to find the tunnel. Once the tunnel is fo= und, the >>>> packet is decapsulated and injecting into the corresponding interf= ace which >>>> stands to another netns. >>>> >>>> When one of the two netns is removed, the tunnel is destroyed. >>> >>> I don't see any fundamental problems with this code. There are bug= s >>> with the cleanup noted below. >>> >>> The primary sit interface is marked as NETNS_LOCAL which is good. = A >>> comment might be nice explaining the reasonsing but for code >>> archeologists. >> Ok. >> >>> >>> Conditionally calling dev_cleanup_skb bugs me. The extra condition= al >>> looks like a maintenance hazard. Unless I have missed some subtle >>> detail either we don't need the cleanup at all or actually it is a = bug >>> that we aren't scrubbing our packets as they progress through tunne= ls >>> even in the same network namespace. >>> >>> Can we just make that function the skb scrubbing needed for packets= to >>> traverse a tunnel? >>> >>> My concern going into this was that we would get code that would br= eak >>> because it would not be tested enough. If we can remove the condit= ional >>> from dev_cleanup_skb we won't have any code that is conditionally r= un >>> and the logic looks simple enough not to bitrot in routine maintena= nce. >> My idea was to have the same level of cleanup/scrubbing that when a = packet is >> sent from a netns to another netns through a veth. I cannot use >> dev_forward_skb() because this function expects to have an ethernet = header, it's >> why I split it in the patch #1. >> >> If we leave all information attached to the skb, we may have, for ex= ample, an >> skb with a conntrack from netns1 and a netdevice from netns2. It see= ms not safe, >> but maybe I'm wrong. And in fact, the conntrack will not be created = in the >> second netns (nf_conntrack_in() =3D> skb->nfct is not null and not a= template =3D> >> stats ignore++). >> Another example is a socket from a netns and the netdevice or conntr= ack from >> another netns. > > All of that I agree with. > > I just don't see any need to make that scrubbing/cleaning of the pack= et > conditional. > > Semantically going through a tunnel is the same as crossing between > network namespaces. So you can change > >>>> + if (tunnel->net !=3D dev_net(tunnel->dev)) >>>> + dev_cleanup_skb(skb); > > to just: > > dev_cleanup_skb(skb); > >> I was thinking that when a packet enter a namespace, it must not be = associated >> to any object from the previous namespace, it should be like if we j= ust receive >> it on the host. > > Overall agree. Tunnels have the same properties. > > Which leads me to conclude either we are missing something or the > current tunnel code is mildly buggy because it does not do this level= of > scrubbing. I'm afraid to break an existing scenario, but you're probably right. Le= t's=20 remove this test. Nicolas > > Eric > >>>> -static void __net_exit sit_destroy_tunnels(struct sit_net *sitn, = struct list_head *head) >>>> +static void __net_exit sit_destroy_tunnels(struct net *net, >>>> + struct list_head *head) >>>> { >>>> - int prio; >>>> + struct net_device *dev, *aux; >>>> >>>> - for (prio =3D 1; prio < 4; prio++) { >>>> - int h; >>>> - for (h =3D 0; h < HASH_SIZE; h++) { >>>> - struct ip_tunnel *t; >>>> - >>>> - t =3D rtnl_dereference(sitn->tunnels[prio][h]); >>>> - while (t !=3D NULL) { >>>> - unregister_netdevice_queue(t->dev, head); >>>> - t =3D rtnl_dereference(t->next); >>>> - } >>>> - } >>>> - } >>>> + for_each_netdev_safe(net, dev, aux) >>>> + if (dev->rtnl_link_ops && >>>> + !strcmp(dev->rtnl_link_ops->kind, "sit")) >>>> + unregister_netdevice_queue(dev, head); >>> >>> This entire idiom change is a bit ugly, and it is wrong. >>> >>> You need to look for two classes of tunnels to take down. Tunnels = that >>> originate in net and tunnels whose netdevice is in net. >>> >>> For tunnles that reside in net you should be able to just compare t= he >>> rtnl_link_ops pointer, rather than an ascii name. >>> >>> Tunnels that originate in this network namespace most definitely ne= ed to >>> be taken down as among other things you wisely do not keep a refere= nce >>> count on the originating network namespace. >> Yes sure. My beta version was doing the right things, but I change t= his code >> before sending the patch :/ > > Bahahaha! The dangers of the last minute cleanup. > > Eric