From mboxrd@z Thu Jan 1 00:00:00 1970 From: Thomas Graf Subject: Re: [RFC net-next 0/3] Proposal for VRF-lite Date: Tue, 9 Jun 2015 12:15:50 +0200 Message-ID: <20150609101550.GA10411@pox.localdomain> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: hannes@stressinduktion.org, nicolas.dichtel@6wind.com, dsahern@gmail.com, ebiederm@xmission.com, hadi@mojatatu.com, davem@davemloft.net, stephen@networkplumber.org, netdev@vger.kernel.org, roopa@cumulusnetworks.com, gospo@cumulusnetworks.com, jtoppins@cumulusnetworks.com, nikolay@cumulusnetworks.com To: Shrijeet Mukherjee Return-path: Received: from mail-wi0-f174.google.com ([209.85.212.174]:35329 "EHLO mail-wi0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753136AbbFIKPy (ORCPT ); Tue, 9 Jun 2015 06:15:54 -0400 Received: by wiga1 with SMTP id a1so11428158wig.0 for ; Tue, 09 Jun 2015 03:15:52 -0700 (PDT) Content-Disposition: inline In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On 06/08/15 at 11:35am, Shrijeet Mukherjee wrote: [...] > model with some performance paths that need optimization. (Specifically > the output route selector that Roopa, Robert, Thomas and EricB are > currently discussing on the MPLS thread) Thanks for posting these patches just in time. This explains how you intent to deploy Roopa's patches in a scalable manner. > High Level points > > 1. Simple overlay driver (minimal changes to current stack) > * uses the existing fib tables and fib rules infrastructure > 2. Modelled closely after the ipvlan driver > 3. Uses current API and infrastructure. > * Applications can use SO_BINDTODEVICE or cmsg device indentifiers > to pick VRF (ping, traceroute just work) I like the aspect of reusing existing user interfaces. We might need to introduce a more fine grained capability than CAP_NET_RAW to give containers the privileges to bind to a VRF without allowing them to inject raw frames. Given I understand this correctly: If my intent was to run a process in multiple VRFs, then I would need to run that process in the host network namespace which contains the VRF devices which would also contain the physical devices. While I might want to grant my process the ability to bind to VRFs, I may not want to give it the privileges to bind to any device. So we could consider introducing CAP_NET_VRF which would allow to bind to VRF devices. > * Standard IP Rules work, and since they are aggregated against the > device, scale is manageable > 4. Completely orthogonal to Namespaces and only provides separation in > the routing plane (and ARP) > 5. Debugging is built-in as tcpdump and counters on the VRF device > works as is. > > N2 > N1 (all configs here) +---------------+ > +--------------+ | | > |swp1 :10.0.1.1+----------------------+swp1 :10.0.1.2 | > | | | | > |swp2 :10.0.2.1+----------------------+swp2 :10.0.2.2 | > | | +---------------+ > | VRF 0 | > | table 5 | > | | > +---------------+ > | | > | VRF 1 | N3 > | table 6 | +---------------+ > | | | | > |swp3 :10.0.2.1+----------------------+swp1 :10.0.2.2 | > | | | | > |swp4 :10.0.3.1+----------------------+swp2 :10.0.3.2 | > +--------------+ +---------------+ Do I understand this correctly that swp* represent veth pairs? Why do you have distinct addresses on each peer of the pair? Are the addresses in N2 and N3 considered private and NATed? [...] > # Install the lookup rules that map table to VRF domain > ip rule add pref 200 oif vrf0 lookup 5 > ip rule add pref 200 iif vrf0 lookup 5 > ip rule add pref 200 oif vrf1 lookup 6 > ip rule add pref 200 iif vrf1 lookup 6 I think this is a good start but we all know the scalability constraints of this. Depending on the number of L3 domains, an eBPF classifier utilizing a map to translate origin to routing table and vice versa might address the scale requirement long term. [...] I will comment on the implementation specifics once I have a good understanding of your desired end state looks like.