From mboxrd@z Thu Jan 1 00:00:00 1970 From: roopa Subject: Re: [PATCH net] mpls: modify RTA_NEWDST netlink attribute to include family Date: Fri, 15 May 2015 11:18:02 -0700 Message-ID: <5556385A.1070709@cumulusnetworks.com> References: <1431664722-59539-1-git-send-email-roopa@cumulusnetworks.com> <877fsa2vs1.fsf@x220.int.ebiederm.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: davem@davemloft.net, rshearma@brocade.com, netdev@vger.kernel.org, vivek@cumulusnetworks.com To: "Eric W. Biederman" Return-path: Received: from mail-pa0-f49.google.com ([209.85.220.49]:32932 "EHLO mail-pa0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754957AbbEOSSE (ORCPT ); Fri, 15 May 2015 14:18:04 -0400 Received: by padbw4 with SMTP id bw4so21426304pad.0 for ; Fri, 15 May 2015 11:18:03 -0700 (PDT) In-Reply-To: <877fsa2vs1.fsf@x220.int.ebiederm.org> Sender: netdev-owner@vger.kernel.org List-ID: On 5/14/15, 11:35 PM, Eric W. Biederman wrote: > roopa@cumulusnetworks.com writes: > >> From: Roopa Prabhu >> >> RTA_NEWDST netlink attribute today is used to carry mpls >> labels. This patch encodes family in RTA_NEWDST. >> >> RTA_NEWDST by its name and its use in iproute2 can be >> used as a generic new dst. But it is currently used only for >> mpls labels ie with family AF_MPLS. Encoding family in the >> attribute will help its reuse in the future. >> >> One usecase where family with RTA_NEWDST becomes necessary >> is when we implement mpls label edge router function. > I don't think this makes any sense. > > How do you change the destination address on a packet to a value in > another protocol? None of IPv4, IPv6, and MPLS support that. > > Aka this attribute represents DNAT. thanks for that clarification (some details on what i was trying to do is at the end of this email). > > >> This is a uapi change but RTA_NEWDST has not made >> into any release yet. so, trying to rush this change into >> 4.1 if acceptable. >> >> (iproute2 patch will follow) >> >> Signed-off-by: Roopa Prabhu >> --- >> eric, if you had already thought about other ways to represent >> labels for LER function, pls let me know. I am looking for suggestions. > I have to some extent, nothing I am completely pleased with yet but > enough that I can narrow things down to some extent. > > I believe you are referring to the case where we have an ipv4 packet > or an ipv6 packet and we are inserting it into an mpls tunnel for the > next step of it's travel. Egress from mpls appears to already be > convered. yes, correct. > > The bounding set of challenges looks something like this: > - We might be placing a full routing table into mpls with > a different mpls tunnel for each different route. > A full routing table today runs about 1 million routes > so we need to support inserting into the ballpark of 1 million > different mpls tunnels. > As it happens 1 million is also 2^20 or the number of mpls labels. > > At 1 million tunnels that rules out using network devices. > > Network devices have two basic things that cause scalability problems. > - struct netdevice and all of sysfs and sysctl overheads fixable > but they run at about 32K today. > - The accounting of ingress and egress packets. > It takes a lot of percpu counters to make accounting fast > so I think fundamentally we want something without counters. agreed. And we have the same conclusions. device is not an option. > Which lead me to look at the kernel xfrm subsystem. xfrm is a close > match in requirements. But having to do a second inefficient lookup and > lookup on more than what we normally used to route a packet seems > wrong. Not hooking into the routing tables seems wrong. The xfrm data > structures themselves seem heavy weight for simple low cost > encapsulation. I have not looked at the xfrm infrastructure in detail. will do so. > > > So I think we need to build yet another infrastructure for dealing with > light weight tunnels (not just mpls). ok, I was looking for a word to describe tunnels like mpls..., 'light weight tunnels' sounds good. > > What I would propose would be a new infrastructure for dealing with > simple stateless tunnels. (AKA tunneling over IP or UDP or MPLS is fine > but tunneling over TCP or otherwise needing smarts to insert a packet > into a tunnel is a no-go). > > To support entering these tunnels and egressing from these tunnels we > need a number that would represent the tunnel type that is linux > specific. This tunnel type would be a superset of the ipv4/ipv6 > protocol number that is stored in /etc/protocol and > http://www.iana.org/assignments/protocol-numbers As well as being a > superset the pseudo wire types > http://www.iana.org/assignments/pwe3-parameters > There are mpls tunnels that are not pseudo wires and there are > tunnels over ip that are encoded in udp are something else as well. > > I believe I would represent this in rtnetlink with a new attribute > RTA_ENCAP. The current idea in my mind is that RTA_ENCAP would include > the encapsulation type, a set of fixed headers and possibly some nested > attributes (like output device), probably RTA_ENCAP and possibly > RTA_DST. ok.. > > At an implementation level I would hook these to the ipv4 and ipv6 > routing tables at the same place as the destination network device, > possibly sharing storage with where we put the destination network > device today. > > We should be able to use dst->output to do all of the work and thus be > able to use many if not all of the same hooks as the fast path of xfrm. > > We definitely need an ecapsulation method because we need to deal with > things like the ttl, mtu and fragmentation and so we need to propogate > bits algorithmically between the different layers. > > There is also the complication that ip over mpls natively vs ip over an > mpls pseudo wire while in practice have the same encoding of the mpls > labels they appear propogate the ttl differently. In one case the ttl > from the inner packet propogates to the outer packet during > encapsulation and propogates to the inner packet when deccapsulating, > and in the other case the mpls tunnel is treated as a single hop > by the ip layer. > > > So I think the right solution is to do the leg work and come up with > an RTA_ENCAP netlink option, and the associated > > > The cheap hack version of this is to use RTA_FLOW and encode a 32bit > number in the routing table and use a magic device to look up that 32bit > number in the mpls routing table (or possibly an mpls flow table) > and use that to generate the mpls labels. > > I don't think we want add the cheap hack. I think we want a good > version that can work for all simple well defined tunnel types like > mpls, gre, ipip, vxlan?, etc. > > > I think we also will want a small layer of indirection in the > implementation of RTA_ENCAP such that we can define a simple > encapsulation separately from defining the route. For IPv4 with in some > cases 8 different prefixes for a single destination address, in the > general case, and internal to a companies network I suspect the > aggregation level can be much higher. > > What such an encapsulation would be is that we would have a tunnel > table with simple integer index, and RTA_ENCAP would just hold > that index to that tunnel. The routing table would hold a reference > counted pointer to the tunnel (so no extra lookups required in the fast > path), and some other bits of netwlink would create and destroy the > light-weight encapsulations. ok, thanks for all the thoughts on this. I was not thinking separate tunnel table. > > Anyway that is my brainstorm on how things should look, and I really > don't think extending RTA_NEWDST makes much if any sense at all. > RTA_NEWDST is just DNAT. Let me tell you where I was going with RTA_NEWDST: I was completely on board with all your hints on a separate generic encapsulation layer for such "light weigh tunnels" in your previous emails on this. The part that wasn't clear was a separate tunnel table. From what i saw, mpls today was the only such light weight tunnel. And, to me RTA_NEWDST was to some extent RTA_ENCAP you were talking about. Clearly i seem to have ignored all the other encapsulation parameters that may need to go into it :). But, i guess in my mind i was thinking those will be additional attributes. But agree, a new nested attribute could be a better option. From IPv4 for example, to me this looked something like adding the below. ip route add 10.1.1.0/30 as mpls 200 via inet 10.1.1.2 dev swp1 the 'mpls 200' goes into RTA_NEWDST. And from ipv4 code you look at the encap family and pass it on to the respective output func (i was looking at a possible abstraction layer here...maybe something like xfrm covering different tunnel types like you mention above). In the hacked up version of my patch (which i was not going to post if it looked like a hack anyways), i essentially set the dst->output to mpls_output. I will see if I can come up with something on the lines of RTA_ENCAP you share above. Thanks for the details eric! appreciate it.