Re: [PATCH net] mpls: modify RTA_NEWDST netlink attribute to include family

All of lore.kernel.org
 help / color / mirror / Atom feed

From: roopa <roopa@cumulusnetworks.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: davem@davemloft.net, rshearma@brocade.com,
	netdev@vger.kernel.org, vivek@cumulusnetworks.com
Subject: Re: [PATCH net] mpls: modify RTA_NEWDST netlink attribute to include family
Date: Fri, 15 May 2015 11:18:02 -0700	[thread overview]
Message-ID: <5556385A.1070709@cumulusnetworks.com> (raw)
In-Reply-To: <877fsa2vs1.fsf@x220.int.ebiederm.org>

On 5/14/15, 11:35 PM, Eric W. Biederman wrote:
> roopa@cumulusnetworks.com writes:
>
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> RTA_NEWDST netlink attribute today is used to carry mpls
>> labels. This patch encodes family in RTA_NEWDST.
>>
>> RTA_NEWDST by its name and its use in iproute2 can be
>> used as a generic new dst. But it is currently used only for
>> mpls labels ie with family AF_MPLS. Encoding family in the
>> attribute will help its reuse in the future.
>>
>> One usecase where family with RTA_NEWDST becomes necessary
>> is when we implement mpls label edge router function.
> I don't think this makes any sense.
>
> How do you change the destination address on a packet to a value in
> another protocol?  None of IPv4, IPv6, and MPLS support that.
>
> Aka this attribute represents DNAT.

thanks for that clarification (some details on what i was trying to do 
is at the end of this email).
>
>
>> This is a uapi change but RTA_NEWDST has not made
>> into any release yet. so, trying to rush this change into
>> 4.1 if acceptable.
>>
>> (iproute2 patch will follow)
>>
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>> ---
>> eric, if you had already thought about other ways to represent
>> labels for LER function, pls let me know. I am looking for suggestions.
> I have to some extent, nothing I am completely pleased with yet but
> enough that I can narrow things down to some extent.
>
> I believe you are referring to the case where we have an ipv4 packet
> or an ipv6 packet and we are inserting it into an mpls tunnel for the
> next step of it's travel.  Egress from mpls appears to already be
> convered.
yes, correct.
>
> The bounding set of challenges looks something like this:
> - We might be placing a full routing table into mpls with
>    a different mpls tunnel for each different route.
>    A full routing table today runs about 1 million routes
>    so we need to support inserting into the ballpark of 1 million
>    different mpls tunnels.
>    As it happens 1 million is also 2^20 or the number of mpls labels.
>
> At 1 million tunnels that rules out using network devices.
>
> Network devices have two basic things that cause scalability problems.
> - struct netdevice and all of sysfs and sysctl overheads fixable
>    but they run at about 32K today.
> - The accounting of ingress and egress packets.
>    It takes a lot of percpu counters to make accounting fast
>    so I think fundamentally we want something without counters.

agreed. And we have the same conclusions. device is not an option.
> Which lead me to look at the kernel xfrm subsystem.  xfrm is a close
> match in requirements.  But having to do a second inefficient lookup and
> lookup on more than what we normally used to route a packet seems
> wrong. Not hooking into the routing tables seems wrong.  The xfrm data
> structures themselves seem heavy weight for simple low cost
> encapsulation.
I have not looked at the xfrm infrastructure in detail. will do so.
>
>
> So I think we need to build yet another infrastructure for dealing with
> light weight tunnels (not just mpls).

ok, I was looking for a word to describe tunnels like mpls..., 'light 
weight tunnels'
sounds good.
>
> What I would propose would be a new infrastructure for dealing with
> simple stateless tunnels.  (AKA tunneling over IP or UDP or MPLS is fine
> but tunneling over TCP or otherwise needing smarts to insert a packet
> into a tunnel is a no-go).
>
> To support entering these tunnels and egressing from these tunnels we
> need a number that would represent the tunnel type that is linux
> specific.  This tunnel type would be a superset of the ipv4/ipv6
> protocol number that is stored in /etc/protocol and
> http://www.iana.org/assignments/protocol-numbers As well as being a
> superset the pseudo wire types
> http://www.iana.org/assignments/pwe3-parameters
> There are mpls tunnels that are not pseudo wires and there are
> tunnels over ip that are encoded in udp are something else as well.
>
> I believe I would represent this in rtnetlink with a new attribute
> RTA_ENCAP.  The current idea in my mind is that RTA_ENCAP would include
> the encapsulation type, a set of fixed headers and possibly some nested
> attributes (like output device), probably RTA_ENCAP and possibly
> RTA_DST.
ok..

>
> At an implementation level I would hook these to the ipv4 and ipv6
> routing tables at the same place as the destination network device,
> possibly sharing storage with where we put the destination network
> device today.
>
> We should be able to use dst->output to do all of the work and thus be
> able to use many if not all of the same hooks as the fast path of xfrm.
>
> We definitely need an ecapsulation method because we need to deal with
> things like the ttl, mtu and fragmentation and so we need to propogate
> bits algorithmically between the different layers.
>
> There is also the complication that ip over mpls natively vs ip over an
> mpls pseudo wire while in practice have the same encoding of the mpls
> labels they appear propogate the ttl differently.  In one case the ttl
> from the inner packet propogates to the outer packet during
> encapsulation and propogates to the inner packet when deccapsulating,
> and in the other case the mpls tunnel is treated as a single hop
> by the ip layer.
>
>
> So I think the right solution is to do the leg work and come up with
> an RTA_ENCAP netlink option, and the associated
>
>
> The cheap hack version of this is to use RTA_FLOW and encode a 32bit
> number in the routing table and use a magic device to look up that 32bit
> number in the mpls routing table (or possibly an mpls flow table)
> and use that to generate the mpls labels.
>
> I don't think we want add the cheap hack.  I think we want a good
> version that can work for all simple well defined tunnel types like
> mpls, gre, ipip, vxlan?, etc.
>
>
> I think we also will want a small layer of indirection in the
> implementation of RTA_ENCAP such that we can define a simple
> encapsulation separately from defining the route.  For IPv4 with in some
> cases 8 different prefixes for a single destination address, in the
> general case, and internal to a companies network I suspect the
> aggregation level can be much higher.
>
> What such an encapsulation would be is that we would have a tunnel
> table with simple integer index, and RTA_ENCAP would just hold
> that index to that tunnel.  The routing table would hold a reference
> counted pointer to the tunnel (so no extra lookups required in the fast
> path), and some other bits of netwlink would create and destroy the
> light-weight encapsulations.

ok, thanks for all the thoughts on this. I was not thinking separate 
tunnel table.
>
> Anyway that is my brainstorm on how things should look, and I really
> don't think extending RTA_NEWDST makes much if any sense at all.
> RTA_NEWDST is just DNAT.
Let me tell you where I was going with RTA_NEWDST: I was completely on 
board with all your
hints on a separate generic encapsulation layer for such "light weigh 
tunnels" in
  your previous emails on this. The part that wasn't clear was a 
separate tunnel table.

 From what i saw, mpls today was the only such light weight tunnel. And, 
to me RTA_NEWDST
was to some extent RTA_ENCAP you were talking about. Clearly i seem to 
have ignored all the other
encapsulation parameters that may need to go into it :). But, i guess in 
my mind i was thinking
those will be additional attributes. But agree, a new nested attribute 
could be a better option.

 From IPv4 for example, to me this looked something like adding the below.

ip route add 10.1.1.0/30 as mpls 200 via inet 10.1.1.2 dev swp1

the 'mpls 200' goes into RTA_NEWDST.

And from ipv4 code you look at the encap family and pass it on to the 
respective output func (i was looking at a possible
abstraction layer here...maybe something like xfrm covering different 
tunnel types like you mention above).

In the hacked up version of my patch (which i was not going to post if 
it looked like a hack anyways),
  i essentially set the dst->output to mpls_output.

I will see if I can come up with something on the lines of RTA_ENCAP you 
share above.

Thanks for the details eric! appreciate it.

next prev parent reply	other threads:[~2015-05-15 18:18 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-15  4:38 [PATCH net] mpls: modify RTA_NEWDST netlink attribute to include family roopa
2015-05-15  6:35 ` Eric W. Biederman
2015-05-15 18:18   ` roopa [this message]
2015-05-19 10:15   ` Robert Shearman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5556385A.1070709@cumulusnetworks.com \
    --to=roopa@cumulusnetworks.com \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=netdev@vger.kernel.org \
    --cc=rshearma@brocade.com \
    --cc=vivek@cumulusnetworks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.