netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Ahern <dsahern@gmail.com>
To: Thomas Graf <tgraf@suug.ch>
Cc: netdev@vger.kernel.org, ebiederm@xmission.com
Subject: Re: [RFC PATCH 00/29] net: VRF support
Date: Tue, 10 Feb 2015 13:54:05 -0700	[thread overview]
Message-ID: <54DA6FED.5020907@gmail.com> (raw)
In-Reply-To: <20150210005344.GA6293@casper.infradead.org>

On 2/9/15 5:53 PM, Thomas Graf wrote:
> On 02/04/15 at 06:34pm, David Ahern wrote:
>> Namespaces provide excellent separation of the networking stack from the
>> netdevices and up. The intent of VRFs is to provide an additional,
>> logical separation at the L3 layer within a namespace.
>
> What you ask for seems to be L3 micro segmentation inside netns. I

I would not label it 'micro' but yes a L3 separation within a L1 separation.

> would argue that we already support this through multiple routing
> tables. I would prefer improving the existing architecture to cover
> your use cases: Increase the number of supported tables, extend
> routing rules as needed, ...

I've seen that response for VRFs as well. I have not personally tried 
it, but from what I have read it does not work well. I think Roopa 
responded that Cumulus has spent time on that path and has hit some 
roadblocks.

>
>> The VRF id of tasks defaults to 1 and is inherited parent to child. It can
>> be read via the file '/proc/<pid>/vrf' and can be changed anytime by writing
>> to this file (if preferred this can be made a prctl to change the VRF id).
>> This allows services to be launched in a VRF context using ip, similar to
>> what is done for network namespaces.
>>      e.g., ip vrf exec 99 /usr/sbin/sshd
>
> I think such as classification should occur through cgroups instead
> of touching PIDs directly.

That is an interesting idea -- using cgroups for task labeling. It 
presents a creation / deletion event for VRFs which I was trying to 
avoid, and there will be some amount of overhead with a cgroup. I'll 
take a look at that option when I get some time.

As for as the current proposal I am treating VRF as part of a network 
context. Today 'ip netns' is used to run a command in a specific network 
namespace; the proposal with the VRF layering is to add a vrf context 
within a namespace so in keeping with how 'ip netns' works the above 
syntax allows a user to supply both a network namespace + VRF for 
running a command.

>
>> Network devices belong to a single VRF context which defaults to VRF 1.
>> They can be assigned to another VRF using IFLA_VRF attribute in link
>> messages. Similarly the VRF assignment is returned in the IFLA_VRF
>> attribute. The ip command has been modified to display the VRF id of a
>> device. L2 applications like lldp are not VRF aware and still work through
>> through all network devices within the namespace.
>
> I believe that binding net_devices to VRFs is misleading and the
> concept by itself is non-scalable. You do not want to create 10k
> net_devices for your overlay of choice just to tie them to a
> particular VRF. You want to store the VRF identifier as metadata and
> have a stateless classifier included it in the VRF decision. See the
> recent VXLAN-GBP work.

I'll take a look when I get time.

I have not seen scalability issues creating 1,000+ net_devices. 
Certainly the 40k'ish memory per net_device is noticeable but I believe 
that can be improved (e.g., a number of entries can be moved under 
proper CONFIG_ checks). I do need to repeat the tests on newer kernels.

>
> You could either map whatever selects the VRF to the mark or support it
> natively in the routing rules classifier.
>
> An obvious alternative is OVS. What you describe can be implemented in
> a scalable matter using OVS and mark. I understand that OVS is not for
> everybody but it gets a fundamental principle right: Scalability
> demands for programmability.
>
> I don’t think we should be adding a new single purpose metadata field
> to arbitrary structures for every new use case that comes up. We
> should work on programmability which increases flexibility and allows
> decoupling application interest from networking details.
>
>> On RX skbs get their VRF context from the netdevice the packet is received
>> on. For TX the VRF context for an skb is taken from the socket. The
>> intention is for L3/raw sockets to be able to set the VRF context for a
>> packet TX using cmsg (not coded in this patch set).
>
> Specyfing L3 context in cmsg seems very broken to me. We do not want
> to bind applications any closer to underlying networking infrastructure.
> In fact, we should do the opposite and decouple this completely.

That suggestion is inline with what is done today for other L3 
parameters -- TOS, TTL, and a few others.

>
>> The 'any' context applies to listen sockets only; connected sockets are in
>> a VRF context. Child sockets accepted by the daemon acquire the VRF context
>> of the network device the connection originated on.
>
> Linux considers an address local regardless of the interface the packet
> was received on.  So you would accept the packet on any interface and
> then bind it to the VRF of that interface even though the route for it
> might be on a different interface.
>
> This really belongs into routing rules from my perspective which takes
> mark and the cgroup context into account.

Expanding the current network namespace checks to a networking context 
is a very simple and clean way of implementing VRFs versus cobbling 
together a 'VRF like' capability using marks, multiple tables, etc (ie., 
the existing capabilities). Further, the VRF tagging of net_devices 
seems to readily fit into the hardware offload and switchdev 
capabilities (e.g., add a ndo operation for setting the VRF tag on a 
device which passes it to the driver).

Big picture wise where is OCP and switchdev headed? Top-of-rack switches 
seem to be the first target, but after that? Will the kernel ever 
support MPLS? Will the kernel attain the richer feature set of high-end 
routers? If so, how does VRF support fit into the design? As I 
understand it a scalable VRF solution is a fundamental building block. 
Will a cobbled together solution of cgroups, marks, rules, multiple 
tables really work versus the simplicity of an expanded network context?

David

  reply	other threads:[~2015-02-10 20:54 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
2015-02-05  1:34 ` [RFC PATCH 01/29] net: Introduce net_ctx and macro for context comparison David Ahern
2015-02-05  1:34 ` [RFC PATCH 02/29] net: Flip net_device to use net_ctx David Ahern
2015-02-05 13:47   ` Nicolas Dichtel
2015-02-06  0:45     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 03/29] net: Flip sock_common to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 04/29] net: Add net_ctx macros for skbuffs David Ahern
2015-02-05  1:34 ` [RFC PATCH 05/29] net: Flip seq_net_private to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 06/29] net: Flip fib_rules and fib_rules_ops to use net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 07/29] net: Flip inet_bind_bucket to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 08/29] net: Flip fib_info " David Ahern
2015-02-05  1:34 ` [RFC PATCH 09/29] net: Flip ip6_flowlabel " David Ahern
2015-02-05  1:34 ` [RFC PATCH 10/29] net: Flip neigh structs " David Ahern
2015-02-05  1:34 ` [RFC PATCH 11/29] net: Flip nl_info " David Ahern
2015-02-05  1:34 ` [RFC PATCH 12/29] net: Add device lookups by net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 13/29] net: Convert function arg from struct net to struct net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 14/29] net: vrf: Introduce vrf header file David Ahern
2015-02-05 13:44   ` Nicolas Dichtel
2015-02-06  0:52     ` David Ahern
2015-02-06  8:53       ` Nicolas Dichtel
2015-02-05  1:34 ` [RFC PATCH 15/29] net: vrf: Add vrf to net_ctx struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 16/29] net: vrf: Set default vrf David Ahern
2015-02-05  1:34 ` [RFC PATCH 17/29] net: vrf: Add vrf context to task struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket David Ahern
2015-02-05 13:44   ` Nicolas Dichtel
2015-02-06  1:18     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 19/29] net: vrf: Add vrf context to skb David Ahern
2015-02-05 13:45   ` Nicolas Dichtel
2015-02-06  1:21     ` David Ahern
2015-02-06  3:54   ` Eric W. Biederman
2015-02-06  6:00     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 20/29] net: vrf: Add vrf context to flow struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 21/29] net: vrf: Add vrf context to genid's David Ahern
2015-02-05  1:34 ` [RFC PATCH 22/29] net: vrf: Set VRF id in various network structs David Ahern
2015-02-05  1:34 ` [RFC PATCH 23/29] net: vrf: Enable vrf checks David Ahern
2015-02-05  1:34 ` [RFC PATCH 24/29] net: vrf: Add support to get/set vrf context on a device David Ahern
2015-02-05  1:34 ` [RFC PATCH 25/29] net: vrf: Handle VRF any context David Ahern
2015-02-05 13:46   ` Nicolas Dichtel
2015-02-06  1:23     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 26/29] net: vrf: Change single_open_net to pass net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 27/29] net: vrf: Add vrf checks and context to ipv4 proc files David Ahern
2015-02-05  1:34 ` [RFC PATCH 28/29] iproute2: vrf: Add vrf subcommand David Ahern
2015-02-05  1:34 ` [RFC PATCH 29/29] iproute2: Add vrf option to ip link command David Ahern
2015-02-05  5:17 ` [RFC PATCH 00/29] net: VRF support roopa
2015-02-05 13:44 ` Nicolas Dichtel
2015-02-06  1:32   ` David Ahern
2015-02-06  8:53     ` Nicolas Dichtel
2015-02-05 23:12 ` roopa
2015-02-06  2:19   ` David Ahern
2015-02-09 16:38     ` roopa
2015-02-10 10:43     ` Derek Fawcus
2015-02-06  6:10   ` Shmulik Ladkani
2015-02-09 15:54     ` roopa
2015-02-11  7:42       ` Shmulik Ladkani
2015-02-06  1:33 ` Stephen Hemminger
2015-02-06  2:10   ` David Ahern
2015-02-06  4:14     ` Eric W. Biederman
2015-02-06  6:15       ` David Ahern
2015-02-06 15:08         ` Nicolas Dichtel
     [not found]         ` <87iofe7n1x.fsf@x220.int.ebiederm.org>
2015-02-09 20:48           ` Nicolas Dichtel
2015-02-11  4:14           ` David Ahern
2015-02-06 15:10 ` Nicolas Dichtel
2015-02-06 20:50 ` Eric W. Biederman
2015-02-09  0:36   ` David Ahern
2015-02-09 11:30     ` Derek Fawcus
     [not found]   ` <871tlxtbhd.fsf_-_@x220.int.ebiederm.org>
2015-02-11  2:55     ` network namespace bloat Eric Dumazet
2015-02-11  3:18       ` Eric W. Biederman
2015-02-19 19:49         ` David Miller
2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
2015-03-09 18:27             ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated Eric W. Biederman
2015-03-09 18:50               ` Sergei Shtylyov
2015-03-11 19:22                 ` Sergei Shtylyov
2015-03-09 18:27             ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-09 18:29             ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-09 20:25               ` Julian Anastasov
2015-03-10  6:59                 ` Eric W. Biederman
2015-03-10  8:23                   ` Julian Anastasov
2015-03-11  0:58                     ` Eric W. Biederman
2015-03-10 16:36                   ` David Miller
2015-03-10 17:06                     ` Eric W. Biederman
2015-03-10 17:29                       ` David Miller
2015-03-10 17:56                         ` Eric W. Biederman
2015-03-09 18:30             ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-09 18:30             ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-09 18:31             ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-09 18:43               ` Eric Dumazet
2015-03-09 18:47               ` Eric Dumazet
2015-03-09 19:35                 ` Eric W. Biederman
2015-03-09 20:21                   ` Eric Dumazet
2015-03-09 20:09             ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction David Miller
2015-03-09 20:21               ` Eric W. Biederman
2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
2015-03-11 16:55                 ` Eric Dumazet
2015-03-11 17:34                   ` Eric W. Biederman
2015-03-11 17:07                 ` Eric Dumazet
2015-03-11 17:08                   ` Eric Dumazet
2015-03-11 17:10                 ` Eric Dumazet
2015-03-11 17:36                   ` Eric W. Biederman
2015-03-11 16:36               ` [PATCH net-next 2/8] net: Introduce possible_net_t Eric W. Biederman
2015-03-11 16:38               ` [PATCH net-next 3/8] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
2015-03-11 16:38               ` [PATCH net-next 4/8] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-11 16:40               ` [PATCH net-next 5/8] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-11 16:41               ` [PATCH net-next 6/8] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-11 16:43               ` [PATCH net-next 7/8] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-11 16:43               ` [PATCH net-next 8/8] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
2015-03-13  5:04                 ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
2015-03-13  5:05                 ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-13  5:05                 ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-13  5:06                 ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-13  5:07                 ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-13  5:07                 ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-13  5:57                 ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 David Miller
2015-02-11 17:09     ` network namespace bloat Nicolas Dichtel
2015-02-10  0:53 ` [RFC PATCH 00/29] net: VRF support Thomas Graf
2015-02-10 20:54   ` David Ahern [this message]
2016-05-25 16:04 ` Chenna
2016-05-25 19:04   ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54DA6FED.5020907@gmail.com \
    --to=dsahern@gmail.com \
    --cc=ebiederm@xmission.com \
    --cc=netdev@vger.kernel.org \
    --cc=tgraf@suug.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).