netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Ahern <dsahern@gmail.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>,
	netdev@vger.kernel.org,
	Nicolas Dichtel <nicolas.dichtel@6wind.com>,
	roopa <roopa@cumulusnetworks.com>,
	hannes@stressinduktion.org,
	Dinesh Dutt <ddutt@cumulusnetworks.com>,
	Vipin Kumar <vipin@cumulusnetworks.com>,
	Shmulik Ladkani <shmulik.ladkani@gmail.com>
Subject: Re: [RFC PATCH 00/29] net: VRF support
Date: Sun, 08 Feb 2015 17:36:05 -0700	[thread overview]
Message-ID: <54D800F5.60603@gmail.com> (raw)
In-Reply-To: <87k2zubw7l.fsf@x220.int.ebiederm.org>

On 2/6/15 1:50 PM, Eric W. Biederman wrote:
>
> David looking at your patches and reading through your code I think I
> understand what you are proposing, let me see if I can sum it up.
>
> Semantics:
>     - The same as a network namespace.
>     - Addition of a VRF-ANY context
>     - No per VRF settings
>     - No creation or removal operation.

Yes. Plus what I see is an important feature: the ability to layer VRFs 
and network namespaces.

Network namespaces can be used to create smaller, logical switches 
within a single physical switch. (i.e., A network namespace is on par 
with what Cisco calls a VDC, virtual device context, for its Nexus 7000 
switches -- logical separation at the device layer front panel port).

Layering VRFs (L3 separation) within a network namespace (L1 separation) 
provides some nice end-user features.

>
> Implementation:
>     - Add a VRF-id to every affected data structure.
>
> This implies that you see the current implementation of network
> namespaces to be space inefficient, and that you think you can remove
> this inefficiency by simply removing the settings and the associated
> proc files.

Not exactly. I see the current namespace implementation as an excellent 
L1 separation construct, but not an L3 construct.

>
> Given that you have chosen to keep the same semantics as network
> namespaces for everything except for settings this raises the questions:
> - Are the settings and their associated proc files what actually cause
>    the size cost you see in network namespaces?
> - Can we instead of reimplementing network namespaces instead optimize
>    the current implementation?

The namespace memory consumption is side problem to the bigger problem 
of how the isolation of namespaces affect processes (the need to have a 
presence in each namespace).

What I was targeting is a trade-off. To make an L3 separation efficient 
from a large scale* perspective one needs to give up something -- here 
it is per-VRF procfs settings. Replicating procfs tree for each 
namespace does have a high cost.

* scale here meaning VRFs from 1 to N, N for current products goes up to 
4000, though I know of that has mentioned 16k VRFs.

>
> We need measurements to answer either of those questions and I think
> before proceeding we need to answer those questions.

agreed.


> Beyond that I want to point out that in general a data structure that
> has a tag on every member is going to have a larger memory foot print
> per entry, contain more entries, and by virtue of both of those be less
> memory efficient and less time efficient to use.   So it is not clear
> that a implementation that tags everything with a vrf-id will actually
> be lighter weight.

The memory hit for a network namespace is >100k (yes, CONFIG option 
dependent and that 100k is based on a 3.10 kernel which is higher than 
what was measured for a 3.4 kernel).

This proposal puts a 4-byte tag to netdevices, sockets and tasks (skb's 
are out per a prior email). So yes there will be a point that the number 
of netdevices (logical + physical), plus tasks, and sockets will make 
the memory hit of a VRF tag on par with namespace overhead. But the VRF 
tagging alleviates the need to replicate processes/multiple 
sockets/threads so in the big picture I can;t see how the overall hit to 
memory is higher with a VRF id tag.

>
> Also there is a concern that placing tags in every data structure may be
> significantly more error prone to implement (as it is more more thing to
> keep trace of), and that can impact the maintainability and the
> correctness of the code for everyone.

I don't agree with this. You have already done the groundwork here by 
plumbing through the namespace checks. Adding a vrf id has not proven to 
be a huge problem. The patch changes are highly repetitive because again 
it can leverage the namespace changes you have done.

> The standard that was applied to the network namespace was that
> it did not have any measurable performance impact when enabled.  The
> measurments taken at the time did not show a slow down when a 1Gig
> interface was place in a network namespace.  Compared to running an
> unpatched kernel.

Sure. I will build kernels at the commit id my patches are based on and 
one with my changes and do a comparison. Virtual machines on KVM 
emphasize the performance effects so I will compare a few netperf runs 
with and without my changes. On a newer 3.x kernel I typically see 
network throughput rates in 15 to 16 Gbps range (though H/W dependent), 
so this far exceeds the 1G rate. Does that sound reasonable?

>
> I suspect your extra layer of indirection to get to struct net in
> addition to touching struct skb will give you a noticable performance
> impact.

I don't understand the 'extra layer of indirection' comment. I don't see 
the indirection, I see an extra comparison. ie., from net_eq to net_eq + 
(vrf_eq || vrf_eq_any).

 From a struct comparison it has gone from:

struct net_device {
     ...
     struct net              *nd_net;
     ...
};

to

struct net_device {
     ...
     struct net_ctx          net_ctx {
         struct net *net;
         __u32 vrf;
    };
...
};


>
>
> I have another concern.  I don't think it is wise to have a data
> structure modified two different ways to deal with network namespaces
> and vrfs.  For maintainability and our own sanity we should pick which
> version that we judge to be the most efficient implementation and go
> with it.

you lost me. What data structure is modified 2 different ways? VRFs are 
sub context to a namespace.

>
>
>
> The architecture I imagine for using network namespaces as vrfs for
> devices that perform layer 2 bridging and layer 3 routing.
>
> port1 port2 port3 port4 port5 port6 port7 port8 port9 port10
>    |     |     |     |     |     |     |     |     |     |
>    +-----+-----+-----+-----+-----+-----+-----+-----+-----+
>   /                   Link Aggregation                    \
> +                                                         +
> |                        Bridging                         |
> +----------------------------+----------------------------+
>                               |
>                            cpu port
>                               |
>         +---------------------+---------------------+
>        /     +---------------/ \---------------+     \
>       /     /     +---------/   \---------+     \     \
>      /     /     /     +---/     \---+     \     \     \
>     /     /     /     /    |     |    \     \     \     \
>    |     |     |     |     |     |     |     |     |     |
> vlan1 vlan2 vlan3 vlan4 vlan5 vlan6 vlan7 vlan8 vlan9 vlan10
>    |     |     |     |     |     |     |     |     |     |
> +-+-----+-----+-----+-----+-+ +-+-----+-----+-----+-----+-+
> |    network namespace 1    | |    network namespace2     |
> +---------------------------+ +---------------------------+
>
> Traffic to and from the rest of the world comes through the
> external ports.
>
> The traffic is then processed at layer two including link
> aggregation, bridging and classifying which vlan the traffic
> belongs in.
>
> If the traffic needs to be routed it then comes up to the cpu port.
> The cpu port looks at the tags on the traffic and places it into
> the appropriate vlan device.
>
>  From the various vlans the traffic is then routed according
> to the routing table of whichever network namespace the vlan
> device is in.
>
> There are stateless offloads to this in modern hardware but this is a
> reasonable model how all of this works semantically.
>
> As such the vlan devices can be moved between network namespaces without
> affecting any layer two monitoring or work that happens on the lower
> level devices.  The practical restriction is that L2 and L3 need to be
> handled on different network devices.
>
> This split of network devices ensures that L2 code that works today
> should not need any changes or in any way be concerned about network
> namespaces or that the parent devices are in.

9+ months ago I had considered something similar. I'll try to amend your 
picture to show my concept:


port1 port2 port3 port4 port5 port6 port7 port8 port9 port10

   |     |     |     |     |     |     |     |     |     |
   +-----+-----+-----+-----+-----+-----+-----+-----+-----+
  /                    Link Aggregation                   \
+                                                         +
|                         Bridging                        |
+-----------------------------+---------------------------+
                               |
+-----------------------------+----------------------------+
|default namespace            |                            |
| (init_net)              NIC driver                       |
|                             |                            |
|  +----+----+-----+-----+-------+----+----+----+----+     |
| eth1 eth2 eth3  eth4 eth5     eth6 eth7 eth8 eth9 eth10  |
|  |    |    |     |     |        |    |    |    |    |    |
+-----------------------------+----------------------------+
    |    |    |     |     |        |    |    |    |    |
+--+----+----+-----+-----+---+ +--+----+----+----+----+----+
|  |    |    |     |     |   | |  |    |    |    |    |    |
| seth1 |   seth3  |   seth5 | | seth6 |   seth8 |   seth10|
|      seth2      seth4      | |      seth7     seth9      |
|                            | |                           |
|    network namespace 1     | |    network namespace 2    |
+----------------------------+ +---------------------------+

Essentially netdevices for front panel ports exist in the default 
namespace (init_net). L2 processes, monitoring processes (collectd, snmp 
agents for devices, etc) and such would run here. From there "shadow 
devices" (the 's' on the eth pairs) are created for namespaces where the 
path between real and shadow is similar to how veth pairs work. In the 
end this approach seemed to be a rather complex solution playing a lot 
of games so I abandoned it in favor of the approach in this patch set -- 
adding a VRF id to a network context.

The patch diff might be large but almost all of it is converting the 
existing struct net passing and net_eq checks to the broader struct 
net_ctx and net_ctx_eq comparisons. Really the change rides on top of 
what you have done for namespaces.

As for your proposal with VLAN based tagging, I do not understand the 
packet path from driver for front panel ports to namespace based 
netdevices. The VLAN sorting, and hence VRF sorting, is done in H/W? So 
there are netdevices in init_net the driver uses and then VLAN devices 
in the namespaces -- so those would correspond to what I called a shadow 
device? If the packets are also VLAN tagged we have nested tagging -- 
one for the port and one for the VRF?

Also, doesn't the VLAN design limit number of VRFs to 4096? My current 
patch set might limit it to 4096 but fix the genid piece (IPv6 seems to 
have removed genid comparisons betweeen 3.17 and 3.19 -- need to look 
into that) and it becomes a 32-bit tag which is a huge range for VRFs.

David

  reply	other threads:[~2015-02-09  0:36 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-05  1:34 [RFC PATCH 00/29] net: VRF support David Ahern
2015-02-05  1:34 ` [RFC PATCH 01/29] net: Introduce net_ctx and macro for context comparison David Ahern
2015-02-05  1:34 ` [RFC PATCH 02/29] net: Flip net_device to use net_ctx David Ahern
2015-02-05 13:47   ` Nicolas Dichtel
2015-02-06  0:45     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 03/29] net: Flip sock_common to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 04/29] net: Add net_ctx macros for skbuffs David Ahern
2015-02-05  1:34 ` [RFC PATCH 05/29] net: Flip seq_net_private to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 06/29] net: Flip fib_rules and fib_rules_ops to use net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 07/29] net: Flip inet_bind_bucket to net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 08/29] net: Flip fib_info " David Ahern
2015-02-05  1:34 ` [RFC PATCH 09/29] net: Flip ip6_flowlabel " David Ahern
2015-02-05  1:34 ` [RFC PATCH 10/29] net: Flip neigh structs " David Ahern
2015-02-05  1:34 ` [RFC PATCH 11/29] net: Flip nl_info " David Ahern
2015-02-05  1:34 ` [RFC PATCH 12/29] net: Add device lookups by net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 13/29] net: Convert function arg from struct net to struct net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 14/29] net: vrf: Introduce vrf header file David Ahern
2015-02-05 13:44   ` Nicolas Dichtel
2015-02-06  0:52     ` David Ahern
2015-02-06  8:53       ` Nicolas Dichtel
2015-02-05  1:34 ` [RFC PATCH 15/29] net: vrf: Add vrf to net_ctx struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 16/29] net: vrf: Set default vrf David Ahern
2015-02-05  1:34 ` [RFC PATCH 17/29] net: vrf: Add vrf context to task struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 18/29] net: vrf: Plumbing for vrf context on a socket David Ahern
2015-02-05 13:44   ` Nicolas Dichtel
2015-02-06  1:18     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 19/29] net: vrf: Add vrf context to skb David Ahern
2015-02-05 13:45   ` Nicolas Dichtel
2015-02-06  1:21     ` David Ahern
2015-02-06  3:54   ` Eric W. Biederman
2015-02-06  6:00     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 20/29] net: vrf: Add vrf context to flow struct David Ahern
2015-02-05  1:34 ` [RFC PATCH 21/29] net: vrf: Add vrf context to genid's David Ahern
2015-02-05  1:34 ` [RFC PATCH 22/29] net: vrf: Set VRF id in various network structs David Ahern
2015-02-05  1:34 ` [RFC PATCH 23/29] net: vrf: Enable vrf checks David Ahern
2015-02-05  1:34 ` [RFC PATCH 24/29] net: vrf: Add support to get/set vrf context on a device David Ahern
2015-02-05  1:34 ` [RFC PATCH 25/29] net: vrf: Handle VRF any context David Ahern
2015-02-05 13:46   ` Nicolas Dichtel
2015-02-06  1:23     ` David Ahern
2015-02-05  1:34 ` [RFC PATCH 26/29] net: vrf: Change single_open_net to pass net_ctx David Ahern
2015-02-05  1:34 ` [RFC PATCH 27/29] net: vrf: Add vrf checks and context to ipv4 proc files David Ahern
2015-02-05  1:34 ` [RFC PATCH 28/29] iproute2: vrf: Add vrf subcommand David Ahern
2015-02-05  1:34 ` [RFC PATCH 29/29] iproute2: Add vrf option to ip link command David Ahern
2015-02-05  5:17 ` [RFC PATCH 00/29] net: VRF support roopa
2015-02-05 13:44 ` Nicolas Dichtel
2015-02-06  1:32   ` David Ahern
2015-02-06  8:53     ` Nicolas Dichtel
2015-02-05 23:12 ` roopa
2015-02-06  2:19   ` David Ahern
2015-02-09 16:38     ` roopa
2015-02-10 10:43     ` Derek Fawcus
2015-02-06  6:10   ` Shmulik Ladkani
2015-02-09 15:54     ` roopa
2015-02-11  7:42       ` Shmulik Ladkani
2015-02-06  1:33 ` Stephen Hemminger
2015-02-06  2:10   ` David Ahern
2015-02-06  4:14     ` Eric W. Biederman
2015-02-06  6:15       ` David Ahern
2015-02-06 15:08         ` Nicolas Dichtel
     [not found]         ` <87iofe7n1x.fsf@x220.int.ebiederm.org>
2015-02-09 20:48           ` Nicolas Dichtel
2015-02-11  4:14           ` David Ahern
2015-02-06 15:10 ` Nicolas Dichtel
2015-02-06 20:50 ` Eric W. Biederman
2015-02-09  0:36   ` David Ahern [this message]
2015-02-09 11:30     ` Derek Fawcus
     [not found]   ` <871tlxtbhd.fsf_-_@x220.int.ebiederm.org>
2015-02-11  2:55     ` network namespace bloat Eric Dumazet
2015-02-11  3:18       ` Eric W. Biederman
2015-02-19 19:49         ` David Miller
2015-03-09 18:22           ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction Eric W. Biederman
2015-03-09 18:27             ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics can not be allocated Eric W. Biederman
2015-03-09 18:50               ` Sergei Shtylyov
2015-03-11 19:22                 ` Sergei Shtylyov
2015-03-09 18:27             ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-09 18:29             ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-09 20:25               ` Julian Anastasov
2015-03-10  6:59                 ` Eric W. Biederman
2015-03-10  8:23                   ` Julian Anastasov
2015-03-11  0:58                     ` Eric W. Biederman
2015-03-10 16:36                   ` David Miller
2015-03-10 17:06                     ` Eric W. Biederman
2015-03-10 17:29                       ` David Miller
2015-03-10 17:56                         ` Eric W. Biederman
2015-03-09 18:30             ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-09 18:30             ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-09 18:31             ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-09 18:43               ` Eric Dumazet
2015-03-09 18:47               ` Eric Dumazet
2015-03-09 19:35                 ` Eric W. Biederman
2015-03-09 20:21                   ` Eric Dumazet
2015-03-09 20:09             ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction David Miller
2015-03-09 20:21               ` Eric W. Biederman
2015-03-11 16:33             ` [PATCH net-next 0/8] tcp_metrics: Network namespace bloat reduction v2 Eric W. Biederman
2015-03-11 16:35               ` [PATCH net-next 1/8] net: Kill hold_net release_net Eric W. Biederman
2015-03-11 16:55                 ` Eric Dumazet
2015-03-11 17:34                   ` Eric W. Biederman
2015-03-11 17:07                 ` Eric Dumazet
2015-03-11 17:08                   ` Eric Dumazet
2015-03-11 17:10                 ` Eric Dumazet
2015-03-11 17:36                   ` Eric W. Biederman
2015-03-11 16:36               ` [PATCH net-next 2/8] net: Introduce possible_net_t Eric W. Biederman
2015-03-11 16:38               ` [PATCH net-next 3/8] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
2015-03-11 16:38               ` [PATCH net-next 4/8] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-11 16:40               ` [PATCH net-next 5/8] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-11 16:41               ` [PATCH net-next 6/8] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-11 16:43               ` [PATCH net-next 7/8] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-11 16:43               ` [PATCH net-next 8/8] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-13  5:04               ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 Eric W. Biederman
2015-03-13  5:04                 ` [PATCH net-next 1/6] tcp_metrics: panic when tcp_metrics_init fails Eric W. Biederman
2015-03-13  5:05                 ` [PATCH net-next 2/6] tcp_metrics: Mix the network namespace into the hash function Eric W. Biederman
2015-03-13  5:05                 ` [PATCH net-next 3/6] tcp_metrics: Add a field tcpm_net and verify it matches on lookup Eric W. Biederman
2015-03-13  5:06                 ` [PATCH net-next 4/6] tcp_metrics: Remove the unused return code from tcp_metrics_flush_all Eric W. Biederman
2015-03-13  5:07                 ` [PATCH net-next 5/6] tcp_metrics: Rewrite tcp_metrics_flush_all Eric W. Biederman
2015-03-13  5:07                 ` [PATCH net-next 6/6] tcp_metrics: Use a single hash table for all network namespaces Eric W. Biederman
2015-03-13  5:57                 ` [PATCH net-next 0/6] tcp_metrics: Network namespace bloat reduction v3 David Miller
2015-02-11 17:09     ` network namespace bloat Nicolas Dichtel
2015-02-10  0:53 ` [RFC PATCH 00/29] net: VRF support Thomas Graf
2015-02-10 20:54   ` David Ahern
2016-05-25 16:04 ` Chenna
2016-05-25 19:04   ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54D800F5.60603@gmail.com \
    --to=dsahern@gmail.com \
    --cc=ddutt@cumulusnetworks.com \
    --cc=ebiederm@xmission.com \
    --cc=hannes@stressinduktion.org \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.dichtel@6wind.com \
    --cc=roopa@cumulusnetworks.com \
    --cc=shmulik.ladkani@gmail.com \
    --cc=stephen@networkplumber.org \
    --cc=vipin@cumulusnetworks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).