netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* VRFs and the scalability of namespaces
@ 2014-09-26 22:37 David Ahern
  2014-09-26 23:52 ` Stephen Hemminger
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: David Ahern @ 2014-09-26 22:37 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: nicolas.dichtel, netdev@vger.kernel.org

Hi Eric:

As you suggested [1] I am starting a new thread to discuss scalability 
problems using namespaces for VRFs.

Background
----------
Consider a single system that wants to provide VRF-based features with 
support for N VRFs. N could easily be 2048 (e.g., 6Wind, [2]), 4000 
(e.g., Cisco, [3]) or even higher.

The single system with support for N VRFs runs M services (e.g., quagga, 
cdp, lldp, stp, strongswan, some homegrown routing protocol) and 
includes standard system services like sshd. Furthermore, a system also 
includes monitoring programs like snmpd and tcollector. In short, M is 
easily 20 processes that need to have a presence across all VRFs.


Network Namespaces for VRFs
---------------------------
For the past 4 years or so the response to VRF questions is a drum beat 
of "use network namespaces". But namespaces are not a good match for VRFs.

1. Network namespaces are a complete separation of the networking stack 
from network devices up. VRFs are an L3 concept. Using namespaces forces 
an L3 separation concept onto L2 apps -- lldp, cdp, etc.

There are use cases when you want device level separation, use cases 
where you want only L3 and up separation, and cases where you want both 
(e.g., divy up the netdevices in a system across some small number of 
namespaces and then provide VRF based features within a namespace).


2. Scalability of apps providing service as namespaces are created. How 
do you create the presence for each service in a network namespace?

a. Spawn a new process for each namespace? brute force approach and 
extremely resource intensive. e.g., the quagga example [4]

b. spawn a thread for each namespace? Better than a full process but 
still a heavyweight solution

c. create a socket per namespace. Better but still this is a resource 
intensive solution -- N listen sockets per service and each service 
needs to be modified for namespace support. For opensource software that 
means each project has to agree that namespace awareness is relevant and 
agree to take the patches.


3. Just creating a network namespace consumes non-negligible amount of 
memory -- ~200kB for the 3.10 kernel. I believe the /proc entries are 
the bulk of that memory usage. 200kB/namespace is again a lot of wasted 
memory and overhead.


4. For a single process to straddle multiple namespaces it has to run 
with full root privileges -- CAP_SYS_ADMIN -- to use setns. Using 
network sockets does not require a process to run as root at all unless 
it wants privileged ports in which case CAP_NET_BIND_SERVICE is 
sufficient, not full root.


The Linux kernel needs proper VRF support -- as an L3 concept. A 
capability to run a process in a "VRF any" context provides a resource 
efficient solution where a single process with a single listen socket 
works across all VRFs in a namespace and then connected sockets have a 
specific VRF context.

Before droning on even more, does the above provide better context on 
the general problem?

Thanks,
David


[1] https://lkml.org/lkml/2014/9/26/840

[2] http://www.6wind.com/6windgate-performance/ip-forwarding

[3] 
http://www.cisco.com/c/en/us/td/docs/switches/datacenter/sw/verified_scalability/b_Cisco_Nexus_7000_Series_NX-OS_Verified_Scalability_Guide.html

[4] 
https://lists.quagga.net/pipermail/quagga-users/2010-February/011351.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-09-30  1:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-26 22:37 VRFs and the scalability of namespaces David Ahern
2014-09-26 23:52 ` Stephen Hemminger
2014-09-27  0:00   ` David Ahern
2014-09-27  1:25 ` Eric W. Biederman
2014-09-29 12:34   ` David Ahern
2014-09-27 13:29 ` Hannes Frederic Sowa
2014-09-27 14:09   ` Hannes Frederic Sowa
2014-09-29 13:06   ` David Ahern
2014-09-29 16:40     ` Ben Greear
2014-09-29 16:50       ` Sowmini Varadhan
2014-09-29 17:00         ` Ben Greear
2014-09-29 23:43           ` David Ahern
2014-09-29 23:50             ` Hannes Frederic Sowa
2014-09-30  1:15               ` Ben Greear
2014-09-29 18:05 ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).