Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Difference between Net and Net-Next
From: Jim Baxter @ 2013-03-28 12:52 UTC (permalink / raw)
  To: netdev
In-Reply-To: <5153FA03.6010005@gmail.com>

Thank you both, that is much clearer.

Regards,
Jim

^ permalink raw reply

* Re: /128 link-local subnet on 6in4 (sit) tunnels?
From: Wilco Baan Hofman @ 2013-03-28 13:00 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev, YOSHIFUJI Hideaki
In-Reply-To: <20130327183558.GC23223@order.stressinduktion.org>

On Wed, 2013-03-27 at 19:35 +0100, Hannes Frederic Sowa wrote:
> On Wed, Mar 27, 2013 at 07:20:54PM +0100, Wilco Baan Hofman wrote:
> > http://tools.ietf.org/html/rfc4213
> 
> Thanks, I have seen that already. The sit driver is used for more than 6in4
> (6to4, isatap, 6rd). So such a change has to be ok with all the other
> protocols implemented by sit. I also looked in the historic git archive for a
> rationale of this but couldn't find one. Commit messages 2002 where not as
> descriptive as today("Import changeset"). :)
> 
> I also added YOSHIFUJI Hideaki as Cc, perhaps he knows the reason.
> 

I've been doing some RFC checking of my own..

As far as 6to4 and 6rd go, a link-local address is optional and not very
useful at all. ISATAP should have a /64 subnet configured as far as I
can tell, same for 6in4.

>From rfc3056 section 3.1 [1]:

   The link-local address of a 6to4 pseudo-interface performing 6to4
   encapsulation would, if needed, be formed as described in Section 3.7
   of [MECH].  However, no scenario is known in which such an address
   would be useful, since a peer 6to4 gateway cannot determine the
   appropriate link-layer (IPv4) address to send to.

For 6rd, rfc5969 section 9 specifies that a link *may*, if needed, have
a non-used link-local address [2], this may be where the /128 comes in:

   The 6rd link is modeled as an NBMA link similar to other automatic
   IPv6 in IPv4 tunneling mechanisms like [RFC5214], with all 6rd CEs
   and BRs defined as off-link neighbors from one other.  The link-local
   address of a 6rd virtual interface performing the 6rd encapsulation
   would, if needed, be formed as described in Section 3.7 of [RFC4213].
   However, no communication using link-local addresses will occur.

For ISATAP, it basically states that a link-local should have a "subnet
of appropriate length".
rfc5214 section 6.2 refers to rfc4862 [2] for link local addressing:

   ISATAP interfaces form ISATAP interface identifiers from IPv4
   addresses in their locator set and use them to create link-local
   ISATAP addresses (Section 5.3 of [RFC4862]).

Which states:

   A link-local address is formed by combining the well-known link-local
   prefix FE80::0 [RFC4291] (of appropriate length) with an interface
   identifier as follows: >snip<

[1] http://tools.ietf.org/html/rfc3056#section-3.1
[2] http://tools.ietf.org/html/rfc5969#section-9
[3] http://tools.ietf.org/html/rfc5214#section-6.2
[4] http://tools.ietf.org/html/rfc4862#section-5.3

^ permalink raw reply

* Re: [PATCH] core: fix the use of this_cpu_ptr
From: Eric Dumazet @ 2013-03-28 13:05 UTC (permalink / raw)
  To: roy.qing.li, Shan Wei, Christoph Lameter; +Cc: netdev
In-Reply-To: <1364463761-32510-1-git-send-email-roy.qing.li@gmail.com>

On Thu, 2013-03-28 at 17:42 +0800, roy.qing.li@gmail.com wrote:
> From: Li RongQing <roy.qing.li@gmail.com>
> 
> flush_tasklet is not percpu var, and percpu is percpu var, and
> 	this_cpu_ptr(&info->cache->percpu->flush_tasklet)
> is not equal to
> 	&this_cpu_ptr(info->cache->percpu)->flush_tasklet
> 
> 1f743b076(use this_cpu_ptr per-cpu helper) introduced this bug.
> 
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
> ---
>  net/core/flow.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/flow.c b/net/core/flow.c
> index 7fae135..e8084b8 100644
> --- a/net/core/flow.c
> +++ b/net/core/flow.c
> @@ -346,7 +346,7 @@ static void flow_cache_flush_per_cpu(void *data)
>  	struct flow_flush_info *info = data;
>  	struct tasklet_struct *tasklet;
>  
> -	tasklet = this_cpu_ptr(&info->cache->percpu->flush_tasklet);
> +	tasklet = &this_cpu_ptr(info->cache->percpu)->flush_tasklet;
>  	tasklet->data = (unsigned long)info;
>  	tasklet_schedule(tasklet);
>  }

Hi

Any reason you dont Cc Shan Wei & Christoph Lameter ?

Christoph, could this kind of error be detected by the compiler or
sparse ?

Thanks

^ permalink raw reply

* Re: /128 link-local subnet on 6in4 (sit) tunnels?
From: Hannes Frederic Sowa @ 2013-03-28 13:12 UTC (permalink / raw)
  To: Wilco Baan Hofman; +Cc: netdev, YOSHIFUJI Hideaki
In-Reply-To: <1364475638.5000.19.camel@localhost>

On Thu, Mar 28, 2013 at 02:00:38PM +0100, Wilco Baan Hofman wrote:
> For 6rd, rfc5969 section 9 specifies that a link *may*, if needed, have
> a non-used link-local address [2], this may be where the /128 comes in:
> 
>    The 6rd link is modeled as an NBMA link similar to other automatic
>    IPv6 in IPv4 tunneling mechanisms like [RFC5214], with all 6rd CEs
>    and BRs defined as off-link neighbors from one other.  The link-local
>    address of a 6rd virtual interface performing the 6rd encapsulation
>    would, if needed, be formed as described in Section 3.7 of [RFC4213].
>    However, no communication using link-local addresses will occur.
> 

Hm, perhaps this is the reason. Also, RFC3964 ("Security Considerations for
6to4") states that the use of non-global addresses on a 6to4 link should be
prohibited:

|   o  Disallow traffic in which the destination IPv6 address is not a
|      global address; in particular, link-local addresses, mapped
|      addresses, and such should not be used.

Could you check if the creation of a /128 ll address does act as a guard
against that and does suppress ll traffic? I am not sure.

Perhaps a patch where we check the IFF_POINTTOPOINT flag and selectively
create a /128 or /64 would be a solution.

Thanks,

  Hannes

^ permalink raw reply

* Re: [PATCH net-next] core: simplify the getting percpu of flow_cache
From: Eric Dumazet @ 2013-03-28 13:15 UTC (permalink / raw)
  To: roy.qing.li; +Cc: netdev, Christoph Lameter
In-Reply-To: <1364473451-3640-1-git-send-email-roy.qing.li@gmail.com>

On Thu, 2013-03-28 at 20:24 +0800, roy.qing.li@gmail.com wrote:
> From: Li RongQing <roy.qing.li@gmail.com>
> 
> replace per_cpu with per_cpu_ptr to save conversion between address and pointer
> 
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
> ---
>  net/core/flow.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/flow.c b/net/core/flow.c
> index 7fae135..707fb7b 100644
> --- a/net/core/flow.c
> +++ b/net/core/flow.c
> @@ -334,7 +334,7 @@ static int flow_cache_percpu_empty(struct flow_cache *fc, int cpu)
>  	struct flow_cache_percpu *fcp;
>  	int i;
>  
> -	fcp = &per_cpu(*fc->percpu, cpu);
> +	fcp = per_cpu_ptr(fc->percpu, cpu);
>  	for (i = 0; i < flow_cache_hash_size(fc); i++)
>  		if (!hlist_empty(&fcp->hash_table[i]))
>  			return 0;


This makes no difference at all, at least on x86

Care to elaborate ?

^ permalink raw reply

* Re: [PATCH net-next] vxlan: Provide means for obtaining port information
From: David Stevens @ 2013-03-28 13:23 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, jeffrey.t.kirsher, jesse, joseph.gasparakis, netdev,
	netdev-owner, pshelar, shemminger, Stephen Hemminger
In-Reply-To: <51536929.9090109@intel.com>

> From: Alexander Duyck <alexander.h.duyck@intel.com>

> Yes this will break if we ended up supporting a different port per
> VXLAN.  The problem is a port per VXLAN is going to be much more
> difficult to support for any sort of hardware offload as well.  What we
> end up having to do is add a significant number of filters just to
> identify all of the possible ports that may be used. 

        I assume this would only be an issue if you actually have lots
of ports. If you're using only one, or a couple, which I'd expect to
be the common case, your filtering needs should be the same, right?

> In addition it
> will mean having to add some sort of notifier as I mentioned in the
> patch description since we will have VXLANs going into and out of
> existence, each one with their own port number..

        I don't know the context in which you're planning to call this,
but regardless of the port or ports in use by VXLAN, the devices will
come and go and the module may be unloaded, so I don't see how what
you've said here applies to one case and not the other.

> This was essentially a "good enough for now" fix to address the fact
> that I needed the port number for offload testing, but I will sit back
> and wait for the port per VXLAN stuff to be pushed in before I take any
> steps to try to sort that out.

        What I was suggesting "for now" is that you create an interface
that need not change to support multiple ports. For example:

int vxlan_get_ports(struct net_device *dev, int *ports, int portcount)
{
        if (portcount <= 0 || vxlan_port == 0)
                return 0;
        if (ports)
                ports[0] = vxlan_port;
        return 1;
}

The return value would be the count of ports in use, and we'd fill up to a 
maximum
of portcount in the "ports" array. If you don't have a "dev" you're 
interested in,
pass it as NULL and in the future this code would match dev only if it is 
non-NULL;
otherwise it would go through the vxlan device list adding ports not in 
the list to it.

And for the right functionality now, change the VXLAN driver to leave 
vxlan_port
zero until a device is instantiated, so you can detect no ports at the 
moment,
and set it to zero again when the last one is deleted.

You need a mechanism to delete your filters if there are no VXLAN devices,
whether or not multiple ports are supported, because "0" is a legal count
now too. If I haven't created any vxlan devices, I can use port 8472 for
something else and your loading of the vxlan driver as a side-effect of
checking the port shouldn't prevent that. That's especially true in a
distro release where, while I might be able to change the port, I'd have
to burn one regardless if your driver were loaded, if it can't go without
a binding when there are no vxlan devices on the system.

                                                                +-DLS

^ permalink raw reply

* [PATCH v2] net IPv6 : Fix broken IPv6 routing table after loopback down-up
From: Balakumaran Kannan @ 2013-03-28 13:27 UTC (permalink / raw)
  To: yoshfuji, davem, Patrick McHardy, Alexey Kuznetsov, jmorris,
	eric.dumazet
  Cc: Balakumaran.Kannan, maruthi.thotad, netdev, jamshed.a,
	amit.agarwal, takuzo.ohara, aaditya.kumar

IPv6 Routing table becomes broken once we do ifdown, ifup of the loopback(lo)
interface. After down-up, routes of other interface's IPv6 addresses through
'lo' are lost.

IPv6 addresses assigned to all interfaces are routed through 'lo' for internal
communication. Once 'lo' is down, those routing entries are removed from
routing table. But those removed entries are not being re-created properly when
'lo' is brought up. So IPv6 addresses of other interfaces becomes unreachable
from the same machine. Also this breaks communication with other machines
because of NDISC packet processing failure.

This patch fixes this issue by reading all interface's IPv6 addresses and
adding them to IPv6 routing table while bringing up 'lo'.

Patch is prepared for Linux-3.9.rc4 kernel.

Signed-off-by: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com>
Signed-off-by: Maruthi Thotad <Maruthi.Thotad@ap.sm.sony.com>
---
This is version-2 of the patch sent earlier.
[Ref: http://marc.info/?l=linux-netdev&m=136437848103355&w=2]

Change Log:
 1. Used 'for_each_netdev_rcu' instead of while loop.
 2. Added required locks.
 3. Skipping IPv6 addresses with 'IFA_F_DADFAILED' or 'IFA_F_TENTATIVE' flags

==Testing==
Before applying the patch:
$ route -A inet6
Kernel IPv6 routing table
Destination                    Next Hop                   Flag Met Ref Use If
2000::20/128                   ::                         U    256 0     0 eth0
fe80::/64                      ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
::1/128                        ::                         Un   0   1     0 lo
2000::20/128                   ::                         Un   0   1     0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128  ::                         Un   0   1     0 lo
ff00::/8                       ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
$ sudo ifdown lo
$ sudo ifup lo
$ route -A inet6
Kernel IPv6 routing table
Destination                    Next Hop                   Flag Met Ref Use If
2000::20/128                   ::                         U    256 0     0 eth0
fe80::/64                      ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
::1/128                        ::                         Un   0   1     0 lo
ff00::/8                       ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
$

After applying the patch:
$ route -A inet6
Kernel IPv6 routing
table
Destination                    Next Hop                   Flag Met Ref Use If
2000::20/128                   ::                         U    256 0     0 eth0
fe80::/64                      ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
::1/128                        ::                         Un   0   1     0 lo
2000::20/128                   ::                         Un   0   1     0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128  ::                         Un   0   1     0 lo
ff00::/8                       ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
$ sudo ifdown lo
$ sudo ifup lo
$ route -A inet6
Kernel IPv6 routing table
Destination                    Next Hop                   Flag Met Ref Use If
2000::20/128                   ::                         U    256 0     0 eth0
fe80::/64                      ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
::1/128                        ::                         Un   0   1     0 lo
2000::20/128                   ::                         Un   0   1     0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128  ::                         Un   0   1     0 lo
ff00::/8                       ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
$
---
--- linux-3.9-rc4/net/ipv6/addrconf.c.orig	2013-03-27 10:40:26.382569527 +0530
+++ linux-3.9-rc4/net/ipv6/addrconf.c	2013-03-28 18:29:00.492241840 +0530
@@ -2529,6 +2529,9 @@ static void sit_add_v4_addrs(struct inet
 static void init_loopback(struct net_device *dev)
 {
 	struct inet6_dev  *idev;
+	struct net_device *sp_dev;
+	struct inet6_ifaddr *sp_ifa;
+	struct rt6_info *sp_rt;

 	/* ::1 */

@@ -2540,6 +2543,33 @@ static void init_loopback(struct net_dev
 	}

 	add_addr(idev, &in6addr_loopback, 128, IFA_HOST);
+
+	/* Add routes to other interface's IPv6 addresses */
+	rcu_read_lock();
+	for_each_netdev_rcu(dev_net(dev), sp_dev) {
+
+		if (!strcmp(sp_dev->name, dev->name))
+			continue;
+
+		idev = ipv6_find_idev(sp_dev);
+		if (NULL == idev)
+			continue;
+
+		read_lock_bh(&idev->lock);
+		list_for_each_entry(sp_ifa, &idev->addr_list, if_list) {
+
+			if (sp_ifa->flags & (IFA_F_DADFAILED | IFA_F_TENTATIVE))
+				continue;
+
+			sp_rt = addrconf_dst_alloc(idev, &sp_ifa->addr, 0);
+
+			/* Failure cases are ignored */
+			if (!IS_ERR(sp_rt))
+				ip6_ins_rt(sp_rt);
+		}
+		read_unlock_bh(&idev->lock);
+	}
+	rcu_read_unlock();
 }

 static void addrconf_add_linklocal(struct inet6_dev *idev, const
struct in6_addr *addr)

^ permalink raw reply

* Re: [PATCH] net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb'
From: Eric Dumazet @ 2013-03-28 13:36 UTC (permalink / raw)
  To: Shmulik Ladkani; +Cc: David Miller, Ben Greear, netdev, Igor Michailov
In-Reply-To: <1364462006-5814-1-git-send-email-shmulik.ladkani@gmail.com>

On Thu, 2013-03-28 at 11:13 +0200, Shmulik Ladkani wrote:
> 'nf_reset' is called just prior calling 'netif_rx'.
> No need to call it twice.
> 
> Reported-by: Igor Michailov <rgohita@gmail.com>
> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
> ---
>  net/core/dev.c |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2db88df..071f398 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1624,7 +1624,6 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
>  	}
>  
>  	skb_orphan(skb);
> -	nf_reset(skb);
>  
>  	if (unlikely(!is_skb_forwardable(dev, skb))) {
>  		atomic_long_inc(&dev->rx_dropped);

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
From: Serge Hallyn @ 2013-03-28 13:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org
In-Reply-To: <87d2uknmnu.fsf@xmission.com>

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> >> 
> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> >> Stephen Hemminger <stephen@networkplumber.org> writes:
> >> >> 
> >> >> > If you need to do lots of operations the --batch mode will be significantly faster.
> >> >> > One command start and one link map.
> >> >> 
> >> >> The problem in this case as I understand it is lots of independent
> >> >> operations. Now maybe lxc should not shell out to ip and perform the
> >> >> work itself.
> >> >
> >> > fwiw lxc uses netlink to create new veths, and picks random names with
> >> > mktemp() ahead of time.
> >> 
> >> I am puzzled where does the slownes in iproute2 come into play?
> >
> > Benoit originally reported slowness when starting >1500 containers.  I
> > asked him to run a few manual tests to figure out what was taking the
> > time.  Manually creating a large # of veths was an obvious test, and
> > one which showed poorly scaling performance.
> 
> Apparently iproute is involved somehwere as when he tested with a
> patched iproute (as you asked him to) the lxc startup slowdown was
> gone.
> 
> > May well be there are other things slowing down lxc of course.
> 
> The evidence indicates it was iproute being called somewhere...

Benoit can you tell us exactly what test you were running when you saw
the slowdown was gone?

-serge

^ permalink raw reply

* Re: Deleting a network namespace
From: David Shwatrz @ 2013-03-28 13:41 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <87y5d7lr73.fsf@xmission.com>

Hello,
I checked and indeed physical hardware are moved to init_net.
I wonder how it is done, as in netns_delete() there is only
umount2() and unlink() syscalls (might these syscalls trigger this
movement to init_net)? I really could not figure how this is
implemented and where in code do we differentiate between physical and
non physical devices.

Best,
DS

On Thu, Mar 28, 2013 at 1:05 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> David Shwatrz <dshwatrz@gmail.com> writes:
>
>> Hello,
>> When assigning a network interface to a network namespace and
>> afterwards deleting the namespace, we will not see the network
>> interface in any other namespace (including the default namespace) anymore:
>>
>> ip netns add ns1
>> ip link set eth0 netns ns1
>> ip netns del ns1
>>
>> This means that in fact we cannot use this interface again (only after
>> rebooting)
>> Am I right on this ?
>
> Interfaces that represent physical hardware are moved to init_net.
> Interfaces that are purely software constructs are deleted.
>
>> Is moving an interface back to the default (init) namespace,
>> when deleting the namespace which contains it, can be considered?
>
> If you aren't seeing that your interface is either a purely software
> construct like the veth or dummy interfaces or something still has a
> reference to your network namespace.
>
> Eric

^ permalink raw reply

* Re: [PATCH v2] net IPv6 : Fix broken IPv6 routing table after loopback down-up
From: Eric Dumazet @ 2013-03-28 13:44 UTC (permalink / raw)
  To: Balakumaran Kannan
  Cc: yoshfuji, davem, Patrick McHardy, Alexey Kuznetsov, jmorris,
	Balakumaran.Kannan, maruthi.thotad, netdev, jamshed.a,
	amit.agarwal, takuzo.ohara, aaditya.kumar
In-Reply-To: <CAHPKR9KE4LT_AEzUZ8cuJwBeMEEu0Ppe+jjynFQk7+N-A4YUmQ@mail.gmail.com>

On Thu, 2013-03-28 at 18:57 +0530, Balakumaran Kannan wrote:
> IPv6 Routing table becomes broken once we do ifdown, ifup of the loopback(lo)
> interface. After down-up, routes of other interface's IPv6 addresses through
> 'lo' are lost.
> 

>  	add_addr(idev, &in6addr_loopback, 128, IFA_HOST);
> +
> +	/* Add routes to other interface's IPv6 addresses */
> +	rcu_read_lock();
> +	for_each_netdev_rcu(dev_net(dev), sp_dev) {
> +

We hold RTNL at this point, so why use RCU at all, and adding potential
long latencies ?

Just use for_each_netdev()

This way, a preemption is still allowed.

Also, I am not sure we need ipv6_find_idev()

__in6_dev_get() should be OK.

^ permalink raw reply

* Re: [PATCH v2] net IPv6 : Fix broken IPv6 routing table after loopback down-up
From: Eric Dumazet @ 2013-03-28 13:45 UTC (permalink / raw)
  To: Balakumaran Kannan
  Cc: yoshfuji, davem, Patrick McHardy, Alexey Kuznetsov, jmorris,
	Balakumaran.Kannan, maruthi.thotad, netdev, jamshed.a,
	amit.agarwal, takuzo.ohara, aaditya.kumar
In-Reply-To: <1364478265.15753.42.camel@edumazet-glaptop>

On Thu, 2013-03-28 at 06:44 -0700, Eric Dumazet wrote:

> __in6_dev_get() should be OK.

more exactly : __in_dev_get_rtnl()

^ permalink raw reply

* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
From: Benoit Lourdelet @ 2013-03-28 13:42 UTC (permalink / raw)
  To: Serge Hallyn, Eric W. Biederman; +Cc: Stephen Hemminger, netdev@vger.kernel.org
In-Reply-To: <20130328133652.GA6652@sergelap>

Hello,

My test consists in starting small containers (10MB of RAM ) each. Each
container has 2x physical VLAN interfaces attached.

lxc.network.type = phys
lxc.network.flags = up
lxc.network.link = eth6.3
lxc.network.name = eth2
lxc.network.hwaddr = 00:50:56:a8:03:03
lxc.network.ipv4 = 192.168.1.1/24
lxc.network.type = phys
lxc.network.flags = up
lxc.network.link = eth7.3
lxc.network.name = eth1
lxc.network.ipv4 = 2.2.2.2/24
lxc.network.hwaddr = 00:50:57:b8:00:01



With initial iproute2 , when I reach around 1600 containers, container
creation almost stops.It takes at least 20s per container to start.
With patched iproutes2 , I have started 4000 containers at a rate of 1 per
second w/o problem. I have 8000 clan interfaces configured on the host (2x
4000).


Regards

Benoit

On 28/03/2013 14:36, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote:

>Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>> 
>> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>> >> 
>> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> >> Stephen Hemminger <stephen@networkplumber.org> writes:
>> >> >> 
>> >> >> > If you need to do lots of operations the --batch mode will be
>>significantly faster.
>> >> >> > One command start and one link map.
>> >> >> 
>> >> >> The problem in this case as I understand it is lots of independent
>> >> >> operations. Now maybe lxc should not shell out to ip and perform
>>the
>> >> >> work itself.
>> >> >
>> >> > fwiw lxc uses netlink to create new veths, and picks random names
>>with
>> >> > mktemp() ahead of time.
>> >> 
>> >> I am puzzled where does the slownes in iproute2 come into play?
>> >
>> > Benoit originally reported slowness when starting >1500 containers.  I
>> > asked him to run a few manual tests to figure out what was taking the
>> > time.  Manually creating a large # of veths was an obvious test, and
>> > one which showed poorly scaling performance.
>> 
>> Apparently iproute is involved somehwere as when he tested with a
>> patched iproute (as you asked him to) the lxc startup slowdown was
>> gone.
>> 
>> > May well be there are other things slowing down lxc of course.
>> 
>> The evidence indicates it was iproute being called somewhere...
>
>Benoit can you tell us exactly what test you were running when you saw
>the slowdown was gone?
>
>-serge
>

^ permalink raw reply

* Re: [PATCH v2 6/6] sctp: convert sctp_assoc_set_id to use idr_alloc_cyclic
From: Neil Horman @ 2013-03-28 13:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: akpm, linux-kernel, tj, Vlad Yasevich, Sridhar Samudrala,
	David S. Miller, linux-sctp, netdev
In-Reply-To: <1364412578-7462-7-git-send-email-jlayton@redhat.com>

On Wed, Mar 27, 2013 at 03:29:38PM -0400, Jeff Layton wrote:
> (Note: compile-tested only)
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> Cc: Vlad Yasevich <vyasevich@gmail.com>
> Cc: Sridhar Samudrala <sri@us.ibm.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: linux-sctp@vger.kernel.org
> Cc: netdev@vger.kernel.org
> ---
>  net/sctp/associola.c | 16 ++--------------
>  1 file changed, 2 insertions(+), 14 deletions(-)
> 
I don't see anything wrong with this patch per-se, but the idr_alloc_cyclic call
isn't integrated with net/net-next or Linus' tree yet.  If we don't gate this
patch on that integration, we'll break the build.
Neil

^ permalink raw reply

* Re: Deleting a network namespace
From: Eric W. Biederman @ 2013-03-28 14:00 UTC (permalink / raw)
  To: David Shwatrz; +Cc: netdev
In-Reply-To: <CAJJAcof98k+GDt88MORMz6sS-JFODeuyqYSwP6dk-ZYqFMGTWg@mail.gmail.com>

David Shwatrz <dshwatrz@gmail.com> writes:

> Hello,
> I checked and indeed physical hardware are moved to init_net.
> I wonder how it is done, as in netns_delete() there is only
> umount2() and unlink() syscalls (might these syscalls trigger this
> movement to init_net)?

The mount holds a refcount to the network namespace, the unmount drops
that refcount.

> I really could not figure how this is
> implemented and where in code do we differentiate between physical and
> non physical devices.

When the refcount drops to zero put_net calls __put_net in
net/core/net_namespace.c which wiggles around and arranges
for cleanup_net to be called.

As for what happens to the network devices look at default_device_exit
and default_device_exit_batch in net/core/dev.c

As for the rest having software based network devices vanish is by
design and I can't think of a single reason why it would make sense to
do anything differently.  Depending on your configuration the initial
network namespace really isn't where you would want network devices to
be moved.   Think about what happens when you run your use can in a lxc
based container for example.

Eric

^ permalink raw reply

* Re: [PATCH v2 6/6] sctp: convert sctp_assoc_set_id to use idr_alloc_cyclic
From: Neil Horman @ 2013-03-28 14:04 UTC (permalink / raw)
  To: Jeff Layton
  Cc: akpm, linux-kernel, tj, Vlad Yasevich, Sridhar Samudrala,
	David S. Miller, linux-sctp, netdev
In-Reply-To: <20130328135308.GA14489@neilslaptop.think-freely.org>

On Thu, Mar 28, 2013 at 09:53:08AM -0400, Neil Horman wrote:
> On Wed, Mar 27, 2013 at 03:29:38PM -0400, Jeff Layton wrote:
> > (Note: compile-tested only)
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > Cc: Vlad Yasevich <vyasevich@gmail.com>
> > Cc: Sridhar Samudrala <sri@us.ibm.com>
> > Cc: Neil Horman <nhorman@tuxdriver.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: linux-sctp@vger.kernel.org
> > Cc: netdev@vger.kernel.org
> > ---
> >  net/sctp/associola.c | 16 ++--------------
> >  1 file changed, 2 insertions(+), 14 deletions(-)
> > 
> I don't see anything wrong with this patch per-se, but the idr_alloc_cyclic call
> isn't integrated with net/net-next or Linus' tree yet.  If we don't gate this
> patch on that integration, we'll break the build.
> Neil
> 
Actually, I just noticed that you only sent us 6/6 here, I'm assuming a prior
patch in the series adds the idr_alloc_cyclic code?  if so, I've seen the prior
version

Acked-by: Neil Horman <nhorman@tuxdriver.com>

> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH v2] net IPv6 : Fix broken IPv6 routing table after loopback down-up
From: Balakumaran Kannan @ 2013-03-28 14:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: yoshfuji, davem, Patrick McHardy, Alexey Kuznetsov, jmorris,
	Balakumaran.Kannan, maruthi.thotad, netdev, jamshed.a,
	amit.agarwal, takuzo.ohara, aaditya.kumar
In-Reply-To: <1364478316.15753.43.camel@edumazet-glaptop>

On Thu, Mar 28, 2013 at 7:15 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2013-03-28 at 06:44 -0700, Eric Dumazet wrote:
>
>> __in6_dev_get() should be OK.
>
> more exactly : __in_dev_get_rtnl()
>
>
>

Thank you for your comments. I'll update the patch and send it.

^ permalink raw reply

* Re: Deleting a network namespace
From: David Shwatrz @ 2013-03-28 14:12 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <87hajvhbe3.fsf@xmission.com>

Hello,
Thanks a lot for the detailed explanation!

>As for the rest having software based network devices vanish >is by
>design and I can't think of a single reason why it would make >sense to
>do anything differently.
Agreed.

Best,
DS

On Thu, Mar 28, 2013 at 4:00 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> David Shwatrz <dshwatrz@gmail.com> writes:
>
>> Hello,
>> I checked and indeed physical hardware are moved to init_net.
>> I wonder how it is done, as in netns_delete() there is only
>> umount2() and unlink() syscalls (might these syscalls trigger this
>> movement to init_net)?
>
> The mount holds a refcount to the network namespace, the unmount drops
> that refcount.
>
>> I really could not figure how this is
>> implemented and where in code do we differentiate between physical and
>> non physical devices.
>
> When the refcount drops to zero put_net calls __put_net in
> net/core/net_namespace.c which wiggles around and arranges
> for cleanup_net to be called.
>
> As for what happens to the network devices look at default_device_exit
> and default_device_exit_batch in net/core/dev.c
>
> As for the rest having software based network devices vanish is by
> design and I can't think of a single reason why it would make sense to
> do anything differently.  Depending on your configuration the initial
> network namespace really isn't where you would want network devices to
> be moved.   Think about what happens when you run your use can in a lxc
> based container for example.
>
> Eric

^ permalink raw reply

* Re: [PATCH 2/6] audit: replace obsolete NLMSG_* with type safe nlmsg_*
From: Thomas Graf @ 2013-03-28 14:30 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: linux-kernel, netdev, linux-security-module, davem
In-Reply-To: <1364402946-32715-1-git-send-email-honkiko@gmail.com>

On 03/28/13 at 12:49am, Hong Zhiguo wrote:
> Signed-off-by: Hong Zhiguo <honkiko@gmail.com>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply

* Re: [PATCH 1/6] net-next: replace obsolete NLMSG_* with type safe nlmsg_*
From: Thomas Graf @ 2013-03-28 14:32 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: netdev, linux-kernel, davem, stephen
In-Reply-To: <1364402824-32680-1-git-send-email-honkiko@gmail.com>

On 03/28/13 at 12:47am, Hong Zhiguo wrote:
> diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c
> index 505b30a..467fb92 100644
> --- a/net/ipv4/udp_diag.c
> +++ b/net/ipv4/udp_diag.c
> @@ -64,9 +64,9 @@ static int udp_dump_one(struct udp_table *tbl, struct sk_buff *in_skb,
>  		goto out;
>  
>  	err = -ENOMEM;
> -	rep = alloc_skb(NLMSG_SPACE((sizeof(struct inet_diag_msg) +
> -				     sizeof(struct inet_diag_meminfo) +
> -				     64)), GFP_KERNEL);
> +	rep = nlmsg_new(sizeof(struct inet_diag_msg) +
> +				     sizeof(struct inet_diag_meminfo) + 64,
> +				     GFP_KERNEL);

This is formatted incorrectly, otherwise the patch looks good.

^ permalink raw reply

* Re: [PATCH 3/6] selinux: replace obsolete NLMSG_* with type safe nlmsg_*
From: Thomas Graf @ 2013-03-28 14:33 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: linux-kernel, netdev, linux-security-module, davem
In-Reply-To: <1364402975-32747-1-git-send-email-honkiko@gmail.com>

On 03/28/13 at 12:49am, Hong Zhiguo wrote:
> Signed-off-by: Hong Zhiguo <honkiko@gmail.com>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply

* Re: [PATCH 4/6] gdm72xx: replace obsolete NLMSG_* with type safe nlmsg_*
From: Thomas Graf @ 2013-03-28 14:37 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: linux-kernel, netdev, davem
In-Reply-To: <1364403137-32806-1-git-send-email-honkiko@gmail.com>

On 03/28/13 at 12:52am, Hong Zhiguo wrote:
> Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
> ---
>  drivers/staging/gdm72xx/netlink_k.c |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/staging/gdm72xx/netlink_k.c b/drivers/staging/gdm72xx/netlink_k.c
> index 52c25ba..c1239aa 100644
> --- a/drivers/staging/gdm72xx/netlink_k.c
> +++ b/drivers/staging/gdm72xx/netlink_k.c
> @@ -25,12 +25,12 @@
>  
>  #define ND_MAX_GROUP			30
>  #define ND_IFINDEX_LEN			sizeof(int)
> -#define ND_NLMSG_SPACE(len)		(NLMSG_SPACE(len) + ND_IFINDEX_LEN)
> +#define ND_NLMSG_SPACE(len)		(nlmsg_total_size(len) + ND_IFINDEX_LEN)
>  #define ND_NLMSG_DATA(nlh) \
> -	((void *)((char *)NLMSG_DATA(nlh) + ND_IFINDEX_LEN))
> +	((void *)((char *)nlmsg_data(nlh) + ND_IFINDEX_LEN))
>  #define ND_NLMSG_S_LEN(len)		(len+ND_IFINDEX_LEN)
>  #define ND_NLMSG_R_LEN(nlh)		(nlh->nlmsg_len-ND_IFINDEX_LEN)
> -#define ND_NLMSG_IFIDX(nlh)		NLMSG_DATA(nlh)
> +#define ND_NLMSG_IFIDX(nlh)		nlmsg_data(nlh)
>  #define ND_MAX_MSG_LEN			8096

This is not pretty at all but outside of the context of your patch.

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply

* Re: [PATCH 6/6] connector: replace obsolete NLMSG_* with type safe nlmsg_*
From: Thomas Graf @ 2013-03-28 14:41 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: linux-kernel, zbr, netdev, davem
In-Reply-To: <1364403296-32880-1-git-send-email-honkiko@gmail.com>

On 03/28/13 at 12:54am, Hong Zhiguo wrote:
> Signed-off-by: Hong Zhiguo <honkiko@gmail.com>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply

* Re: [PATCH 5/6] scsi: replace obsolete NLMSG_* with type safe nlmsg_*
From: Thomas Graf @ 2013-03-28 14:45 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: linux-kernel, linux-scsi, netdev, davem
In-Reply-To: <1364403195-32839-1-git-send-email-honkiko@gmail.com>

On 03/28/13 at 12:53am, Hong Zhiguo wrote:
> Signed-off-by: Hong Zhiguo <honkiko@gmail.com>

There are some formatting errors but the Netlink bits themselves
look good.

^ permalink raw reply

* Re: [PATCH v2] net IPv6 : Fix broken IPv6 routing table after loopback down-up
From: Balakumaran Kannan @ 2013-03-28 14:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: yoshfuji, davem, Patrick McHardy, Alexey Kuznetsov, jmorris,
	Balakumaran.Kannan, maruthi.thotad, netdev, jamshed.a,
	amit.agarwal, takuzo.ohara, aaditya.kumar
In-Reply-To: <CAHPKR9L4Y0=cbH7w_iZ20CBsH=nt4UuZwyBvDOMSrMmM_sJt1w@mail.gmail.com>

On Thu, Mar 28, 2013 at 7:34 PM, Balakumaran Kannan
<kumaran.4353@gmail.com> wrote:
> On Thu, Mar 28, 2013 at 7:15 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Thu, 2013-03-28 at 06:44 -0700, Eric Dumazet wrote:
>>
>>> __in6_dev_get() should be OK.
>>
>> more exactly : __in_dev_get_rtnl()
>>
>>
>>
>
> Thank you for your comments. I'll update the patch and send it.

As __in_dev_get_rtnl returns IPv4 specific data (ip_ptr), we have to
use __in6_dev_get to get IPv6 specific data (ip6_ptr).

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox