* [RFC][PATCH] iproute: Faster ip link add, set and delete
@ 2013-03-22 22:23 Eric W. Biederman
2013-03-22 22:27 ` Stephen Hemminger
0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-22 22:23 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev, Serge Hallyn, Benoit Lourdelet
Because ip link add, set, and delete map the interface name to the
interface index by dumping all of the interfaces before performing
their respective commands. Operations that should be constant time
slow down when lots of network interfaces are in use. Resulting
in O(N^2) time to work with O(N) devices.
Make the work that iproute does constant time by passing the interface
name to the kernel instead.
In small scale testing on my system this shows dramatic performance
increases of ip link add from 120s to just 11s to add 5000 network
devices. And from longer than I cared to wait to just 58s to delete
all of those interfaces again.
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Reported-by: Benoit Lourdelet <blourdel@juniper.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
I think I am bungling the case where people specify an ifindex as ifNNNN
but does anyone care?
ip/iplink.c | 19 +------------------
1 files changed, 1 insertions(+), 18 deletions(-)
diff --git a/ip/iplink.c b/ip/iplink.c
index ad33611..6dffbf0 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
}
}
- ll_init_map(&rth);
-
if (!(flags & NLM_F_CREATE)) {
if (!dev) {
fprintf(stderr, "Not enough information: \"dev\" "
@@ -542,27 +540,12 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
exit(-1);
}
- req.i.ifi_index = ll_name_to_index(dev);
- if (req.i.ifi_index == 0) {
- fprintf(stderr, "Cannot find device \"%s\"\n", dev);
- return -1;
- }
+ name = dev;
} else {
/* Allow "ip link add dev" and "ip link add name" */
if (!name)
name = dev;
- if (link) {
- int ifindex;
-
- ifindex = ll_name_to_index(link);
- if (ifindex == 0) {
- fprintf(stderr, "Cannot find device \"%s\"\n",
- link);
- return -1;
- }
- addattr_l(&req.n, sizeof(req), IFLA_LINK, &ifindex, 4);
- }
}
if (name) {
--
1.7.5.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-22 22:23 [RFC][PATCH] iproute: Faster ip link add, set and delete Eric W. Biederman
@ 2013-03-22 22:27 ` Stephen Hemminger
2013-03-26 11:51 ` Benoit Lourdelet
0 siblings, 1 reply; 32+ messages in thread
From: Stephen Hemminger @ 2013-03-22 22:27 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: netdev, Serge Hallyn, Benoit Lourdelet
The whole ifindex map is a design mistake at this point.
Better off to do a lazy cache or something like that.
On Fri, Mar 22, 2013 at 3:23 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Because ip link add, set, and delete map the interface name to the
> interface index by dumping all of the interfaces before performing
> their respective commands. Operations that should be constant time
> slow down when lots of network interfaces are in use. Resulting
> in O(N^2) time to work with O(N) devices.
>
> Make the work that iproute does constant time by passing the interface
> name to the kernel instead.
>
> In small scale testing on my system this shows dramatic performance
> increases of ip link add from 120s to just 11s to add 5000 network
> devices. And from longer than I cared to wait to just 58s to delete
> all of those interfaces again.
>
> Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
> Reported-by: Benoit Lourdelet <blourdel@juniper.net>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>
> I think I am bungling the case where people specify an ifindex as ifNNNN
> but does anyone care?
>
> ip/iplink.c | 19 +------------------
> 1 files changed, 1 insertions(+), 18 deletions(-)
>
> diff --git a/ip/iplink.c b/ip/iplink.c
> index ad33611..6dffbf0 100644
> --- a/ip/iplink.c
> +++ b/ip/iplink.c
> @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
> }
> }
>
> - ll_init_map(&rth);
> -
> if (!(flags & NLM_F_CREATE)) {
> if (!dev) {
> fprintf(stderr, "Not enough information: \"dev\" "
> @@ -542,27 +540,12 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
> exit(-1);
> }
>
> - req.i.ifi_index = ll_name_to_index(dev);
> - if (req.i.ifi_index == 0) {
> - fprintf(stderr, "Cannot find device \"%s\"\n", dev);
> - return -1;
> - }
> + name = dev;
> } else {
> /* Allow "ip link add dev" and "ip link add name" */
> if (!name)
> name = dev;
>
> - if (link) {
> - int ifindex;
> -
> - ifindex = ll_name_to_index(link);
> - if (ifindex == 0) {
> - fprintf(stderr, "Cannot find device \"%s\"\n",
> - link);
> - return -1;
> - }
> - addattr_l(&req.n, sizeof(req), IFLA_LINK, &ifindex, 4);
> - }
> }
>
> if (name) {
> --
> 1.7.5.4
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-22 22:27 ` Stephen Hemminger
@ 2013-03-26 11:51 ` Benoit Lourdelet
2013-03-26 12:40 ` Eric W. Biederman
2013-03-26 15:31 ` Eric Dumazet
0 siblings, 2 replies; 32+ messages in thread
From: Benoit Lourdelet @ 2013-03-26 11:51 UTC (permalink / raw)
To: Stephen Hemminger, Eric W. Biederman; +Cc: netdev@vger.kernel.org, Serge Hallyn
Hello,
I re-tested with the patch and got the following results on a 32x 2Ghz
core system.
# veth add delete
1000 36 34
3000 259 137
4000 462 195
5000 729 N/A
The script to create is the following :
for i in `seq 1 5000`; do
sudo ip link add type veth
Done
The script to delete:
for d in /sys/class/net/veth*; do
ip link del `basename $d` 2>/dev/null || true
Done
There is a very good improvement in deletion.
iproute2 does not seems to be well multithread as I get time divided by a
factor of 2 with a 8x 3.2 Ghz core system.
I don¹t know if that is the improvement you expected ?
Would the iproute2 redesign you mentioned help improve performance even
further ?
As a reference : Iproute2 baseline w/o patch:
# veth add delete
1000 57 70
2000 193 250
3000 435 510
4000 752 824
5000 1123 1185
Regards
Benoit
On 22/03/2013 23:27, "Stephen Hemminger" <stephen@networkplumber.org>
wrote:
>The whole ifindex map is a design mistake at this point.
>Better off to do a lazy cache or something like that.
>
>
>On Fri, Mar 22, 2013 at 3:23 PM, Eric W. Biederman
><ebiederm@xmission.com> wrote:
>>
>> Because ip link add, set, and delete map the interface name to the
>> interface index by dumping all of the interfaces before performing
>> their respective commands. Operations that should be constant time
>> slow down when lots of network interfaces are in use. Resulting
>> in O(N^2) time to work with O(N) devices.
>>
>> Make the work that iproute does constant time by passing the interface
>> name to the kernel instead.
>>
>> In small scale testing on my system this shows dramatic performance
>> increases of ip link add from 120s to just 11s to add 5000 network
>> devices. And from longer than I cared to wait to just 58s to delete
>> all of those interfaces again.
>>
>> Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
>> Reported-by: Benoit Lourdelet <blourdel@juniper.net>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>
>> I think I am bungling the case where people specify an ifindex as ifNNNN
>> but does anyone care?
>>
>> ip/iplink.c | 19 +------------------
>> 1 files changed, 1 insertions(+), 18 deletions(-)
>>
>> diff --git a/ip/iplink.c b/ip/iplink.c
>> index ad33611..6dffbf0 100644
>> --- a/ip/iplink.c
>> +++ b/ip/iplink.c
>> @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int
>>flags, int argc, char **argv)
>> }
>> }
>>
>> - ll_init_map(&rth);
>> -
>> if (!(flags & NLM_F_CREATE)) {
>> if (!dev) {
>> fprintf(stderr, "Not enough information:
>>\"dev\" "
>> @@ -542,27 +540,12 @@ static int iplink_modify(int cmd, unsigned int
>>flags, int argc, char **argv)
>> exit(-1);
>> }
>>
>> - req.i.ifi_index = ll_name_to_index(dev);
>> - if (req.i.ifi_index == 0) {
>> - fprintf(stderr, "Cannot find device \"%s\"\n",
>>dev);
>> - return -1;
>> - }
>> + name = dev;
>> } else {
>> /* Allow "ip link add dev" and "ip link add name" */
>> if (!name)
>> name = dev;
>>
>> - if (link) {
>> - int ifindex;
>> -
>> - ifindex = ll_name_to_index(link);
>> - if (ifindex == 0) {
>> - fprintf(stderr, "Cannot find device
>>\"%s\"\n",
>> - link);
>> - return -1;
>> - }
>> - addattr_l(&req.n, sizeof(req), IFLA_LINK,
>>&ifindex, 4);
>> - }
>> }
>>
>> if (name) {
>> --
>> 1.7.5.4
>>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-26 11:51 ` Benoit Lourdelet
@ 2013-03-26 12:40 ` Eric W. Biederman
2013-03-26 14:17 ` Serge Hallyn
2013-03-26 14:33 ` Serge Hallyn
2013-03-26 15:31 ` Eric Dumazet
1 sibling, 2 replies; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-26 12:40 UTC (permalink / raw)
To: Benoit Lourdelet; +Cc: Stephen Hemminger, netdev@vger.kernel.org, Serge Hallyn
Benoit Lourdelet <blourdel@juniper.net> writes:
> Hello,
>
> I re-tested with the patch and got the following results on a 32x 2Ghz
> core system.
>
> # veth add delete
> 1000 36 34
> 3000 259 137
> 4000 462 195
> 5000 729 N/A
>
> The script to create is the following :
> for i in `seq 1 5000`; do
> sudo ip link add type veth
> Done
Which performs horribly as I mentioned earlier because you are asking
the kernel to create the names. If you want performance you need to
specify the names of the network devices you are creating.
aka ip link add a$i type veth name b$i
> The script to delete:
> for d in /sys/class/net/veth*; do
> ip link del `basename $d` 2>/dev/null || true
> Done
>
> There is a very good improvement in deletion.
>
>
>
> iproute2 does not seems to be well multithread as I get time divided by a
> factor of 2 with a 8x 3.2 Ghz core system.
All netlink traffic and all network stack configuration is serialized by
the rtnl_lock in the kernel. This is the slow path in the kernel, not
the fast path.
> I don¹t know if that is the improvement you expected ?
>
> Would the iproute2 redesign you mentioned help improve performance even
> further ?
Specifing the names would dramatically improve your creation
performance. It should only take you about 10s for 5000 veth pairs.
But you have to specify the names.
Anyway I have exhausted my time, and inclination in this matter. Good
luck with whatever your problem is.
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-26 12:40 ` Eric W. Biederman
@ 2013-03-26 14:17 ` Serge Hallyn
2013-03-26 14:33 ` Serge Hallyn
1 sibling, 0 replies; 32+ messages in thread
From: Serge Hallyn @ 2013-03-26 14:17 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Benoit Lourdelet, Stephen Hemminger, netdev@vger.kernel.org
Quoting Eric W. Biederman (ebiederm@xmission.com):
> Specifing the names would dramatically improve your creation
> performance. It should only take you about 10s for 5000 veth pairs.
> But you have to specify the names.
Thanks, Eric. I'm going to update lxc to always specify names for
the veth pairs, rather than only when they are requested by the
user's configuration file.
-serge
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-26 12:40 ` Eric W. Biederman
2013-03-26 14:17 ` Serge Hallyn
@ 2013-03-26 14:33 ` Serge Hallyn
2013-03-27 13:37 ` Benoit Lourdelet
1 sibling, 1 reply; 32+ messages in thread
From: Serge Hallyn @ 2013-03-26 14:33 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Benoit Lourdelet, Stephen Hemminger, netdev@vger.kernel.org
Actually, lxc is using random names now, so it's ok.
Benoit, can you use the patches from Eric with lxc (or use the script
you were using before but specify names as he said)?
-serge
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-26 11:51 ` Benoit Lourdelet
2013-03-26 12:40 ` Eric W. Biederman
@ 2013-03-26 15:31 ` Eric Dumazet
1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2013-03-26 15:31 UTC (permalink / raw)
To: Benoit Lourdelet
Cc: Stephen Hemminger, Eric W. Biederman, netdev@vger.kernel.org,
Serge Hallyn
On Tue, 2013-03-26 at 11:51 +0000, Benoit Lourdelet wrote:
> The script to delete:
> for d in /sys/class/net/veth*; do
> ip link del `basename $d` 2>/dev/null || true
> Done
>
> There is a very good improvement in deletion.
I can do better ;)
If you are really doing this kind of things, you could use :
rmmod veth
Note that "ip" command supports a batch mode
ip -batch filename
In this case, the caching is done only once.
Eric, Stephen, one possibility would be to use the cache only in batch
mode.
Anyway caching is wrong because several users can use ip command at the
same time.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-26 14:33 ` Serge Hallyn
@ 2013-03-27 13:37 ` Benoit Lourdelet
2013-03-27 15:11 ` Eric W. Biederman
0 siblings, 1 reply; 32+ messages in thread
From: Benoit Lourdelet @ 2013-03-27 13:37 UTC (permalink / raw)
To: Serge Hallyn, Eric W. Biederman; +Cc: Stephen Hemminger, netdev@vger.kernel.org
Hello Serge,
I am indeed using Eric patch with lxc.
It solves the initial problem of slowness to start around 1600 containers.
I am now able to start more than 2000 without having new containers
slower and slower to start.
thanks
Benoit
On 26/03/2013 15:33, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote:
>Actually, lxc is using random names now, so it's ok.
>
>Benoit, can you use the patches from Eric with lxc (or use the script
>you were using before but specify names as he said)?
>
>-serge
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-27 13:37 ` Benoit Lourdelet
@ 2013-03-27 15:11 ` Eric W. Biederman
2013-03-27 17:47 ` Stephen Hemminger
2013-03-28 20:27 ` Benoit Lourdelet
0 siblings, 2 replies; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-27 15:11 UTC (permalink / raw)
To: Benoit Lourdelet; +Cc: Serge Hallyn, Stephen Hemminger, netdev@vger.kernel.org
Benoit Lourdelet <blourdel@juniper.net> writes:
> Hello Serge,
>
> I am indeed using Eric patch with lxc.
>
> It solves the initial problem of slowness to start around 1600
> containers.
Good so now we just need a production ready patch for iproute.
> I am now able to start more than 2000 without having new containers
> slower and slower to start.
May I ask how large a box you are running and how complex your
containers are. I am trying to get a feel for how common it is likely
to be to find people running thousands of containers on a single
machine.
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-27 15:11 ` Eric W. Biederman
@ 2013-03-27 17:47 ` Stephen Hemminger
2013-03-28 0:46 ` Eric W. Biederman
2013-03-28 20:27 ` Benoit Lourdelet
1 sibling, 1 reply; 32+ messages in thread
From: Stephen Hemminger @ 2013-03-27 17:47 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org
If you need to do lots of operations the --batch mode will be significantly faster.
One command start and one link map.
I have an updated version of link map hash (index and name). Could you test this patch
which applies to latest version in git.
diff --git a/lib/ll_map.c b/lib/ll_map.c
index e9ae129..bf5b0bc 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -12,6 +12,7 @@
#include <stdio.h>
#include <stdlib.h>
+#include <stddef.h>
#include <unistd.h>
#include <syslog.h>
#include <fcntl.h>
@@ -23,9 +24,44 @@
#include "libnetlink.h"
#include "ll_map.h"
-struct ll_cache
+
+struct hlist_head {
+ struct hlist_node *first;
+};
+
+struct hlist_node {
+ struct hlist_node *next, **pprev;
+};
+
+static inline void hlist_del(struct hlist_node *n)
+{
+ struct hlist_node *next = n->next;
+ struct hlist_node **pprev = n->pprev;
+ *pprev = next;
+ if (next)
+ next->pprev = pprev;
+}
+
+static inline void hlist_add_head(struct hlist_node *n, struct hlist_head *h)
{
- struct ll_cache *idx_next;
+ struct hlist_node *first = h->first;
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ h->first = n;
+ n->pprev = &h->first;
+}
+
+#define hlist_for_each(pos, head) \
+ for (pos = (head)->first; pos ; pos = pos->next)
+
+#define container_of(ptr, type, member) ({ \
+ const typeof( ((type *)0)->member ) *__mptr = (ptr); \
+ (type *)( (char *)__mptr - offsetof(type,member) );})
+
+struct ll_cache {
+ struct hlist_node idx_hash;
+ struct hlist_node name_hash;
unsigned flags;
int index;
unsigned short type;
@@ -33,49 +69,107 @@ struct ll_cache
};
#define IDXMAP_SIZE 1024
-static struct ll_cache *idx_head[IDXMAP_SIZE];
+static struct hlist_head idx_head[IDXMAP_SIZE];
+static struct hlist_head name_head[IDXMAP_SIZE];
-static inline struct ll_cache *idxhead(int idx)
+static struct ll_cache *ll_get_by_index(unsigned index)
{
- return idx_head[idx & (IDXMAP_SIZE - 1)];
+ struct hlist_node *n;
+ unsigned h = index & (IDXMAP_SIZE - 1);
+
+ hlist_for_each(n, &idx_head[h]) {
+ struct ll_cache *im
+ = container_of(n, struct ll_cache, idx_hash);
+ if (im->index == index)
+ return im;
+ }
+
+ return NULL;
+}
+
+static unsigned namehash(const char *str)
+{
+ unsigned hash = 5381;
+
+ while (*str)
+ hash = ((hash << 5) + hash) + *str++; /* hash * 33 + c */
+
+ return hash;
+}
+
+static struct ll_cache *ll_get_by_name(const char *name)
+{
+ struct hlist_node *n;
+ unsigned h = namehash(name) & (IDXMAP_SIZE - 1);
+
+ hlist_for_each(n, &name_head[h]) {
+ struct ll_cache *im
+ = container_of(n, struct ll_cache, name_hash);
+
+ if (strncmp(im->name, name, IFNAMSIZ) == 0)
+ return im;
+ }
+
+ return NULL;
}
int ll_remember_index(const struct sockaddr_nl *who,
struct nlmsghdr *n, void *arg)
{
- int h;
+ unsigned int h;
+ const char *ifname;
struct ifinfomsg *ifi = NLMSG_DATA(n);
- struct ll_cache *im, **imp;
+ struct ll_cache *im;
struct rtattr *tb[IFLA_MAX+1];
- if (n->nlmsg_type != RTM_NEWLINK)
+ if (n->nlmsg_type != RTM_NEWLINK && n->nlmsg_type != RTM_DELLINK)
return 0;
if (n->nlmsg_len < NLMSG_LENGTH(sizeof(ifi)))
return -1;
+ im = ll_get_by_index(ifi->ifi_index);
+ if (n->nlmsg_type == RTM_DELLINK) {
+ if (im) {
+ hlist_del(&im->name_hash);
+ hlist_del(&im->idx_hash);
+ free(im);
+ }
+ return 0;
+ }
+
memset(tb, 0, sizeof(tb));
parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), IFLA_PAYLOAD(n));
- if (tb[IFLA_IFNAME] == NULL)
+ ifname = rta_getattr_str(tb[IFLA_IFNAME]);
+ if (ifname == NULL)
return 0;
- h = ifi->ifi_index & (IDXMAP_SIZE - 1);
- for (imp = &idx_head[h]; (im=*imp)!=NULL; imp = &im->idx_next)
- if (im->index == ifi->ifi_index)
- break;
-
- if (im == NULL) {
- im = malloc(sizeof(*im));
- if (im == NULL)
- return 0;
- im->idx_next = *imp;
- im->index = ifi->ifi_index;
- *imp = im;
+ if (im) {
+ /* change to existing entry */
+ if (strcmp(im->name, ifname) != 0) {
+ hlist_del(&im->name_hash);
+ h = namehash(ifname) & (IDXMAP_SIZE - 1);
+ hlist_add_head(&im->name_hash, &name_head[h]);
+ }
+
+ im->flags = ifi->ifi_flags;
+ return 0;
}
+ im = malloc(sizeof(*im));
+ if (im == NULL)
+ return 0;
+ im->index = ifi->ifi_index;
+ strcpy(im->name, ifname);
im->type = ifi->ifi_type;
im->flags = ifi->ifi_flags;
- strcpy(im->name, RTA_DATA(tb[IFLA_IFNAME]));
+
+ h = ifi->ifi_index & (IDXMAP_SIZE - 1);
+ hlist_add_head(&im->idx_hash, &idx_head[h]);
+
+ h = namehash(ifname) & (IDXMAP_SIZE - 1);
+ hlist_add_head(&im->name_hash, &name_head[h]);
+
return 0;
}
@@ -86,15 +180,14 @@ const char *ll_idx_n2a(unsigned idx, char *buf)
if (idx == 0)
return "*";
- for (im = idxhead(idx); im; im = im->idx_next)
- if (im->index == idx)
- return im->name;
+ im = ll_get_by_index(idx);
+ if (im)
+ return im->name;
snprintf(buf, IFNAMSIZ, "if%d", idx);
return buf;
}
-
const char *ll_index_to_name(unsigned idx)
{
static char nbuf[IFNAMSIZ];
@@ -108,10 +201,9 @@ int ll_index_to_type(unsigned idx)
if (idx == 0)
return -1;
- for (im = idxhead(idx); im; im = im->idx_next)
- if (im->index == idx)
- return im->type;
- return -1;
+
+ im = ll_get_by_index(idx);
+ return im ? im->type : -1;
}
unsigned ll_index_to_flags(unsigned idx)
@@ -121,35 +213,21 @@ unsigned ll_index_to_flags(unsigned idx)
if (idx == 0)
return 0;
- for (im = idxhead(idx); im; im = im->idx_next)
- if (im->index == idx)
- return im->flags;
- return 0;
+ im = ll_get_by_index(idx);
+ return im ? im->flags : -1;
}
unsigned ll_name_to_index(const char *name)
{
- static char ncache[IFNAMSIZ];
- static int icache;
- struct ll_cache *im;
- int i;
+ const struct ll_cache *im;
unsigned idx;
if (name == NULL)
return 0;
- if (icache && strcmp(name, ncache) == 0)
- return icache;
-
- for (i=0; i<IDXMAP_SIZE; i++) {
- for (im = idx_head[i]; im; im = im->idx_next) {
- if (strcmp(im->name, name) == 0) {
- icache = im->index;
- strcpy(ncache, name);
- return im->index;
- }
- }
- }
+ im = ll_get_by_name(name);
+ if (im)
+ return im->index;
idx = if_nametoindex(name);
if (idx == 0)
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-27 17:47 ` Stephen Hemminger
@ 2013-03-28 0:46 ` Eric W. Biederman
2013-03-28 3:20 ` Serge Hallyn
0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-28 0:46 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org
Stephen Hemminger <stephen@networkplumber.org> writes:
> If you need to do lots of operations the --batch mode will be significantly faster.
> One command start and one link map.
The problem in this case as I understand it is lots of independent
operations. Now maybe lxc should not shell out to ip and perform the
work itself.
> I have an updated version of link map hash (index and name). Could you test this patch
> which applies to latest version in git.
This still dumps all of the interfaces in ll_init_map causing things to
slow down noticably.
# with your patch
# time ~/projects/iproute/iproute2/ip/ip link add a4511 type veth peer name b4511
real 0m0.049s
user 0m0.000s
sys 0m0.048s
# With a hack to make ll_map_init a nop.
# time ~/projects/iproute/iproute2/ip/ip link add a4512 type veth peer name b4512
real 0m0.003s
user 0m0.000s
sys 0m0.000s
eric-ThinkPad-X220 6bed4 #
# Without any patches.
# time ~/projects/iproute/iproute2/ip/ip link add a5002 type veth peer name b5002
real 0m0.052s
user 0m0.004s
sys 0m0.044s
So it looks like dumping all of the interfaces is taking 46 miliseconds,
longer than otherwise. Causing ip to take nearly an order of magnitude
longer to run when there are a lot of interfaces, and causing ip to slow
down with each command.
So the ideal situation is probably just to fill in the ll_map on demand
instead of up front.
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 0:46 ` Eric W. Biederman
@ 2013-03-28 3:20 ` Serge Hallyn
2013-03-28 3:44 ` Eric W. Biederman
0 siblings, 1 reply; 32+ messages in thread
From: Serge Hallyn @ 2013-03-28 3:20 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org
Quoting Eric W. Biederman (ebiederm@xmission.com):
> Stephen Hemminger <stephen@networkplumber.org> writes:
>
> > If you need to do lots of operations the --batch mode will be significantly faster.
> > One command start and one link map.
>
> The problem in this case as I understand it is lots of independent
> operations. Now maybe lxc should not shell out to ip and perform the
> work itself.
fwiw lxc uses netlink to create new veths, and picks random names with
mktemp() ahead of time.
-serge
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 3:20 ` Serge Hallyn
@ 2013-03-28 3:44 ` Eric W. Biederman
2013-03-28 4:28 ` Serge Hallyn
0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-28 3:44 UTC (permalink / raw)
To: Serge Hallyn; +Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org
Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Stephen Hemminger <stephen@networkplumber.org> writes:
>>
>> > If you need to do lots of operations the --batch mode will be significantly faster.
>> > One command start and one link map.
>>
>> The problem in this case as I understand it is lots of independent
>> operations. Now maybe lxc should not shell out to ip and perform the
>> work itself.
>
> fwiw lxc uses netlink to create new veths, and picks random names with
> mktemp() ahead of time.
I am puzzled where does the slownes in iproute2 come into play?
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 3:44 ` Eric W. Biederman
@ 2013-03-28 4:28 ` Serge Hallyn
2013-03-28 5:00 ` Eric W. Biederman
0 siblings, 1 reply; 32+ messages in thread
From: Serge Hallyn @ 2013-03-28 4:28 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org
Quoting Eric W. Biederman (ebiederm@xmission.com):
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> Stephen Hemminger <stephen@networkplumber.org> writes:
> >>
> >> > If you need to do lots of operations the --batch mode will be significantly faster.
> >> > One command start and one link map.
> >>
> >> The problem in this case as I understand it is lots of independent
> >> operations. Now maybe lxc should not shell out to ip and perform the
> >> work itself.
> >
> > fwiw lxc uses netlink to create new veths, and picks random names with
> > mktemp() ahead of time.
>
> I am puzzled where does the slownes in iproute2 come into play?
Benoit originally reported slowness when starting >1500 containers. I
asked him to run a few manual tests to figure out what was taking the
time. Manually creating a large # of veths was an obvious test, and
one which showed poorly scaling performance.
May well be there are other things slowing down lxc of course.
-serge
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 4:28 ` Serge Hallyn
@ 2013-03-28 5:00 ` Eric W. Biederman
2013-03-28 13:36 ` Serge Hallyn
0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-28 5:00 UTC (permalink / raw)
To: Serge Hallyn; +Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org
Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>>
>> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> Stephen Hemminger <stephen@networkplumber.org> writes:
>> >>
>> >> > If you need to do lots of operations the --batch mode will be significantly faster.
>> >> > One command start and one link map.
>> >>
>> >> The problem in this case as I understand it is lots of independent
>> >> operations. Now maybe lxc should not shell out to ip and perform the
>> >> work itself.
>> >
>> > fwiw lxc uses netlink to create new veths, and picks random names with
>> > mktemp() ahead of time.
>>
>> I am puzzled where does the slownes in iproute2 come into play?
>
> Benoit originally reported slowness when starting >1500 containers. I
> asked him to run a few manual tests to figure out what was taking the
> time. Manually creating a large # of veths was an obvious test, and
> one which showed poorly scaling performance.
Apparently iproute is involved somehwere as when he tested with a
patched iproute (as you asked him to) the lxc startup slowdown was
gone.
> May well be there are other things slowing down lxc of course.
The evidence indicates it was iproute being called somewhere...
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 5:00 ` Eric W. Biederman
@ 2013-03-28 13:36 ` Serge Hallyn
2013-03-28 13:42 ` Benoit Lourdelet
0 siblings, 1 reply; 32+ messages in thread
From: Serge Hallyn @ 2013-03-28 13:36 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org
Quoting Eric W. Biederman (ebiederm@xmission.com):
> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> >>
> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> >> Stephen Hemminger <stephen@networkplumber.org> writes:
> >> >>
> >> >> > If you need to do lots of operations the --batch mode will be significantly faster.
> >> >> > One command start and one link map.
> >> >>
> >> >> The problem in this case as I understand it is lots of independent
> >> >> operations. Now maybe lxc should not shell out to ip and perform the
> >> >> work itself.
> >> >
> >> > fwiw lxc uses netlink to create new veths, and picks random names with
> >> > mktemp() ahead of time.
> >>
> >> I am puzzled where does the slownes in iproute2 come into play?
> >
> > Benoit originally reported slowness when starting >1500 containers. I
> > asked him to run a few manual tests to figure out what was taking the
> > time. Manually creating a large # of veths was an obvious test, and
> > one which showed poorly scaling performance.
>
> Apparently iproute is involved somehwere as when he tested with a
> patched iproute (as you asked him to) the lxc startup slowdown was
> gone.
>
> > May well be there are other things slowing down lxc of course.
>
> The evidence indicates it was iproute being called somewhere...
Benoit can you tell us exactly what test you were running when you saw
the slowdown was gone?
-serge
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 13:36 ` Serge Hallyn
@ 2013-03-28 13:42 ` Benoit Lourdelet
2013-03-28 15:04 ` Serge Hallyn
0 siblings, 1 reply; 32+ messages in thread
From: Benoit Lourdelet @ 2013-03-28 13:42 UTC (permalink / raw)
To: Serge Hallyn, Eric W. Biederman; +Cc: Stephen Hemminger, netdev@vger.kernel.org
Hello,
My test consists in starting small containers (10MB of RAM ) each. Each
container has 2x physical VLAN interfaces attached.
lxc.network.type = phys
lxc.network.flags = up
lxc.network.link = eth6.3
lxc.network.name = eth2
lxc.network.hwaddr = 00:50:56:a8:03:03
lxc.network.ipv4 = 192.168.1.1/24
lxc.network.type = phys
lxc.network.flags = up
lxc.network.link = eth7.3
lxc.network.name = eth1
lxc.network.ipv4 = 2.2.2.2/24
lxc.network.hwaddr = 00:50:57:b8:00:01
With initial iproute2 , when I reach around 1600 containers, container
creation almost stops.It takes at least 20s per container to start.
With patched iproutes2 , I have started 4000 containers at a rate of 1 per
second w/o problem. I have 8000 clan interfaces configured on the host (2x
4000).
Regards
Benoit
On 28/03/2013 14:36, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote:
>Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>>
>> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>> >>
>> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> >> Stephen Hemminger <stephen@networkplumber.org> writes:
>> >> >>
>> >> >> > If you need to do lots of operations the --batch mode will be
>>significantly faster.
>> >> >> > One command start and one link map.
>> >> >>
>> >> >> The problem in this case as I understand it is lots of independent
>> >> >> operations. Now maybe lxc should not shell out to ip and perform
>>the
>> >> >> work itself.
>> >> >
>> >> > fwiw lxc uses netlink to create new veths, and picks random names
>>with
>> >> > mktemp() ahead of time.
>> >>
>> >> I am puzzled where does the slownes in iproute2 come into play?
>> >
>> > Benoit originally reported slowness when starting >1500 containers. I
>> > asked him to run a few manual tests to figure out what was taking the
>> > time. Manually creating a large # of veths was an obvious test, and
>> > one which showed poorly scaling performance.
>>
>> Apparently iproute is involved somehwere as when he tested with a
>> patched iproute (as you asked him to) the lxc startup slowdown was
>> gone.
>>
>> > May well be there are other things slowing down lxc of course.
>>
>> The evidence indicates it was iproute being called somewhere...
>
>Benoit can you tell us exactly what test you were running when you saw
>the slowdown was gone?
>
>-serge
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 13:42 ` Benoit Lourdelet
@ 2013-03-28 15:04 ` Serge Hallyn
2013-03-28 15:21 ` Benoit Lourdelet
0 siblings, 1 reply; 32+ messages in thread
From: Serge Hallyn @ 2013-03-28 15:04 UTC (permalink / raw)
To: Benoit Lourdelet
Cc: Eric W. Biederman, Stephen Hemminger, netdev@vger.kernel.org
Quoting Benoit Lourdelet (blourdel@juniper.net):
> Hello,
>
> My test consists in starting small containers (10MB of RAM ) each. Each
> container has 2x physical VLAN interfaces attached.
Which commands were you using to create/start them?
> lxc.network.type = phys
> lxc.network.flags = up
> lxc.network.link = eth6.3
> lxc.network.name = eth2
> lxc.network.hwaddr = 00:50:56:a8:03:03
> lxc.network.ipv4 = 192.168.1.1/24
> lxc.network.type = phys
> lxc.network.flags = up
> lxc.network.link = eth7.3
> lxc.network.name = eth1
> lxc.network.ipv4 = 2.2.2.2/24
> lxc.network.hwaddr = 00:50:57:b8:00:01
>
>
>
> With initial iproute2 , when I reach around 1600 containers, container
> creation almost stops.It takes at least 20s per container to start.
> With patched iproutes2 , I have started 4000 containers at a rate of 1 per
> second w/o problem. I have 8000 clan interfaces configured on the host (2x
> 4000).
>
>
> Regards
>
> Benoit
>
> On 28/03/2013 14:36, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote:
>
> >Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> >>
> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
> >> >>
> >> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> >> >> Stephen Hemminger <stephen@networkplumber.org> writes:
> >> >> >>
> >> >> >> > If you need to do lots of operations the --batch mode will be
> >>significantly faster.
> >> >> >> > One command start and one link map.
> >> >> >>
> >> >> >> The problem in this case as I understand it is lots of independent
> >> >> >> operations. Now maybe lxc should not shell out to ip and perform
> >>the
> >> >> >> work itself.
> >> >> >
> >> >> > fwiw lxc uses netlink to create new veths, and picks random names
> >>with
> >> >> > mktemp() ahead of time.
> >> >>
> >> >> I am puzzled where does the slownes in iproute2 come into play?
> >> >
> >> > Benoit originally reported slowness when starting >1500 containers. I
> >> > asked him to run a few manual tests to figure out what was taking the
> >> > time. Manually creating a large # of veths was an obvious test, and
> >> > one which showed poorly scaling performance.
> >>
> >> Apparently iproute is involved somehwere as when he tested with a
> >> patched iproute (as you asked him to) the lxc startup slowdown was
> >> gone.
> >>
> >> > May well be there are other things slowing down lxc of course.
> >>
> >> The evidence indicates it was iproute being called somewhere...
> >
> >Benoit can you tell us exactly what test you were running when you saw
> >the slowdown was gone?
> >
> >-serge
> >
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 15:04 ` Serge Hallyn
@ 2013-03-28 15:21 ` Benoit Lourdelet
2013-03-28 22:20 ` Stephen Hemminger
0 siblings, 1 reply; 32+ messages in thread
From: Benoit Lourdelet @ 2013-03-28 15:21 UTC (permalink / raw)
To: Serge Hallyn; +Cc: Eric W. Biederman, Stephen Hemminger, netdev@vger.kernel.org
I use, for each container :
lxc-start -n lwb2001 -f /var/lib/lxc/lwb2001/config -d
I created the containers with lxc-ubuntu -n lwb2001
Benoit
On 28/03/2013 16:04, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote:
>Quoting Benoit Lourdelet (blourdel@juniper.net):
>> Hello,
>>
>> My test consists in starting small containers (10MB of RAM ) each. Each
>> container has 2x physical VLAN interfaces attached.
>
>Which commands were you using to create/start them?
>
>> lxc.network.type = phys
>> lxc.network.flags = up
>> lxc.network.link = eth6.3
>> lxc.network.name = eth2
>> lxc.network.hwaddr = 00:50:56:a8:03:03
>> lxc.network.ipv4 = 192.168.1.1/24
>> lxc.network.type = phys
>> lxc.network.flags = up
>> lxc.network.link = eth7.3
>> lxc.network.name = eth1
>> lxc.network.ipv4 = 2.2.2.2/24
>> lxc.network.hwaddr = 00:50:57:b8:00:01
>>
>>
>>
>> With initial iproute2 , when I reach around 1600 containers, container
>> creation almost stops.It takes at least 20s per container to start.
>> With patched iproutes2 , I have started 4000 containers at a rate of 1
>>per
>> second w/o problem. I have 8000 clan interfaces configured on the host
>>(2x
>> 4000).
>>
>>
>> Regards
>>
>> Benoit
>>
>> On 28/03/2013 14:36, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote:
>>
>> >Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>> >>
>> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>> >> >>
>> >> >> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> >> >> Stephen Hemminger <stephen@networkplumber.org> writes:
>> >> >> >>
>> >> >> >> > If you need to do lots of operations the --batch mode will be
>> >>significantly faster.
>> >> >> >> > One command start and one link map.
>> >> >> >>
>> >> >> >> The problem in this case as I understand it is lots of
>>independent
>> >> >> >> operations. Now maybe lxc should not shell out to ip and
>>perform
>> >>the
>> >> >> >> work itself.
>> >> >> >
>> >> >> > fwiw lxc uses netlink to create new veths, and picks random
>>names
>> >>with
>> >> >> > mktemp() ahead of time.
>> >> >>
>> >> >> I am puzzled where does the slownes in iproute2 come into play?
>> >> >
>> >> > Benoit originally reported slowness when starting >1500
>>containers. I
>> >> > asked him to run a few manual tests to figure out what was taking
>>the
>> >> > time. Manually creating a large # of veths was an obvious test,
>>and
>> >> > one which showed poorly scaling performance.
>> >>
>> >> Apparently iproute is involved somehwere as when he tested with a
>> >> patched iproute (as you asked him to) the lxc startup slowdown was
>> >> gone.
>> >>
>> >> > May well be there are other things slowing down lxc of course.
>> >>
>> >> The evidence indicates it was iproute being called somewhere...
>> >
>> >Benoit can you tell us exactly what test you were running when you saw
>> >the slowdown was gone?
>> >
>> >-serge
>> >
>>
>>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-27 15:11 ` Eric W. Biederman
2013-03-27 17:47 ` Stephen Hemminger
@ 2013-03-28 20:27 ` Benoit Lourdelet
1 sibling, 0 replies; 32+ messages in thread
From: Benoit Lourdelet @ 2013-03-28 20:27 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Serge Hallyn, Stephen Hemminger, netdev@vger.kernel.org
Hello Eric,
I am running simple containers (2 network interfaces, 10MB of RAM, default
routing) and want to test scalability.
Out test platform is a 32x 2Ghz cores x86.
Regards
Benoit
On 27/03/2013 16:11, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>Benoit Lourdelet <blourdel@juniper.net> writes:
>
>> Hello Serge,
>>
>> I am indeed using Eric patch with lxc.
>>
>> It solves the initial problem of slowness to start around 1600
>> containers.
>
>Good so now we just need a production ready patch for iproute.
>
>> I am now able to start more than 2000 without having new containers
>> slower and slower to start.
>
>May I ask how large a box you are running and how complex your
>containers are. I am trying to get a feel for how common it is likely
>to be to find people running thousands of containers on a single
>machine.
>
>Eric
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 15:21 ` Benoit Lourdelet
@ 2013-03-28 22:20 ` Stephen Hemminger
2013-03-28 23:52 ` Eric W. Biederman
0 siblings, 1 reply; 32+ messages in thread
From: Stephen Hemminger @ 2013-03-28 22:20 UTC (permalink / raw)
To: Benoit Lourdelet; +Cc: Serge Hallyn, Eric W. Biederman, netdev@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 128 bytes --]
Try the following two patches. It adds a name hash list, and uses Eric's idea
to avoid loading map on add/delete operations.
[-- Attachment #2: 0001-ll_map-add-name-and-index-hash.patch --]
[-- Type: text/x-patch, Size: 8143 bytes --]
>From 0025e5d63d5d1598ab622867834a3bcb9f518f9f Mon Sep 17 00:00:00 2001
From: Stephen Hemminger <stephen@networkplumber.org>
Date: Thu, 28 Mar 2013 14:57:28 -0700
Subject: [PATCH 1/2] ll_map: add name and index hash
Make ll_ functions faster by having a name hash, and allow
for deletion. Also, allow them to work without calling ll_init_map.
---
include/hlist.h | 56 ++++++++++++++++++++
include/ll_map.h | 3 +-
lib/ll_map.c | 155 ++++++++++++++++++++++++++++++++++--------------------
3 files changed, 157 insertions(+), 57 deletions(-)
create mode 100644 include/hlist.h
diff --git a/include/hlist.h b/include/hlist.h
new file mode 100644
index 0000000..4e8de9e
--- /dev/null
+++ b/include/hlist.h
@@ -0,0 +1,56 @@
+#ifndef __HLIST_H__
+#define __HLIST_H__ 1
+/* Hash list stuff from kernel */
+
+#include <stddef.h>
+
+#define container_of(ptr, type, member) ({ \
+ const typeof( ((type *)0)->member ) *__mptr = (ptr); \
+ (type *)( (char *)__mptr - offsetof(type,member) );})
+
+struct hlist_head {
+ struct hlist_node *first;
+};
+
+struct hlist_node {
+ struct hlist_node *next, **pprev;
+};
+
+static inline void hlist_del(struct hlist_node *n)
+{
+ struct hlist_node *next = n->next;
+ struct hlist_node **pprev = n->pprev;
+ *pprev = next;
+ if (next)
+ next->pprev = pprev;
+}
+
+static inline void hlist_add_head(struct hlist_node *n, struct hlist_head *h)
+{
+ struct hlist_node *first = h->first;
+ n->next = first;
+ if (first)
+ first->pprev = &n->next;
+ h->first = n;
+ n->pprev = &h->first;
+}
+
+#define hlist_for_each(pos, head) \
+ for (pos = (head)->first; pos ; pos = pos->next)
+
+
+#define hlist_for_each_safe(pos, n, head) \
+ for (pos = (head)->first; pos && ({ n = pos->next; 1; }); \
+ pos = n)
+
+#define hlist_entry_safe(ptr, type, member) \
+ ({ typeof(ptr) ____ptr = (ptr); \
+ ____ptr ? hlist_entry(____ptr, type, member) : NULL; \
+ })
+
+#define hlist_for_each_entry(pos, head, member) \
+ for (pos = hlist_entry_safe((head)->first, typeof(*(pos)), member);\
+ pos; \
+ pos = hlist_entry_safe((pos)->member.next, typeof(*(pos)), member))
+
+#endif /* __HLIST_H__ */
diff --git a/include/ll_map.h b/include/ll_map.h
index c4d5c6d..f1dda39 100644
--- a/include/ll_map.h
+++ b/include/ll_map.h
@@ -3,7 +3,8 @@
extern int ll_remember_index(const struct sockaddr_nl *who,
struct nlmsghdr *n, void *arg);
-extern int ll_init_map(struct rtnl_handle *rth);
+
+extern void ll_init_map(struct rtnl_handle *rth);
extern unsigned ll_name_to_index(const char *name);
extern const char *ll_index_to_name(unsigned idx);
extern const char *ll_idx_n2a(unsigned idx, char *buf);
diff --git a/lib/ll_map.c b/lib/ll_map.c
index e9ae129..fd7db55 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -22,10 +22,11 @@
#include "libnetlink.h"
#include "ll_map.h"
+#include "hlist.h"
-struct ll_cache
-{
- struct ll_cache *idx_next;
+struct ll_cache {
+ struct hlist_node idx_hash;
+ struct hlist_node name_hash;
unsigned flags;
int index;
unsigned short type;
@@ -33,49 +34,107 @@ struct ll_cache
};
#define IDXMAP_SIZE 1024
-static struct ll_cache *idx_head[IDXMAP_SIZE];
+static struct hlist_head idx_head[IDXMAP_SIZE];
+static struct hlist_head name_head[IDXMAP_SIZE];
-static inline struct ll_cache *idxhead(int idx)
+static struct ll_cache *ll_get_by_index(unsigned index)
{
- return idx_head[idx & (IDXMAP_SIZE - 1)];
+ struct hlist_node *n;
+ unsigned h = index & (IDXMAP_SIZE - 1);
+
+ hlist_for_each(n, &idx_head[h]) {
+ struct ll_cache *im
+ = container_of(n, struct ll_cache, idx_hash);
+ if (im->index == index)
+ return im;
+ }
+
+ return NULL;
+}
+
+static unsigned namehash(const char *str)
+{
+ unsigned hash = 5381;
+
+ while (*str)
+ hash = ((hash << 5) + hash) + *str++; /* hash * 33 + c */
+
+ return hash;
+}
+
+static struct ll_cache *ll_get_by_name(const char *name)
+{
+ struct hlist_node *n;
+ unsigned h = namehash(name) & (IDXMAP_SIZE - 1);
+
+ hlist_for_each(n, &name_head[h]) {
+ struct ll_cache *im
+ = container_of(n, struct ll_cache, name_hash);
+
+ if (strncmp(im->name, name, IFNAMSIZ) == 0)
+ return im;
+ }
+
+ return NULL;
}
int ll_remember_index(const struct sockaddr_nl *who,
struct nlmsghdr *n, void *arg)
{
- int h;
+ unsigned int h;
+ const char *ifname;
struct ifinfomsg *ifi = NLMSG_DATA(n);
- struct ll_cache *im, **imp;
+ struct ll_cache *im;
struct rtattr *tb[IFLA_MAX+1];
- if (n->nlmsg_type != RTM_NEWLINK)
+ if (n->nlmsg_type != RTM_NEWLINK && n->nlmsg_type != RTM_DELLINK)
return 0;
if (n->nlmsg_len < NLMSG_LENGTH(sizeof(ifi)))
return -1;
+ im = ll_get_by_index(ifi->ifi_index);
+ if (n->nlmsg_type == RTM_DELLINK) {
+ if (im) {
+ hlist_del(&im->name_hash);
+ hlist_del(&im->idx_hash);
+ free(im);
+ }
+ return 0;
+ }
+
memset(tb, 0, sizeof(tb));
parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), IFLA_PAYLOAD(n));
- if (tb[IFLA_IFNAME] == NULL)
+ ifname = rta_getattr_str(tb[IFLA_IFNAME]);
+ if (ifname == NULL)
return 0;
- h = ifi->ifi_index & (IDXMAP_SIZE - 1);
- for (imp = &idx_head[h]; (im=*imp)!=NULL; imp = &im->idx_next)
- if (im->index == ifi->ifi_index)
- break;
-
- if (im == NULL) {
- im = malloc(sizeof(*im));
- if (im == NULL)
- return 0;
- im->idx_next = *imp;
- im->index = ifi->ifi_index;
- *imp = im;
+ if (im) {
+ /* change to existing entry */
+ if (strcmp(im->name, ifname) != 0) {
+ hlist_del(&im->name_hash);
+ h = namehash(ifname) & (IDXMAP_SIZE - 1);
+ hlist_add_head(&im->name_hash, &name_head[h]);
+ }
+
+ im->flags = ifi->ifi_flags;
+ return 0;
}
+ im = malloc(sizeof(*im));
+ if (im == NULL)
+ return 0;
+ im->index = ifi->ifi_index;
+ strcpy(im->name, ifname);
im->type = ifi->ifi_type;
im->flags = ifi->ifi_flags;
- strcpy(im->name, RTA_DATA(tb[IFLA_IFNAME]));
+
+ h = ifi->ifi_index & (IDXMAP_SIZE - 1);
+ hlist_add_head(&im->idx_hash, &idx_head[h]);
+
+ h = namehash(ifname) & (IDXMAP_SIZE - 1);
+ hlist_add_head(&im->name_hash, &name_head[h]);
+
return 0;
}
@@ -86,15 +145,16 @@ const char *ll_idx_n2a(unsigned idx, char *buf)
if (idx == 0)
return "*";
- for (im = idxhead(idx); im; im = im->idx_next)
- if (im->index == idx)
- return im->name;
+ im = ll_get_by_index(idx);
+ if (im)
+ return im->name;
+
+ if (if_indextoname(idx, buf) == NULL)
+ snprintf(buf, IFNAMSIZ, "if%d", idx);
- snprintf(buf, IFNAMSIZ, "if%d", idx);
return buf;
}
-
const char *ll_index_to_name(unsigned idx)
{
static char nbuf[IFNAMSIZ];
@@ -108,10 +168,9 @@ int ll_index_to_type(unsigned idx)
if (idx == 0)
return -1;
- for (im = idxhead(idx); im; im = im->idx_next)
- if (im->index == idx)
- return im->type;
- return -1;
+
+ im = ll_get_by_index(idx);
+ return im ? im->type : -1;
}
unsigned ll_index_to_flags(unsigned idx)
@@ -121,35 +180,21 @@ unsigned ll_index_to_flags(unsigned idx)
if (idx == 0)
return 0;
- for (im = idxhead(idx); im; im = im->idx_next)
- if (im->index == idx)
- return im->flags;
- return 0;
+ im = ll_get_by_index(idx);
+ return im ? im->flags : -1;
}
unsigned ll_name_to_index(const char *name)
{
- static char ncache[IFNAMSIZ];
- static int icache;
- struct ll_cache *im;
- int i;
+ const struct ll_cache *im;
unsigned idx;
if (name == NULL)
return 0;
- if (icache && strcmp(name, ncache) == 0)
- return icache;
-
- for (i=0; i<IDXMAP_SIZE; i++) {
- for (im = idx_head[i]; im; im = im->idx_next) {
- if (strcmp(im->name, name) == 0) {
- icache = im->index;
- strcpy(ncache, name);
- return im->index;
- }
- }
- }
+ im = ll_get_by_name(name);
+ if (im)
+ return im->index;
idx = if_nametoindex(name);
if (idx == 0)
@@ -157,12 +202,12 @@ unsigned ll_name_to_index(const char *name)
return idx;
}
-int ll_init_map(struct rtnl_handle *rth)
+void ll_init_map(struct rtnl_handle *rth)
{
static int initialized;
if (initialized)
- return 0;
+ return;
if (rtnl_wilddump_request(rth, AF_UNSPEC, RTM_GETLINK) < 0) {
perror("Cannot send dump request");
@@ -175,6 +220,4 @@ int ll_init_map(struct rtnl_handle *rth)
}
initialized = 1;
-
- return 0;
}
--
1.7.10.4
[-- Attachment #3: 0002-ip-remove-unnecessary-ll_init_map.patch --]
[-- Type: text/x-patch, Size: 2755 bytes --]
>From f0124b0f0aa0e5b9288114eb8e6ff9b4f8c33ec8 Mon Sep 17 00:00:00 2001
From: Stephen Hemminger <stephen@networkplumber.org>
Date: Thu, 28 Mar 2013 15:17:47 -0700
Subject: [PATCH 2/2] ip: remove unnecessary ll_init_map
Don't call ll_init_map on modify operations
Saves significant overhead with 1000's of devices.
---
ip/ipaddress.c | 2 --
ip/ipaddrlabel.c | 2 --
ip/iplink.c | 2 --
ip/iproute.c | 6 ------
ip/xfrm_monitor.c | 2 --
5 files changed, 14 deletions(-)
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 149df69..5b9a438 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1365,8 +1365,6 @@ static int ipaddr_modify(int cmd, int flags, int argc, char **argv)
if (!scoped && cmd != RTM_DELADDR)
req.ifa.ifa_scope = default_scope(&lcl);
- ll_init_map(&rth);
-
if ((req.ifa.ifa_index = ll_name_to_index(d)) == 0) {
fprintf(stderr, "Cannot find device \"%s\"\n", d);
return -1;
diff --git a/ip/ipaddrlabel.c b/ip/ipaddrlabel.c
index eb6a48c..1789d9c 100644
--- a/ip/ipaddrlabel.c
+++ b/ip/ipaddrlabel.c
@@ -246,8 +246,6 @@ static int ipaddrlabel_flush(int argc, char **argv)
int do_ipaddrlabel(int argc, char **argv)
{
- ll_init_map(&rth);
-
if (argc < 1) {
return ipaddrlabel_list(0, NULL);
} else if (matches(argv[0], "list") == 0 ||
diff --git a/ip/iplink.c b/ip/iplink.c
index 5c7b43c..dc98019 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
}
}
- ll_init_map(&rth);
-
if (!(flags & NLM_F_CREATE)) {
if (!dev) {
fprintf(stderr, "Not enough information: \"dev\" "
diff --git a/ip/iproute.c b/ip/iproute.c
index 2c2a331..adef774 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -970,8 +970,6 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv)
if (d || nhs_ok) {
int idx;
- ll_init_map(&rth);
-
if (d) {
if ((idx = ll_name_to_index(d)) == 0) {
fprintf(stderr, "Cannot find device \"%s\"\n", d);
@@ -1265,8 +1263,6 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action)
if (do_ipv6 == AF_UNSPEC && filter.tb)
do_ipv6 = AF_INET;
- ll_init_map(&rth);
-
if (id || od) {
int idx;
@@ -1452,8 +1448,6 @@ static int iproute_get(int argc, char **argv)
exit(1);
}
- ll_init_map(&rth);
-
if (idev || odev) {
int idx;
diff --git a/ip/xfrm_monitor.c b/ip/xfrm_monitor.c
index bfc48f1..a1f5d53 100644
--- a/ip/xfrm_monitor.c
+++ b/ip/xfrm_monitor.c
@@ -408,8 +408,6 @@ int do_xfrm_monitor(int argc, char **argv)
return rtnl_from_file(fp, xfrm_accept_msg, (void*)stdout);
}
- //ll_init_map(&rth);
-
if (rtnl_open_byproto(&rth, groups, NETLINK_XFRM) < 0)
exit(1);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 22:20 ` Stephen Hemminger
@ 2013-03-28 23:52 ` Eric W. Biederman
2013-03-29 0:13 ` Eric Dumazet
2013-03-30 10:09 ` Benoit Lourdelet
0 siblings, 2 replies; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-28 23:52 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org
Stephen Hemminger <stephen@networkplumber.org> writes:
> Try the following two patches. It adds a name hash list, and uses Eric's idea
> to avoid loading map on add/delete operations.
On my microbenchmark of just creating 5000 veth pairs this takes pairs
16s instead of 13s of my earlier hacks but that is well down in the
usable range.
Deleting all of those network interfaces one by one takes me 60s.
So on the microbenchmark side this looks like a good improvement and
pretty usable.
I expect Benoit's container startup workload will also reflect this, but
it will be interesting to see the actual result.
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 23:52 ` Eric W. Biederman
@ 2013-03-29 0:13 ` Eric Dumazet
2013-03-29 0:25 ` Eric W. Biederman
2013-03-30 10:09 ` Benoit Lourdelet
1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-03-29 0:13 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn,
netdev@vger.kernel.org
On Thu, 2013-03-28 at 16:52 -0700, Eric W. Biederman wrote:
> On my microbenchmark of just creating 5000 veth pairs this takes pairs
> 16s instead of 13s of my earlier hacks but that is well down in the
> usable range.
I guess most of the time is taken by sysctl_check_table()
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-29 0:13 ` Eric Dumazet
@ 2013-03-29 0:25 ` Eric W. Biederman
2013-03-29 0:43 ` Eric Dumazet
0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-29 0:25 UTC (permalink / raw)
To: Eric Dumazet
Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn,
netdev@vger.kernel.org
Eric Dumazet <eric.dumazet@gmail.com> writes:
> On Thu, 2013-03-28 at 16:52 -0700, Eric W. Biederman wrote:
>
>> On my microbenchmark of just creating 5000 veth pairs this takes pairs
>> 16s instead of 13s of my earlier hacks but that is well down in the
>> usable range.
>
> I guess most of the time is taken by sysctl_check_table()
All of the significant sysctl slowdowns were fixed in 3.4. If you see
something of sysctl show up in a trace I would be happy to talk about
it. The kernel side seems to be creating N network devices seems to
take NlogN time now. Both sysfs and sysctl store directories as
rbtrees removing their previous bottlenecks.
The loop I timed at 16s was just:
time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name b$i; done
There is plenty of room for inefficiencies in 10000 network devices and
5000 forks+execs.
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-29 0:25 ` Eric W. Biederman
@ 2013-03-29 0:43 ` Eric Dumazet
2013-03-29 1:06 ` Eric W. Biederman
2013-03-29 1:10 ` Eric Dumazet
0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2013-03-29 0:43 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn,
netdev@vger.kernel.org
On Thu, 2013-03-28 at 17:25 -0700, Eric W. Biederman wrote:
> Eric Dumazet <eric.dumazet@gmail.com> writes:
>
> > On Thu, 2013-03-28 at 16:52 -0700, Eric W. Biederman wrote:
> >
> >> On my microbenchmark of just creating 5000 veth pairs this takes pairs
> >> 16s instead of 13s of my earlier hacks but that is well down in the
> >> usable range.
> >
> > I guess most of the time is taken by sysctl_check_table()
>
> All of the significant sysctl slowdowns were fixed in 3.4. If you see
> something of sysctl show up in a trace I would be happy to talk about
> it. The kernel side seems to be creating N network devices seems to
> take NlogN time now. Both sysfs and sysctl store directories as
> rbtrees removing their previous bottlenecks.
>
> The loop I timed at 16s was just:
>
> time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name b$i; done
>
> There is plenty of room for inefficiencies in 10000 network devices and
> 5000 forks+execs.
Ah right, the sysctl part is fixed ;)
In batch mode, I can create these veth pairs in 4 seconds
for i in $(seq 1 5000) ; do echo link add a$i type veth peer name b$i;
done | ip -batch -
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-29 0:43 ` Eric Dumazet
@ 2013-03-29 1:06 ` Eric W. Biederman
2013-03-29 1:10 ` Eric Dumazet
1 sibling, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-29 1:06 UTC (permalink / raw)
To: Eric Dumazet
Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn,
netdev@vger.kernel.org
Eric Dumazet <eric.dumazet@gmail.com> writes:
> On Thu, 2013-03-28 at 17:25 -0700, Eric W. Biederman wrote:
>> Eric Dumazet <eric.dumazet@gmail.com> writes:
>>
>> > On Thu, 2013-03-28 at 16:52 -0700, Eric W. Biederman wrote:
>> >
>> >> On my microbenchmark of just creating 5000 veth pairs this takes pairs
>> >> 16s instead of 13s of my earlier hacks but that is well down in the
>> >> usable range.
>> >
>> > I guess most of the time is taken by sysctl_check_table()
>>
>> All of the significant sysctl slowdowns were fixed in 3.4. If you see
>> something of sysctl show up in a trace I would be happy to talk about
>> it. The kernel side seems to be creating N network devices seems to
>> take NlogN time now. Both sysfs and sysctl store directories as
>> rbtrees removing their previous bottlenecks.
>>
>> The loop I timed at 16s was just:
>>
>> time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name b$i; done
>>
>> There is plenty of room for inefficiencies in 10000 network devices and
>> 5000 forks+execs.
>
> Ah right, the sysctl part is fixed ;)
>
> In batch mode, I can create these veth pairs in 4 seconds
>
> for i in $(seq 1 5000) ; do echo link add a$i type veth peer name b$i;
> done | ip -batch -
Yes. The interesting story here is that the bottleneck before these
patches was the ll_init_map function of iproute2. Which resulted in an
over an order of magnitude slowdown of when starting iproute on a system
with lots of network devices.
It is still unclear where iproute comes into the picture in the original
problem scenario of creating 2000 containers each with 2 veth pairs.
But apparently it was.
As the fundamental use case here was taking 2000 separate independent
actions it turns out to be important for things to not slowdown
unreasonably outside of batch mode. So I was explicitly testing the
non-batch mode performance.
On the flip side it might be interesting to see if we can get batch mode
deletes to batch in the kernel, so we don't have to wait for through
syncrhonize_rcu_expidited for each of them. Although for the container
case I can just drop the last reference to the network namespace and all
of the network device removals will batch.
Ultimately shrug. Except in the previous O(N^2) userspace behavior
there don't seem to be any practical performance problems with this many
network devices. What is interesting is that this many network devices
is becoming interesting on inexpensive COTS servers, for cases that are
not purely network focused.
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-29 0:43 ` Eric Dumazet
2013-03-29 1:06 ` Eric W. Biederman
@ 2013-03-29 1:10 ` Eric Dumazet
2013-03-29 1:29 ` Eric W. Biederman
1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-03-29 1:10 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn,
netdev@vger.kernel.org
On Thu, 2013-03-28 at 17:43 -0700, Eric Dumazet wrote:
> In batch mode, I can create these veth pairs in 4 seconds
>
> for i in $(seq 1 5000) ; do echo link add a$i type veth peer name b$i;
> done | ip -batch -
At rmmod time, 30% of cpu is spent in packet_notifier()
Maybe we can do something about this.
30.85% rmmod [kernel.kallsyms] [k]
packet_notifier
|
--- packet_notifier
notifier_call_chain
raw_notifier_call_chain
call_netdevice_notifiers
rollback_registered_many
unregister_netdevice_many
__rtnl_link_unregister
rtnl_link_unregister
0xffffffffa0044868
sys_delete_module
sysenter_dispatch
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-29 1:10 ` Eric Dumazet
@ 2013-03-29 1:29 ` Eric W. Biederman
2013-03-29 1:38 ` Eric Dumazet
0 siblings, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2013-03-29 1:29 UTC (permalink / raw)
To: Eric Dumazet
Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn,
netdev@vger.kernel.org
Eric Dumazet <eric.dumazet@gmail.com> writes:
> On Thu, 2013-03-28 at 17:43 -0700, Eric Dumazet wrote:
>
>> In batch mode, I can create these veth pairs in 4 seconds
>>
>> for i in $(seq 1 5000) ; do echo link add a$i type veth peer name b$i;
>> done | ip -batch -
>
>
> At rmmod time, 30% of cpu is spent in packet_notifier()
>
> Maybe we can do something about this.
An interesting thought. I had a patch I never got around to pushing a
while back that would have had an effect.
It is my observation that the vast majority of packet filters apply not
to the entire machine but to an individual interface. In fact you have
to work pretty hard to get tools like tcpdump to dump all of the
interfaces at once.
So to speed things up for machines that have a lot of these things the
idea was to create per device lists for the filters that only needed to
be run on a single device. In this case it looks like we could
potentially create per device lists for of the listening sockets as well.
In general these lists should be short so the search can also be short.
But I am curious do you actually have a tcpdump or something similar
running on your box that is using AF_PACKET sockets? Perhaps a dhcp
client?
I am a little surprised that your default case has anything on the lists
to trigger any work in the packet_notifier notifier.
> 30.85% rmmod [kernel.kallsyms] [k]
> packet_notifier
> |
> --- packet_notifier
> notifier_call_chain
> raw_notifier_call_chain
> call_netdevice_notifiers
> rollback_registered_many
> unregister_netdevice_many
> __rtnl_link_unregister
> rtnl_link_unregister
> 0xffffffffa0044868
> sys_delete_module
> sysenter_dispatch
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-29 1:29 ` Eric W. Biederman
@ 2013-03-29 1:38 ` Eric Dumazet
0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2013-03-29 1:38 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn,
netdev@vger.kernel.org
On Thu, 2013-03-28 at 18:29 -0700, Eric W. Biederman wrote:
> An interesting thought. I had a patch I never got around to pushing a
> while back that would have had an effect.
>
> It is my observation that the vast majority of packet filters apply not
> to the entire machine but to an individual interface. In fact you have
> to work pretty hard to get tools like tcpdump to dump all of the
> interfaces at once.
>
> So to speed things up for machines that have a lot of these things the
> idea was to create per device lists for the filters that only needed to
> be run on a single device. In this case it looks like we could
> potentially create per device lists for of the listening sockets as well.
>
> In general these lists should be short so the search can also be short.
>
> But I am curious do you actually have a tcpdump or something similar
> running on your box that is using AF_PACKET sockets? Perhaps a dhcp
> client?
>
> I am a little surprised that your default case has anything on the lists
> to trigger any work in the packet_notifier notifier.
Hmm, it might be a local daemon on my lab machine which does a
PACKET_ADD_MEMBERSHIP for each created interface.
So my machine spend time in packet_dev_mclist(), with a quadratic
behavior at rmmod.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-28 23:52 ` Eric W. Biederman
2013-03-29 0:13 ` Eric Dumazet
@ 2013-03-30 10:09 ` Benoit Lourdelet
2013-03-30 14:44 ` Eric Dumazet
1 sibling, 1 reply; 32+ messages in thread
From: Benoit Lourdelet @ 2013-03-30 10:09 UTC (permalink / raw)
To: Eric W. Biederman, Stephen Hemminger; +Cc: Serge Hallyn, netdev@vger.kernel.org
Hello,
Here are my tests of the last patches on 3 different platforms all
running 3.8.5 :
Time are in seconds :
8x 3.7Ghz virtual cores
# veth create delete
1000 14 18
2000 39 56
5000 256 161
10000 1200 399
8x 3.2Ghz virtual cores
# veth create delete
1000 19 40
2000 118 66
5000 305 251
32x 2Ghz virtual cores , 2 sockets
# veth create delete
1000 35 86
2000 120 90
5000 724 245
Compared to initial iproute2 performance on this 32 virtual core system :
5000 1143 1185
"perf record" for creation of 5000 veth on the 32 core system :
# captured on: Fri Mar 29 14:03:35 2013
# hostname : ieng-serv06
# os release : 3.8.5
# perf version : 3.8.5
# arch : x86_64
# nrcpus online : 32
# nrcpus avail : 32
# cpudesc : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
# cpuid : GenuineIntel,6,45,7
# total memory : 264124548 kB
# cmdline : /usr/src/linux-3.8.5/tools/perf/perf record -a ./test3.script
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 =
0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1,
precise_ip = 0, id = { 36, 37, 38, 39, 40, 41, 42,
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2,
uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 =
20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, unco
# ========
#
# Samples: 9M of event 'cycles'
# Event count (approx.): 2894480238483
#
# Overhead Command Shared Object
Symbol
# ........ ............... .............................
...............................................
#
15.17% sudo [kernel.kallsyms] [k]
snmp_fold_field
5.94% sudo libc-2.15.so [.]
0x00000000000802cd
5.64% sudo [kernel.kallsyms] [k]
find_next_bit
3.21% init libnih.so.1.0.0 [.]
nih_list_add_after
2.12% swapper [kernel.kallsyms] [k] intel_idle
1.94% init [kernel.kallsyms] [k] page_fault
1.93% sed libc-2.15.so [.]
0x00000000000a1368
1.93% sudo [kernel.kallsyms] [k]
rtnl_fill_ifinfo
1.92% sudo [veth] [k]
veth_get_stats64
1.78% sudo [kernel.kallsyms] [k] memcpy
1.53% ifquery libc-2.15.so [.]
0x000000000007f52b
1.24% init libc-2.15.so [.]
0x000000000008918f
1.05% sudo [kernel.kallsyms] [k]
inet6_fill_ifla6_attrs
0.98% init [kernel.kallsyms] [k]
copy_pte_range
0.88% irqbalance libc-2.15.so [.]
0x00000000000802cd
0.85% sudo [kernel.kallsyms] [k] memset
0.72% sed ld-2.15.so [.]
0x000000000000a226
0.68% ifquery ld-2.15.so [.]
0x00000000000165a0
0.64% init libnih.so.1.0.0 [.]
nih_tree_next_post_full
0.61% bridge-network- libc-2.15.so [.]
0x0000000000131e2a
0.59% init [kernel.kallsyms] [k] do_wp_page
0.59% ifquery [kernel.kallsyms] [k] page_fault
0.54% sed [kernel.kallsyms] [k] page_fault
Regards
Benoit
On 29/03/2013 00:52, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>Stephen Hemminger <stephen@networkplumber.org> writes:
>
>> Try the following two patches. It adds a name hash list, and uses
>>Eric's idea
>> to avoid loading map on add/delete operations.
>
>On my microbenchmark of just creating 5000 veth pairs this takes pairs
>16s instead of 13s of my earlier hacks but that is well down in the
>usable range.
>
>Deleting all of those network interfaces one by one takes me 60s.
>
>So on the microbenchmark side this looks like a good improvement and
>pretty usable.
>
>I expect Benoit's container startup workload will also reflect this, but
>it will be interesting to see the actual result.
>
>Eric
>
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-30 10:09 ` Benoit Lourdelet
@ 2013-03-30 14:44 ` Eric Dumazet
2013-03-30 16:07 ` Benoit Lourdelet
0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-03-30 14:44 UTC (permalink / raw)
To: Benoit Lourdelet
Cc: Eric W. Biederman, Stephen Hemminger, Serge Hallyn,
netdev@vger.kernel.org
On Sat, 2013-03-30 at 10:09 +0000, Benoit Lourdelet wrote:
> Hello,
>
> Here are my tests of the last patches on 3 different platforms all
> running 3.8.5 :
>
> Time are in seconds :
>
> 8x 3.7Ghz virtual cores
>
> # veth create delete
> 1000 14 18
> 2000 39 56
> 5000 256 161
> 10000 1200 399
>
>
> 8x 3.2Ghz virtual cores
>
> # veth create delete
>
> 1000 19 40
> 2000 118 66
> 5000 305 251
>
>
>
> 32x 2Ghz virtual cores , 2 sockets
> # veth create delete
> 1000 35 86
>
> 2000 120 90
>
> 5000 724 245
>
>
> Compared to initial iproute2 performance on this 32 virtual core system :
> 5000 1143 1185
>
>
>
> "perf record" for creation of 5000 veth on the 32 core system :
>
> # captured on: Fri Mar 29 14:03:35 2013
> # hostname : ieng-serv06
> # os release : 3.8.5
> # perf version : 3.8.5
> # arch : x86_64
> # nrcpus online : 32
> # nrcpus avail : 32
> # cpudesc : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> # cpuid : GenuineIntel,6,45,7
> # total memory : 264124548 kB
> # cmdline : /usr/src/linux-3.8.5/tools/perf/perf record -a ./test3.script
> # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 =
> 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1,
> precise_ip = 0, id = { 36, 37, 38, 39, 40, 41, 42,
> # HEADER_CPU_TOPOLOGY info available, use -I to display
> # HEADER_NUMA_TOPOLOGY info available, use -I to display
> # pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2,
> uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 =
> 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, unco
> # ========
> #
> # Samples: 9M of event 'cycles'
> # Event count (approx.): 2894480238483
> #
> # Overhead Command Shared Object
> Symbol
> # ........ ............... .............................
> ...............................................
> #
> 15.17% sudo [kernel.kallsyms] [k]
> snmp_fold_field
> 5.94% sudo libc-2.15.so [.]
> 0x00000000000802cd
> 5.64% sudo [kernel.kallsyms] [k]
> find_next_bit
> 3.21% init libnih.so.1.0.0 [.]
> nih_list_add_after
> 2.12% swapper [kernel.kallsyms] [k] intel_idle
>
> 1.94% init [kernel.kallsyms] [k] page_fault
>
> 1.93% sed libc-2.15.so [.]
> 0x00000000000a1368
> 1.93% sudo [kernel.kallsyms] [k]
> rtnl_fill_ifinfo
> 1.92% sudo [veth] [k]
> veth_get_stats64
> 1.78% sudo [kernel.kallsyms] [k] memcpy
>
> 1.53% ifquery libc-2.15.so [.]
> 0x000000000007f52b
> 1.24% init libc-2.15.so [.]
> 0x000000000008918f
> 1.05% sudo [kernel.kallsyms] [k]
> inet6_fill_ifla6_attrs
> 0.98% init [kernel.kallsyms] [k]
> copy_pte_range
> 0.88% irqbalance libc-2.15.so [.]
> 0x00000000000802cd
> 0.85% sudo [kernel.kallsyms] [k] memset
>
> 0.72% sed ld-2.15.so [.]
> 0x000000000000a226
> 0.68% ifquery ld-2.15.so [.]
> 0x00000000000165a0
> 0.64% init libnih.so.1.0.0 [.]
> nih_tree_next_post_full
> 0.61% bridge-network- libc-2.15.so [.]
> 0x0000000000131e2a
> 0.59% init [kernel.kallsyms] [k] do_wp_page
>
> 0.59% ifquery [kernel.kallsyms] [k] page_fault
>
> 0.54% sed [kernel.kallsyms] [k] page_fault
>
>
>
>
>
> Regards
>
> Benoit
>
>
>
>
This means lxc-start does the same thing than ip :
It fetches the whole device list.
You could strace it to have a confirmation.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
2013-03-30 14:44 ` Eric Dumazet
@ 2013-03-30 16:07 ` Benoit Lourdelet
0 siblings, 0 replies; 32+ messages in thread
From: Benoit Lourdelet @ 2013-03-30 16:07 UTC (permalink / raw)
To: Eric Dumazet
Cc: Eric W. Biederman, Stephen Hemminger, Serge Hallyn,
netdev@vger.kernel.org
Sorry Eric,
This is not an lxc-start perf report.This is an "ip" report".
Will run an "lxc-start" perf ASAP.
Regards
Benoit
On 30/03/2013 15:44, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:
>On Sat, 2013-03-30 at 10:09 +0000, Benoit Lourdelet wrote:
>> Hello,
>>
>> Here are my tests of the last patches on 3 different platforms all
>> running 3.8.5 :
>>
>> Time are in seconds :
>>
>> 8x 3.7Ghz virtual cores
>>
>> # veth create delete
>> 1000 14 18
>> 2000 39 56
>> 5000 256 161
>> 10000 1200 399
>>
>>
>> 8x 3.2Ghz virtual cores
>>
>> # veth create delete
>>
>> 1000 19 40
>> 2000 118 66
>> 5000 305 251
>>
>>
>>
>> 32x 2Ghz virtual cores , 2 sockets
>> # veth create delete
>> 1000 35 86
>>
>> 2000 120 90
>>
>> 5000 724 245
>>
>>
>> Compared to initial iproute2 performance on this 32 virtual core
>>system :
>> 5000 1143 1185
>>
>>
>>
>> "perf record" for creation of 5000 veth on the 32 core system :
>>
>> # captured on: Fri Mar 29 14:03:35 2013
>> # hostname : ieng-serv06
>> # os release : 3.8.5
>> # perf version : 3.8.5
>> # arch : x86_64
>> # nrcpus online : 32
>> # nrcpus avail : 32
>> # cpudesc : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
>> # cpuid : GenuineIntel,6,45,7
>> # total memory : 264124548 kB
>> # cmdline : /usr/src/linux-3.8.5/tools/perf/perf record -a
>>./test3.script
>> # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2
>>=
>> 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1,
>> precise_ip = 0, id = { 36, 37, 38, 39, 40, 41, 42,
>> # HEADER_CPU_TOPOLOGY info available, use -I to display
>> # HEADER_NUMA_TOPOLOGY info available, use -I to display
>> # pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2,
>> uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 =
>> 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, unco
>> # ========
>> #
>> # Samples: 9M of event 'cycles'
>> # Event count (approx.): 2894480238483
>> #
>> # Overhead Command Shared Object
>> Symbol
>> # ........ ............... .............................
>> ...............................................
>> #
>> 15.17% sudo [kernel.kallsyms] [k]
>> snmp_fold_field
>> 5.94% sudo libc-2.15.so [.]
>> 0x00000000000802cd
>> 5.64% sudo [kernel.kallsyms] [k]
>> find_next_bit
>> 3.21% init libnih.so.1.0.0 [.]
>> nih_list_add_after
>> 2.12% swapper [kernel.kallsyms] [k]
>>intel_idle
>>
>> 1.94% init [kernel.kallsyms] [k]
>>page_fault
>>
>> 1.93% sed libc-2.15.so [.]
>> 0x00000000000a1368
>> 1.93% sudo [kernel.kallsyms] [k]
>> rtnl_fill_ifinfo
>> 1.92% sudo [veth] [k]
>> veth_get_stats64
>> 1.78% sudo [kernel.kallsyms] [k] memcpy
>>
>> 1.53% ifquery libc-2.15.so [.]
>> 0x000000000007f52b
>> 1.24% init libc-2.15.so [.]
>> 0x000000000008918f
>> 1.05% sudo [kernel.kallsyms] [k]
>> inet6_fill_ifla6_attrs
>> 0.98% init [kernel.kallsyms] [k]
>> copy_pte_range
>> 0.88% irqbalance libc-2.15.so [.]
>> 0x00000000000802cd
>> 0.85% sudo [kernel.kallsyms] [k] memset
>>
>> 0.72% sed ld-2.15.so [.]
>> 0x000000000000a226
>> 0.68% ifquery ld-2.15.so [.]
>> 0x00000000000165a0
>> 0.64% init libnih.so.1.0.0 [.]
>> nih_tree_next_post_full
>> 0.61% bridge-network- libc-2.15.so [.]
>> 0x0000000000131e2a
>> 0.59% init [kernel.kallsyms] [k]
>>do_wp_page
>>
>> 0.59% ifquery [kernel.kallsyms] [k]
>>page_fault
>>
>> 0.54% sed [kernel.kallsyms] [k]
>>page_fault
>>
>>
>>
>>
>>
>> Regards
>>
>> Benoit
>>
>>
>>
>>
>
>This means lxc-start does the same thing than ip :
>
>It fetches the whole device list.
>
>You could strace it to have a confirmation.
>
>
>
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2013-03-30 16:09 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-22 22:23 [RFC][PATCH] iproute: Faster ip link add, set and delete Eric W. Biederman
2013-03-22 22:27 ` Stephen Hemminger
2013-03-26 11:51 ` Benoit Lourdelet
2013-03-26 12:40 ` Eric W. Biederman
2013-03-26 14:17 ` Serge Hallyn
2013-03-26 14:33 ` Serge Hallyn
2013-03-27 13:37 ` Benoit Lourdelet
2013-03-27 15:11 ` Eric W. Biederman
2013-03-27 17:47 ` Stephen Hemminger
2013-03-28 0:46 ` Eric W. Biederman
2013-03-28 3:20 ` Serge Hallyn
2013-03-28 3:44 ` Eric W. Biederman
2013-03-28 4:28 ` Serge Hallyn
2013-03-28 5:00 ` Eric W. Biederman
2013-03-28 13:36 ` Serge Hallyn
2013-03-28 13:42 ` Benoit Lourdelet
2013-03-28 15:04 ` Serge Hallyn
2013-03-28 15:21 ` Benoit Lourdelet
2013-03-28 22:20 ` Stephen Hemminger
2013-03-28 23:52 ` Eric W. Biederman
2013-03-29 0:13 ` Eric Dumazet
2013-03-29 0:25 ` Eric W. Biederman
2013-03-29 0:43 ` Eric Dumazet
2013-03-29 1:06 ` Eric W. Biederman
2013-03-29 1:10 ` Eric Dumazet
2013-03-29 1:29 ` Eric W. Biederman
2013-03-29 1:38 ` Eric Dumazet
2013-03-30 10:09 ` Benoit Lourdelet
2013-03-30 14:44 ` Eric Dumazet
2013-03-30 16:07 ` Benoit Lourdelet
2013-03-28 20:27 ` Benoit Lourdelet
2013-03-26 15:31 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).