* [RFC][PATCH] iproute: Faster ip link add, set and delete @ 2013-03-22 22:23 Eric W. Biederman 2013-03-22 22:27 ` Stephen Hemminger 0 siblings, 1 reply; 32+ messages in thread From: Eric W. Biederman @ 2013-03-22 22:23 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev, Serge Hallyn, Benoit Lourdelet Because ip link add, set, and delete map the interface name to the interface index by dumping all of the interfaces before performing their respective commands. Operations that should be constant time slow down when lots of network interfaces are in use. Resulting in O(N^2) time to work with O(N) devices. Make the work that iproute does constant time by passing the interface name to the kernel instead. In small scale testing on my system this shows dramatic performance increases of ip link add from 120s to just 11s to add 5000 network devices. And from longer than I cared to wait to just 58s to delete all of those interfaces again. Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Reported-by: Benoit Lourdelet <blourdel@juniper.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- I think I am bungling the case where people specify an ifindex as ifNNNN but does anyone care? ip/iplink.c | 19 +------------------ 1 files changed, 1 insertions(+), 18 deletions(-) diff --git a/ip/iplink.c b/ip/iplink.c index ad33611..6dffbf0 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) } } - ll_init_map(&rth); - if (!(flags & NLM_F_CREATE)) { if (!dev) { fprintf(stderr, "Not enough information: \"dev\" " @@ -542,27 +540,12 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) exit(-1); } - req.i.ifi_index = ll_name_to_index(dev); - if (req.i.ifi_index == 0) { - fprintf(stderr, "Cannot find device \"%s\"\n", dev); - return -1; - } + name = dev; } else { /* Allow "ip link add dev" and "ip link add name" */ if (!name) name = dev; - if (link) { - int ifindex; - - ifindex = ll_name_to_index(link); - if (ifindex == 0) { - fprintf(stderr, "Cannot find device \"%s\"\n", - link); - return -1; - } - addattr_l(&req.n, sizeof(req), IFLA_LINK, &ifindex, 4); - } } if (name) { -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-22 22:23 [RFC][PATCH] iproute: Faster ip link add, set and delete Eric W. Biederman @ 2013-03-22 22:27 ` Stephen Hemminger 2013-03-26 11:51 ` Benoit Lourdelet 0 siblings, 1 reply; 32+ messages in thread From: Stephen Hemminger @ 2013-03-22 22:27 UTC (permalink / raw) To: Eric W. Biederman; +Cc: netdev, Serge Hallyn, Benoit Lourdelet The whole ifindex map is a design mistake at this point. Better off to do a lazy cache or something like that. On Fri, Mar 22, 2013 at 3:23 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > Because ip link add, set, and delete map the interface name to the > interface index by dumping all of the interfaces before performing > their respective commands. Operations that should be constant time > slow down when lots of network interfaces are in use. Resulting > in O(N^2) time to work with O(N) devices. > > Make the work that iproute does constant time by passing the interface > name to the kernel instead. > > In small scale testing on my system this shows dramatic performance > increases of ip link add from 120s to just 11s to add 5000 network > devices. And from longer than I cared to wait to just 58s to delete > all of those interfaces again. > > Cc: Serge Hallyn <serge.hallyn@ubuntu.com> > Reported-by: Benoit Lourdelet <blourdel@juniper.net> > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > > I think I am bungling the case where people specify an ifindex as ifNNNN > but does anyone care? > > ip/iplink.c | 19 +------------------ > 1 files changed, 1 insertions(+), 18 deletions(-) > > diff --git a/ip/iplink.c b/ip/iplink.c > index ad33611..6dffbf0 100644 > --- a/ip/iplink.c > +++ b/ip/iplink.c > @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) > } > } > > - ll_init_map(&rth); > - > if (!(flags & NLM_F_CREATE)) { > if (!dev) { > fprintf(stderr, "Not enough information: \"dev\" " > @@ -542,27 +540,12 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) > exit(-1); > } > > - req.i.ifi_index = ll_name_to_index(dev); > - if (req.i.ifi_index == 0) { > - fprintf(stderr, "Cannot find device \"%s\"\n", dev); > - return -1; > - } > + name = dev; > } else { > /* Allow "ip link add dev" and "ip link add name" */ > if (!name) > name = dev; > > - if (link) { > - int ifindex; > - > - ifindex = ll_name_to_index(link); > - if (ifindex == 0) { > - fprintf(stderr, "Cannot find device \"%s\"\n", > - link); > - return -1; > - } > - addattr_l(&req.n, sizeof(req), IFLA_LINK, &ifindex, 4); > - } > } > > if (name) { > -- > 1.7.5.4 > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-22 22:27 ` Stephen Hemminger @ 2013-03-26 11:51 ` Benoit Lourdelet 2013-03-26 12:40 ` Eric W. Biederman 2013-03-26 15:31 ` Eric Dumazet 0 siblings, 2 replies; 32+ messages in thread From: Benoit Lourdelet @ 2013-03-26 11:51 UTC (permalink / raw) To: Stephen Hemminger, Eric W. Biederman; +Cc: netdev@vger.kernel.org, Serge Hallyn Hello, I re-tested with the patch and got the following results on a 32x 2Ghz core system. # veth add delete 1000 36 34 3000 259 137 4000 462 195 5000 729 N/A The script to create is the following : for i in `seq 1 5000`; do sudo ip link add type veth Done The script to delete: for d in /sys/class/net/veth*; do ip link del `basename $d` 2>/dev/null || true Done There is a very good improvement in deletion. iproute2 does not seems to be well multithread as I get time divided by a factor of 2 with a 8x 3.2 Ghz core system. I don¹t know if that is the improvement you expected ? Would the iproute2 redesign you mentioned help improve performance even further ? As a reference : Iproute2 baseline w/o patch: # veth add delete 1000 57 70 2000 193 250 3000 435 510 4000 752 824 5000 1123 1185 Regards Benoit On 22/03/2013 23:27, "Stephen Hemminger" <stephen@networkplumber.org> wrote: >The whole ifindex map is a design mistake at this point. >Better off to do a lazy cache or something like that. > > >On Fri, Mar 22, 2013 at 3:23 PM, Eric W. Biederman ><ebiederm@xmission.com> wrote: >> >> Because ip link add, set, and delete map the interface name to the >> interface index by dumping all of the interfaces before performing >> their respective commands. Operations that should be constant time >> slow down when lots of network interfaces are in use. Resulting >> in O(N^2) time to work with O(N) devices. >> >> Make the work that iproute does constant time by passing the interface >> name to the kernel instead. >> >> In small scale testing on my system this shows dramatic performance >> increases of ip link add from 120s to just 11s to add 5000 network >> devices. And from longer than I cared to wait to just 58s to delete >> all of those interfaces again. >> >> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> >> Reported-by: Benoit Lourdelet <blourdel@juniper.net> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >> --- >> >> I think I am bungling the case where people specify an ifindex as ifNNNN >> but does anyone care? >> >> ip/iplink.c | 19 +------------------ >> 1 files changed, 1 insertions(+), 18 deletions(-) >> >> diff --git a/ip/iplink.c b/ip/iplink.c >> index ad33611..6dffbf0 100644 >> --- a/ip/iplink.c >> +++ b/ip/iplink.c >> @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int >>flags, int argc, char **argv) >> } >> } >> >> - ll_init_map(&rth); >> - >> if (!(flags & NLM_F_CREATE)) { >> if (!dev) { >> fprintf(stderr, "Not enough information: >>\"dev\" " >> @@ -542,27 +540,12 @@ static int iplink_modify(int cmd, unsigned int >>flags, int argc, char **argv) >> exit(-1); >> } >> >> - req.i.ifi_index = ll_name_to_index(dev); >> - if (req.i.ifi_index == 0) { >> - fprintf(stderr, "Cannot find device \"%s\"\n", >>dev); >> - return -1; >> - } >> + name = dev; >> } else { >> /* Allow "ip link add dev" and "ip link add name" */ >> if (!name) >> name = dev; >> >> - if (link) { >> - int ifindex; >> - >> - ifindex = ll_name_to_index(link); >> - if (ifindex == 0) { >> - fprintf(stderr, "Cannot find device >>\"%s\"\n", >> - link); >> - return -1; >> - } >> - addattr_l(&req.n, sizeof(req), IFLA_LINK, >>&ifindex, 4); >> - } >> } >> >> if (name) { >> -- >> 1.7.5.4 >> > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-26 11:51 ` Benoit Lourdelet @ 2013-03-26 12:40 ` Eric W. Biederman 2013-03-26 14:17 ` Serge Hallyn 2013-03-26 14:33 ` Serge Hallyn 2013-03-26 15:31 ` Eric Dumazet 1 sibling, 2 replies; 32+ messages in thread From: Eric W. Biederman @ 2013-03-26 12:40 UTC (permalink / raw) To: Benoit Lourdelet; +Cc: Stephen Hemminger, netdev@vger.kernel.org, Serge Hallyn Benoit Lourdelet <blourdel@juniper.net> writes: > Hello, > > I re-tested with the patch and got the following results on a 32x 2Ghz > core system. > > # veth add delete > 1000 36 34 > 3000 259 137 > 4000 462 195 > 5000 729 N/A > > The script to create is the following : > for i in `seq 1 5000`; do > sudo ip link add type veth > Done Which performs horribly as I mentioned earlier because you are asking the kernel to create the names. If you want performance you need to specify the names of the network devices you are creating. aka ip link add a$i type veth name b$i > The script to delete: > for d in /sys/class/net/veth*; do > ip link del `basename $d` 2>/dev/null || true > Done > > There is a very good improvement in deletion. > > > > iproute2 does not seems to be well multithread as I get time divided by a > factor of 2 with a 8x 3.2 Ghz core system. All netlink traffic and all network stack configuration is serialized by the rtnl_lock in the kernel. This is the slow path in the kernel, not the fast path. > I don¹t know if that is the improvement you expected ? > > Would the iproute2 redesign you mentioned help improve performance even > further ? Specifing the names would dramatically improve your creation performance. It should only take you about 10s for 5000 veth pairs. But you have to specify the names. Anyway I have exhausted my time, and inclination in this matter. Good luck with whatever your problem is. Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-26 12:40 ` Eric W. Biederman @ 2013-03-26 14:17 ` Serge Hallyn 2013-03-26 14:33 ` Serge Hallyn 1 sibling, 0 replies; 32+ messages in thread From: Serge Hallyn @ 2013-03-26 14:17 UTC (permalink / raw) To: Eric W. Biederman Cc: Benoit Lourdelet, Stephen Hemminger, netdev@vger.kernel.org Quoting Eric W. Biederman (ebiederm@xmission.com): > Specifing the names would dramatically improve your creation > performance. It should only take you about 10s for 5000 veth pairs. > But you have to specify the names. Thanks, Eric. I'm going to update lxc to always specify names for the veth pairs, rather than only when they are requested by the user's configuration file. -serge ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-26 12:40 ` Eric W. Biederman 2013-03-26 14:17 ` Serge Hallyn @ 2013-03-26 14:33 ` Serge Hallyn 2013-03-27 13:37 ` Benoit Lourdelet 1 sibling, 1 reply; 32+ messages in thread From: Serge Hallyn @ 2013-03-26 14:33 UTC (permalink / raw) To: Eric W. Biederman Cc: Benoit Lourdelet, Stephen Hemminger, netdev@vger.kernel.org Actually, lxc is using random names now, so it's ok. Benoit, can you use the patches from Eric with lxc (or use the script you were using before but specify names as he said)? -serge ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-26 14:33 ` Serge Hallyn @ 2013-03-27 13:37 ` Benoit Lourdelet 2013-03-27 15:11 ` Eric W. Biederman 0 siblings, 1 reply; 32+ messages in thread From: Benoit Lourdelet @ 2013-03-27 13:37 UTC (permalink / raw) To: Serge Hallyn, Eric W. Biederman; +Cc: Stephen Hemminger, netdev@vger.kernel.org Hello Serge, I am indeed using Eric patch with lxc. It solves the initial problem of slowness to start around 1600 containers. I am now able to start more than 2000 without having new containers slower and slower to start. thanks Benoit On 26/03/2013 15:33, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote: >Actually, lxc is using random names now, so it's ok. > >Benoit, can you use the patches from Eric with lxc (or use the script >you were using before but specify names as he said)? > >-serge > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-27 13:37 ` Benoit Lourdelet @ 2013-03-27 15:11 ` Eric W. Biederman 2013-03-27 17:47 ` Stephen Hemminger 2013-03-28 20:27 ` Benoit Lourdelet 0 siblings, 2 replies; 32+ messages in thread From: Eric W. Biederman @ 2013-03-27 15:11 UTC (permalink / raw) To: Benoit Lourdelet; +Cc: Serge Hallyn, Stephen Hemminger, netdev@vger.kernel.org Benoit Lourdelet <blourdel@juniper.net> writes: > Hello Serge, > > I am indeed using Eric patch with lxc. > > It solves the initial problem of slowness to start around 1600 > containers. Good so now we just need a production ready patch for iproute. > I am now able to start more than 2000 without having new containers > slower and slower to start. May I ask how large a box you are running and how complex your containers are. I am trying to get a feel for how common it is likely to be to find people running thousands of containers on a single machine. Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-27 15:11 ` Eric W. Biederman @ 2013-03-27 17:47 ` Stephen Hemminger 2013-03-28 0:46 ` Eric W. Biederman 2013-03-28 20:27 ` Benoit Lourdelet 1 sibling, 1 reply; 32+ messages in thread From: Stephen Hemminger @ 2013-03-27 17:47 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org If you need to do lots of operations the --batch mode will be significantly faster. One command start and one link map. I have an updated version of link map hash (index and name). Could you test this patch which applies to latest version in git. diff --git a/lib/ll_map.c b/lib/ll_map.c index e9ae129..bf5b0bc 100644 --- a/lib/ll_map.c +++ b/lib/ll_map.c @@ -12,6 +12,7 @@ #include <stdio.h> #include <stdlib.h> +#include <stddef.h> #include <unistd.h> #include <syslog.h> #include <fcntl.h> @@ -23,9 +24,44 @@ #include "libnetlink.h" #include "ll_map.h" -struct ll_cache + +struct hlist_head { + struct hlist_node *first; +}; + +struct hlist_node { + struct hlist_node *next, **pprev; +}; + +static inline void hlist_del(struct hlist_node *n) +{ + struct hlist_node *next = n->next; + struct hlist_node **pprev = n->pprev; + *pprev = next; + if (next) + next->pprev = pprev; +} + +static inline void hlist_add_head(struct hlist_node *n, struct hlist_head *h) { - struct ll_cache *idx_next; + struct hlist_node *first = h->first; + n->next = first; + if (first) + first->pprev = &n->next; + h->first = n; + n->pprev = &h->first; +} + +#define hlist_for_each(pos, head) \ + for (pos = (head)->first; pos ; pos = pos->next) + +#define container_of(ptr, type, member) ({ \ + const typeof( ((type *)0)->member ) *__mptr = (ptr); \ + (type *)( (char *)__mptr - offsetof(type,member) );}) + +struct ll_cache { + struct hlist_node idx_hash; + struct hlist_node name_hash; unsigned flags; int index; unsigned short type; @@ -33,49 +69,107 @@ struct ll_cache }; #define IDXMAP_SIZE 1024 -static struct ll_cache *idx_head[IDXMAP_SIZE]; +static struct hlist_head idx_head[IDXMAP_SIZE]; +static struct hlist_head name_head[IDXMAP_SIZE]; -static inline struct ll_cache *idxhead(int idx) +static struct ll_cache *ll_get_by_index(unsigned index) { - return idx_head[idx & (IDXMAP_SIZE - 1)]; + struct hlist_node *n; + unsigned h = index & (IDXMAP_SIZE - 1); + + hlist_for_each(n, &idx_head[h]) { + struct ll_cache *im + = container_of(n, struct ll_cache, idx_hash); + if (im->index == index) + return im; + } + + return NULL; +} + +static unsigned namehash(const char *str) +{ + unsigned hash = 5381; + + while (*str) + hash = ((hash << 5) + hash) + *str++; /* hash * 33 + c */ + + return hash; +} + +static struct ll_cache *ll_get_by_name(const char *name) +{ + struct hlist_node *n; + unsigned h = namehash(name) & (IDXMAP_SIZE - 1); + + hlist_for_each(n, &name_head[h]) { + struct ll_cache *im + = container_of(n, struct ll_cache, name_hash); + + if (strncmp(im->name, name, IFNAMSIZ) == 0) + return im; + } + + return NULL; } int ll_remember_index(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) { - int h; + unsigned int h; + const char *ifname; struct ifinfomsg *ifi = NLMSG_DATA(n); - struct ll_cache *im, **imp; + struct ll_cache *im; struct rtattr *tb[IFLA_MAX+1]; - if (n->nlmsg_type != RTM_NEWLINK) + if (n->nlmsg_type != RTM_NEWLINK && n->nlmsg_type != RTM_DELLINK) return 0; if (n->nlmsg_len < NLMSG_LENGTH(sizeof(ifi))) return -1; + im = ll_get_by_index(ifi->ifi_index); + if (n->nlmsg_type == RTM_DELLINK) { + if (im) { + hlist_del(&im->name_hash); + hlist_del(&im->idx_hash); + free(im); + } + return 0; + } + memset(tb, 0, sizeof(tb)); parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), IFLA_PAYLOAD(n)); - if (tb[IFLA_IFNAME] == NULL) + ifname = rta_getattr_str(tb[IFLA_IFNAME]); + if (ifname == NULL) return 0; - h = ifi->ifi_index & (IDXMAP_SIZE - 1); - for (imp = &idx_head[h]; (im=*imp)!=NULL; imp = &im->idx_next) - if (im->index == ifi->ifi_index) - break; - - if (im == NULL) { - im = malloc(sizeof(*im)); - if (im == NULL) - return 0; - im->idx_next = *imp; - im->index = ifi->ifi_index; - *imp = im; + if (im) { + /* change to existing entry */ + if (strcmp(im->name, ifname) != 0) { + hlist_del(&im->name_hash); + h = namehash(ifname) & (IDXMAP_SIZE - 1); + hlist_add_head(&im->name_hash, &name_head[h]); + } + + im->flags = ifi->ifi_flags; + return 0; } + im = malloc(sizeof(*im)); + if (im == NULL) + return 0; + im->index = ifi->ifi_index; + strcpy(im->name, ifname); im->type = ifi->ifi_type; im->flags = ifi->ifi_flags; - strcpy(im->name, RTA_DATA(tb[IFLA_IFNAME])); + + h = ifi->ifi_index & (IDXMAP_SIZE - 1); + hlist_add_head(&im->idx_hash, &idx_head[h]); + + h = namehash(ifname) & (IDXMAP_SIZE - 1); + hlist_add_head(&im->name_hash, &name_head[h]); + return 0; } @@ -86,15 +180,14 @@ const char *ll_idx_n2a(unsigned idx, char *buf) if (idx == 0) return "*"; - for (im = idxhead(idx); im; im = im->idx_next) - if (im->index == idx) - return im->name; + im = ll_get_by_index(idx); + if (im) + return im->name; snprintf(buf, IFNAMSIZ, "if%d", idx); return buf; } - const char *ll_index_to_name(unsigned idx) { static char nbuf[IFNAMSIZ]; @@ -108,10 +201,9 @@ int ll_index_to_type(unsigned idx) if (idx == 0) return -1; - for (im = idxhead(idx); im; im = im->idx_next) - if (im->index == idx) - return im->type; - return -1; + + im = ll_get_by_index(idx); + return im ? im->type : -1; } unsigned ll_index_to_flags(unsigned idx) @@ -121,35 +213,21 @@ unsigned ll_index_to_flags(unsigned idx) if (idx == 0) return 0; - for (im = idxhead(idx); im; im = im->idx_next) - if (im->index == idx) - return im->flags; - return 0; + im = ll_get_by_index(idx); + return im ? im->flags : -1; } unsigned ll_name_to_index(const char *name) { - static char ncache[IFNAMSIZ]; - static int icache; - struct ll_cache *im; - int i; + const struct ll_cache *im; unsigned idx; if (name == NULL) return 0; - if (icache && strcmp(name, ncache) == 0) - return icache; - - for (i=0; i<IDXMAP_SIZE; i++) { - for (im = idx_head[i]; im; im = im->idx_next) { - if (strcmp(im->name, name) == 0) { - icache = im->index; - strcpy(ncache, name); - return im->index; - } - } - } + im = ll_get_by_name(name); + if (im) + return im->index; idx = if_nametoindex(name); if (idx == 0) ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-27 17:47 ` Stephen Hemminger @ 2013-03-28 0:46 ` Eric W. Biederman 2013-03-28 3:20 ` Serge Hallyn 0 siblings, 1 reply; 32+ messages in thread From: Eric W. Biederman @ 2013-03-28 0:46 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org Stephen Hemminger <stephen@networkplumber.org> writes: > If you need to do lots of operations the --batch mode will be significantly faster. > One command start and one link map. The problem in this case as I understand it is lots of independent operations. Now maybe lxc should not shell out to ip and perform the work itself. > I have an updated version of link map hash (index and name). Could you test this patch > which applies to latest version in git. This still dumps all of the interfaces in ll_init_map causing things to slow down noticably. # with your patch # time ~/projects/iproute/iproute2/ip/ip link add a4511 type veth peer name b4511 real 0m0.049s user 0m0.000s sys 0m0.048s # With a hack to make ll_map_init a nop. # time ~/projects/iproute/iproute2/ip/ip link add a4512 type veth peer name b4512 real 0m0.003s user 0m0.000s sys 0m0.000s eric-ThinkPad-X220 6bed4 # # Without any patches. # time ~/projects/iproute/iproute2/ip/ip link add a5002 type veth peer name b5002 real 0m0.052s user 0m0.004s sys 0m0.044s So it looks like dumping all of the interfaces is taking 46 miliseconds, longer than otherwise. Causing ip to take nearly an order of magnitude longer to run when there are a lot of interfaces, and causing ip to slow down with each command. So the ideal situation is probably just to fill in the ll_map on demand instead of up front. Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 0:46 ` Eric W. Biederman @ 2013-03-28 3:20 ` Serge Hallyn 2013-03-28 3:44 ` Eric W. Biederman 0 siblings, 1 reply; 32+ messages in thread From: Serge Hallyn @ 2013-03-28 3:20 UTC (permalink / raw) To: Eric W. Biederman Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org Quoting Eric W. Biederman (ebiederm@xmission.com): > Stephen Hemminger <stephen@networkplumber.org> writes: > > > If you need to do lots of operations the --batch mode will be significantly faster. > > One command start and one link map. > > The problem in this case as I understand it is lots of independent > operations. Now maybe lxc should not shell out to ip and perform the > work itself. fwiw lxc uses netlink to create new veths, and picks random names with mktemp() ahead of time. -serge ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 3:20 ` Serge Hallyn @ 2013-03-28 3:44 ` Eric W. Biederman 2013-03-28 4:28 ` Serge Hallyn 0 siblings, 1 reply; 32+ messages in thread From: Eric W. Biederman @ 2013-03-28 3:44 UTC (permalink / raw) To: Serge Hallyn; +Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org Serge Hallyn <serge.hallyn@ubuntu.com> writes: > Quoting Eric W. Biederman (ebiederm@xmission.com): >> Stephen Hemminger <stephen@networkplumber.org> writes: >> >> > If you need to do lots of operations the --batch mode will be significantly faster. >> > One command start and one link map. >> >> The problem in this case as I understand it is lots of independent >> operations. Now maybe lxc should not shell out to ip and perform the >> work itself. > > fwiw lxc uses netlink to create new veths, and picks random names with > mktemp() ahead of time. I am puzzled where does the slownes in iproute2 come into play? Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 3:44 ` Eric W. Biederman @ 2013-03-28 4:28 ` Serge Hallyn 2013-03-28 5:00 ` Eric W. Biederman 0 siblings, 1 reply; 32+ messages in thread From: Serge Hallyn @ 2013-03-28 4:28 UTC (permalink / raw) To: Eric W. Biederman Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org Quoting Eric W. Biederman (ebiederm@xmission.com): > Serge Hallyn <serge.hallyn@ubuntu.com> writes: > > > Quoting Eric W. Biederman (ebiederm@xmission.com): > >> Stephen Hemminger <stephen@networkplumber.org> writes: > >> > >> > If you need to do lots of operations the --batch mode will be significantly faster. > >> > One command start and one link map. > >> > >> The problem in this case as I understand it is lots of independent > >> operations. Now maybe lxc should not shell out to ip and perform the > >> work itself. > > > > fwiw lxc uses netlink to create new veths, and picks random names with > > mktemp() ahead of time. > > I am puzzled where does the slownes in iproute2 come into play? Benoit originally reported slowness when starting >1500 containers. I asked him to run a few manual tests to figure out what was taking the time. Manually creating a large # of veths was an obvious test, and one which showed poorly scaling performance. May well be there are other things slowing down lxc of course. -serge ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 4:28 ` Serge Hallyn @ 2013-03-28 5:00 ` Eric W. Biederman 2013-03-28 13:36 ` Serge Hallyn 0 siblings, 1 reply; 32+ messages in thread From: Eric W. Biederman @ 2013-03-28 5:00 UTC (permalink / raw) To: Serge Hallyn; +Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org Serge Hallyn <serge.hallyn@ubuntu.com> writes: > Quoting Eric W. Biederman (ebiederm@xmission.com): >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: >> >> > Quoting Eric W. Biederman (ebiederm@xmission.com): >> >> Stephen Hemminger <stephen@networkplumber.org> writes: >> >> >> >> > If you need to do lots of operations the --batch mode will be significantly faster. >> >> > One command start and one link map. >> >> >> >> The problem in this case as I understand it is lots of independent >> >> operations. Now maybe lxc should not shell out to ip and perform the >> >> work itself. >> > >> > fwiw lxc uses netlink to create new veths, and picks random names with >> > mktemp() ahead of time. >> >> I am puzzled where does the slownes in iproute2 come into play? > > Benoit originally reported slowness when starting >1500 containers. I > asked him to run a few manual tests to figure out what was taking the > time. Manually creating a large # of veths was an obvious test, and > one which showed poorly scaling performance. Apparently iproute is involved somehwere as when he tested with a patched iproute (as you asked him to) the lxc startup slowdown was gone. > May well be there are other things slowing down lxc of course. The evidence indicates it was iproute being called somewhere... Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 5:00 ` Eric W. Biederman @ 2013-03-28 13:36 ` Serge Hallyn 2013-03-28 13:42 ` Benoit Lourdelet 0 siblings, 1 reply; 32+ messages in thread From: Serge Hallyn @ 2013-03-28 13:36 UTC (permalink / raw) To: Eric W. Biederman Cc: Stephen Hemminger, Benoit Lourdelet, netdev@vger.kernel.org Quoting Eric W. Biederman (ebiederm@xmission.com): > Serge Hallyn <serge.hallyn@ubuntu.com> writes: > > > Quoting Eric W. Biederman (ebiederm@xmission.com): > >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >> > >> > Quoting Eric W. Biederman (ebiederm@xmission.com): > >> >> Stephen Hemminger <stephen@networkplumber.org> writes: > >> >> > >> >> > If you need to do lots of operations the --batch mode will be significantly faster. > >> >> > One command start and one link map. > >> >> > >> >> The problem in this case as I understand it is lots of independent > >> >> operations. Now maybe lxc should not shell out to ip and perform the > >> >> work itself. > >> > > >> > fwiw lxc uses netlink to create new veths, and picks random names with > >> > mktemp() ahead of time. > >> > >> I am puzzled where does the slownes in iproute2 come into play? > > > > Benoit originally reported slowness when starting >1500 containers. I > > asked him to run a few manual tests to figure out what was taking the > > time. Manually creating a large # of veths was an obvious test, and > > one which showed poorly scaling performance. > > Apparently iproute is involved somehwere as when he tested with a > patched iproute (as you asked him to) the lxc startup slowdown was > gone. > > > May well be there are other things slowing down lxc of course. > > The evidence indicates it was iproute being called somewhere... Benoit can you tell us exactly what test you were running when you saw the slowdown was gone? -serge ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 13:36 ` Serge Hallyn @ 2013-03-28 13:42 ` Benoit Lourdelet 2013-03-28 15:04 ` Serge Hallyn 0 siblings, 1 reply; 32+ messages in thread From: Benoit Lourdelet @ 2013-03-28 13:42 UTC (permalink / raw) To: Serge Hallyn, Eric W. Biederman; +Cc: Stephen Hemminger, netdev@vger.kernel.org Hello, My test consists in starting small containers (10MB of RAM ) each. Each container has 2x physical VLAN interfaces attached. lxc.network.type = phys lxc.network.flags = up lxc.network.link = eth6.3 lxc.network.name = eth2 lxc.network.hwaddr = 00:50:56:a8:03:03 lxc.network.ipv4 = 192.168.1.1/24 lxc.network.type = phys lxc.network.flags = up lxc.network.link = eth7.3 lxc.network.name = eth1 lxc.network.ipv4 = 2.2.2.2/24 lxc.network.hwaddr = 00:50:57:b8:00:01 With initial iproute2 , when I reach around 1600 containers, container creation almost stops.It takes at least 20s per container to start. With patched iproutes2 , I have started 4000 containers at a rate of 1 per second w/o problem. I have 8000 clan interfaces configured on the host (2x 4000). Regards Benoit On 28/03/2013 14:36, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote: >Quoting Eric W. Biederman (ebiederm@xmission.com): >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: >> >> > Quoting Eric W. Biederman (ebiederm@xmission.com): >> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: >> >> >> >> > Quoting Eric W. Biederman (ebiederm@xmission.com): >> >> >> Stephen Hemminger <stephen@networkplumber.org> writes: >> >> >> >> >> >> > If you need to do lots of operations the --batch mode will be >>significantly faster. >> >> >> > One command start and one link map. >> >> >> >> >> >> The problem in this case as I understand it is lots of independent >> >> >> operations. Now maybe lxc should not shell out to ip and perform >>the >> >> >> work itself. >> >> > >> >> > fwiw lxc uses netlink to create new veths, and picks random names >>with >> >> > mktemp() ahead of time. >> >> >> >> I am puzzled where does the slownes in iproute2 come into play? >> > >> > Benoit originally reported slowness when starting >1500 containers. I >> > asked him to run a few manual tests to figure out what was taking the >> > time. Manually creating a large # of veths was an obvious test, and >> > one which showed poorly scaling performance. >> >> Apparently iproute is involved somehwere as when he tested with a >> patched iproute (as you asked him to) the lxc startup slowdown was >> gone. >> >> > May well be there are other things slowing down lxc of course. >> >> The evidence indicates it was iproute being called somewhere... > >Benoit can you tell us exactly what test you were running when you saw >the slowdown was gone? > >-serge > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 13:42 ` Benoit Lourdelet @ 2013-03-28 15:04 ` Serge Hallyn 2013-03-28 15:21 ` Benoit Lourdelet 0 siblings, 1 reply; 32+ messages in thread From: Serge Hallyn @ 2013-03-28 15:04 UTC (permalink / raw) To: Benoit Lourdelet Cc: Eric W. Biederman, Stephen Hemminger, netdev@vger.kernel.org Quoting Benoit Lourdelet (blourdel@juniper.net): > Hello, > > My test consists in starting small containers (10MB of RAM ) each. Each > container has 2x physical VLAN interfaces attached. Which commands were you using to create/start them? > lxc.network.type = phys > lxc.network.flags = up > lxc.network.link = eth6.3 > lxc.network.name = eth2 > lxc.network.hwaddr = 00:50:56:a8:03:03 > lxc.network.ipv4 = 192.168.1.1/24 > lxc.network.type = phys > lxc.network.flags = up > lxc.network.link = eth7.3 > lxc.network.name = eth1 > lxc.network.ipv4 = 2.2.2.2/24 > lxc.network.hwaddr = 00:50:57:b8:00:01 > > > > With initial iproute2 , when I reach around 1600 containers, container > creation almost stops.It takes at least 20s per container to start. > With patched iproutes2 , I have started 4000 containers at a rate of 1 per > second w/o problem. I have 8000 clan interfaces configured on the host (2x > 4000). > > > Regards > > Benoit > > On 28/03/2013 14:36, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote: > > >Quoting Eric W. Biederman (ebiederm@xmission.com): > >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >> > >> > Quoting Eric W. Biederman (ebiederm@xmission.com): > >> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >> >> > >> >> > Quoting Eric W. Biederman (ebiederm@xmission.com): > >> >> >> Stephen Hemminger <stephen@networkplumber.org> writes: > >> >> >> > >> >> >> > If you need to do lots of operations the --batch mode will be > >>significantly faster. > >> >> >> > One command start and one link map. > >> >> >> > >> >> >> The problem in this case as I understand it is lots of independent > >> >> >> operations. Now maybe lxc should not shell out to ip and perform > >>the > >> >> >> work itself. > >> >> > > >> >> > fwiw lxc uses netlink to create new veths, and picks random names > >>with > >> >> > mktemp() ahead of time. > >> >> > >> >> I am puzzled where does the slownes in iproute2 come into play? > >> > > >> > Benoit originally reported slowness when starting >1500 containers. I > >> > asked him to run a few manual tests to figure out what was taking the > >> > time. Manually creating a large # of veths was an obvious test, and > >> > one which showed poorly scaling performance. > >> > >> Apparently iproute is involved somehwere as when he tested with a > >> patched iproute (as you asked him to) the lxc startup slowdown was > >> gone. > >> > >> > May well be there are other things slowing down lxc of course. > >> > >> The evidence indicates it was iproute being called somewhere... > > > >Benoit can you tell us exactly what test you were running when you saw > >the slowdown was gone? > > > >-serge > > > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 15:04 ` Serge Hallyn @ 2013-03-28 15:21 ` Benoit Lourdelet 2013-03-28 22:20 ` Stephen Hemminger 0 siblings, 1 reply; 32+ messages in thread From: Benoit Lourdelet @ 2013-03-28 15:21 UTC (permalink / raw) To: Serge Hallyn; +Cc: Eric W. Biederman, Stephen Hemminger, netdev@vger.kernel.org I use, for each container : lxc-start -n lwb2001 -f /var/lib/lxc/lwb2001/config -d I created the containers with lxc-ubuntu -n lwb2001 Benoit On 28/03/2013 16:04, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote: >Quoting Benoit Lourdelet (blourdel@juniper.net): >> Hello, >> >> My test consists in starting small containers (10MB of RAM ) each. Each >> container has 2x physical VLAN interfaces attached. > >Which commands were you using to create/start them? > >> lxc.network.type = phys >> lxc.network.flags = up >> lxc.network.link = eth6.3 >> lxc.network.name = eth2 >> lxc.network.hwaddr = 00:50:56:a8:03:03 >> lxc.network.ipv4 = 192.168.1.1/24 >> lxc.network.type = phys >> lxc.network.flags = up >> lxc.network.link = eth7.3 >> lxc.network.name = eth1 >> lxc.network.ipv4 = 2.2.2.2/24 >> lxc.network.hwaddr = 00:50:57:b8:00:01 >> >> >> >> With initial iproute2 , when I reach around 1600 containers, container >> creation almost stops.It takes at least 20s per container to start. >> With patched iproutes2 , I have started 4000 containers at a rate of 1 >>per >> second w/o problem. I have 8000 clan interfaces configured on the host >>(2x >> 4000). >> >> >> Regards >> >> Benoit >> >> On 28/03/2013 14:36, "Serge Hallyn" <serge.hallyn@ubuntu.com> wrote: >> >> >Quoting Eric W. Biederman (ebiederm@xmission.com): >> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: >> >> >> >> > Quoting Eric W. Biederman (ebiederm@xmission.com): >> >> >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: >> >> >> >> >> >> > Quoting Eric W. Biederman (ebiederm@xmission.com): >> >> >> >> Stephen Hemminger <stephen@networkplumber.org> writes: >> >> >> >> >> >> >> >> > If you need to do lots of operations the --batch mode will be >> >>significantly faster. >> >> >> >> > One command start and one link map. >> >> >> >> >> >> >> >> The problem in this case as I understand it is lots of >>independent >> >> >> >> operations. Now maybe lxc should not shell out to ip and >>perform >> >>the >> >> >> >> work itself. >> >> >> > >> >> >> > fwiw lxc uses netlink to create new veths, and picks random >>names >> >>with >> >> >> > mktemp() ahead of time. >> >> >> >> >> >> I am puzzled where does the slownes in iproute2 come into play? >> >> > >> >> > Benoit originally reported slowness when starting >1500 >>containers. I >> >> > asked him to run a few manual tests to figure out what was taking >>the >> >> > time. Manually creating a large # of veths was an obvious test, >>and >> >> > one which showed poorly scaling performance. >> >> >> >> Apparently iproute is involved somehwere as when he tested with a >> >> patched iproute (as you asked him to) the lxc startup slowdown was >> >> gone. >> >> >> >> > May well be there are other things slowing down lxc of course. >> >> >> >> The evidence indicates it was iproute being called somewhere... >> > >> >Benoit can you tell us exactly what test you were running when you saw >> >the slowdown was gone? >> > >> >-serge >> > >> >> > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 15:21 ` Benoit Lourdelet @ 2013-03-28 22:20 ` Stephen Hemminger 2013-03-28 23:52 ` Eric W. Biederman 0 siblings, 1 reply; 32+ messages in thread From: Stephen Hemminger @ 2013-03-28 22:20 UTC (permalink / raw) To: Benoit Lourdelet; +Cc: Serge Hallyn, Eric W. Biederman, netdev@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 128 bytes --] Try the following two patches. It adds a name hash list, and uses Eric's idea to avoid loading map on add/delete operations. [-- Attachment #2: 0001-ll_map-add-name-and-index-hash.patch --] [-- Type: text/x-patch, Size: 8143 bytes --] >From 0025e5d63d5d1598ab622867834a3bcb9f518f9f Mon Sep 17 00:00:00 2001 From: Stephen Hemminger <stephen@networkplumber.org> Date: Thu, 28 Mar 2013 14:57:28 -0700 Subject: [PATCH 1/2] ll_map: add name and index hash Make ll_ functions faster by having a name hash, and allow for deletion. Also, allow them to work without calling ll_init_map. --- include/hlist.h | 56 ++++++++++++++++++++ include/ll_map.h | 3 +- lib/ll_map.c | 155 ++++++++++++++++++++++++++++++++++-------------------- 3 files changed, 157 insertions(+), 57 deletions(-) create mode 100644 include/hlist.h diff --git a/include/hlist.h b/include/hlist.h new file mode 100644 index 0000000..4e8de9e --- /dev/null +++ b/include/hlist.h @@ -0,0 +1,56 @@ +#ifndef __HLIST_H__ +#define __HLIST_H__ 1 +/* Hash list stuff from kernel */ + +#include <stddef.h> + +#define container_of(ptr, type, member) ({ \ + const typeof( ((type *)0)->member ) *__mptr = (ptr); \ + (type *)( (char *)__mptr - offsetof(type,member) );}) + +struct hlist_head { + struct hlist_node *first; +}; + +struct hlist_node { + struct hlist_node *next, **pprev; +}; + +static inline void hlist_del(struct hlist_node *n) +{ + struct hlist_node *next = n->next; + struct hlist_node **pprev = n->pprev; + *pprev = next; + if (next) + next->pprev = pprev; +} + +static inline void hlist_add_head(struct hlist_node *n, struct hlist_head *h) +{ + struct hlist_node *first = h->first; + n->next = first; + if (first) + first->pprev = &n->next; + h->first = n; + n->pprev = &h->first; +} + +#define hlist_for_each(pos, head) \ + for (pos = (head)->first; pos ; pos = pos->next) + + +#define hlist_for_each_safe(pos, n, head) \ + for (pos = (head)->first; pos && ({ n = pos->next; 1; }); \ + pos = n) + +#define hlist_entry_safe(ptr, type, member) \ + ({ typeof(ptr) ____ptr = (ptr); \ + ____ptr ? hlist_entry(____ptr, type, member) : NULL; \ + }) + +#define hlist_for_each_entry(pos, head, member) \ + for (pos = hlist_entry_safe((head)->first, typeof(*(pos)), member);\ + pos; \ + pos = hlist_entry_safe((pos)->member.next, typeof(*(pos)), member)) + +#endif /* __HLIST_H__ */ diff --git a/include/ll_map.h b/include/ll_map.h index c4d5c6d..f1dda39 100644 --- a/include/ll_map.h +++ b/include/ll_map.h @@ -3,7 +3,8 @@ extern int ll_remember_index(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg); -extern int ll_init_map(struct rtnl_handle *rth); + +extern void ll_init_map(struct rtnl_handle *rth); extern unsigned ll_name_to_index(const char *name); extern const char *ll_index_to_name(unsigned idx); extern const char *ll_idx_n2a(unsigned idx, char *buf); diff --git a/lib/ll_map.c b/lib/ll_map.c index e9ae129..fd7db55 100644 --- a/lib/ll_map.c +++ b/lib/ll_map.c @@ -22,10 +22,11 @@ #include "libnetlink.h" #include "ll_map.h" +#include "hlist.h" -struct ll_cache -{ - struct ll_cache *idx_next; +struct ll_cache { + struct hlist_node idx_hash; + struct hlist_node name_hash; unsigned flags; int index; unsigned short type; @@ -33,49 +34,107 @@ struct ll_cache }; #define IDXMAP_SIZE 1024 -static struct ll_cache *idx_head[IDXMAP_SIZE]; +static struct hlist_head idx_head[IDXMAP_SIZE]; +static struct hlist_head name_head[IDXMAP_SIZE]; -static inline struct ll_cache *idxhead(int idx) +static struct ll_cache *ll_get_by_index(unsigned index) { - return idx_head[idx & (IDXMAP_SIZE - 1)]; + struct hlist_node *n; + unsigned h = index & (IDXMAP_SIZE - 1); + + hlist_for_each(n, &idx_head[h]) { + struct ll_cache *im + = container_of(n, struct ll_cache, idx_hash); + if (im->index == index) + return im; + } + + return NULL; +} + +static unsigned namehash(const char *str) +{ + unsigned hash = 5381; + + while (*str) + hash = ((hash << 5) + hash) + *str++; /* hash * 33 + c */ + + return hash; +} + +static struct ll_cache *ll_get_by_name(const char *name) +{ + struct hlist_node *n; + unsigned h = namehash(name) & (IDXMAP_SIZE - 1); + + hlist_for_each(n, &name_head[h]) { + struct ll_cache *im + = container_of(n, struct ll_cache, name_hash); + + if (strncmp(im->name, name, IFNAMSIZ) == 0) + return im; + } + + return NULL; } int ll_remember_index(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) { - int h; + unsigned int h; + const char *ifname; struct ifinfomsg *ifi = NLMSG_DATA(n); - struct ll_cache *im, **imp; + struct ll_cache *im; struct rtattr *tb[IFLA_MAX+1]; - if (n->nlmsg_type != RTM_NEWLINK) + if (n->nlmsg_type != RTM_NEWLINK && n->nlmsg_type != RTM_DELLINK) return 0; if (n->nlmsg_len < NLMSG_LENGTH(sizeof(ifi))) return -1; + im = ll_get_by_index(ifi->ifi_index); + if (n->nlmsg_type == RTM_DELLINK) { + if (im) { + hlist_del(&im->name_hash); + hlist_del(&im->idx_hash); + free(im); + } + return 0; + } + memset(tb, 0, sizeof(tb)); parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), IFLA_PAYLOAD(n)); - if (tb[IFLA_IFNAME] == NULL) + ifname = rta_getattr_str(tb[IFLA_IFNAME]); + if (ifname == NULL) return 0; - h = ifi->ifi_index & (IDXMAP_SIZE - 1); - for (imp = &idx_head[h]; (im=*imp)!=NULL; imp = &im->idx_next) - if (im->index == ifi->ifi_index) - break; - - if (im == NULL) { - im = malloc(sizeof(*im)); - if (im == NULL) - return 0; - im->idx_next = *imp; - im->index = ifi->ifi_index; - *imp = im; + if (im) { + /* change to existing entry */ + if (strcmp(im->name, ifname) != 0) { + hlist_del(&im->name_hash); + h = namehash(ifname) & (IDXMAP_SIZE - 1); + hlist_add_head(&im->name_hash, &name_head[h]); + } + + im->flags = ifi->ifi_flags; + return 0; } + im = malloc(sizeof(*im)); + if (im == NULL) + return 0; + im->index = ifi->ifi_index; + strcpy(im->name, ifname); im->type = ifi->ifi_type; im->flags = ifi->ifi_flags; - strcpy(im->name, RTA_DATA(tb[IFLA_IFNAME])); + + h = ifi->ifi_index & (IDXMAP_SIZE - 1); + hlist_add_head(&im->idx_hash, &idx_head[h]); + + h = namehash(ifname) & (IDXMAP_SIZE - 1); + hlist_add_head(&im->name_hash, &name_head[h]); + return 0; } @@ -86,15 +145,16 @@ const char *ll_idx_n2a(unsigned idx, char *buf) if (idx == 0) return "*"; - for (im = idxhead(idx); im; im = im->idx_next) - if (im->index == idx) - return im->name; + im = ll_get_by_index(idx); + if (im) + return im->name; + + if (if_indextoname(idx, buf) == NULL) + snprintf(buf, IFNAMSIZ, "if%d", idx); - snprintf(buf, IFNAMSIZ, "if%d", idx); return buf; } - const char *ll_index_to_name(unsigned idx) { static char nbuf[IFNAMSIZ]; @@ -108,10 +168,9 @@ int ll_index_to_type(unsigned idx) if (idx == 0) return -1; - for (im = idxhead(idx); im; im = im->idx_next) - if (im->index == idx) - return im->type; - return -1; + + im = ll_get_by_index(idx); + return im ? im->type : -1; } unsigned ll_index_to_flags(unsigned idx) @@ -121,35 +180,21 @@ unsigned ll_index_to_flags(unsigned idx) if (idx == 0) return 0; - for (im = idxhead(idx); im; im = im->idx_next) - if (im->index == idx) - return im->flags; - return 0; + im = ll_get_by_index(idx); + return im ? im->flags : -1; } unsigned ll_name_to_index(const char *name) { - static char ncache[IFNAMSIZ]; - static int icache; - struct ll_cache *im; - int i; + const struct ll_cache *im; unsigned idx; if (name == NULL) return 0; - if (icache && strcmp(name, ncache) == 0) - return icache; - - for (i=0; i<IDXMAP_SIZE; i++) { - for (im = idx_head[i]; im; im = im->idx_next) { - if (strcmp(im->name, name) == 0) { - icache = im->index; - strcpy(ncache, name); - return im->index; - } - } - } + im = ll_get_by_name(name); + if (im) + return im->index; idx = if_nametoindex(name); if (idx == 0) @@ -157,12 +202,12 @@ unsigned ll_name_to_index(const char *name) return idx; } -int ll_init_map(struct rtnl_handle *rth) +void ll_init_map(struct rtnl_handle *rth) { static int initialized; if (initialized) - return 0; + return; if (rtnl_wilddump_request(rth, AF_UNSPEC, RTM_GETLINK) < 0) { perror("Cannot send dump request"); @@ -175,6 +220,4 @@ int ll_init_map(struct rtnl_handle *rth) } initialized = 1; - - return 0; } -- 1.7.10.4 [-- Attachment #3: 0002-ip-remove-unnecessary-ll_init_map.patch --] [-- Type: text/x-patch, Size: 2755 bytes --] >From f0124b0f0aa0e5b9288114eb8e6ff9b4f8c33ec8 Mon Sep 17 00:00:00 2001 From: Stephen Hemminger <stephen@networkplumber.org> Date: Thu, 28 Mar 2013 15:17:47 -0700 Subject: [PATCH 2/2] ip: remove unnecessary ll_init_map Don't call ll_init_map on modify operations Saves significant overhead with 1000's of devices. --- ip/ipaddress.c | 2 -- ip/ipaddrlabel.c | 2 -- ip/iplink.c | 2 -- ip/iproute.c | 6 ------ ip/xfrm_monitor.c | 2 -- 5 files changed, 14 deletions(-) diff --git a/ip/ipaddress.c b/ip/ipaddress.c index 149df69..5b9a438 100644 --- a/ip/ipaddress.c +++ b/ip/ipaddress.c @@ -1365,8 +1365,6 @@ static int ipaddr_modify(int cmd, int flags, int argc, char **argv) if (!scoped && cmd != RTM_DELADDR) req.ifa.ifa_scope = default_scope(&lcl); - ll_init_map(&rth); - if ((req.ifa.ifa_index = ll_name_to_index(d)) == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", d); return -1; diff --git a/ip/ipaddrlabel.c b/ip/ipaddrlabel.c index eb6a48c..1789d9c 100644 --- a/ip/ipaddrlabel.c +++ b/ip/ipaddrlabel.c @@ -246,8 +246,6 @@ static int ipaddrlabel_flush(int argc, char **argv) int do_ipaddrlabel(int argc, char **argv) { - ll_init_map(&rth); - if (argc < 1) { return ipaddrlabel_list(0, NULL); } else if (matches(argv[0], "list") == 0 || diff --git a/ip/iplink.c b/ip/iplink.c index 5c7b43c..dc98019 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) } } - ll_init_map(&rth); - if (!(flags & NLM_F_CREATE)) { if (!dev) { fprintf(stderr, "Not enough information: \"dev\" " diff --git a/ip/iproute.c b/ip/iproute.c index 2c2a331..adef774 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -970,8 +970,6 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv) if (d || nhs_ok) { int idx; - ll_init_map(&rth); - if (d) { if ((idx = ll_name_to_index(d)) == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", d); @@ -1265,8 +1263,6 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) if (do_ipv6 == AF_UNSPEC && filter.tb) do_ipv6 = AF_INET; - ll_init_map(&rth); - if (id || od) { int idx; @@ -1452,8 +1448,6 @@ static int iproute_get(int argc, char **argv) exit(1); } - ll_init_map(&rth); - if (idev || odev) { int idx; diff --git a/ip/xfrm_monitor.c b/ip/xfrm_monitor.c index bfc48f1..a1f5d53 100644 --- a/ip/xfrm_monitor.c +++ b/ip/xfrm_monitor.c @@ -408,8 +408,6 @@ int do_xfrm_monitor(int argc, char **argv) return rtnl_from_file(fp, xfrm_accept_msg, (void*)stdout); } - //ll_init_map(&rth); - if (rtnl_open_byproto(&rth, groups, NETLINK_XFRM) < 0) exit(1); -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 22:20 ` Stephen Hemminger @ 2013-03-28 23:52 ` Eric W. Biederman 2013-03-29 0:13 ` Eric Dumazet 2013-03-30 10:09 ` Benoit Lourdelet 0 siblings, 2 replies; 32+ messages in thread From: Eric W. Biederman @ 2013-03-28 23:52 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org Stephen Hemminger <stephen@networkplumber.org> writes: > Try the following two patches. It adds a name hash list, and uses Eric's idea > to avoid loading map on add/delete operations. On my microbenchmark of just creating 5000 veth pairs this takes pairs 16s instead of 13s of my earlier hacks but that is well down in the usable range. Deleting all of those network interfaces one by one takes me 60s. So on the microbenchmark side this looks like a good improvement and pretty usable. I expect Benoit's container startup workload will also reflect this, but it will be interesting to see the actual result. Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 23:52 ` Eric W. Biederman @ 2013-03-29 0:13 ` Eric Dumazet 2013-03-29 0:25 ` Eric W. Biederman 2013-03-30 10:09 ` Benoit Lourdelet 1 sibling, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2013-03-29 0:13 UTC (permalink / raw) To: Eric W. Biederman Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org On Thu, 2013-03-28 at 16:52 -0700, Eric W. Biederman wrote: > On my microbenchmark of just creating 5000 veth pairs this takes pairs > 16s instead of 13s of my earlier hacks but that is well down in the > usable range. I guess most of the time is taken by sysctl_check_table() ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-29 0:13 ` Eric Dumazet @ 2013-03-29 0:25 ` Eric W. Biederman 2013-03-29 0:43 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Eric W. Biederman @ 2013-03-29 0:25 UTC (permalink / raw) To: Eric Dumazet Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org Eric Dumazet <eric.dumazet@gmail.com> writes: > On Thu, 2013-03-28 at 16:52 -0700, Eric W. Biederman wrote: > >> On my microbenchmark of just creating 5000 veth pairs this takes pairs >> 16s instead of 13s of my earlier hacks but that is well down in the >> usable range. > > I guess most of the time is taken by sysctl_check_table() All of the significant sysctl slowdowns were fixed in 3.4. If you see something of sysctl show up in a trace I would be happy to talk about it. The kernel side seems to be creating N network devices seems to take NlogN time now. Both sysfs and sysctl store directories as rbtrees removing their previous bottlenecks. The loop I timed at 16s was just: time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name b$i; done There is plenty of room for inefficiencies in 10000 network devices and 5000 forks+execs. Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-29 0:25 ` Eric W. Biederman @ 2013-03-29 0:43 ` Eric Dumazet 2013-03-29 1:06 ` Eric W. Biederman 2013-03-29 1:10 ` Eric Dumazet 0 siblings, 2 replies; 32+ messages in thread From: Eric Dumazet @ 2013-03-29 0:43 UTC (permalink / raw) To: Eric W. Biederman Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org On Thu, 2013-03-28 at 17:25 -0700, Eric W. Biederman wrote: > Eric Dumazet <eric.dumazet@gmail.com> writes: > > > On Thu, 2013-03-28 at 16:52 -0700, Eric W. Biederman wrote: > > > >> On my microbenchmark of just creating 5000 veth pairs this takes pairs > >> 16s instead of 13s of my earlier hacks but that is well down in the > >> usable range. > > > > I guess most of the time is taken by sysctl_check_table() > > All of the significant sysctl slowdowns were fixed in 3.4. If you see > something of sysctl show up in a trace I would be happy to talk about > it. The kernel side seems to be creating N network devices seems to > take NlogN time now. Both sysfs and sysctl store directories as > rbtrees removing their previous bottlenecks. > > The loop I timed at 16s was just: > > time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name b$i; done > > There is plenty of room for inefficiencies in 10000 network devices and > 5000 forks+execs. Ah right, the sysctl part is fixed ;) In batch mode, I can create these veth pairs in 4 seconds for i in $(seq 1 5000) ; do echo link add a$i type veth peer name b$i; done | ip -batch - ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-29 0:43 ` Eric Dumazet @ 2013-03-29 1:06 ` Eric W. Biederman 2013-03-29 1:10 ` Eric Dumazet 1 sibling, 0 replies; 32+ messages in thread From: Eric W. Biederman @ 2013-03-29 1:06 UTC (permalink / raw) To: Eric Dumazet Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org Eric Dumazet <eric.dumazet@gmail.com> writes: > On Thu, 2013-03-28 at 17:25 -0700, Eric W. Biederman wrote: >> Eric Dumazet <eric.dumazet@gmail.com> writes: >> >> > On Thu, 2013-03-28 at 16:52 -0700, Eric W. Biederman wrote: >> > >> >> On my microbenchmark of just creating 5000 veth pairs this takes pairs >> >> 16s instead of 13s of my earlier hacks but that is well down in the >> >> usable range. >> > >> > I guess most of the time is taken by sysctl_check_table() >> >> All of the significant sysctl slowdowns were fixed in 3.4. If you see >> something of sysctl show up in a trace I would be happy to talk about >> it. The kernel side seems to be creating N network devices seems to >> take NlogN time now. Both sysfs and sysctl store directories as >> rbtrees removing their previous bottlenecks. >> >> The loop I timed at 16s was just: >> >> time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name b$i; done >> >> There is plenty of room for inefficiencies in 10000 network devices and >> 5000 forks+execs. > > Ah right, the sysctl part is fixed ;) > > In batch mode, I can create these veth pairs in 4 seconds > > for i in $(seq 1 5000) ; do echo link add a$i type veth peer name b$i; > done | ip -batch - Yes. The interesting story here is that the bottleneck before these patches was the ll_init_map function of iproute2. Which resulted in an over an order of magnitude slowdown of when starting iproute on a system with lots of network devices. It is still unclear where iproute comes into the picture in the original problem scenario of creating 2000 containers each with 2 veth pairs. But apparently it was. As the fundamental use case here was taking 2000 separate independent actions it turns out to be important for things to not slowdown unreasonably outside of batch mode. So I was explicitly testing the non-batch mode performance. On the flip side it might be interesting to see if we can get batch mode deletes to batch in the kernel, so we don't have to wait for through syncrhonize_rcu_expidited for each of them. Although for the container case I can just drop the last reference to the network namespace and all of the network device removals will batch. Ultimately shrug. Except in the previous O(N^2) userspace behavior there don't seem to be any practical performance problems with this many network devices. What is interesting is that this many network devices is becoming interesting on inexpensive COTS servers, for cases that are not purely network focused. Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-29 0:43 ` Eric Dumazet 2013-03-29 1:06 ` Eric W. Biederman @ 2013-03-29 1:10 ` Eric Dumazet 2013-03-29 1:29 ` Eric W. Biederman 1 sibling, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2013-03-29 1:10 UTC (permalink / raw) To: Eric W. Biederman Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org On Thu, 2013-03-28 at 17:43 -0700, Eric Dumazet wrote: > In batch mode, I can create these veth pairs in 4 seconds > > for i in $(seq 1 5000) ; do echo link add a$i type veth peer name b$i; > done | ip -batch - At rmmod time, 30% of cpu is spent in packet_notifier() Maybe we can do something about this. 30.85% rmmod [kernel.kallsyms] [k] packet_notifier | --- packet_notifier notifier_call_chain raw_notifier_call_chain call_netdevice_notifiers rollback_registered_many unregister_netdevice_many __rtnl_link_unregister rtnl_link_unregister 0xffffffffa0044868 sys_delete_module sysenter_dispatch ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-29 1:10 ` Eric Dumazet @ 2013-03-29 1:29 ` Eric W. Biederman 2013-03-29 1:38 ` Eric Dumazet 0 siblings, 1 reply; 32+ messages in thread From: Eric W. Biederman @ 2013-03-29 1:29 UTC (permalink / raw) To: Eric Dumazet Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org Eric Dumazet <eric.dumazet@gmail.com> writes: > On Thu, 2013-03-28 at 17:43 -0700, Eric Dumazet wrote: > >> In batch mode, I can create these veth pairs in 4 seconds >> >> for i in $(seq 1 5000) ; do echo link add a$i type veth peer name b$i; >> done | ip -batch - > > > At rmmod time, 30% of cpu is spent in packet_notifier() > > Maybe we can do something about this. An interesting thought. I had a patch I never got around to pushing a while back that would have had an effect. It is my observation that the vast majority of packet filters apply not to the entire machine but to an individual interface. In fact you have to work pretty hard to get tools like tcpdump to dump all of the interfaces at once. So to speed things up for machines that have a lot of these things the idea was to create per device lists for the filters that only needed to be run on a single device. In this case it looks like we could potentially create per device lists for of the listening sockets as well. In general these lists should be short so the search can also be short. But I am curious do you actually have a tcpdump or something similar running on your box that is using AF_PACKET sockets? Perhaps a dhcp client? I am a little surprised that your default case has anything on the lists to trigger any work in the packet_notifier notifier. > 30.85% rmmod [kernel.kallsyms] [k] > packet_notifier > | > --- packet_notifier > notifier_call_chain > raw_notifier_call_chain > call_netdevice_notifiers > rollback_registered_many > unregister_netdevice_many > __rtnl_link_unregister > rtnl_link_unregister > 0xffffffffa0044868 > sys_delete_module > sysenter_dispatch Eric ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-29 1:29 ` Eric W. Biederman @ 2013-03-29 1:38 ` Eric Dumazet 0 siblings, 0 replies; 32+ messages in thread From: Eric Dumazet @ 2013-03-29 1:38 UTC (permalink / raw) To: Eric W. Biederman Cc: Stephen Hemminger, Benoit Lourdelet, Serge Hallyn, netdev@vger.kernel.org On Thu, 2013-03-28 at 18:29 -0700, Eric W. Biederman wrote: > An interesting thought. I had a patch I never got around to pushing a > while back that would have had an effect. > > It is my observation that the vast majority of packet filters apply not > to the entire machine but to an individual interface. In fact you have > to work pretty hard to get tools like tcpdump to dump all of the > interfaces at once. > > So to speed things up for machines that have a lot of these things the > idea was to create per device lists for the filters that only needed to > be run on a single device. In this case it looks like we could > potentially create per device lists for of the listening sockets as well. > > In general these lists should be short so the search can also be short. > > But I am curious do you actually have a tcpdump or something similar > running on your box that is using AF_PACKET sockets? Perhaps a dhcp > client? > > I am a little surprised that your default case has anything on the lists > to trigger any work in the packet_notifier notifier. Hmm, it might be a local daemon on my lab machine which does a PACKET_ADD_MEMBERSHIP for each created interface. So my machine spend time in packet_dev_mclist(), with a quadratic behavior at rmmod. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-28 23:52 ` Eric W. Biederman 2013-03-29 0:13 ` Eric Dumazet @ 2013-03-30 10:09 ` Benoit Lourdelet 2013-03-30 14:44 ` Eric Dumazet 1 sibling, 1 reply; 32+ messages in thread From: Benoit Lourdelet @ 2013-03-30 10:09 UTC (permalink / raw) To: Eric W. Biederman, Stephen Hemminger; +Cc: Serge Hallyn, netdev@vger.kernel.org Hello, Here are my tests of the last patches on 3 different platforms all running 3.8.5 : Time are in seconds : 8x 3.7Ghz virtual cores # veth create delete 1000 14 18 2000 39 56 5000 256 161 10000 1200 399 8x 3.2Ghz virtual cores # veth create delete 1000 19 40 2000 118 66 5000 305 251 32x 2Ghz virtual cores , 2 sockets # veth create delete 1000 35 86 2000 120 90 5000 724 245 Compared to initial iproute2 performance on this 32 virtual core system : 5000 1143 1185 "perf record" for creation of 5000 veth on the 32 core system : # captured on: Fri Mar 29 14:03:35 2013 # hostname : ieng-serv06 # os release : 3.8.5 # perf version : 3.8.5 # arch : x86_64 # nrcpus online : 32 # nrcpus avail : 32 # cpudesc : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz # cpuid : GenuineIntel,6,45,7 # total memory : 264124548 kB # cmdline : /usr/src/linux-3.8.5/tools/perf/perf record -a ./test3.script # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, id = { 36, 37, 38, 39, 40, 41, 42, # HEADER_CPU_TOPOLOGY info available, use -I to display # HEADER_NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2, uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 = 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, unco # ======== # # Samples: 9M of event 'cycles' # Event count (approx.): 2894480238483 # # Overhead Command Shared Object Symbol # ........ ............... ............................. ............................................... # 15.17% sudo [kernel.kallsyms] [k] snmp_fold_field 5.94% sudo libc-2.15.so [.] 0x00000000000802cd 5.64% sudo [kernel.kallsyms] [k] find_next_bit 3.21% init libnih.so.1.0.0 [.] nih_list_add_after 2.12% swapper [kernel.kallsyms] [k] intel_idle 1.94% init [kernel.kallsyms] [k] page_fault 1.93% sed libc-2.15.so [.] 0x00000000000a1368 1.93% sudo [kernel.kallsyms] [k] rtnl_fill_ifinfo 1.92% sudo [veth] [k] veth_get_stats64 1.78% sudo [kernel.kallsyms] [k] memcpy 1.53% ifquery libc-2.15.so [.] 0x000000000007f52b 1.24% init libc-2.15.so [.] 0x000000000008918f 1.05% sudo [kernel.kallsyms] [k] inet6_fill_ifla6_attrs 0.98% init [kernel.kallsyms] [k] copy_pte_range 0.88% irqbalance libc-2.15.so [.] 0x00000000000802cd 0.85% sudo [kernel.kallsyms] [k] memset 0.72% sed ld-2.15.so [.] 0x000000000000a226 0.68% ifquery ld-2.15.so [.] 0x00000000000165a0 0.64% init libnih.so.1.0.0 [.] nih_tree_next_post_full 0.61% bridge-network- libc-2.15.so [.] 0x0000000000131e2a 0.59% init [kernel.kallsyms] [k] do_wp_page 0.59% ifquery [kernel.kallsyms] [k] page_fault 0.54% sed [kernel.kallsyms] [k] page_fault Regards Benoit On 29/03/2013 00:52, "Eric W. Biederman" <ebiederm@xmission.com> wrote: >Stephen Hemminger <stephen@networkplumber.org> writes: > >> Try the following two patches. It adds a name hash list, and uses >>Eric's idea >> to avoid loading map on add/delete operations. > >On my microbenchmark of just creating 5000 veth pairs this takes pairs >16s instead of 13s of my earlier hacks but that is well down in the >usable range. > >Deleting all of those network interfaces one by one takes me 60s. > >So on the microbenchmark side this looks like a good improvement and >pretty usable. > >I expect Benoit's container startup workload will also reflect this, but >it will be interesting to see the actual result. > >Eric > > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-30 10:09 ` Benoit Lourdelet @ 2013-03-30 14:44 ` Eric Dumazet 2013-03-30 16:07 ` Benoit Lourdelet 0 siblings, 1 reply; 32+ messages in thread From: Eric Dumazet @ 2013-03-30 14:44 UTC (permalink / raw) To: Benoit Lourdelet Cc: Eric W. Biederman, Stephen Hemminger, Serge Hallyn, netdev@vger.kernel.org On Sat, 2013-03-30 at 10:09 +0000, Benoit Lourdelet wrote: > Hello, > > Here are my tests of the last patches on 3 different platforms all > running 3.8.5 : > > Time are in seconds : > > 8x 3.7Ghz virtual cores > > # veth create delete > 1000 14 18 > 2000 39 56 > 5000 256 161 > 10000 1200 399 > > > 8x 3.2Ghz virtual cores > > # veth create delete > > 1000 19 40 > 2000 118 66 > 5000 305 251 > > > > 32x 2Ghz virtual cores , 2 sockets > # veth create delete > 1000 35 86 > > 2000 120 90 > > 5000 724 245 > > > Compared to initial iproute2 performance on this 32 virtual core system : > 5000 1143 1185 > > > > "perf record" for creation of 5000 veth on the 32 core system : > > # captured on: Fri Mar 29 14:03:35 2013 > # hostname : ieng-serv06 > # os release : 3.8.5 > # perf version : 3.8.5 > # arch : x86_64 > # nrcpus online : 32 > # nrcpus avail : 32 > # cpudesc : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz > # cpuid : GenuineIntel,6,45,7 > # total memory : 264124548 kB > # cmdline : /usr/src/linux-3.8.5/tools/perf/perf record -a ./test3.script > # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = > 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, > precise_ip = 0, id = { 36, 37, 38, 39, 40, 41, 42, > # HEADER_CPU_TOPOLOGY info available, use -I to display > # HEADER_NUMA_TOPOLOGY info available, use -I to display > # pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2, > uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 = > 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, unco > # ======== > # > # Samples: 9M of event 'cycles' > # Event count (approx.): 2894480238483 > # > # Overhead Command Shared Object > Symbol > # ........ ............... ............................. > ............................................... > # > 15.17% sudo [kernel.kallsyms] [k] > snmp_fold_field > 5.94% sudo libc-2.15.so [.] > 0x00000000000802cd > 5.64% sudo [kernel.kallsyms] [k] > find_next_bit > 3.21% init libnih.so.1.0.0 [.] > nih_list_add_after > 2.12% swapper [kernel.kallsyms] [k] intel_idle > > 1.94% init [kernel.kallsyms] [k] page_fault > > 1.93% sed libc-2.15.so [.] > 0x00000000000a1368 > 1.93% sudo [kernel.kallsyms] [k] > rtnl_fill_ifinfo > 1.92% sudo [veth] [k] > veth_get_stats64 > 1.78% sudo [kernel.kallsyms] [k] memcpy > > 1.53% ifquery libc-2.15.so [.] > 0x000000000007f52b > 1.24% init libc-2.15.so [.] > 0x000000000008918f > 1.05% sudo [kernel.kallsyms] [k] > inet6_fill_ifla6_attrs > 0.98% init [kernel.kallsyms] [k] > copy_pte_range > 0.88% irqbalance libc-2.15.so [.] > 0x00000000000802cd > 0.85% sudo [kernel.kallsyms] [k] memset > > 0.72% sed ld-2.15.so [.] > 0x000000000000a226 > 0.68% ifquery ld-2.15.so [.] > 0x00000000000165a0 > 0.64% init libnih.so.1.0.0 [.] > nih_tree_next_post_full > 0.61% bridge-network- libc-2.15.so [.] > 0x0000000000131e2a > 0.59% init [kernel.kallsyms] [k] do_wp_page > > 0.59% ifquery [kernel.kallsyms] [k] page_fault > > 0.54% sed [kernel.kallsyms] [k] page_fault > > > > > > Regards > > Benoit > > > > This means lxc-start does the same thing than ip : It fetches the whole device list. You could strace it to have a confirmation. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-30 14:44 ` Eric Dumazet @ 2013-03-30 16:07 ` Benoit Lourdelet 0 siblings, 0 replies; 32+ messages in thread From: Benoit Lourdelet @ 2013-03-30 16:07 UTC (permalink / raw) To: Eric Dumazet Cc: Eric W. Biederman, Stephen Hemminger, Serge Hallyn, netdev@vger.kernel.org Sorry Eric, This is not an lxc-start perf report.This is an "ip" report". Will run an "lxc-start" perf ASAP. Regards Benoit On 30/03/2013 15:44, "Eric Dumazet" <eric.dumazet@gmail.com> wrote: >On Sat, 2013-03-30 at 10:09 +0000, Benoit Lourdelet wrote: >> Hello, >> >> Here are my tests of the last patches on 3 different platforms all >> running 3.8.5 : >> >> Time are in seconds : >> >> 8x 3.7Ghz virtual cores >> >> # veth create delete >> 1000 14 18 >> 2000 39 56 >> 5000 256 161 >> 10000 1200 399 >> >> >> 8x 3.2Ghz virtual cores >> >> # veth create delete >> >> 1000 19 40 >> 2000 118 66 >> 5000 305 251 >> >> >> >> 32x 2Ghz virtual cores , 2 sockets >> # veth create delete >> 1000 35 86 >> >> 2000 120 90 >> >> 5000 724 245 >> >> >> Compared to initial iproute2 performance on this 32 virtual core >>system : >> 5000 1143 1185 >> >> >> >> "perf record" for creation of 5000 veth on the 32 core system : >> >> # captured on: Fri Mar 29 14:03:35 2013 >> # hostname : ieng-serv06 >> # os release : 3.8.5 >> # perf version : 3.8.5 >> # arch : x86_64 >> # nrcpus online : 32 >> # nrcpus avail : 32 >> # cpudesc : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz >> # cpuid : GenuineIntel,6,45,7 >> # total memory : 264124548 kB >> # cmdline : /usr/src/linux-3.8.5/tools/perf/perf record -a >>./test3.script >> # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 >>= >> 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, >> precise_ip = 0, id = { 36, 37, 38, 39, 40, 41, 42, >> # HEADER_CPU_TOPOLOGY info available, use -I to display >> # HEADER_NUMA_TOPOLOGY info available, use -I to display >> # pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2, >> uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 = >> 20, uncore_qpi_0 = 21, uncore_qpi_1 = 22, unco >> # ======== >> # >> # Samples: 9M of event 'cycles' >> # Event count (approx.): 2894480238483 >> # >> # Overhead Command Shared Object >> Symbol >> # ........ ............... ............................. >> ............................................... >> # >> 15.17% sudo [kernel.kallsyms] [k] >> snmp_fold_field >> 5.94% sudo libc-2.15.so [.] >> 0x00000000000802cd >> 5.64% sudo [kernel.kallsyms] [k] >> find_next_bit >> 3.21% init libnih.so.1.0.0 [.] >> nih_list_add_after >> 2.12% swapper [kernel.kallsyms] [k] >>intel_idle >> >> 1.94% init [kernel.kallsyms] [k] >>page_fault >> >> 1.93% sed libc-2.15.so [.] >> 0x00000000000a1368 >> 1.93% sudo [kernel.kallsyms] [k] >> rtnl_fill_ifinfo >> 1.92% sudo [veth] [k] >> veth_get_stats64 >> 1.78% sudo [kernel.kallsyms] [k] memcpy >> >> 1.53% ifquery libc-2.15.so [.] >> 0x000000000007f52b >> 1.24% init libc-2.15.so [.] >> 0x000000000008918f >> 1.05% sudo [kernel.kallsyms] [k] >> inet6_fill_ifla6_attrs >> 0.98% init [kernel.kallsyms] [k] >> copy_pte_range >> 0.88% irqbalance libc-2.15.so [.] >> 0x00000000000802cd >> 0.85% sudo [kernel.kallsyms] [k] memset >> >> 0.72% sed ld-2.15.so [.] >> 0x000000000000a226 >> 0.68% ifquery ld-2.15.so [.] >> 0x00000000000165a0 >> 0.64% init libnih.so.1.0.0 [.] >> nih_tree_next_post_full >> 0.61% bridge-network- libc-2.15.so [.] >> 0x0000000000131e2a >> 0.59% init [kernel.kallsyms] [k] >>do_wp_page >> >> 0.59% ifquery [kernel.kallsyms] [k] >>page_fault >> >> 0.54% sed [kernel.kallsyms] [k] >>page_fault >> >> >> >> >> >> Regards >> >> Benoit >> >> >> >> > >This means lxc-start does the same thing than ip : > >It fetches the whole device list. > >You could strace it to have a confirmation. > > > > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-27 15:11 ` Eric W. Biederman 2013-03-27 17:47 ` Stephen Hemminger @ 2013-03-28 20:27 ` Benoit Lourdelet 1 sibling, 0 replies; 32+ messages in thread From: Benoit Lourdelet @ 2013-03-28 20:27 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Serge Hallyn, Stephen Hemminger, netdev@vger.kernel.org Hello Eric, I am running simple containers (2 network interfaces, 10MB of RAM, default routing) and want to test scalability. Out test platform is a 32x 2Ghz cores x86. Regards Benoit On 27/03/2013 16:11, "Eric W. Biederman" <ebiederm@xmission.com> wrote: >Benoit Lourdelet <blourdel@juniper.net> writes: > >> Hello Serge, >> >> I am indeed using Eric patch with lxc. >> >> It solves the initial problem of slowness to start around 1600 >> containers. > >Good so now we just need a production ready patch for iproute. > >> I am now able to start more than 2000 without having new containers >> slower and slower to start. > >May I ask how large a box you are running and how complex your >containers are. I am trying to get a feel for how common it is likely >to be to find people running thousands of containers on a single >machine. > >Eric > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete 2013-03-26 11:51 ` Benoit Lourdelet 2013-03-26 12:40 ` Eric W. Biederman @ 2013-03-26 15:31 ` Eric Dumazet 1 sibling, 0 replies; 32+ messages in thread From: Eric Dumazet @ 2013-03-26 15:31 UTC (permalink / raw) To: Benoit Lourdelet Cc: Stephen Hemminger, Eric W. Biederman, netdev@vger.kernel.org, Serge Hallyn On Tue, 2013-03-26 at 11:51 +0000, Benoit Lourdelet wrote: > The script to delete: > for d in /sys/class/net/veth*; do > ip link del `basename $d` 2>/dev/null || true > Done > > There is a very good improvement in deletion. I can do better ;) If you are really doing this kind of things, you could use : rmmod veth Note that "ip" command supports a batch mode ip -batch filename In this case, the caching is done only once. Eric, Stephen, one possibility would be to use the cache only in batch mode. Anyway caching is wrong because several users can use ip command at the same time. ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2013-03-30 16:09 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-22 22:23 [RFC][PATCH] iproute: Faster ip link add, set and delete Eric W. Biederman 2013-03-22 22:27 ` Stephen Hemminger 2013-03-26 11:51 ` Benoit Lourdelet 2013-03-26 12:40 ` Eric W. Biederman 2013-03-26 14:17 ` Serge Hallyn 2013-03-26 14:33 ` Serge Hallyn 2013-03-27 13:37 ` Benoit Lourdelet 2013-03-27 15:11 ` Eric W. Biederman 2013-03-27 17:47 ` Stephen Hemminger 2013-03-28 0:46 ` Eric W. Biederman 2013-03-28 3:20 ` Serge Hallyn 2013-03-28 3:44 ` Eric W. Biederman 2013-03-28 4:28 ` Serge Hallyn 2013-03-28 5:00 ` Eric W. Biederman 2013-03-28 13:36 ` Serge Hallyn 2013-03-28 13:42 ` Benoit Lourdelet 2013-03-28 15:04 ` Serge Hallyn 2013-03-28 15:21 ` Benoit Lourdelet 2013-03-28 22:20 ` Stephen Hemminger 2013-03-28 23:52 ` Eric W. Biederman 2013-03-29 0:13 ` Eric Dumazet 2013-03-29 0:25 ` Eric W. Biederman 2013-03-29 0:43 ` Eric Dumazet 2013-03-29 1:06 ` Eric W. Biederman 2013-03-29 1:10 ` Eric Dumazet 2013-03-29 1:29 ` Eric W. Biederman 2013-03-29 1:38 ` Eric Dumazet 2013-03-30 10:09 ` Benoit Lourdelet 2013-03-30 14:44 ` Eric Dumazet 2013-03-30 16:07 ` Benoit Lourdelet 2013-03-28 20:27 ` Benoit Lourdelet 2013-03-26 15:31 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).