* Re: [Lxc-users] Containers slow to start after 1600 [not found] ` <CD6E73D8.71FE%blourdel@juniper.net> @ 2013-03-19 20:34 ` Serge Hallyn 2013-03-20 0:29 ` Eric W. Biederman 0 siblings, 1 reply; 6+ messages in thread From: Serge Hallyn @ 2013-03-19 20:34 UTC (permalink / raw) To: Benoit Lourdelet, linux-kernel, Eric W. Biederman; +Cc: lxc-users Hi, Benoit was kind enough to follow up on some scalability issues with larger (but not huge imo) numbers of containers. Running a script to simply time the creation of veth pairs on a rather large (iiuc) machine, he got the following numbers (time is for creation of the full number, not latest increment - so 1123 seconds to create 5000 veth pairs) > >Quoting Benoit Lourdelet (blourdel@juniper.net): > >> Hello Serge, > >> > >> I put together a small table, running your script for various values : > >> > >> Time are in seconds, > >> > >> Number of veth, time to create, time to delete: > >> > >> 500 18 26 > >> > >> 1000 57 70 > >> > >> 2000 193 250 > >> > >> 3000 435 510 > >> > >> 4000 752 824 > >> > >> 5000 1123 1185 > > Benoit Ok. Ran some tests on a tiny cloud instance. When I simply run 2k tasks in unshared new network namespaces, it flies by. #!/bin/sh rm -f /tmp/timings3 date | tee -a /tmp/timings3 for i in `seq 1 2000`; do nsexec -n -- /bin/sleep 1000 & if [ $((i % 100)) -eq 0 ]; then echo $i | tee -a /tmp/timings3 date | tee -a /tmp/timings3 fi done (all scripts run under sudo, and nsexec can be found at https://code.launchpad.net/~serge-hallyn/+junk/nsexec)) So that isn't an issue. When I run a script to just time veth pair creations like Benoit ran, creating 2000 veth pairs and timing the results for each 100, the time does degrade, from 1 second for the first 100 up to 8 seconds for the last 100. (that script for me is: #!/bin/sh rm -f /tmp/timings for i in `seq 1 2000`; do ip link add type veth if [ $((i % 100)) -eq 0 ]; then echo $i | tee -a /tmp/timings date | tee -a /tmp/timings ls /sys/class/net > /dev/null fi done ) But when I actually pass veth instances to those unshared network namespaces: #!/bin/sh rm -f /tmp/timings2 echo 0 | tee -a /tmp/timings2 date | tee -a /tmp/timings2 for i in `seq 1 2000`; do nsexec -n -P /tmp/pid.$i -- /bin/sleep 1000 & ip link add type veth dev2=`ls -d /sys/class/net/veth* | tail -1` dev=`basename $dev2` pid=`cat /tmp/pid.$i` ip link set $dev netns $pid if [ $((i % 100)) -eq 0 ]; then echo $i | tee -a /tmp/timings2 date | tee -a /tmp/timings2 fi rm -f /tmp/pid.* done it goes from 4 seconds for the first hundred to 16 seconds for the last hundred - a worse regression than simply creating the veths. Though I guess that could be accounted for simply by sysfs actions when a veth is moved from the old netns to the new? 0 Tue Mar 19 20:15:26 UTC 2013 100 Tue Mar 19 20:15:30 UTC 2013 200 Tue Mar 19 20:15:35 UTC 2013 300 Tue Mar 19 20:15:41 UTC 2013 400 Tue Mar 19 20:15:47 UTC 2013 500 Tue Mar 19 20:15:54 UTC 2013 600 Tue Mar 19 20:16:02 UTC 2013 700 Tue Mar 19 20:16:09 UTC 2013 800 Tue Mar 19 20:16:17 UTC 2013 900 Tue Mar 19 20:16:26 UTC 2013 1000 Tue Mar 19 20:16:35 UTC 2013 1100 Tue Mar 19 20:16:46 UTC 2013 1200 Tue Mar 19 20:16:57 UTC 2013 1300 Tue Mar 19 20:17:08 UTC 2013 1400 Tue Mar 19 20:17:21 UTC 2013 1500 Tue Mar 19 20:17:33 UTC 2013 1600 Tue Mar 19 20:17:46 UTC 2013 1700 Tue Mar 19 20:17:59 UTC 2013 1800 Tue Mar 19 20:18:13 UTC 2013 1900 Tue Mar 19 20:18:29 UTC 2013 2000 Tue Mar 19 20:18:48 UTC 2013 -serge ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lxc-users] Containers slow to start after 1600 2013-03-19 20:34 ` [Lxc-users] Containers slow to start after 1600 Serge Hallyn @ 2013-03-20 0:29 ` Eric W. Biederman 2013-03-20 20:09 ` Benoit Lourdelet 0 siblings, 1 reply; 6+ messages in thread From: Eric W. Biederman @ 2013-03-20 0:29 UTC (permalink / raw) To: Serge Hallyn; +Cc: Benoit Lourdelet, linux-kernel, lxc-users Serge Hallyn <serge.hallyn@ubuntu.com> writes: > Hi, > > Benoit was kind enough to follow up on some scalability issues with > larger (but not huge imo) numbers of containers. Running a script > to simply time the creation of veth pairs on a rather large (iiuc) > machine, he got the following numbers (time is for creation of the > full number, not latest increment - so 1123 seconds to create 5000 > veth pairs) A kernel version and a profile would be interesting. At first glance it looks like things are dramatically slowing down as things get longer which should not happen. There used to be quadratic issues in proc and sysfs that should have been reduced to O(NlogN) as of 3.4 or so. A comparison to the dummy device which is a touch simpler than veth and is more frequently benchmarked could also be revealing. >> >Quoting Benoit Lourdelet (blourdel@juniper.net): >> >> Hello Serge, >> >> >> >> I put together a small table, running your script for various values : >> >> >> >> Time are in seconds, >> >> >> >> Number of veth, time to create, time to delete: >> >> >> >> 500 18 26 >> >> >> >> 1000 57 70 >> >> >> >> 2000 193 250 >> >> >> >> 3000 435 510 >> >> >> >> 4000 752 824 >> >> >> >> 5000 1123 1185 > >> >> Benoit > > Ok. Ran some tests on a tiny cloud instance. When I simply run 2k tasks in > unshared new network namespaces, it flies by. > > #!/bin/sh > rm -f /tmp/timings3 > date | tee -a /tmp/timings3 > for i in `seq 1 2000`; do > nsexec -n -- /bin/sleep 1000 & > if [ $((i % 100)) -eq 0 ]; then > echo $i | tee -a /tmp/timings3 > date | tee -a /tmp/timings3 > fi > done > > (all scripts run under sudo, and nsexec can be found at > https://code.launchpad.net/~serge-hallyn/+junk/nsexec)) > > So that isn't an issue. > > When I run a script to just time veth pair creations like Benoit ran, > creating 2000 veth pairs and timing the results for each 100, the time > does degrade, from 1 second for the first 100 up to 8 seconds for the > last 100. > > (that script for me is: > > #!/bin/sh > rm -f /tmp/timings > for i in `seq 1 2000`; do > ip link add type veth > if [ $((i % 100)) -eq 0 ]; then > echo $i | tee -a /tmp/timings > date | tee -a /tmp/timings > ls /sys/class/net > /dev/null > fi > done > ) > > But when I actually pass veth instances to those unshared network > namespaces: > > #!/bin/sh > rm -f /tmp/timings2 > echo 0 | tee -a /tmp/timings2 > date | tee -a /tmp/timings2 > for i in `seq 1 2000`; do > nsexec -n -P /tmp/pid.$i -- /bin/sleep 1000 & > ip link add type veth > dev2=`ls -d /sys/class/net/veth* | tail -1` > dev=`basename $dev2` > pid=`cat /tmp/pid.$i` > ip link set $dev netns $pid > if [ $((i % 100)) -eq 0 ]; then > echo $i | tee -a /tmp/timings2 > date | tee -a /tmp/timings2 > fi > rm -f /tmp/pid.* > done > > it goes from 4 seconds for the first hundred to 16 seconds for > the last hundred - a worse regression than simply creating the > veths. Though I guess that could be accounted for simply by > sysfs actions when a veth is moved from the old netns to the > new? And network stack actions. Creating one end of the veth in the desired network namespace is likely desirable. "ip link add type veth peer netns ..." rcu in the past has also played a critical role, as what the network configuration is when devices are torn down. For device movement and device teardown there is at least one synchronize_rcu, which at scale can slow things down. But if the syncrhonize_rcu dominates it should be mostly a constant factor cost not something that gets worse with each device creation. Oh and to start with I would specify the name of each network device to create. Last I looked coming up with a network device name is a O(N) operation in the number of device names. Just to see what I am seeing in 3.9-rc1 I did: # time for i in $(seq 1 2000) ; do ip link add a$i type veth peer name b$i; done real 0m23.607s user 0m0.656s sys 0m18.132s # time for i in $(seq 1 2000) ; do ip link del a$i ; done real 2m8.038s user 0m0.964s sys 0m18.688s Which is tremendously better than you are reporting below for device creation. Now the deletes are still slow because it is hard to back that kind of delete, having a bunch of network namespaces exit all at once would likely be much faster as they can be batched and the syncrhonize_rcu calls drastically reduced. What is making you say there is a regression? A regression compared to what? Hmm. # time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name b$i; done real 2m11.007s user 0m3.508s sys 1m55.452s Ok there is most definitely something non-linear about the cost of creating network devices. I am happy to comment from previous experience but I'm not volunteering to profile and fix this one. Eric > 0 > Tue Mar 19 20:15:26 UTC 2013 > 100 > Tue Mar 19 20:15:30 UTC 2013 > 200 > Tue Mar 19 20:15:35 UTC 2013 > 300 > Tue Mar 19 20:15:41 UTC 2013 > 400 > Tue Mar 19 20:15:47 UTC 2013 > 500 > Tue Mar 19 20:15:54 UTC 2013 > 600 > Tue Mar 19 20:16:02 UTC 2013 > 700 > Tue Mar 19 20:16:09 UTC 2013 > 800 > Tue Mar 19 20:16:17 UTC 2013 > 900 > Tue Mar 19 20:16:26 UTC 2013 > 1000 > Tue Mar 19 20:16:35 UTC 2013 > 1100 > Tue Mar 19 20:16:46 UTC 2013 > 1200 > Tue Mar 19 20:16:57 UTC 2013 > 1300 > Tue Mar 19 20:17:08 UTC 2013 > 1400 > Tue Mar 19 20:17:21 UTC 2013 > 1500 > Tue Mar 19 20:17:33 UTC 2013 > 1600 > Tue Mar 19 20:17:46 UTC 2013 > 1700 > Tue Mar 19 20:17:59 UTC 2013 > 1800 > Tue Mar 19 20:18:13 UTC 2013 > 1900 > Tue Mar 19 20:18:29 UTC 2013 > 2000 > Tue Mar 19 20:18:48 UTC 2013 > > -serge ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lxc-users] Containers slow to start after 1600 2013-03-20 0:29 ` Eric W. Biederman @ 2013-03-20 20:09 ` Benoit Lourdelet 2013-03-20 20:38 ` Eric W. Biederman 0 siblings, 1 reply; 6+ messages in thread From: Benoit Lourdelet @ 2013-03-20 20:09 UTC (permalink / raw) To: Eric W. Biederman, Serge Hallyn; +Cc: linux-kernel@vger.kernel.org, lxc-users Hello, The measurement has been done with kernel 3.8.2. Linux ieng-serv06 3.7.9 #3 SMP Wed Feb 27 02:38:58 PST 2013 x86_64 x86_64 x86_64 GNU/Linux What information would you like to see on the kernel ? Regards Benoit On 20/03/2013 01:29, "Eric W. Biederman" <ebiederm@xmission.com> wrote: >Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >> Hi, >> >> Benoit was kind enough to follow up on some scalability issues with >> larger (but not huge imo) numbers of containers. Running a script >> to simply time the creation of veth pairs on a rather large (iiuc) >> machine, he got the following numbers (time is for creation of the >> full number, not latest increment - so 1123 seconds to create 5000 >> veth pairs) > >A kernel version and a profile would be interesting. > >At first glance it looks like things are dramatically slowing down as >things get longer which should not happen. > >There used to be quadratic issues in proc and sysfs that should have >been reduced to O(NlogN) as of 3.4 or so. A comparison to the dummy >device which is a touch simpler than veth and is more frequently >benchmarked could also be revealing. > >>> >Quoting Benoit Lourdelet (blourdel@juniper.net): >>> >> Hello Serge, >>> >> >>> >> I put together a small table, running your script for various >>>values : >>> >> >>> >> Time are in seconds, >>> >> >>> >> Number of veth, time to create, time to delete: >>> >> >>> >> 500 18 26 >>> >> >>> >> 1000 57 70 >>> >> >>> >> 2000 193 250 >>> >> >>> >> 3000 435 510 >>> >> >>> >> 4000 752 824 >>> >> >>> >> 5000 1123 1185 >> >>> >>> Benoit >> >> Ok. Ran some tests on a tiny cloud instance. When I simply run 2k >>tasks in >> unshared new network namespaces, it flies by. >> >> #!/bin/sh >> rm -f /tmp/timings3 >> date | tee -a /tmp/timings3 >> for i in `seq 1 2000`; do >> nsexec -n -- /bin/sleep 1000 & >> if [ $((i % 100)) -eq 0 ]; then >> echo $i | tee -a /tmp/timings3 >> date | tee -a /tmp/timings3 >> fi >> done >> >> (all scripts run under sudo, and nsexec can be found at >> https://code.launchpad.net/~serge-hallyn/+junk/nsexec)) >> >> So that isn't an issue. >> >> When I run a script to just time veth pair creations like Benoit ran, >> creating 2000 veth pairs and timing the results for each 100, the time >> does degrade, from 1 second for the first 100 up to 8 seconds for the >> last 100. >> >> (that script for me is: >> >> #!/bin/sh >> rm -f /tmp/timings >> for i in `seq 1 2000`; do >> ip link add type veth >> if [ $((i % 100)) -eq 0 ]; then >> echo $i | tee -a /tmp/timings >> date | tee -a /tmp/timings >> ls /sys/class/net > /dev/null >> fi >> done >> ) >> >> But when I actually pass veth instances to those unshared network >> namespaces: >> >> #!/bin/sh >> rm -f /tmp/timings2 >> echo 0 | tee -a /tmp/timings2 >> date | tee -a /tmp/timings2 >> for i in `seq 1 2000`; do >> nsexec -n -P /tmp/pid.$i -- /bin/sleep 1000 & >> ip link add type veth >> dev2=`ls -d /sys/class/net/veth* | tail -1` >> dev=`basename $dev2` >> pid=`cat /tmp/pid.$i` >> ip link set $dev netns $pid >> if [ $((i % 100)) -eq 0 ]; then >> echo $i | tee -a /tmp/timings2 >> date | tee -a /tmp/timings2 >> fi >> rm -f /tmp/pid.* >> done >> >> it goes from 4 seconds for the first hundred to 16 seconds for >> the last hundred - a worse regression than simply creating the >> veths. Though I guess that could be accounted for simply by >> sysfs actions when a veth is moved from the old netns to the >> new? > >And network stack actions. Creating one end of the veth in the desired >network namespace is likely desirable. "ip link add type veth peer netns >..." > >rcu in the past has also played a critical role, as what the network >configuration is when devices are torn down. > >For device movement and device teardown there is at least one >synchronize_rcu, which at scale can slow things down. But if the >syncrhonize_rcu dominates it should be mostly a constant factor cost not >something that gets worse with each device creation. > >Oh and to start with I would specify the name of each network device to >create. Last I looked coming up with a network device name is a O(N) >operation in the number of device names. > >Just to see what I am seeing in 3.9-rc1 I did: > ># time for i in $(seq 1 2000) ; do ip link add a$i type veth peer name >b$i; done >real 0m23.607s >user 0m0.656s >sys 0m18.132s > ># time for i in $(seq 1 2000) ; do ip link del a$i ; done >real 2m8.038s >user 0m0.964s >sys 0m18.688s > >Which is tremendously better than you are reporting below for device >creation. >Now the deletes are still slow because it is hard to back that kind of >delete, having a bunch of network namespaces exit all at once would >likely be much faster as they can be batched and the syncrhonize_rcu >calls drastically reduced. > >What is making you say there is a regression? A regression compared to >what? > >Hmm. > ># time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name >b$i; done >real 2m11.007s >user 0m3.508s >sys 1m55.452s > >Ok there is most definitely something non-linear about the cost of >creating network devices. > >I am happy to comment from previous experience but I'm not volunteering >to profile and fix this one. > >Eric > > >> 0 >> Tue Mar 19 20:15:26 UTC 2013 >> 100 >> Tue Mar 19 20:15:30 UTC 2013 >> 200 >> Tue Mar 19 20:15:35 UTC 2013 >> 300 >> Tue Mar 19 20:15:41 UTC 2013 >> 400 >> Tue Mar 19 20:15:47 UTC 2013 >> 500 >> Tue Mar 19 20:15:54 UTC 2013 >> 600 >> Tue Mar 19 20:16:02 UTC 2013 >> 700 >> Tue Mar 19 20:16:09 UTC 2013 >> 800 >> Tue Mar 19 20:16:17 UTC 2013 >> 900 >> Tue Mar 19 20:16:26 UTC 2013 >> 1000 >> Tue Mar 19 20:16:35 UTC 2013 >> 1100 >> Tue Mar 19 20:16:46 UTC 2013 >> 1200 >> Tue Mar 19 20:16:57 UTC 2013 >> 1300 >> Tue Mar 19 20:17:08 UTC 2013 >> 1400 >> Tue Mar 19 20:17:21 UTC 2013 >> 1500 >> Tue Mar 19 20:17:33 UTC 2013 >> 1600 >> Tue Mar 19 20:17:46 UTC 2013 >> 1700 >> Tue Mar 19 20:17:59 UTC 2013 >> 1800 >> Tue Mar 19 20:18:13 UTC 2013 >> 1900 >> Tue Mar 19 20:18:29 UTC 2013 >> 2000 >> Tue Mar 19 20:18:48 UTC 2013 >> >> -serge ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lxc-users] Containers slow to start after 1600 2013-03-20 20:09 ` Benoit Lourdelet @ 2013-03-20 20:38 ` Eric W. Biederman 2013-03-22 17:05 ` Benoit Lourdelet 0 siblings, 1 reply; 6+ messages in thread From: Eric W. Biederman @ 2013-03-20 20:38 UTC (permalink / raw) To: Benoit Lourdelet; +Cc: Serge Hallyn, linux-kernel@vger.kernel.org, lxc-users Benoit Lourdelet <blourdel@juniper.net> writes: > Hello, > > The measurement has been done with kernel 3.8.2. > > Linux ieng-serv06 3.7.9 #3 SMP Wed Feb 27 02:38:58 PST 2013 x86_64 x86_64 > x86_64 GNU/Linux Two different kernel versions? > What information would you like to see on the kernel ? The question is where is the kernel spending it's time. So profiling information should help us see that. Something like. $ cat > test-script.sh << 'EOF' #!/bin/bash for i in $(seq 1 2000) ; do ip link add a$i type veth peer name b$i done EOF $ perf record -a -g test-script.sh $ perf report I don't do anywhere enough work with perf to remember what good options are. You definititely don't want to time anything you are doing something silly like asking ip link add to generate device names which is O(N^2) when you create one device at a time. And of course there is the interesting discrepency. Why can I add 5000 veth pairs in 120 seconds and it takes you 1123 seconds. Do you have a very slow cpu in your test environment? Or was your test asking the kernel to generate names. Once we know where the kernel is spending it's time we can look to see if there is anything that is easy to fix, and where to point you. Both my timing and yours indicates that there is something taking O(N^2) time in there. So it would at least be interesting to see what that something is. Eric ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lxc-users] Containers slow to start after 1600 2013-03-20 20:38 ` Eric W. Biederman @ 2013-03-22 17:05 ` Benoit Lourdelet 2013-03-22 21:31 ` Eric W. Biederman 0 siblings, 1 reply; 6+ messages in thread From: Benoit Lourdelet @ 2013-03-22 17:05 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Serge Hallyn, linux-kernel@vger.kernel.org, lxc-users Hello, I tried multiple kernels with similar results. I ran the following on 3.8.2 $ cat > test-script.sh << 'EOF' #!/bin/bash for i in $(seq 1 2000) ; do ip link add a$i type veth peer name b$i done EOF $ perf record -a test-script.sh $ perf report # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, id = { 33 # HEADER_NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2, uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 = 20, uncore_qpi_0 = 2 # ======== # # Samples: 2M of event 'cycles' # Event count (approx.): 638240250735 # # Overhead Command Shared Object Symbol # ........ ............... ............................. .......................................ŠŠ. # 11.13% ip [kernel.kallsyms] [k] snmp_fold_field 4.24% ip [kernel.kallsyms] [k] find_next_bit 2.51% swapper [kernel.kallsyms] [k] intel_idle 1.71% init libnih.so.1.0.0 [.] nih_list_add_after 1.35% ip [kernel.kallsyms] [k] memcpy 1.28% ip [xt_conntrack] [k] 0x0000000000005296 1.26% ip [kernel.kallsyms] [k] rtnl_fill_ifinfo 1.25% sed ld-2.15.so [.] 0x0000000000015972 1.13% ifquery ld-2.15.so [.] 0x0000000000008bdb 1.10% init [kernel.kallsyms] [k] page_fault 1.00% ifquery [kernel.kallsyms] [k] page_fault 0.97% init libc-2.15.so [.] 0x0000000000131e42 0.94% sed [kernel.kallsyms] [k] page_fault 0.75% ip [kernel.kallsyms] [k] inet6_fill_ifla6_attrs 0.67% ip [kernel.kallsyms] [k] memset 0.66% init [kernel.kallsyms] [k] copy_pte_range 0.64% init init [.] 0x0000000000012433 0.58% sed libc-2.15.so [.] 0x000000000008149a 0.48% init libnih.so.1.0.0 [.] nih_tree_next_post_full If I increase the number of links created , time spent for SNMP can even increase : 12.04% ip [kernel.kallsyms] [k] snmp_fold_field 12.03% sudo [kernel.kallsyms] [k] snmp_fold_field 8.56% sudo libc-2.15.so [.] 0x000000000009198b 4.39% sudo [kernel.kallsyms] [k] find_next_bit 4.36% ip [kernel.kallsyms] [k] find_next_bit 3.17% irqbalance libc-2.15.so [.] 0x000000000003d298 2.01% init libnih.so.1.0.0 [.] nih_list_add_after 1.63% ip [kernel.kallsyms] [k] rtnl_fill_ifinfo Regards Benoit On 20/03/2013 21:38, "Eric W. Biederman" <ebiederm@xmission.com> wrote: >Benoit Lourdelet <blourdel@juniper.net> writes: > >> Hello, >> >> The measurement has been done with kernel 3.8.2. >> >> Linux ieng-serv06 3.7.9 #3 SMP Wed Feb 27 02:38:58 PST 2013 x86_64 >>x86_64 >> x86_64 GNU/Linux > >Two different kernel versions? > >> What information would you like to see on the kernel ? > >The question is where is the kernel spending it's time. So profiling >information should help us see that. Something like. > >$ cat > test-script.sh << 'EOF' >#!/bin/bash >for i in $(seq 1 2000) ; do > ip link add a$i type veth peer name b$i >done >EOF > >$ perf record -a -g test-script.sh >$ perf report > >I don't do anywhere enough work with perf to remember what good options >are. > >You definititely don't want to time anything you are doing something >silly like asking ip link add to generate device names which is O(N^2) >when you create one device at a time. > >And of course there is the interesting discrepency. Why can I add 5000 >veth pairs in 120 seconds and it takes you 1123 seconds. Do you have a >very >slow cpu in your test environment? Or was your test asking the kernel >to generate names. > >Once we know where the kernel is spending it's time we can look to see >if there is anything that is easy to fix, and where to point you. > >Both my timing and yours indicates that there is something taking O(N^2) >time in there. So it would at least be interesting to see what that >something is. > >Eric > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lxc-users] Containers slow to start after 1600 2013-03-22 17:05 ` Benoit Lourdelet @ 2013-03-22 21:31 ` Eric W. Biederman 0 siblings, 0 replies; 6+ messages in thread From: Eric W. Biederman @ 2013-03-22 21:31 UTC (permalink / raw) To: Benoit Lourdelet; +Cc: Serge Hallyn, linux-kernel@vger.kernel.org, lxc-users Benoit Lourdelet <blourdel@juniper.net> writes: > Hello, > > I tried multiple kernels with similar results. > > I ran the following on 3.8.2 > > $ cat > test-script.sh << 'EOF' > #!/bin/bash > for i in $(seq 1 2000) ; do > ip link add a$i type veth peer name b$i > done > EOF > > $ perf record -a test-script.sh > $ perf report > Interesting. And the snmp_fold_field is an interesting hint. It looks like something is listing all of your interfaces probably between interface creation. And that will definitely take this process from O(NlogN) to O(N^2). Although I find the find_next_bit call also concerning. snmp_fold_field is used when reading /proc/net/snmp /proc/net/netstat and when running ip link to show all of the interfaces. I suspect find_next_bit is coming out of the per cpu allocator, and it might be partially responsible. But that doesn't feel right. Regardless from this trace it looks like the performance problem on creation is something in userspace is calling /sbin/ip to list all of the interfaces and that is causing the interface creation to slow down. Oh I see what is happening. In iproute. do_iplink iplink_modify ll_init_map So the limitation at this point is in iproute, and not in the kernel. And with the following patch the time to create 5000 devices drops from 120 to 11 seconds for me. So it looks like the kernel is fine it is whatever userspace tools that are of interest that are the bottleneck. Although a quicker way to do a name to ifindex mapping might be interesting. Eric diff --git a/ip/iplink.c b/ip/iplink.c index ad33611..58af369 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) } } - ll_init_map(&rth); - if (!(flags & NLM_F_CREATE)) { if (!dev) { fprintf(stderr, "Not enough information: \"dev\" " @@ -542,6 +540,7 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) exit(-1); } + ll_init_map(&rth); req.i.ifi_index = ll_name_to_index(dev); if (req.i.ifi_index == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", dev); @@ -555,6 +554,7 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) if (link) { int ifindex; + ll_init_map(&rth); ifindex = ll_name_to_index(link); if (ifindex == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-03-22 21:32 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20130319182829.GA15451@sergelap>
[not found] ` <CD6E73D8.71FE%blourdel@juniper.net>
2013-03-19 20:34 ` [Lxc-users] Containers slow to start after 1600 Serge Hallyn
2013-03-20 0:29 ` Eric W. Biederman
2013-03-20 20:09 ` Benoit Lourdelet
2013-03-20 20:38 ` Eric W. Biederman
2013-03-22 17:05 ` Benoit Lourdelet
2013-03-22 21:31 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox