Re: [Lxc-users] Containers slow to start after 1600

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [Lxc-users] Containers slow to start after 1600
       [not found] ` <CD6E73D8.71FE%blourdel@juniper.net>
@ 2013-03-19 20:34   ` Serge Hallyn
  2013-03-20  0:29     ` Eric W. Biederman
  0 siblings, 1 reply; 6+ messages in thread
From: Serge Hallyn @ 2013-03-19 20:34 UTC (permalink / raw)
  To: Benoit Lourdelet, linux-kernel, Eric W. Biederman; +Cc: lxc-users

Hi,

Benoit was kind enough to follow up on some scalability issues with
larger (but not huge imo) numbers of containers.  Running a script
to simply time the creation of veth pairs on a rather large (iiuc)
machine, he got the following numbers (time is for creation of the
full number, not latest increment - so 1123 seconds to create 5000
veth pairs)

> >Quoting Benoit Lourdelet (blourdel@juniper.net):
> >> Hello Serge,
> >> 
> >> I put together a small table, running your script for various values :
> >> 
> >> Time are in seconds,
> >> 
> >> Number of veth, time to create, time to delete:
> >> 
> >> 500  18  26
> >> 
> >> 1000 57 70
> >> 
> >> 2000 193 250
> >> 
> >> 3000 435 510
> >> 
> >> 4000 752  824
> >> 
> >> 5000 1123  1185

> 
> Benoit

Ok.  Ran some tests on a tiny cloud instance.  When I simply run 2k tasks in
unshared new network namespaces, it flies by.

#!/bin/sh
rm -f /tmp/timings3
date | tee -a /tmp/timings3
for i in `seq 1 2000`; do
    nsexec -n -- /bin/sleep 1000 &
    if [ $((i % 100)) -eq 0 ]; then
           echo $i | tee -a /tmp/timings3
           date | tee -a /tmp/timings3
    fi
done

(all scripts run under sudo, and nsexec can be found at
https://code.launchpad.net/~serge-hallyn/+junk/nsexec))

So that isn't an issue.

When I run a script to just time veth pair creations like Benoit ran,
creating 2000 veth pairs and timing the results for each 100, the time
does degrade, from 1 second for the first 100 up to 8 seconds for the
last 100.

(that script for me is:

#!/bin/sh
rm -f /tmp/timings
for i in `seq 1 2000`; do
       ip link add type veth
       if [ $((i % 100)) -eq 0 ]; then
               echo $i | tee -a /tmp/timings
               date | tee -a /tmp/timings
               ls /sys/class/net > /dev/null
       fi
done
)

But when I actually pass veth instances to those unshared network
namespaces:

#!/bin/sh
rm -f /tmp/timings2
echo 0 | tee -a /tmp/timings2
date | tee -a /tmp/timings2
for i in `seq 1 2000`; do
    nsexec -n -P /tmp/pid.$i -- /bin/sleep 1000 &
    ip link add type veth
    dev2=`ls -d /sys/class/net/veth* | tail -1`
    dev=`basename $dev2`
    pid=`cat /tmp/pid.$i`
    ip link set $dev netns $pid
    if [ $((i % 100)) -eq 0 ]; then
           echo $i | tee -a /tmp/timings2
           date | tee -a /tmp/timings2
    fi
    rm -f /tmp/pid.*
done

it goes from 4 seconds for the first hundred to 16 seconds for
the last hundred - a worse regression than simply creating the
veths.  Though I guess that could be accounted for simply by
sysfs actions when a veth is moved from the old netns to the
new?

0
Tue Mar 19 20:15:26 UTC 2013
100
Tue Mar 19 20:15:30 UTC 2013
200
Tue Mar 19 20:15:35 UTC 2013
300
Tue Mar 19 20:15:41 UTC 2013
400
Tue Mar 19 20:15:47 UTC 2013
500
Tue Mar 19 20:15:54 UTC 2013
600
Tue Mar 19 20:16:02 UTC 2013
700
Tue Mar 19 20:16:09 UTC 2013
800
Tue Mar 19 20:16:17 UTC 2013
900
Tue Mar 19 20:16:26 UTC 2013
1000
Tue Mar 19 20:16:35 UTC 2013
1100
Tue Mar 19 20:16:46 UTC 2013
1200
Tue Mar 19 20:16:57 UTC 2013
1300
Tue Mar 19 20:17:08 UTC 2013
1400
Tue Mar 19 20:17:21 UTC 2013
1500
Tue Mar 19 20:17:33 UTC 2013
1600
Tue Mar 19 20:17:46 UTC 2013
1700
Tue Mar 19 20:17:59 UTC 2013
1800
Tue Mar 19 20:18:13 UTC 2013
1900
Tue Mar 19 20:18:29 UTC 2013
2000
Tue Mar 19 20:18:48 UTC 2013

-serge

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Lxc-users] Containers slow to start after 1600
  2013-03-19 20:34   ` [Lxc-users] Containers slow to start after 1600 Serge Hallyn
@ 2013-03-20  0:29     ` Eric W. Biederman
  2013-03-20 20:09       ` Benoit Lourdelet
  0 siblings, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2013-03-20  0:29 UTC (permalink / raw)
  To: Serge Hallyn; +Cc: Benoit Lourdelet, linux-kernel, lxc-users

Serge Hallyn <serge.hallyn@ubuntu.com> writes:

> Hi,
>
> Benoit was kind enough to follow up on some scalability issues with
> larger (but not huge imo) numbers of containers.  Running a script
> to simply time the creation of veth pairs on a rather large (iiuc)
> machine, he got the following numbers (time is for creation of the
> full number, not latest increment - so 1123 seconds to create 5000
> veth pairs)

A kernel version and a profile would be interesting.

At first glance it looks like things are dramatically slowing down as
things get longer which should not happen.

There used to be quadratic issues in proc and sysfs that should have
been reduced to O(NlogN) as of 3.4 or so.  A comparison to the dummy
device which is a touch simpler than veth and is more frequently
benchmarked could also be revealing.

>> >Quoting Benoit Lourdelet (blourdel@juniper.net):
>> >> Hello Serge,
>> >> 
>> >> I put together a small table, running your script for various values :
>> >> 
>> >> Time are in seconds,
>> >> 
>> >> Number of veth, time to create, time to delete:
>> >> 
>> >> 500  18  26
>> >> 
>> >> 1000 57 70
>> >> 
>> >> 2000 193 250
>> >> 
>> >> 3000 435 510
>> >> 
>> >> 4000 752  824
>> >> 
>> >> 5000 1123  1185
>
>> 
>> Benoit
>
> Ok.  Ran some tests on a tiny cloud instance.  When I simply run 2k tasks in
> unshared new network namespaces, it flies by.
>
> #!/bin/sh
> rm -f /tmp/timings3
> date | tee -a /tmp/timings3
> for i in `seq 1 2000`; do
>     nsexec -n -- /bin/sleep 1000 &
>     if [ $((i % 100)) -eq 0 ]; then
>            echo $i | tee -a /tmp/timings3
>            date | tee -a /tmp/timings3
>     fi
> done
>
> (all scripts run under sudo, and nsexec can be found at
> https://code.launchpad.net/~serge-hallyn/+junk/nsexec))
>
> So that isn't an issue.
>
> When I run a script to just time veth pair creations like Benoit ran,
> creating 2000 veth pairs and timing the results for each 100, the time
> does degrade, from 1 second for the first 100 up to 8 seconds for the
> last 100.
>
> (that script for me is:
>
> #!/bin/sh
> rm -f /tmp/timings
> for i in `seq 1 2000`; do
>        ip link add type veth
>        if [ $((i % 100)) -eq 0 ]; then
>                echo $i | tee -a /tmp/timings
>                date | tee -a /tmp/timings
>                ls /sys/class/net > /dev/null
>        fi
> done
> )
>
> But when I actually pass veth instances to those unshared network
> namespaces:
>
> #!/bin/sh
> rm -f /tmp/timings2
> echo 0 | tee -a /tmp/timings2
> date | tee -a /tmp/timings2
> for i in `seq 1 2000`; do
>     nsexec -n -P /tmp/pid.$i -- /bin/sleep 1000 &
>     ip link add type veth
>     dev2=`ls -d /sys/class/net/veth* | tail -1`
>     dev=`basename $dev2`
>     pid=`cat /tmp/pid.$i`
>     ip link set $dev netns $pid
>     if [ $((i % 100)) -eq 0 ]; then
>            echo $i | tee -a /tmp/timings2
>            date | tee -a /tmp/timings2
>     fi
>     rm -f /tmp/pid.*
> done
>
> it goes from 4 seconds for the first hundred to 16 seconds for
> the last hundred - a worse regression than simply creating the
> veths.  Though I guess that could be accounted for simply by
> sysfs actions when a veth is moved from the old netns to the
> new?

And network stack actions.  Creating one end of the veth in the desired
network namespace is likely desirable. "ip link add type veth peer netns ..."

rcu in the past has also played a critical role, as what the network
configuration is when devices are torn down.

For device movement and device teardown there is at least one
synchronize_rcu, which at scale can slow things down. But if the
syncrhonize_rcu dominates it should be mostly a constant factor cost not
something that gets worse with each device creation.

Oh and to start with I would specify the name of each network device to
create.  Last I looked coming up with a network device name is a O(N)
operation in the number of device names.

Just to see what I am seeing in 3.9-rc1 I did:

# time for i in $(seq 1 2000) ; do ip link add a$i type veth peer name b$i; done
real	0m23.607s
user	0m0.656s
sys	0m18.132s

# time for i in $(seq 1 2000) ; do ip link del a$i ; done
real	2m8.038s
user	0m0.964s
sys	0m18.688s

Which is tremendously better than you are reporting below for device creation.
Now the deletes are still slow because it is hard to back that kind of
delete, having a bunch of network namespaces exit all at once would
likely be much faster as they can be batched and the syncrhonize_rcu
calls drastically reduced.

What is making you say there is a regression?  A regression compared to what?

Hmm.

# time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name b$i; done
real	2m11.007s
user	0m3.508s
sys	1m55.452s

Ok there is most definitely something non-linear about the cost of
creating network devices.

I am happy to comment from previous experience but I'm not volunteering
to profile and fix this one.

Eric


> 0
> Tue Mar 19 20:15:26 UTC 2013
> 100
> Tue Mar 19 20:15:30 UTC 2013
> 200
> Tue Mar 19 20:15:35 UTC 2013
> 300
> Tue Mar 19 20:15:41 UTC 2013
> 400
> Tue Mar 19 20:15:47 UTC 2013
> 500
> Tue Mar 19 20:15:54 UTC 2013
> 600
> Tue Mar 19 20:16:02 UTC 2013
> 700
> Tue Mar 19 20:16:09 UTC 2013
> 800
> Tue Mar 19 20:16:17 UTC 2013
> 900
> Tue Mar 19 20:16:26 UTC 2013
> 1000
> Tue Mar 19 20:16:35 UTC 2013
> 1100
> Tue Mar 19 20:16:46 UTC 2013
> 1200
> Tue Mar 19 20:16:57 UTC 2013
> 1300
> Tue Mar 19 20:17:08 UTC 2013
> 1400
> Tue Mar 19 20:17:21 UTC 2013
> 1500
> Tue Mar 19 20:17:33 UTC 2013
> 1600
> Tue Mar 19 20:17:46 UTC 2013
> 1700
> Tue Mar 19 20:17:59 UTC 2013
> 1800
> Tue Mar 19 20:18:13 UTC 2013
> 1900
> Tue Mar 19 20:18:29 UTC 2013
> 2000
> Tue Mar 19 20:18:48 UTC 2013
>
> -serge

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Lxc-users] Containers slow to start after 1600
  2013-03-20  0:29     ` Eric W. Biederman
@ 2013-03-20 20:09       ` Benoit Lourdelet
  2013-03-20 20:38         ` Eric W. Biederman
  0 siblings, 1 reply; 6+ messages in thread
From: Benoit Lourdelet @ 2013-03-20 20:09 UTC (permalink / raw)
  To: Eric W. Biederman, Serge Hallyn; +Cc: linux-kernel@vger.kernel.org, lxc-users

Hello,

The measurement has been done with kernel 3.8.2.

Linux ieng-serv06 3.7.9 #3 SMP Wed Feb 27 02:38:58 PST 2013 x86_64 x86_64
x86_64 GNU/Linux

What information would you like to see on the kernel ?


Regards

Benoit

On 20/03/2013 01:29, "Eric W. Biederman" <ebiederm@xmission.com> wrote:

>Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>
>> Hi,
>>
>> Benoit was kind enough to follow up on some scalability issues with
>> larger (but not huge imo) numbers of containers.  Running a script
>> to simply time the creation of veth pairs on a rather large (iiuc)
>> machine, he got the following numbers (time is for creation of the
>> full number, not latest increment - so 1123 seconds to create 5000
>> veth pairs)
>
>A kernel version and a profile would be interesting.
>
>At first glance it looks like things are dramatically slowing down as
>things get longer which should not happen.
>
>There used to be quadratic issues in proc and sysfs that should have
>been reduced to O(NlogN) as of 3.4 or so.  A comparison to the dummy
>device which is a touch simpler than veth and is more frequently
>benchmarked could also be revealing.
>
>>> >Quoting Benoit Lourdelet (blourdel@juniper.net):
>>> >> Hello Serge,
>>> >> 
>>> >> I put together a small table, running your script for various
>>>values :
>>> >> 
>>> >> Time are in seconds,
>>> >> 
>>> >> Number of veth, time to create, time to delete:
>>> >> 
>>> >> 500  18  26
>>> >> 
>>> >> 1000 57 70
>>> >> 
>>> >> 2000 193 250
>>> >> 
>>> >> 3000 435 510
>>> >> 
>>> >> 4000 752  824
>>> >> 
>>> >> 5000 1123  1185
>>
>>> 
>>> Benoit
>>
>> Ok.  Ran some tests on a tiny cloud instance.  When I simply run 2k
>>tasks in
>> unshared new network namespaces, it flies by.
>>
>> #!/bin/sh
>> rm -f /tmp/timings3
>> date | tee -a /tmp/timings3
>> for i in `seq 1 2000`; do
>>     nsexec -n -- /bin/sleep 1000 &
>>     if [ $((i % 100)) -eq 0 ]; then
>>            echo $i | tee -a /tmp/timings3
>>            date | tee -a /tmp/timings3
>>     fi
>> done
>>
>> (all scripts run under sudo, and nsexec can be found at
>> https://code.launchpad.net/~serge-hallyn/+junk/nsexec))
>>
>> So that isn't an issue.
>>
>> When I run a script to just time veth pair creations like Benoit ran,
>> creating 2000 veth pairs and timing the results for each 100, the time
>> does degrade, from 1 second for the first 100 up to 8 seconds for the
>> last 100.
>>
>> (that script for me is:
>>
>> #!/bin/sh
>> rm -f /tmp/timings
>> for i in `seq 1 2000`; do
>>        ip link add type veth
>>        if [ $((i % 100)) -eq 0 ]; then
>>                echo $i | tee -a /tmp/timings
>>                date | tee -a /tmp/timings
>>                ls /sys/class/net > /dev/null
>>        fi
>> done
>> )
>>
>> But when I actually pass veth instances to those unshared network
>> namespaces:
>>
>> #!/bin/sh
>> rm -f /tmp/timings2
>> echo 0 | tee -a /tmp/timings2
>> date | tee -a /tmp/timings2
>> for i in `seq 1 2000`; do
>>     nsexec -n -P /tmp/pid.$i -- /bin/sleep 1000 &
>>     ip link add type veth
>>     dev2=`ls -d /sys/class/net/veth* | tail -1`
>>     dev=`basename $dev2`
>>     pid=`cat /tmp/pid.$i`
>>     ip link set $dev netns $pid
>>     if [ $((i % 100)) -eq 0 ]; then
>>            echo $i | tee -a /tmp/timings2
>>            date | tee -a /tmp/timings2
>>     fi
>>     rm -f /tmp/pid.*
>> done
>>
>> it goes from 4 seconds for the first hundred to 16 seconds for
>> the last hundred - a worse regression than simply creating the
>> veths.  Though I guess that could be accounted for simply by
>> sysfs actions when a veth is moved from the old netns to the
>> new?
>
>And network stack actions.  Creating one end of the veth in the desired
>network namespace is likely desirable. "ip link add type veth peer netns
>..."
>
>rcu in the past has also played a critical role, as what the network
>configuration is when devices are torn down.
>
>For device movement and device teardown there is at least one
>synchronize_rcu, which at scale can slow things down. But if the
>syncrhonize_rcu dominates it should be mostly a constant factor cost not
>something that gets worse with each device creation.
>
>Oh and to start with I would specify the name of each network device to
>create.  Last I looked coming up with a network device name is a O(N)
>operation in the number of device names.
>
>Just to see what I am seeing in 3.9-rc1 I did:
>
># time for i in $(seq 1 2000) ; do ip link add a$i type veth peer name
>b$i; done
>real	0m23.607s
>user	0m0.656s
>sys	0m18.132s
>
># time for i in $(seq 1 2000) ; do ip link del a$i ; done
>real	2m8.038s
>user	0m0.964s
>sys	0m18.688s
>
>Which is tremendously better than you are reporting below for device
>creation.
>Now the deletes are still slow because it is hard to back that kind of
>delete, having a bunch of network namespaces exit all at once would
>likely be much faster as they can be batched and the syncrhonize_rcu
>calls drastically reduced.
>
>What is making you say there is a regression?  A regression compared to
>what?
>
>Hmm.
>
># time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name
>b$i; done
>real	2m11.007s
>user	0m3.508s
>sys	1m55.452s
>
>Ok there is most definitely something non-linear about the cost of
>creating network devices.
>
>I am happy to comment from previous experience but I'm not volunteering
>to profile and fix this one.
>
>Eric
>
>
>> 0
>> Tue Mar 19 20:15:26 UTC 2013
>> 100
>> Tue Mar 19 20:15:30 UTC 2013
>> 200
>> Tue Mar 19 20:15:35 UTC 2013
>> 300
>> Tue Mar 19 20:15:41 UTC 2013
>> 400
>> Tue Mar 19 20:15:47 UTC 2013
>> 500
>> Tue Mar 19 20:15:54 UTC 2013
>> 600
>> Tue Mar 19 20:16:02 UTC 2013
>> 700
>> Tue Mar 19 20:16:09 UTC 2013
>> 800
>> Tue Mar 19 20:16:17 UTC 2013
>> 900
>> Tue Mar 19 20:16:26 UTC 2013
>> 1000
>> Tue Mar 19 20:16:35 UTC 2013
>> 1100
>> Tue Mar 19 20:16:46 UTC 2013
>> 1200
>> Tue Mar 19 20:16:57 UTC 2013
>> 1300
>> Tue Mar 19 20:17:08 UTC 2013
>> 1400
>> Tue Mar 19 20:17:21 UTC 2013
>> 1500
>> Tue Mar 19 20:17:33 UTC 2013
>> 1600
>> Tue Mar 19 20:17:46 UTC 2013
>> 1700
>> Tue Mar 19 20:17:59 UTC 2013
>> 1800
>> Tue Mar 19 20:18:13 UTC 2013
>> 1900
>> Tue Mar 19 20:18:29 UTC 2013
>> 2000
>> Tue Mar 19 20:18:48 UTC 2013
>>
>> -serge



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Lxc-users] Containers slow to start after 1600
  2013-03-20 20:09       ` Benoit Lourdelet
@ 2013-03-20 20:38         ` Eric W. Biederman
  2013-03-22 17:05           ` Benoit Lourdelet
  0 siblings, 1 reply; 6+ messages in thread
From: Eric W. Biederman @ 2013-03-20 20:38 UTC (permalink / raw)
  To: Benoit Lourdelet; +Cc: Serge Hallyn, linux-kernel@vger.kernel.org, lxc-users

Benoit Lourdelet <blourdel@juniper.net> writes:

> Hello,
>
> The measurement has been done with kernel 3.8.2.
>
> Linux ieng-serv06 3.7.9 #3 SMP Wed Feb 27 02:38:58 PST 2013 x86_64 x86_64
> x86_64 GNU/Linux

Two different kernel versions?

> What information would you like to see on the kernel ?

The question is where is the kernel spending it's time.  So profiling
information should help us see that.  Something like.

$ cat > test-script.sh << 'EOF'
#!/bin/bash
for i in $(seq 1 2000) ; do
	ip link add a$i type veth peer name b$i
done
EOF

$ perf record -a -g test-script.sh
$ perf report 

I don't do anywhere enough work with perf to remember what good options
are.

You definititely don't want to time anything you are doing something
silly like asking ip link add to generate device names which is O(N^2)
when you create one device at a time.

And of course there is the interesting discrepency.  Why can I add 5000
veth pairs in 120 seconds and it takes you 1123 seconds.  Do you have a very
slow cpu in your test environment?  Or was your test asking the kernel
to generate names.

Once we know where the kernel is spending it's time we can look to see
if there is anything that is easy to fix, and where to point you.

Both my timing and yours indicates that there is something taking O(N^2)
time in there.  So it would at least be interesting to see what that
something is.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Lxc-users] Containers slow to start after 1600
  2013-03-20 20:38         ` Eric W. Biederman
@ 2013-03-22 17:05           ` Benoit Lourdelet
  2013-03-22 21:31             ` Eric W. Biederman
  0 siblings, 1 reply; 6+ messages in thread
From: Benoit Lourdelet @ 2013-03-22 17:05 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Serge Hallyn, linux-kernel@vger.kernel.org, lxc-users

Hello,

I tried multiple kernels with similar results.

I ran the following on 3.8.2

$ cat > test-script.sh << 'EOF'
#!/bin/bash
for i in $(seq 1 2000) ; do
	ip link add a$i type veth peer name b$i
done
EOF

$ perf record -a test-script.sh
$ perf report


# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 =
0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1,
precise_ip = 0, id = { 33
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, uncore_pcu = 15, tracepoint = 2,
uncore_imc_0 = 17, uncore_imc_1 = 18, uncore_imc_2 = 19, uncore_imc_3 =
20, uncore_qpi_0 = 2
# ========
#
# Samples: 2M of event 'cycles'
# Event count (approx.): 638240250735
#
# Overhead Command Shared Object Symbol
# ........ ............... .............................
.......................................ŠŠ.
#
    11.13% ip [kernel.kallsyms] [k] snmp_fold_field
   
     4.24% ip [kernel.kallsyms] [k] find_next_bit
   
     2.51% swapper [kernel.kallsyms] [k] intel_idle
        
     1.71% init libnih.so.1.0.0 [.] nih_list_add_after
   
     1.35% ip [kernel.kallsyms] [k] memcpy
   
     1.28% ip [xt_conntrack] [k] 0x0000000000005296
     1.26% ip [kernel.kallsyms] [k] rtnl_fill_ifinfo
   
     1.25% sed ld-2.15.so [.] 0x0000000000015972
     1.13% ifquery ld-2.15.so [.] 0x0000000000008bdb
 
     1.10% init [kernel.kallsyms] [k] page_fault
     
     1.00% ifquery [kernel.kallsyms] [k] page_fault
        
     0.97% init libc-2.15.so [.] 0x0000000000131e42
     0.94% sed [kernel.kallsyms] [k] page_fault
    
     0.75% ip [kernel.kallsyms] [k] inet6_fill_ifla6_attrs
   
     0.67% ip [kernel.kallsyms] [k] memset
   
     0.66% init [kernel.kallsyms] [k] copy_pte_range
     
     0.64% init init [.] 0x0000000000012433
     0.58% sed libc-2.15.so [.] 0x000000000008149a
     0.48% init libnih.so.1.0.0 [.] nih_tree_next_post_full




If I increase the number of links created , time spent for SNMP can even
increase :

12.04% ip [kernel.kallsyms] [k] snmp_fold_field
    
    12.03% sudo [kernel.kallsyms] [k] snmp_fold_field
          
     8.56% sudo libc-2.15.so [.] 0x000000000009198b
     
     4.39% sudo [kernel.kallsyms] [k] find_next_bit
          
     4.36% ip [kernel.kallsyms] [k] find_next_bit
        
     3.17% irqbalance libc-2.15.so [.] 0x000000000003d298
           
     2.01% init libnih.so.1.0.0 [.] nih_list_add_after
        
     1.63% ip [kernel.kallsyms] [k] rtnl_fill_ifinfo


Regards

Benoit


On 20/03/2013 21:38, "Eric W. Biederman" <ebiederm@xmission.com> wrote:

>Benoit Lourdelet <blourdel@juniper.net> writes:
>
>> Hello,
>>
>> The measurement has been done with kernel 3.8.2.
>>
>> Linux ieng-serv06 3.7.9 #3 SMP Wed Feb 27 02:38:58 PST 2013 x86_64
>>x86_64
>> x86_64 GNU/Linux
>
>Two different kernel versions?
>
>> What information would you like to see on the kernel ?
>
>The question is where is the kernel spending it's time.  So profiling
>information should help us see that.  Something like.
>
>$ cat > test-script.sh << 'EOF'
>#!/bin/bash
>for i in $(seq 1 2000) ; do
>	ip link add a$i type veth peer name b$i
>done
>EOF
>
>$ perf record -a -g test-script.sh
>$ perf report 
>
>I don't do anywhere enough work with perf to remember what good options
>are.
>
>You definititely don't want to time anything you are doing something
>silly like asking ip link add to generate device names which is O(N^2)
>when you create one device at a time.
>
>And of course there is the interesting discrepency.  Why can I add 5000
>veth pairs in 120 seconds and it takes you 1123 seconds.  Do you have a
>very
>slow cpu in your test environment?  Or was your test asking the kernel
>to generate names.
>
>Once we know where the kernel is spending it's time we can look to see
>if there is anything that is easy to fix, and where to point you.
>
>Both my timing and yours indicates that there is something taking O(N^2)
>time in there.  So it would at least be interesting to see what that
>something is.
>
>Eric
>



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Lxc-users] Containers slow to start after 1600
  2013-03-22 17:05           ` Benoit Lourdelet
@ 2013-03-22 21:31             ` Eric W. Biederman
  0 siblings, 0 replies; 6+ messages in thread
From: Eric W. Biederman @ 2013-03-22 21:31 UTC (permalink / raw)
  To: Benoit Lourdelet; +Cc: Serge Hallyn, linux-kernel@vger.kernel.org, lxc-users

Benoit Lourdelet <blourdel@juniper.net> writes:

> Hello,
>
> I tried multiple kernels with similar results.
>
> I ran the following on 3.8.2
>
> $ cat > test-script.sh << 'EOF'
> #!/bin/bash
> for i in $(seq 1 2000) ; do
> 	ip link add a$i type veth peer name b$i
> done
> EOF
>
> $ perf record -a test-script.sh
> $ perf report
>

Interesting.  And the snmp_fold_field is an interesting hint.

It looks like something is listing all of your interfaces probably
between interface creation.  And that will definitely take this
process from O(NlogN) to O(N^2).

Although I find the find_next_bit call also concerning.

snmp_fold_field is used when reading /proc/net/snmp /proc/net/netstat
and when running ip link to show all of the interfaces.

I suspect find_next_bit is coming out of the per cpu allocator, and it
might be partially responsible.  But that doesn't feel right.

Regardless from this trace it looks like the performance problem on
creation is something in userspace is calling /sbin/ip to list all of
the interfaces and that is  causing the interface creation to slow down.

Oh I see what is happening.  In iproute.
do_iplink
  iplink_modify
     ll_init_map

So the limitation at this point is in iproute, and not in the kernel.

And with the following patch the time to create 5000 devices drops from
120 to 11 seconds for me.  So it looks like the kernel is fine it is
whatever userspace tools that are of interest that are the bottleneck.

Although a quicker way to do a name to ifindex mapping might be
interesting.

Eric

diff --git a/ip/iplink.c b/ip/iplink.c
index ad33611..58af369 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
                }
        }

-       ll_init_map(&rth);
-
        if (!(flags & NLM_F_CREATE)) {
                if (!dev) {
                        fprintf(stderr, "Not enough information: \"dev\" "
@@ -542,6 +540,7 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
                        exit(-1);
                }

+               ll_init_map(&rth);
                req.i.ifi_index = ll_name_to_index(dev);
                if (req.i.ifi_index == 0) {
                        fprintf(stderr, "Cannot find device \"%s\"\n", dev);
@@ -555,6 +554,7 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
                if (link) {
                        int ifindex;

+                       ll_init_map(&rth);
                        ifindex = ll_name_to_index(link);
                        if (ifindex == 0) {
                                fprintf(stderr, "Cannot find device \"%s\"\n",

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-03-22 21:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20130319182829.GA15451@sergelap>
     [not found] ` <CD6E73D8.71FE%blourdel@juniper.net>
2013-03-19 20:34   ` [Lxc-users] Containers slow to start after 1600 Serge Hallyn
2013-03-20  0:29     ` Eric W. Biederman
2013-03-20 20:09       ` Benoit Lourdelet
2013-03-20 20:38         ` Eric W. Biederman
2013-03-22 17:05           ` Benoit Lourdelet
2013-03-22 21:31             ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox