From: Benoit Lourdelet <blourdel@juniper.net>
To: "Eric W. Biederman" <ebiederm@xmission.com>,
Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
lxc-users <lxc-users@lists.sourceforge.net>
Subject: Re: [Lxc-users] Containers slow to start after 1600
Date: Wed, 20 Mar 2013 20:09:29 +0000 [thread overview]
Message-ID: <CD6F35B6.7221%blourdel@juniper.net> (raw)
In-Reply-To: <87txo6ewxf.fsf@xmission.com>
Hello,
The measurement has been done with kernel 3.8.2.
Linux ieng-serv06 3.7.9 #3 SMP Wed Feb 27 02:38:58 PST 2013 x86_64 x86_64
x86_64 GNU/Linux
What information would you like to see on the kernel ?
Regards
Benoit
On 20/03/2013 01:29, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>Serge Hallyn <serge.hallyn@ubuntu.com> writes:
>
>> Hi,
>>
>> Benoit was kind enough to follow up on some scalability issues with
>> larger (but not huge imo) numbers of containers. Running a script
>> to simply time the creation of veth pairs on a rather large (iiuc)
>> machine, he got the following numbers (time is for creation of the
>> full number, not latest increment - so 1123 seconds to create 5000
>> veth pairs)
>
>A kernel version and a profile would be interesting.
>
>At first glance it looks like things are dramatically slowing down as
>things get longer which should not happen.
>
>There used to be quadratic issues in proc and sysfs that should have
>been reduced to O(NlogN) as of 3.4 or so. A comparison to the dummy
>device which is a touch simpler than veth and is more frequently
>benchmarked could also be revealing.
>
>>> >Quoting Benoit Lourdelet (blourdel@juniper.net):
>>> >> Hello Serge,
>>> >>
>>> >> I put together a small table, running your script for various
>>>values :
>>> >>
>>> >> Time are in seconds,
>>> >>
>>> >> Number of veth, time to create, time to delete:
>>> >>
>>> >> 500 18 26
>>> >>
>>> >> 1000 57 70
>>> >>
>>> >> 2000 193 250
>>> >>
>>> >> 3000 435 510
>>> >>
>>> >> 4000 752 824
>>> >>
>>> >> 5000 1123 1185
>>
>>>
>>> Benoit
>>
>> Ok. Ran some tests on a tiny cloud instance. When I simply run 2k
>>tasks in
>> unshared new network namespaces, it flies by.
>>
>> #!/bin/sh
>> rm -f /tmp/timings3
>> date | tee -a /tmp/timings3
>> for i in `seq 1 2000`; do
>> nsexec -n -- /bin/sleep 1000 &
>> if [ $((i % 100)) -eq 0 ]; then
>> echo $i | tee -a /tmp/timings3
>> date | tee -a /tmp/timings3
>> fi
>> done
>>
>> (all scripts run under sudo, and nsexec can be found at
>> https://code.launchpad.net/~serge-hallyn/+junk/nsexec))
>>
>> So that isn't an issue.
>>
>> When I run a script to just time veth pair creations like Benoit ran,
>> creating 2000 veth pairs and timing the results for each 100, the time
>> does degrade, from 1 second for the first 100 up to 8 seconds for the
>> last 100.
>>
>> (that script for me is:
>>
>> #!/bin/sh
>> rm -f /tmp/timings
>> for i in `seq 1 2000`; do
>> ip link add type veth
>> if [ $((i % 100)) -eq 0 ]; then
>> echo $i | tee -a /tmp/timings
>> date | tee -a /tmp/timings
>> ls /sys/class/net > /dev/null
>> fi
>> done
>> )
>>
>> But when I actually pass veth instances to those unshared network
>> namespaces:
>>
>> #!/bin/sh
>> rm -f /tmp/timings2
>> echo 0 | tee -a /tmp/timings2
>> date | tee -a /tmp/timings2
>> for i in `seq 1 2000`; do
>> nsexec -n -P /tmp/pid.$i -- /bin/sleep 1000 &
>> ip link add type veth
>> dev2=`ls -d /sys/class/net/veth* | tail -1`
>> dev=`basename $dev2`
>> pid=`cat /tmp/pid.$i`
>> ip link set $dev netns $pid
>> if [ $((i % 100)) -eq 0 ]; then
>> echo $i | tee -a /tmp/timings2
>> date | tee -a /tmp/timings2
>> fi
>> rm -f /tmp/pid.*
>> done
>>
>> it goes from 4 seconds for the first hundred to 16 seconds for
>> the last hundred - a worse regression than simply creating the
>> veths. Though I guess that could be accounted for simply by
>> sysfs actions when a veth is moved from the old netns to the
>> new?
>
>And network stack actions. Creating one end of the veth in the desired
>network namespace is likely desirable. "ip link add type veth peer netns
>..."
>
>rcu in the past has also played a critical role, as what the network
>configuration is when devices are torn down.
>
>For device movement and device teardown there is at least one
>synchronize_rcu, which at scale can slow things down. But if the
>syncrhonize_rcu dominates it should be mostly a constant factor cost not
>something that gets worse with each device creation.
>
>Oh and to start with I would specify the name of each network device to
>create. Last I looked coming up with a network device name is a O(N)
>operation in the number of device names.
>
>Just to see what I am seeing in 3.9-rc1 I did:
>
># time for i in $(seq 1 2000) ; do ip link add a$i type veth peer name
>b$i; done
>real 0m23.607s
>user 0m0.656s
>sys 0m18.132s
>
># time for i in $(seq 1 2000) ; do ip link del a$i ; done
>real 2m8.038s
>user 0m0.964s
>sys 0m18.688s
>
>Which is tremendously better than you are reporting below for device
>creation.
>Now the deletes are still slow because it is hard to back that kind of
>delete, having a bunch of network namespaces exit all at once would
>likely be much faster as they can be batched and the syncrhonize_rcu
>calls drastically reduced.
>
>What is making you say there is a regression? A regression compared to
>what?
>
>Hmm.
>
># time for i in $(seq 1 5000) ; do ip link add a$i type veth peer name
>b$i; done
>real 2m11.007s
>user 0m3.508s
>sys 1m55.452s
>
>Ok there is most definitely something non-linear about the cost of
>creating network devices.
>
>I am happy to comment from previous experience but I'm not volunteering
>to profile and fix this one.
>
>Eric
>
>
>> 0
>> Tue Mar 19 20:15:26 UTC 2013
>> 100
>> Tue Mar 19 20:15:30 UTC 2013
>> 200
>> Tue Mar 19 20:15:35 UTC 2013
>> 300
>> Tue Mar 19 20:15:41 UTC 2013
>> 400
>> Tue Mar 19 20:15:47 UTC 2013
>> 500
>> Tue Mar 19 20:15:54 UTC 2013
>> 600
>> Tue Mar 19 20:16:02 UTC 2013
>> 700
>> Tue Mar 19 20:16:09 UTC 2013
>> 800
>> Tue Mar 19 20:16:17 UTC 2013
>> 900
>> Tue Mar 19 20:16:26 UTC 2013
>> 1000
>> Tue Mar 19 20:16:35 UTC 2013
>> 1100
>> Tue Mar 19 20:16:46 UTC 2013
>> 1200
>> Tue Mar 19 20:16:57 UTC 2013
>> 1300
>> Tue Mar 19 20:17:08 UTC 2013
>> 1400
>> Tue Mar 19 20:17:21 UTC 2013
>> 1500
>> Tue Mar 19 20:17:33 UTC 2013
>> 1600
>> Tue Mar 19 20:17:46 UTC 2013
>> 1700
>> Tue Mar 19 20:17:59 UTC 2013
>> 1800
>> Tue Mar 19 20:18:13 UTC 2013
>> 1900
>> Tue Mar 19 20:18:29 UTC 2013
>> 2000
>> Tue Mar 19 20:18:48 UTC 2013
>>
>> -serge
next prev parent reply other threads:[~2013-03-20 20:21 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20130319182829.GA15451@sergelap>
[not found] ` <CD6E73D8.71FE%blourdel@juniper.net>
2013-03-19 20:34 ` [Lxc-users] Containers slow to start after 1600 Serge Hallyn
2013-03-20 0:29 ` Eric W. Biederman
2013-03-20 20:09 ` Benoit Lourdelet [this message]
2013-03-20 20:38 ` Eric W. Biederman
2013-03-22 17:05 ` Benoit Lourdelet
2013-03-22 21:31 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CD6F35B6.7221%blourdel@juniper.net \
--to=blourdel@juniper.net \
--cc=ebiederm@xmission.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lxc-users@lists.sourceforge.net \
--cc=serge.hallyn@ubuntu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.