* [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second @ 2009-10-17 22:18 Benjamin LaHaise 2009-10-18 4:26 ` Eric Dumazet 0 siblings, 1 reply; 41+ messages in thread From: Benjamin LaHaise @ 2009-10-17 22:18 UTC (permalink / raw) To: netdev Hi folks, Below is a patch that changes the interaction between netdev_wait_allrefs() and dev_put() to replace an msleep(250) with a wait_event() on the final dev_put(). This change reduces the time spent sleeping during an unregister_netdev(), causing the system to go from <1% CPU time to something more CPU bound (50+% in a test vm). This increases the speed of a bulk unregister_netdev() from 4 interfaces per second to over 500 per second on my test system. The requirement comes from handling thousands of L2TP sessions where a tunnel flap results in all interfaces being torn down at one time. Note that there is still more work to be done in this area. -ben Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 812a5f3..e20d4a4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1514,10 +1514,7 @@ extern void netdev_run_todo(void); * * Release reference to device to allow it to be freed. */ -static inline void dev_put(struct net_device *dev) -{ - atomic_dec(&dev->refcnt); -} +void dev_put(struct net_device *dev); /** * dev_hold - get reference to device diff --git a/net/core/dev.c b/net/core/dev.c index b8f74cf..155217f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4945,6 +4945,16 @@ out: } EXPORT_SYMBOL(register_netdev); +DECLARE_WAIT_QUEUE_HEAD(netdev_refcnt_wait); + +void dev_put(struct net_device *dev) +{ + if (atomic_dec_and_test(&dev->refcnt)) + wake_up(&netdev_refcnt_wait); +} +EXPORT_SYMBOL(dev_put); + + /* * netdev_wait_allrefs - wait until all references are gone. * @@ -4984,7 +4994,8 @@ static void netdev_wait_allrefs(struct net_device *dev) rebroadcast_time = jiffies; } - msleep(250); + wait_event_timeout(netdev_refcnt_wait, + !atomic_read(&dev->refcnt), HZ/4); if (time_after(jiffies, warning_time + 10 * HZ)) { printk(KERN_EMERG "unregister_netdevice: " ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second 2009-10-17 22:18 [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second Benjamin LaHaise @ 2009-10-18 4:26 ` Eric Dumazet 2009-10-18 16:13 ` Benjamin LaHaise 0 siblings, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-18 4:26 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: netdev Benjamin LaHaise a écrit : > Hi folks, > > Below is a patch that changes the interaction between netdev_wait_allrefs() > and dev_put() to replace an msleep(250) with a wait_event() on the final > dev_put(). This change reduces the time spent sleeping during an > unregister_netdev(), causing the system to go from <1% CPU time to something > more CPU bound (50+% in a test vm). This increases the speed of a bulk > unregister_netdev() from 4 interfaces per second to over 500 per second on > my test system. The requirement comes from handling thousands of L2TP > sessions where a tunnel flap results in all interfaces being torn down at > one time. > > Note that there is still more work to be done in this area. > > -ben > > +DECLARE_WAIT_QUEUE_HEAD(netdev_refcnt_wait); > + > +void dev_put(struct net_device *dev) > +{ > + if (atomic_dec_and_test(&dev->refcnt)) > + wake_up(&netdev_refcnt_wait); > +} > +EXPORT_SYMBOL(dev_put); > + Unfortunatly this slow down fast path by an order of magnitude. atomic_dec() is pretty cheap (and eventually could use a per_cpu thing, now we have a new and sexy per_cpu allocator), but atomic_dec_and_test() is not that cheap and more important forbids a per_cpu conversion. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second 2009-10-18 4:26 ` Eric Dumazet @ 2009-10-18 16:13 ` Benjamin LaHaise 2009-10-18 17:51 ` Eric Dumazet 0 siblings, 1 reply; 41+ messages in thread From: Benjamin LaHaise @ 2009-10-18 16:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Sun, Oct 18, 2009 at 06:26:22AM +0200, Eric Dumazet wrote: > Unfortunatly this slow down fast path by an order of magnitude. > > atomic_dec() is pretty cheap (and eventually could use a per_cpu thing, > now we have a new and sexy per_cpu allocator), but atomic_dec_and_test() > is not that cheap and more important forbids a per_cpu conversion. dev_put() is not a fast path by any means. atomic_dec_and_test() costs the same as atomic_dec() on any modern CPU -- the cost is in the cacheline bouncing and serialisation both require. The case of the device count becoming 0 is quite rare -- any device with a route on it will never hit a reference count of 0. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second 2009-10-18 16:13 ` Benjamin LaHaise @ 2009-10-18 17:51 ` Eric Dumazet 2009-10-18 18:21 ` Benjamin LaHaise 0 siblings, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-18 17:51 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: netdev Benjamin LaHaise a écrit : > On Sun, Oct 18, 2009 at 06:26:22AM +0200, Eric Dumazet wrote: >> Unfortunatly this slow down fast path by an order of magnitude. >> >> atomic_dec() is pretty cheap (and eventually could use a per_cpu thing, >> now we have a new and sexy per_cpu allocator), but atomic_dec_and_test() >> is not that cheap and more important forbids a per_cpu conversion. > > dev_put() is not a fast path by any means. atomic_dec_and_test() costs > the same as atomic_dec() on any modern CPU -- the cost is in the cacheline > bouncing and serialisation both require. The case of the device count > becoming 0 is quite rare -- any device with a route on it will never hit > a reference count of 0. You forgot af_packet sendmsg() users, and heavy routers where route cache is stressed or disabled. I know several of them, they even added mmap TX support to get better performance. They will be disapointed by your patch. atomic_dec_and_test() is definitly more expensive, because of strong barrier semantics and added test after the decrement. refcnt being close to zero or not has not impact, even on 2 years old cpus. Machines hardly had to dismantle a netdevice in a normal lifetime, so maybe we were lazy with this insane msleep(250). This came from old linux times, when cpus were soooo slow and programers soooo lazy :) The msleep(250) should be tuned first. Then if this is really necessary to dismantle 100.000 netdevices per second, we might have to think a bit more. Just try msleep(1 or 2), it should work quite well. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second 2009-10-18 17:51 ` Eric Dumazet @ 2009-10-18 18:21 ` Benjamin LaHaise 2009-10-18 19:36 ` Eric Dumazet 2009-10-21 12:39 ` Octavian Purdila 0 siblings, 2 replies; 41+ messages in thread From: Benjamin LaHaise @ 2009-10-18 18:21 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Sun, Oct 18, 2009 at 07:51:56PM +0200, Eric Dumazet wrote: > You forgot af_packet sendmsg() users, and heavy routers where route cache > is stressed or disabled. I know several of them, they even added mmap TX > support to get better performance. They will be disapointed by your patch. If that's a problem, the cacheline overhead is a more serious issue. AF_PACKET should really keep the reference on the device between syscalls. Do you have a benchmark in mind that would show the overhead? > atomic_dec_and_test() is definitly more expensive, because of strong barrier > semantics and added test after the decrement. > refcnt being close to zero or not has not impact, even on 2 years old cpus. At least on x86, the atomic_dec_and_test() cost is pretty much identical to atomic_dec(). If this really is a performance bottleneck, people should be complaining about the cache miss overhead and lock overhead which will dwarf the atomic_dec_and_test() cost vs atomic_dec(). Granted, I'm not saying that it isn't an issue on other architectures, but for x86 the lock prefix is what's expensive, not checking the flags or not after doing the operation. If your complaint is about uninlining dev_put(), I'm indifferent to keeping it inline or out of line and can change the patch to suit. > Machines hardly had to dismantle a netdevice in a normal lifetime, so maybe > we were lazy with this insane msleep(250). This came from old linux times, > when cpus were soooo slow and programers soooo lazy :) It's only now that machines can actually route one or more 10Gbps links that it really becomes an issue. I've been hacking around it for some time, but fixing it properly is starting to be a real requirement. > The msleep(250) should be tuned first. Then if this is really necessary > to dismantle 100.000 netdevices per second, we might have to think a bit more. > > Just try msleep(1 or 2), it should work quite well. My goal is tearing down 100,000 interfaces in a few seconds, which really is necessary. Right now we're running about 40,000 interfaces on a not yet saturated 10Gbps link. Going to dual 10Gbps links means pushing more than 100,000 subscriber interfaces, and it looks like a modern dual socket system can handle that. A bigger concern is rtnl_lock(). It is a huge impediment to scaling up interface creation/deletion on multicore systems. That's going to be a lot more invasive to fix, though. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second 2009-10-18 18:21 ` Benjamin LaHaise @ 2009-10-18 19:36 ` Eric Dumazet 2009-10-21 12:39 ` Octavian Purdila 1 sibling, 0 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-18 19:36 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: netdev Benjamin LaHaise a écrit : > > My goal is tearing down 100,000 interfaces in a few seconds, which really is > necessary. Right now we're running about 40,000 interfaces on a not yet > saturated 10Gbps link. Going to dual 10Gbps links means pushing more than > 100,000 subscriber interfaces, and it looks like a modern dual socket system > can handle that. > > A bigger concern is rtnl_lock(). It is a huge impediment to scaling up > interface creation/deletion on multicore systems. That's going to be a > lot more invasive to fix, though. Dont forget synchronize_net() too (two calls per rollback_registered()) You need something to dismantle XXXXX interfaces at once, instead of serializing one by one. Because in three years you'll want to dismantle 1.000.000 interfaces in one second. Maybe defining an asynchronous unregister_netdev() function... ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second 2009-10-18 18:21 ` Benjamin LaHaise 2009-10-18 19:36 ` Eric Dumazet @ 2009-10-21 12:39 ` Octavian Purdila 2009-10-21 15:40 ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet 1 sibling, 1 reply; 41+ messages in thread From: Octavian Purdila @ 2009-10-21 12:39 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric Dumazet, netdev, Cosmin Ratiu On Sunday 18 October 2009 21:21:44 you wrote: > > The msleep(250) should be tuned first. Then if this is really necessary > > to dismantle 100.000 netdevices per second, we might have to think a bit > > more. > > Just try msleep(1 or 2), it should work quite well. > > My goal is tearing down 100,000 interfaces in a few seconds, which really > is necessary. Right now we're running about 40,000 interfaces on a not > yet saturated 10Gbps link. Going to dual 10Gbps links means pushing more > than 100,000 subscriber interfaces, and it looks like a modern dual socket > system can handle that. > I would also like to see this patch in, we are running into scalability issues with creating/deleting lots of interfaces as well. Thanks, tavi ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-21 12:39 ` Octavian Purdila @ 2009-10-21 15:40 ` Eric Dumazet 2009-10-21 16:09 ` Eric Dumazet ` (3 more replies) 0 siblings, 4 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-21 15:40 UTC (permalink / raw) To: Octavian Purdila; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu Octavian Purdila a écrit : > On Sunday 18 October 2009 21:21:44 you wrote: >>> The msleep(250) should be tuned first. Then if this is really necessary >>> to dismantle 100.000 netdevices per second, we might have to think a bit >>> more. >>> Just try msleep(1 or 2), it should work quite well. >> My goal is tearing down 100,000 interfaces in a few seconds, which really >> is necessary. Right now we're running about 40,000 interfaces on a not >> yet saturated 10Gbps link. Going to dual 10Gbps links means pushing more >> than 100,000 subscriber interfaces, and it looks like a modern dual socket >> system can handle that. >> > > I would also like to see this patch in, we are running into scalability issues > with creating/deleting lots of interfaces as well. Ben patch only address interface deletion, and one part of the problem, maybe the more visible one for the current kernel. Adding lots of interfaces only needs several threads to run concurently. Before applying/examining his patch I suggest identifying all dev_put() spots than can be deleted and replaced by something more scalable. I began this job but others can help me. RTNL and rcu grace periods are going to hurt anyway, so you probably need to use many tasks to be able to delete lots of interfaces in parallel. netdev_run_todo() should also use a better algorithm to allow parallelism. Following patch doesnt slow down dev_put() users and real scalability problems will surface and might be addressed. [PATCH] net: allow netdev_wait_allrefs() to run faster netdev_wait_allrefs() waits that all references to a device vanishes. It currently uses a _very_ pessimistic 250 ms delay between each probe. Some users report that no more than 4 devices can be dismantled per second, this is a pretty serious problem for extreme setups. Most likely, references only wait for a rcu grace period that should come fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/core/dev.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 28b0b9e..fca2e4a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4983,7 +4983,7 @@ static void netdev_wait_allrefs(struct net_device *dev) rebroadcast_time = jiffies; } - msleep(250); + schedule_timeout_uninterruptible(1); if (time_after(jiffies, warning_time + 10 * HZ)) { printk(KERN_EMERG "unregister_netdevice: " ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-21 15:40 ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet @ 2009-10-21 16:09 ` Eric Dumazet 2009-10-21 16:51 ` Benjamin LaHaise ` (2 subsequent siblings) 3 siblings, 0 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-21 16:09 UTC (permalink / raw) To: Octavian Purdila; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu Eric Dumazet a écrit : > Octavian Purdila a écrit : >> On Sunday 18 October 2009 21:21:44 you wrote: >>>> The msleep(250) should be tuned first. Then if this is really necessary >>>> to dismantle 100.000 netdevices per second, we might have to think a bit >>>> more. >>>> Just try msleep(1 or 2), it should work quite well. >>> My goal is tearing down 100,000 interfaces in a few seconds, which really >>> is necessary. Right now we're running about 40,000 interfaces on a not >>> yet saturated 10Gbps link. Going to dual 10Gbps links means pushing more >>> than 100,000 subscriber interfaces, and it looks like a modern dual socket >>> system can handle that. >>> >> I would also like to see this patch in, we are running into scalability issues >> with creating/deleting lots of interfaces as well. > > Ben patch only address interface deletion, and one part of the problem, > maybe the more visible one for the current kernel. > > Adding lots of interfaces only needs several threads to run concurently. > > Before applying/examining his patch I suggest identifying all dev_put() spots than > can be deleted and replaced by something more scalable. I began this job > but others can help me. > > RTNL and rcu grace periods are going to hurt anyway, so you probably need > to use many tasks to be able to delete lots of interfaces in parallel. > > netdev_run_todo() should also use a better algorithm to allow parallelism. > > Following patch doesnt slow down dev_put() users and real scalability > problems will surface and might be addressed. > Here are typical timings (on current kernel, but on following example netdev_wait_allrefs() doesnt wait at all, because my netdevice has no refs) # time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan real 0m0.001s user 0m0.000s sys 0m0.001s # time ip link set mv161 up real 0m0.001s user 0m0.000s sys 0m0.001s # time ip link set mv161 down real 0m0.021s user 0m0.000s sys 0m0.001s # time ip link del mv161 real 0m0.022s user 0m0.000s sys 0m0.001s # time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan real 0m0.001s user 0m0.001s sys 0m0.001s # time ip link set mv161 up real 0m0.001s user 0m0.000s sys 0m0.001s # time ip link del mv161 real 0m0.036s user 0m0.000s sys 0m0.001s 22 ms (or 36 ms) delay are also problematic if you want to dismantle 1.000.000 netdevices at once. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-21 15:40 ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet 2009-10-21 16:09 ` Eric Dumazet @ 2009-10-21 16:51 ` Benjamin LaHaise 2009-10-21 19:54 ` Eric Dumazet 2009-10-29 23:07 ` Eric W. Biederman 2009-10-21 16:55 ` Octavian Purdila 2009-10-23 21:13 ` Paul E. McKenney 3 siblings, 2 replies; 41+ messages in thread From: Benjamin LaHaise @ 2009-10-21 16:51 UTC (permalink / raw) To: Eric Dumazet; +Cc: Octavian Purdila, netdev, Cosmin Ratiu On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote: > Ben patch only address interface deletion, and one part of the problem, > maybe the more visible one for the current kernel. The first part I've been tackling has been the overhead in procfs, sysctl and sysfs. I've got patches for some of the issues, hacks for others, and should have something to post in a few days. Getting rid of those overheads is enough to get to decent interface creation times for the first 20 or 30k interfaces. On the interface deletion side of things, within the network code, fib_hash has a few linear scans that really start hurting. trie is a bit better, but I haven't started digging too deeply into its flush/remove overhead yet, aside from noticing that trie has a linear scan if CONFIG_IP_ROUTE_MULTIPATH is set since it sets the hash size to 1. fn_trie_flush() is currently the worst offender during deletion. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-21 16:51 ` Benjamin LaHaise @ 2009-10-21 19:54 ` Eric Dumazet 2009-10-29 23:07 ` Eric W. Biederman 1 sibling, 0 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-21 19:54 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Octavian Purdila, netdev, Cosmin Ratiu Benjamin LaHaise a écrit : > On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote: >> Ben patch only address interface deletion, and one part of the problem, >> maybe the more visible one for the current kernel. > > The first part I've been tackling has been the overhead in procfs, sysctl > and sysfs. I've got patches for some of the issues, hacks for others, and > should have something to post in a few days. Getting rid of those overheads > is enough to get to decent interface creation times for the first 20 or 30k > interfaces. > > On the interface deletion side of things, within the network code, fib_hash > has a few linear scans that really start hurting. trie is a bit better, > but I haven't started digging too deeply into its flush/remove overhead yet, > aside from noticing that trie has a linear scan if > CONFIG_IP_ROUTE_MULTIPATH is set since it sets the hash size to 1. > fn_trie_flush() is currently the worst offender during deletion. Well, there are many things to change... # ip -o link | wc -l 13097 # time ip -o link show mv22248 13045: mv22248@eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN \ link/ether 00:1e:0b:8e:c8:08 brd ff:ff:ff:ff:ff:ff real 0m0.840s user 0m0.473s sys 0m0.368s almost one second to get link status of one particular interface :( ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-21 16:51 ` Benjamin LaHaise 2009-10-21 19:54 ` Eric Dumazet @ 2009-10-29 23:07 ` Eric W. Biederman 2009-10-29 23:38 ` Benjamin LaHaise 1 sibling, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2009-10-29 23:07 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu Benjamin LaHaise <bcrl@lhnet.ca> writes: > On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote: >> Ben patch only address interface deletion, and one part of the problem, >> maybe the more visible one for the current kernel. > > The first part I've been tackling has been the overhead in procfs, sysctl > and sysfs. Could you keep me in the loop with that. I have some pending cleanups for all of those pieces of code and may be able to help/advice/review. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-29 23:07 ` Eric W. Biederman @ 2009-10-29 23:38 ` Benjamin LaHaise 2009-10-30 1:45 ` Eric W. Biederman 2010-08-09 17:23 ` Ben Greear 0 siblings, 2 replies; 41+ messages in thread From: Benjamin LaHaise @ 2009-10-29 23:38 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu On Thu, Oct 29, 2009 at 04:07:18PM -0700, Eric W. Biederman wrote: > Could you keep me in the loop with that. I have some pending cleanups for > all of those pieces of code and may be able to help/advice/review. Here are the sysfs scaling improvements. I have to break them up, as there are 3 separate changes in this patch: 1. use an rbtree for name lookup in sysfs, 2. keep track of the number of directories for the purpose of generating the link count, as otherwise too much cpu time is spent in sysfs_count_nlink when new entries are added, and 3. when adding a new sysfs_dirent, walk the list backwards when linking it in, as higher numbered inodes tend to be at the end of the list, not the beginning. -ben diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c index 5fad489..38ad7c8 100644 --- a/fs/sysfs/dir.c +++ b/fs/sysfs/dir.c @@ -43,10 +43,18 @@ static DEFINE_IDA(sysfs_ino_ida); static void sysfs_link_sibling(struct sysfs_dirent *sd) { struct sysfs_dirent *parent_sd = sd->s_parent; - struct sysfs_dirent **pos; + struct sysfs_dirent **pos, *prev = NULL; + struct rb_node **new, *parent; BUG_ON(sd->s_sibling); + if (parent_sd->s_dir.children_tail && + parent_sd->s_dir.children_tail->s_ino < sd->s_ino) { + prev = parent_sd->s_dir.children_tail; + pos = &prev->s_sibling; + goto got_it; + } + /* Store directory entries in order by ino. This allows * readdir to properly restart without having to add a * cursor into the s_dir.children list. @@ -54,9 +62,36 @@ static void sysfs_link_sibling(struct sysfs_dirent *sd) for (pos = &parent_sd->s_dir.children; *pos; pos = &(*pos)->s_sibling) { if (sd->s_ino < (*pos)->s_ino) break; + prev = *pos; } +got_it: + if (prev == parent_sd->s_dir.children_tail) + parent_sd->s_dir.children_tail = sd; sd->s_sibling = *pos; + sd->s_sibling_prev = prev; *pos = sd; + parent_sd->s_nr_children_dir += (sysfs_type(sd) == SYSFS_DIR); + + // rb tree insert + new = &(parent_sd->s_dir.child_rb_root.rb_node); + parent = NULL; + + while (*new) { + struct sysfs_dirent *this = + container_of(*new, struct sysfs_dirent, s_rb_node); + int result = strcmp(sd->s_name, this->s_name); + + parent = *new; + if (result < 0) + new = &((*new)->rb_left); + else if (result > 0) + new = &((*new)->rb_right); + else + BUG(); + } + + rb_link_node(&sd->s_rb_node, parent, new); + rb_insert_color(&sd->s_rb_node, &parent_sd->s_dir.child_rb_root); } /** @@ -71,16 +106,22 @@ static void sysfs_link_sibling(struct sysfs_dirent *sd) */ static void sysfs_unlink_sibling(struct sysfs_dirent *sd) { - struct sysfs_dirent **pos; + struct sysfs_dirent **pos, *prev = NULL; - for (pos = &sd->s_parent->s_dir.children; *pos; - pos = &(*pos)->s_sibling) { - if (*pos == sd) { - *pos = sd->s_sibling; - sd->s_sibling = NULL; - break; - } - } + prev = sd->s_sibling_prev; + if (prev) + pos = &prev->s_sibling; + else + pos = &sd->s_parent->s_dir.children; + if (sd == sd->s_parent->s_dir.children_tail) + sd->s_parent->s_dir.children_tail = prev; + *pos = sd->s_sibling; + if (sd->s_sibling) + sd->s_sibling->s_sibling_prev = prev; + + sd->s_parent->s_nr_children_dir -= (sysfs_type(sd) == SYSFS_DIR); + sd->s_sibling_prev = NULL; + rb_erase(&sd->s_rb_node, &sd->s_parent->s_dir.child_rb_root); } /** @@ -331,6 +372,9 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type) sd->s_mode = mode; sd->s_flags = type; + if (type == SYSFS_DIR) + sd->s_dir.child_rb_root = RB_ROOT; + return sd; err_out2: @@ -630,11 +674,20 @@ void sysfs_addrm_finish(struct sysfs_addrm_cxt *acxt) struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd, const unsigned char *name) { - struct sysfs_dirent *sd; - - for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling) - if (!strcmp(sd->s_name, name)) - return sd; + struct rb_node *node = parent_sd->s_dir.child_rb_root.rb_node; + + while (node) { + struct sysfs_dirent *data = + container_of(node, struct sysfs_dirent, s_rb_node); + int result; + result = strcmp(name, data->s_name); + if (result < 0) + node = node->rb_left; + else if (result > 0) + node = node->rb_right; + else + return data; + } return NULL; } diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c index e28cecf..ff6e960 100644 --- a/fs/sysfs/inode.c +++ b/fs/sysfs/inode.c @@ -191,14 +191,7 @@ static struct lock_class_key sysfs_inode_imutex_key; static int sysfs_count_nlink(struct sysfs_dirent *sd) { - struct sysfs_dirent *child; - int nr = 0; - - for (child = sd->s_dir.children; child; child = child->s_sibling) - if (sysfs_type(child) == SYSFS_DIR) - nr++; - - return nr + 2; + return sd->s_nr_children_dir + 2; } static void sysfs_init_inode(struct sysfs_dirent *sd, struct inode *inode) diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h index af4c4e7..22fd1bc 100644 --- a/fs/sysfs/sysfs.h +++ b/fs/sysfs/sysfs.h @@ -9,6 +9,7 @@ */ #include <linux/fs.h> +#include <linux/rbtree.h> struct sysfs_open_dirent; @@ -16,7 +17,10 @@ struct sysfs_open_dirent; struct sysfs_elem_dir { struct kobject *kobj; /* children list starts here and goes through sd->s_sibling */ + struct sysfs_dirent *children; + struct sysfs_dirent *children_tail; + struct rb_root child_rb_root; }; struct sysfs_elem_symlink { @@ -52,6 +56,8 @@ struct sysfs_dirent { atomic_t s_active; struct sysfs_dirent *s_parent; struct sysfs_dirent *s_sibling; + struct sysfs_dirent *s_sibling_prev; + struct rb_node s_rb_node; const char *s_name; union { @@ -65,6 +71,8 @@ struct sysfs_dirent { ino_t s_ino; umode_t s_mode; struct sysfs_inode_attrs *s_iattr; + + int s_nr_children_dir; }; #define SD_DEACTIVATED_BIAS INT_MIN ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-29 23:38 ` Benjamin LaHaise @ 2009-10-30 1:45 ` Eric W. Biederman 2009-10-30 14:35 ` Benjamin LaHaise 2010-08-09 17:23 ` Ben Greear 1 sibling, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2009-10-30 1:45 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu Benjamin LaHaise <bcrl@lhnet.ca> writes: > On Thu, Oct 29, 2009 at 04:07:18PM -0700, Eric W. Biederman wrote: >> Could you keep me in the loop with that. I have some pending cleanups for >> all of those pieces of code and may be able to help/advice/review. > > Here are the sysfs scaling improvements. I have to break them up, as there > are 3 separate changes in this patch: 1. use an rbtree for name lookup in > sysfs, 2. keep track of the number of directories for the purpose of > generating the link count, as otherwise too much cpu time is spent in > sysfs_count_nlink when new entries are added, and 3. when adding a new > sysfs_dirent, walk the list backwards when linking it in, as higher > numbered inodes tend to be at the end of the list, not the beginning. The reason for the existence of sysfs_dirent is as things grow larger we want to keep the amount of RAM consumed down. So we don't pin everything in the dcache. So we try and keep the amount of memory consumed down. So I would like to see how much we can par down. For dealing with seeks in the middle of readdir I expect the best way to do that is to be inspired by htrees in extNfs and return a hash of the filename as our position, and keep the filename list sorted by that hash. Since we are optimizing for size we don't need to store that hash. Then we can turn that list into a some flavor of sorted binary tree. I'm surprised sysfs_count_nlink shows up, as it is not directly on the add or remove path. I think the answer there is to change s_flags into a set of bitfields and make link_count one of them, perhaps 16bits long. If we ever overflow our bitfield we can just set link count to 0, and userspace (aka find) will know it can't optimized based on link count. I was expecting someone to run into problems with the linear directory of sysfs someday. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-30 1:45 ` Eric W. Biederman @ 2009-10-30 14:35 ` Benjamin LaHaise 2009-10-30 14:43 ` Eric Dumazet 2009-10-30 23:25 ` Eric W. Biederman 0 siblings, 2 replies; 41+ messages in thread From: Benjamin LaHaise @ 2009-10-30 14:35 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu On Thu, Oct 29, 2009 at 06:45:32PM -0700, Eric W. Biederman wrote: > The reason for the existence of sysfs_dirent is as things grow larger > we want to keep the amount of RAM consumed down. So we don't pin > everything in the dcache. So we try and keep the amount of memory > consumed down. I'm aware of that, but for users running into this sort of scaling issue, the amount of RAM required is a non-issue (30,000 interfaces require about 1GB of RAM at present), making the question more one of how to avoid the overhead for users who don't require it. I'd prefer a config option. The only way I can really see saving memory usage is to somehow tie sysfs dirent lookups into the network stack's own tables for looking up device entries. The network stack already has to cope with this kind of scaling, and that would save the RAM. > So I would like to see how much we can par down. > For dealing with seeks in the middle of readdir I expect the best way > to do that is to be inspired by htrees in extNfs and return a hash of > the filename as our position, and keep the filename list sorted by > that hash. Since we are optimizing for size we don't need to store > that hash. Then we can turn that list into a some flavor of sorted > binary tree. readdir() generally isn't an issue at present. > I'm surprised sysfs_count_nlink shows up, as it is not directly on the > add or remove path. I think the answer there is to change s_flags > into a set of bitfields and make link_count one of them, perhaps > 16bits long. If we ever overflow our bitfield we can just set link > count to 0, and userspace (aka find) will know it can't optimized > based on link count. It shows up because of the bits of userspace (udev) touching the directory from things like the hotplug code path. > I was expecting someone to run into problems with the linear directory > of sysfs someday. Alas, sysfs isn't the only offender. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-30 14:35 ` Benjamin LaHaise @ 2009-10-30 14:43 ` Eric Dumazet 2009-10-30 23:25 ` Eric W. Biederman 1 sibling, 0 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-30 14:43 UTC (permalink / raw) To: Benjamin LaHaise Cc: Eric W. Biederman, Octavian Purdila, netdev, Cosmin Ratiu Benjamin LaHaise a écrit : > On Thu, Oct 29, 2009 at 06:45:32PM -0700, Eric W. Biederman wrote: >> I was expecting someone to run into problems with the linear directory >> of sysfs someday. > > Alas, sysfs isn't the only offender. > > -ben In my tests, the sysfs lookup by name is the big offender, I believe you should post your rb_tree patch ASAP ;) Then we can go further ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-30 14:35 ` Benjamin LaHaise 2009-10-30 14:43 ` Eric Dumazet @ 2009-10-30 23:25 ` Eric W. Biederman 2009-10-30 23:53 ` Benjamin LaHaise 1 sibling, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2009-10-30 23:25 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu Benjamin LaHaise <bcrl@lhnet.ca> writes: > On Thu, Oct 29, 2009 at 06:45:32PM -0700, Eric W. Biederman wrote: >> The reason for the existence of sysfs_dirent is as things grow larger >> we want to keep the amount of RAM consumed down. So we don't pin >> everything in the dcache. So we try and keep the amount of memory >> consumed down. > > I'm aware of that, but for users running into this sort of scaling issue, > the amount of RAM required is a non-issue (30,000 interfaces require about > 1GB of RAM at present), making the question more one of how to avoid the > overhead for users who don't require it. I'd prefer a config option. The > only way I can really see saving memory usage is to somehow tie sysfs dirent > lookups into the network stack's own tables for looking up device entries. > The network stack already has to cope with this kind of scaling, and that > would save the RAM. There is that. I'm trying to figure out how to add the improvements without making sysfs_dirent larger. Which I think that is doable. >> So I would like to see how much we can par down. > >> For dealing with seeks in the middle of readdir I expect the best way >> to do that is to be inspired by htrees in extNfs and return a hash of >> the filename as our position, and keep the filename list sorted by >> that hash. Since we are optimizing for size we don't need to store >> that hash. Then we can turn that list into a some flavor of sorted >> binary tree. > > readdir() generally isn't an issue at present. Supporting seekdir into the middle of a directory is the entire reason I keep the entries sorted by inode. If we sort by a hash of the name. We can use the hash to support directory position in readdir and seekdir. And we can completely remove the linear list when the rb_tree is introduced. >> I'm surprised sysfs_count_nlink shows up, as it is not directly on the >> add or remove path. I think the answer there is to change s_flags >> into a set of bitfields and make link_count one of them, perhaps >> 16bits long. If we ever overflow our bitfield we can just set link >> count to 0, and userspace (aka find) will know it can't optimized >> based on link count. > > It shows up because of the bits of userspace (udev) touching the directory > from things like the hotplug code path. I realized after sending the message that s_mode in sysfs_dirent is a real size offense. It is a 16bit field packed in between two longs. So in practice it is possible to move the s_mode up next to s_flags and add a s_nlink after it both unsigned short and get a cheap sysfs_nlink. >> I was expecting someone to run into problems with the linear directory >> of sysfs someday. > > Alas, sysfs isn't the only offender. Agreed. Sysfs is probably the easiest to untangle. Since I'm not quite ready to post my patches. I will briefly mention what I have in my queue and hopefully get things posted. I have changes to make it so that sysfs never has to go from the sysfs_dirent to the sysfs inode. I have changes to sys_sysctl() so that it becomes a filesystem lookup under /proc/sys. Which ultimately makes the code easier to maintain and debug. Now back to getting things forward ported and ready to post. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-30 23:25 ` Eric W. Biederman @ 2009-10-30 23:53 ` Benjamin LaHaise 2009-10-31 0:37 ` Eric W. Biederman 0 siblings, 1 reply; 41+ messages in thread From: Benjamin LaHaise @ 2009-10-30 23:53 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu On Fri, Oct 30, 2009 at 04:25:52PM -0700, Eric W. Biederman wrote: > I realized after sending the message that s_mode in sysfs_dirent is a > real size offense. It is a 16bit field packed in between two longs. > So in practice it is possible to move the s_mode up next to s_flags > and add a s_nlink after it both unsigned short and get a cheap sysfs_nlink. That doesn't work -- the number of directory entries can easily exceed 65535. Current mid range hardware is good enough to terminate 100,000 network interfaces on a single host. > Since I'm not quite ready to post my patches. I will briefly > mention what I have in my queue and hopefully get things posted. > > I have changes to make it so that sysfs never has to go from > the sysfs_dirent to the sysfs inode. Ah, interesting. > I have changes to sys_sysctl() so that it becomes a filesystem lookup > under /proc/sys. Which ultimately makes the code easier to maintain > and debug. That sounds like a much saner approach, but has the wrinkle that procfs can be configured out. > Now back to getting things forward ported and ready to post. I'm looking forward to those changes. I've been ignoring procfs for the time being by disabling the per-interface entries in the network stack, but there is some desire to be able to enable rp_filter on a per-interface radius config at runtime. rp_filter has to be disabled across the board on my access routers, as there are several places where assymetric routing is used for performance reasons. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-30 23:53 ` Benjamin LaHaise @ 2009-10-31 0:37 ` Eric W. Biederman 0 siblings, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2009-10-31 0:37 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu Benjamin LaHaise <bcrl@lhnet.ca> writes: > On Fri, Oct 30, 2009 at 04:25:52PM -0700, Eric W. Biederman wrote: >> I realized after sending the message that s_mode in sysfs_dirent is a >> real size offense. It is a 16bit field packed in between two longs. >> So in practice it is possible to move the s_mode up next to s_flags >> and add a s_nlink after it both unsigned short and get a cheap sysfs_nlink. > > That doesn't work -- the number of directory entries can easily exceed 65535. > Current mid range hardware is good enough to terminate 100,000 network > interfaces on a single host. On overflow you nlink becomes zero and you leave it there. That is how ondisk filesystems handle that case on directories, and find etc knows how to deal. >> Since I'm not quite ready to post my patches. I will briefly >> mention what I have in my queue and hopefully get things posted. >> >> I have changes to make it so that sysfs never has to go from >> the sysfs_dirent to the sysfs inode. > > Ah, interesting. I have to cleanup sysfs before I merge changes for supporting multiple network namespaces. >> I have changes to sys_sysctl() so that it becomes a filesystem lookup >> under /proc/sys. Which ultimately makes the code easier to maintain >> and debug. > > That sounds like a much saner approach, but has the wrinkle that procfs can > be configured out. So I will add the dependency. There are very few serious users of sys_sysctl, and all of them have been getting a deprecated interface warning every time they use it for the last several years. >> Now back to getting things forward ported and ready to post. > > I'm looking forward to those changes. I've been ignoring procfs for the > time being by disabling the per-interface entries in the network stack, > but there is some desire to be able to enable rp_filter on a per-interface > radius config at runtime. rp_filter has to be disabled across the board > on my access routers, as there are several places where assymetric routing > is used for performance reasons. Just out of curiosity does the loose rp_filter mode work for you? Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-29 23:38 ` Benjamin LaHaise 2009-10-30 1:45 ` Eric W. Biederman @ 2010-08-09 17:23 ` Ben Greear 2010-08-09 17:34 ` Benjamin LaHaise 1 sibling, 1 reply; 41+ messages in thread From: Ben Greear @ 2010-08-09 17:23 UTC (permalink / raw) To: Benjamin LaHaise Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu On 10/29/2009 04:38 PM, Benjamin LaHaise wrote: > On Thu, Oct 29, 2009 at 04:07:18PM -0700, Eric W. Biederman wrote: >> Could you keep me in the loop with that. I have some pending cleanups for >> all of those pieces of code and may be able to help/advice/review. > > Here are the sysfs scaling improvements. I have to break them up, as there > are 3 separate changes in this patch: 1. use an rbtree for name lookup in > sysfs, 2. keep track of the number of directories for the purpose of > generating the link count, as otherwise too much cpu time is spent in > sysfs_count_nlink when new entries are added, and 3. when adding a new > sysfs_dirent, walk the list backwards when linking it in, as higher > numbered inodes tend to be at the end of the list, not the beginning. I was just comparing my out-of-tree patch set to .35, and it appears little or none of the patches discussed in this thread are in the upstream kernel yet. Specifically, there is still that msleep(250) in netdev_wait_allrefs Is anyone still trying to get the improvements needed for adding/deleting lots of interfaces into the kernel? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2010-08-09 17:23 ` Ben Greear @ 2010-08-09 17:34 ` Benjamin LaHaise 2010-08-09 17:44 ` Ben Greear 2010-08-09 19:59 ` Eric W. Biederman 0 siblings, 2 replies; 41+ messages in thread From: Benjamin LaHaise @ 2010-08-09 17:34 UTC (permalink / raw) To: Ben Greear Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu Hello Ben, On Mon, Aug 09, 2010 at 10:23:37AM -0700, Ben Greear wrote: > I was just comparing my out-of-tree patch set to .35, and it appears > little or none of the patches discussed in this thread are in the > upstream kernel yet. I was waiting on Eric's sysfs changes for namespaces to settle down, but ended up getting busy on other things. I guess now is a good time to pick this back up and try to merge my changes for improving interface scaling. I'll send out a new version of the patches sometime in the next couple of days. I'm also about to make a new Babylon release as well, I just need to write some more documentation. :-/ Btw, one thing I noticed but haven't been able to come up with a fix for yet is that iptables has scaling issues with lots of interfaces. Specifically, we had to start adding one iptables rule per interface for smtp filtering (not all subscribers are permitted to send smtp directly out to the net, so it has to be per-interface). It seems that those all get dumped into a giant list. What I'd like to do is to be able to attach rules directly to the interface, but I haven't really had the time to do a mergable set of changes for that. Thoughts anyone? -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2010-08-09 17:34 ` Benjamin LaHaise @ 2010-08-09 17:44 ` Ben Greear 2010-08-09 17:48 ` Benjamin LaHaise 2010-08-09 19:59 ` Eric W. Biederman 1 sibling, 1 reply; 41+ messages in thread From: Ben Greear @ 2010-08-09 17:44 UTC (permalink / raw) To: Benjamin LaHaise Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu On 08/09/2010 10:34 AM, Benjamin LaHaise wrote: > Hello Ben, > > On Mon, Aug 09, 2010 at 10:23:37AM -0700, Ben Greear wrote: >> I was just comparing my out-of-tree patch set to .35, and it appears >> little or none of the patches discussed in this thread are in the >> upstream kernel yet. > > I was waiting on Eric's sysfs changes for namespaces to settle down, but > ended up getting busy on other things. I guess now is a good time to pick > this back up and try to merge my changes for improving interface scaling. > I'll send out a new version of the patches sometime in the next couple of > days. I'm also about to make a new Babylon release as well, I just need > to write some more documentation. :-/ > > Btw, one thing I noticed but haven't been able to come up with a fix for > yet is that iptables has scaling issues with lots of interfaces. > Specifically, we had to start adding one iptables rule per interface for smtp > filtering (not all subscribers are permitted to send smtp directly out to > the net, so it has to be per-interface). It seems that those all get > dumped into a giant list. What I'd like to do is to be able to attach rules > directly to the interface, but I haven't really had the time to do a mergable > set of changes for that. Thoughts anyone? We also have a few rules per interface, and notice that it takes around 10ms per rule when we are removing them, even when using batching in 'ip': This is on a high-end core i7, otherwise lightly loaded. Total IPv4 rule listings: 2097 Cleaning 2094 rules with ip -batch... time -p ip -4 -force -batch /tmp/crr_batch_cmds_4.txt real 17.81 user 0.05 sys 0.00 Patrick thought had an idea, but I don't think he had time to look at it further: "Its probably the synchronize_rcu() in fib_nl_delrule() and the route flushing happening after rule removal." Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2010-08-09 17:44 ` Ben Greear @ 2010-08-09 17:48 ` Benjamin LaHaise 2010-08-09 18:03 ` Ben Greear 0 siblings, 1 reply; 41+ messages in thread From: Benjamin LaHaise @ 2010-08-09 17:48 UTC (permalink / raw) To: Ben Greear Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu On Mon, Aug 09, 2010 at 10:44:14AM -0700, Ben Greear wrote: > We also have a few rules per interface, and notice that it takes around 10ms > per rule when we are removing them, even when using batching in 'ip': ... > Patrick thought had an idea, but I don't think he had time to > look at it further: > > "Its probably the synchronize_rcu() in fib_nl_delrule() and > the route flushing happening after rule removal." Yes, that would be a problem, but the issue is deeper than that -- if I'm not mistaken it's on the packet processing path that iptables doesn't scale for 100k interfaces with 1 rule per interface. It's been a while since I ran the tests, but I don't think it's changed much. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2010-08-09 17:48 ` Benjamin LaHaise @ 2010-08-09 18:03 ` Ben Greear 0 siblings, 0 replies; 41+ messages in thread From: Ben Greear @ 2010-08-09 18:03 UTC (permalink / raw) To: Benjamin LaHaise Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu On 08/09/2010 10:48 AM, Benjamin LaHaise wrote: > On Mon, Aug 09, 2010 at 10:44:14AM -0700, Ben Greear wrote: >> We also have a few rules per interface, and notice that it takes around 10ms >> per rule when we are removing them, even when using batching in 'ip': > ... >> Patrick thought had an idea, but I don't think he had time to >> look at it further: >> >> "Its probably the synchronize_rcu() in fib_nl_delrule() and >> the route flushing happening after rule removal." > > Yes, that would be a problem, but the issue is deeper than that -- if I'm > not mistaken it's on the packet processing path that iptables doesn't scale > for 100k interfaces with 1 rule per interface. It's been a while since I > ran the tests, but I don't think it's changed much. It would be nice to tie the rules based on 'iif' to a specific interface. Seems it should give near constant time lookup for rules if we only have a few per interface.... Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2010-08-09 17:34 ` Benjamin LaHaise 2010-08-09 17:44 ` Ben Greear @ 2010-08-09 19:59 ` Eric W. Biederman 2010-08-09 21:03 ` Benjamin LaHaise 1 sibling, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2010-08-09 19:59 UTC (permalink / raw) To: Benjamin LaHaise Cc: Ben Greear, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu Benjamin LaHaise <bcrl@lhnet.ca> writes: > Hello Ben, > > On Mon, Aug 09, 2010 at 10:23:37AM -0700, Ben Greear wrote: >> I was just comparing my out-of-tree patch set to .35, and it appears >> little or none of the patches discussed in this thread are in the >> upstream kernel yet. The network device deletion batching code has gone in, which is a big help, as have some dev_put deletions, so we hit that 250ms delay less often. > I was waiting on Eric's sysfs changes for namespaces to settle down, but > ended up getting busy on other things. I guess now is a good time to pick > this back up and try to merge my changes for improving interface scaling. > I'll send out a new version of the patches sometime in the next couple of > days. I'm also about to make a new Babylon release as well, I just need > to write some more documentation. :-/ sysfs feature wise has now settled down, and the regressions have all been stamped out so now should be a good time to work on scaling. I still have some preliminary patches in my tree, that I will dig up as time goes by. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2010-08-09 19:59 ` Eric W. Biederman @ 2010-08-09 21:03 ` Benjamin LaHaise 2010-08-09 21:17 ` Eric W. Biederman 0 siblings, 1 reply; 41+ messages in thread From: Benjamin LaHaise @ 2010-08-09 21:03 UTC (permalink / raw) To: Eric W. Biederman Cc: Ben Greear, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu On Mon, Aug 09, 2010 at 12:59:14PM -0700, Eric W. Biederman wrote: > The network device deletion batching code has gone in, which is > a big help, as have some dev_put deletions, so we hit that 250ms > delay less often. I'll see how much that helps. Odds are I'm going to have to move the device deletion into a separate thread. That should give me a natural boundary to queue up deletions at, which should fix the tunnel-flap and partial tunnel-flap cases I'm worried about. At some point I have to figure out how to get my API needs met by the in-kernel L2TP code, but that's a worry for another day. > sysfs feature wise has now settled down, and the regressions have all > been stamped out so now should be a good time to work on scaling. > > I still have some preliminary patches in my tree, that I will dig up > as time goes by. I should have some time this evening to run a few tests, and hopefully can post some results. -ben ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2010-08-09 21:03 ` Benjamin LaHaise @ 2010-08-09 21:17 ` Eric W. Biederman 0 siblings, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2010-08-09 21:17 UTC (permalink / raw) To: Benjamin LaHaise Cc: Ben Greear, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu Benjamin LaHaise <bcrl@lhnet.ca> writes: > On Mon, Aug 09, 2010 at 12:59:14PM -0700, Eric W. Biederman wrote: >> The network device deletion batching code has gone in, which is >> a big help, as have some dev_put deletions, so we hit that 250ms >> delay less often. > > I'll see how much that helps. Odds are I'm going to have to move the > device deletion into a separate thread. That should give me a natural > boundary to queue up deletions at, which should fix the tunnel-flap and > partial tunnel-flap cases I'm worried about. At some point I have to > figure out how to get my API needs met by the in-kernel L2TP code, but > that's a worry for another day. In case it is useful, if you delete a network namespace in general all of the network device deletions can be batched. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-21 15:40 ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet 2009-10-21 16:09 ` Eric Dumazet 2009-10-21 16:51 ` Benjamin LaHaise @ 2009-10-21 16:55 ` Octavian Purdila 2009-10-23 21:13 ` Paul E. McKenney 3 siblings, 0 replies; 41+ messages in thread From: Octavian Purdila @ 2009-10-21 16:55 UTC (permalink / raw) To: Eric Dumazet; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu On Wednesday 21 October 2009 18:40:07 you wrote: > > > > I would also like to see this patch in, we are running into scalability > > issues with creating/deleting lots of interfaces as well. > > Ben patch only address interface deletion, and one part of the problem, > maybe the more visible one for the current kernel. > > Adding lots of interfaces only needs several threads to run concurently. > > Before applying/examining his patch I suggest identifying all dev_put() > spots than can be deleted and replaced by something more scalable. I began > this job but others can help me. > Yes, I agree with you, there are multiple places which needs to be touched to allow for better scaling with regard to the number of interfaces. We do have patches that addresses some of these issues, but unfortunately they are based on 2.6.7 and some of them are quite ugly hacks :) However, we are in the process of switching to 2.6.31 so I hope we will be able to contribute on this effort. > RTNL and rcu grace periods are going to hurt anyway, so you probably need > to use many tasks to be able to delete lots of interfaces in parallel. > Hmm, how would multiple tasks help here? Isn't the RTNL mutex global? > netdev_run_todo() should also use a better algorithm to allow parallelism. > > Following patch doesnt slow down dev_put() users and real scalability > problems will surface and might be addressed. > > [PATCH] net: allow netdev_wait_allrefs() to run faster > Thanks, I am going to test it on our platform and send back the results. tavi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-21 15:40 ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet ` (2 preceding siblings ...) 2009-10-21 16:55 ` Octavian Purdila @ 2009-10-23 21:13 ` Paul E. McKenney 2009-10-24 4:35 ` Eric Dumazet 3 siblings, 1 reply; 41+ messages in thread From: Paul E. McKenney @ 2009-10-23 21:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote: > Octavian Purdila a écrit : > > On Sunday 18 October 2009 21:21:44 you wrote: > >>> The msleep(250) should be tuned first. Then if this is really necessary > >>> to dismantle 100.000 netdevices per second, we might have to think a bit > >>> more. > >>> Just try msleep(1 or 2), it should work quite well. > >> My goal is tearing down 100,000 interfaces in a few seconds, which really > >> is necessary. Right now we're running about 40,000 interfaces on a not > >> yet saturated 10Gbps link. Going to dual 10Gbps links means pushing more > >> than 100,000 subscriber interfaces, and it looks like a modern dual socket > >> system can handle that. > >> > > > > I would also like to see this patch in, we are running into scalability issues > > with creating/deleting lots of interfaces as well. > > Ben patch only address interface deletion, and one part of the problem, > maybe the more visible one for the current kernel. > > Adding lots of interfaces only needs several threads to run concurently. > > Before applying/examining his patch I suggest identifying all dev_put() spots than > can be deleted and replaced by something more scalable. I began this job > but others can help me. > > RTNL and rcu grace periods are going to hurt anyway, so you probably need > to use many tasks to be able to delete lots of interfaces in parallel. > > netdev_run_todo() should also use a better algorithm to allow parallelism. > > Following patch doesnt slow down dev_put() users and real scalability > problems will surface and might be addressed. > > [PATCH] net: allow netdev_wait_allrefs() to run faster > > netdev_wait_allrefs() waits that all references to a device vanishes. > > It currently uses a _very_ pessimistic 250 ms delay between each probe. > Some users report that no more than 4 devices can be dismantled per second, > this is a pretty serious problem for extreme setups. > > Most likely, references only wait for a rcu grace period that should come > fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery. Is this a place where synchronize_rcu_expedited() is appropriate? (It went in to 2.6.32-rc1.) Thanx, Paul > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > --- > net/core/dev.c | 2 +- > 1 files changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/core/dev.c b/net/core/dev.c > index 28b0b9e..fca2e4a 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4983,7 +4983,7 @@ static void netdev_wait_allrefs(struct net_device *dev) > rebroadcast_time = jiffies; > } > > - msleep(250); > + schedule_timeout_uninterruptible(1); > > if (time_after(jiffies, warning_time + 10 * HZ)) { > printk(KERN_EMERG "unregister_netdevice: " > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-23 21:13 ` Paul E. McKenney @ 2009-10-24 4:35 ` Eric Dumazet 2009-10-24 5:49 ` Paul E. McKenney 2009-10-24 20:22 ` Stephen Hemminger 0 siblings, 2 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-24 4:35 UTC (permalink / raw) To: paulmck; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu Paul E. McKenney a écrit : > On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote: >> [PATCH] net: allow netdev_wait_allrefs() to run faster >> >> netdev_wait_allrefs() waits that all references to a device vanishes. >> >> It currently uses a _very_ pessimistic 250 ms delay between each probe. >> Some users report that no more than 4 devices can be dismantled per second, >> this is a pretty serious problem for extreme setups. >> >> Most likely, references only wait for a rcu grace period that should come >> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery. > > Is this a place where synchronize_rcu_expedited() is appropriate? > (It went in to 2.6.32-rc1.) > Thanks for the tip Paul I believe netdev_wait_allrefs() is not a perfect candidate, because synchronize_sched_expedited() seems really expensive. Maybe we could call it once only, if we had to call 1 times the jiffie delay ? diff --git a/net/core/dev.c b/net/core/dev.c index fa88dcd..9b04b9a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev); static void netdev_wait_allrefs(struct net_device *dev) { unsigned long rebroadcast_time, warning_time; + unsigned int count = 0; rebroadcast_time = warning_time = jiffies; while (atomic_read(&dev->refcnt) != 0) { @@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev) rebroadcast_time = jiffies; } - msleep(250); + if (count++ == 1) + synchronize_rcu_expedited(); + else + schedule_timeout_uninterruptible(1); if (time_after(jiffies, warning_time + 10 * HZ)) { printk(KERN_EMERG "unregister_netdevice: " ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-24 4:35 ` Eric Dumazet @ 2009-10-24 5:49 ` Paul E. McKenney 2009-10-24 8:49 ` Eric Dumazet 2009-10-24 20:22 ` Stephen Hemminger 1 sibling, 1 reply; 41+ messages in thread From: Paul E. McKenney @ 2009-10-24 5:49 UTC (permalink / raw) To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu On Sat, Oct 24, 2009 at 06:35:53AM +0200, Eric Dumazet wrote: > Paul E. McKenney a écrit : > > On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote: > >> [PATCH] net: allow netdev_wait_allrefs() to run faster > >> > >> netdev_wait_allrefs() waits that all references to a device vanishes. > >> > >> It currently uses a _very_ pessimistic 250 ms delay between each probe. > >> Some users report that no more than 4 devices can be dismantled per second, > >> this is a pretty serious problem for extreme setups. > >> > >> Most likely, references only wait for a rcu grace period that should come > >> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery. > > > > Is this a place where synchronize_rcu_expedited() is appropriate? > > (It went in to 2.6.32-rc1.) > > Thanks for the tip Paul > > I believe netdev_wait_allrefs() is not a perfect candidate, because > synchronize_sched_expedited() seems really expensive. It does indeed keep the CPUs quite busy for a bit. ;-) > Maybe we could call it once only, if we had to call 1 times > the jiffie delay ? This could be a very useful approach! However, please keep in mind that although synchronize_rcu_expedited() forces a grace period, it does nothing to speed the invocation of other RCU callbacks. In short, synchronize_rcu_expedited() is a faster version of synchronize_rcu(), but doesn't necessarily help other synchronize_rcu() or call_rcu() invocations. The reason I point this out is that it looks to me that the code below is waiting for some other task which is in turn waiting on a grace period. But I don't know this code, so could easily be confused. Thanx, paul > diff --git a/net/core/dev.c b/net/core/dev.c > index fa88dcd..9b04b9a 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev); > static void netdev_wait_allrefs(struct net_device *dev) > { > unsigned long rebroadcast_time, warning_time; > + unsigned int count = 0; > > rebroadcast_time = warning_time = jiffies; > while (atomic_read(&dev->refcnt) != 0) { > @@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev) > rebroadcast_time = jiffies; > } > > - msleep(250); > + if (count++ == 1) > + synchronize_rcu_expedited(); > + else > + schedule_timeout_uninterruptible(1); > > if (time_after(jiffies, warning_time + 10 * HZ)) { > printk(KERN_EMERG "unregister_netdevice: " > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-24 5:49 ` Paul E. McKenney @ 2009-10-24 8:49 ` Eric Dumazet 2009-10-24 13:52 ` Paul E. McKenney 0 siblings, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-24 8:49 UTC (permalink / raw) To: paulmck; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu Paul E. McKenney a écrit : > On Sat, Oct 24, 2009 at 06:35:53AM +0200, Eric Dumazet wrote: > >> Maybe we could call it once only, if we had to call 1 times >> the jiffie delay ? > > This could be a very useful approach! > > However, please keep in mind that although synchronize_rcu_expedited() > forces a grace period, it does nothing to speed the invocation of other > RCU callbacks. In short, synchronize_rcu_expedited() is a faster version > of synchronize_rcu(), but doesn't necessarily help other synchronize_rcu() > or call_rcu() invocations. > > The reason I point this out is that it looks to me that the code below is > waiting for some other task which is in turn waiting on a grace period. > But I don't know this code, so could easily be confused. > Normally, we need a synchronize_rcu() calls, but I feel its bit more than really needed here. On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.580259] synchronize_net() 4045596 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.588262] synchronize_net() 7769327 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.625014] synchronize_net() 4772052 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.633008] synchronize_net() 7773896 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.669260] synchronize_net() 3958141 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.677259] synchronize_net() 7755817 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.712011] synchronize_net() 2502544 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.720011] synchronize_net() 7767748 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.754259] synchronize_net() 2087946 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.762258] synchronize_net() 7738054 ns messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.796011] synchronize_net() 3392760 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.808025] synchronize_net() 11814619 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.848010] synchronize_net() 8970220 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.856015] synchronize_net() 7800782 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.893008] synchronize_net() 6650174 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.897012] synchronize_net() 3744808 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.940202] synchronize_net() 8354366 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.952137] synchronize_net() 11693215 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.985010] synchronize_net() 2355970 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.989009] synchronize_net() 3771419 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.028137] synchronize_net() 7661195 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.036152] synchronize_net() 7800056 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.083135] synchronize_net() 6774026 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.089145] synchronize_net() 5727189 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.130385] synchronize_net() 10133932 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.134399] synchronize_net() 3773058 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.170136] synchronize_net() 4479194 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.178138] synchronize_net() 7710466 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.217198] synchronize_net() 4323437 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.226206] synchronize_net() 8723108 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.268013] synchronize_net() 6221155 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.280007] synchronize_net() 11719297 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.324008] synchronize_net() 11654511 ns messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.332009] synchronize_net() 7744182 ns ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-24 8:49 ` Eric Dumazet @ 2009-10-24 13:52 ` Paul E. McKenney 2009-10-24 14:24 ` Eric Dumazet 0 siblings, 1 reply; 41+ messages in thread From: Paul E. McKenney @ 2009-10-24 13:52 UTC (permalink / raw) To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote: > Paul E. McKenney a écrit : > > On Sat, Oct 24, 2009 at 06:35:53AM +0200, Eric Dumazet wrote: > > > >> Maybe we could call it once only, if we had to call 1 times > >> the jiffie delay ? > > > > This could be a very useful approach! > > > > However, please keep in mind that although synchronize_rcu_expedited() > > forces a grace period, it does nothing to speed the invocation of other > > RCU callbacks. In short, synchronize_rcu_expedited() is a faster version > > of synchronize_rcu(), but doesn't necessarily help other synchronize_rcu() > > or call_rcu() invocations. > > > > The reason I point this out is that it looks to me that the code below is > > waiting for some other task which is in turn waiting on a grace period. > > But I don't know this code, so could easily be confused. > > > > Normally, we need a synchronize_rcu() calls, but I feel its bit more than really > needed here. > > On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms That sounds like the right range, depending on what else is happening on the machine at the time. The synchronize_rcu_expedited() primitive would run in the 10s-100s of microseconds. It involves a pair of wakeups and a pair of context switches on each CPU. Thanx, Paul > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.580259] synchronize_net() 4045596 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.588262] synchronize_net() 7769327 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.625014] synchronize_net() 4772052 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.633008] synchronize_net() 7773896 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.669260] synchronize_net() 3958141 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.677259] synchronize_net() 7755817 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.712011] synchronize_net() 2502544 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.720011] synchronize_net() 7767748 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.754259] synchronize_net() 2087946 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.762258] synchronize_net() 7738054 ns > messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.796011] synchronize_net() 3392760 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.808025] synchronize_net() 11814619 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.848010] synchronize_net() 8970220 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.856015] synchronize_net() 7800782 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.893008] synchronize_net() 6650174 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.897012] synchronize_net() 3744808 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.940202] synchronize_net() 8354366 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.952137] synchronize_net() 11693215 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.985010] synchronize_net() 2355970 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.989009] synchronize_net() 3771419 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.028137] synchronize_net() 7661195 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.036152] synchronize_net() 7800056 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.083135] synchronize_net() 6774026 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.089145] synchronize_net() 5727189 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.130385] synchronize_net() 10133932 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.134399] synchronize_net() 3773058 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.170136] synchronize_net() 4479194 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.178138] synchronize_net() 7710466 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.217198] synchronize_net() 4323437 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.226206] synchronize_net() 8723108 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.268013] synchronize_net() 6221155 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.280007] synchronize_net() 11719297 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.324008] synchronize_net() 11654511 ns > messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.332009] synchronize_net() 7744182 ns > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-24 13:52 ` Paul E. McKenney @ 2009-10-24 14:24 ` Eric Dumazet 2009-10-24 14:46 ` Paul E. McKenney 2009-10-24 23:49 ` Octavian Purdila 0 siblings, 2 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-24 14:24 UTC (permalink / raw) To: paulmck; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu Paul E. McKenney a écrit : > On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote: >> >> On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms > > That sounds like the right range, depending on what else is happening > on the machine at the time. > > The synchronize_rcu_expedited() primitive would run in the 10s-100s > of microseconds. It involves a pair of wakeups and a pair of context > switches on each CPU. > Hmm... I'll make some experiments Monday and post results, but it seems very promising. Do you think the "on_each_cpu(flush_backlog, dev, 1);" we perform right before calling netdev_wait_allrefs() could be changed somehow to speedup rcu callbacks ? Maybe we ould avoid sending IPI twice to cpus ? Thanks ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-24 14:24 ` Eric Dumazet @ 2009-10-24 14:46 ` Paul E. McKenney 2009-10-24 23:49 ` Octavian Purdila 1 sibling, 0 replies; 41+ messages in thread From: Paul E. McKenney @ 2009-10-24 14:46 UTC (permalink / raw) To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu On Sat, Oct 24, 2009 at 04:24:27PM +0200, Eric Dumazet wrote: > Paul E. McKenney a écrit : > > On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote: > >> > >> On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms > > > > That sounds like the right range, depending on what else is happening > > on the machine at the time. > > > > The synchronize_rcu_expedited() primitive would run in the 10s-100s > > of microseconds. It involves a pair of wakeups and a pair of context > > switches on each CPU. > > Hmm... I'll make some experiments Monday and post results, but it seems very > promising. I should hasten to add that synchronize_rcu_expedited() goes fast for TREE_RCU but not yet for TREE_PREEMPT_RCU (where it maps safely but slowly to synchronize_rcu()). > Do you think the "on_each_cpu(flush_backlog, dev, 1);" > we perform right before calling netdev_wait_allrefs() could be changed > somehow to speedup rcu callbacks ? Maybe we ould avoid sending IPI twice to > cpus ? This is an interesting possibility, and might fit in with some of the changes that I am thinking about to reduce OS jitter for the heavy-duty numerical-computing guys. In the meantime, you could try doing the following from flush_backlog(): local_irq_save(flags); rcu_check_callbacks(smp_processor_id(), 0); local_irq_restore(flags); This would emulate a much-faster HZ value, but only for RCU. This works better in TREE_RCU than it does in TREE_PREEMPT_RCU at the moment (on my todo list!). In older kernels, this should also work for CLASSIC_RCU. Of course, in TINY_RCU, synchronize_rcu() is a no-op anyway. ;-) And just to be clear, synchronize_rcu_expedited() currently just does wakeups, not explicit IPIs. Thanx, Paul ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-24 14:24 ` Eric Dumazet 2009-10-24 14:46 ` Paul E. McKenney @ 2009-10-24 23:49 ` Octavian Purdila 2009-10-25 4:47 ` Paul E. McKenney 2009-10-25 8:35 ` Eric Dumazet 1 sibling, 2 replies; 41+ messages in thread From: Octavian Purdila @ 2009-10-24 23:49 UTC (permalink / raw) To: Eric Dumazet; +Cc: paulmck, Benjamin LaHaise, netdev, Cosmin Ratiu On Saturday 24 October 2009 17:24:27 you wrote: > Paul E. McKenney a écrit : > > On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote: > >> On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms > > > > That sounds like the right range, depending on what else is happening > > on the machine at the time. > > > > The synchronize_rcu_expedited() primitive would run in the 10s-100s > > of microseconds. It involves a pair of wakeups and a pair of context > > switches on each CPU. > > Hmm... I'll make some experiments Monday and post results, but it seems > very promising. > Got some time today and did some experiments myself. The test is deleting 1000 dummy interfaces (interface status down, no IP/IPv6 addresses assigned) on a UP non-preempt ppc750 @800Mhz system. 1. Ben's patch: real 0m 3.42s user 0m 0.00s sys 0m 0.00s 2. Eric's schedule_timeout_uninterruptible(1); real 0m 3.00s user 0m 0.00s sys 0m 0.00s 3. Simple synchronize_rcu_expedited() This doesn't seem to work well with the UP non-preempt case since synchronize_rcu_expedited() is a noop in this case - turning netdev_wait_allrefs() into a while(1) loop. tavi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-24 23:49 ` Octavian Purdila @ 2009-10-25 4:47 ` Paul E. McKenney 2009-10-25 8:35 ` Eric Dumazet 1 sibling, 0 replies; 41+ messages in thread From: Paul E. McKenney @ 2009-10-25 4:47 UTC (permalink / raw) To: Octavian Purdila; +Cc: Eric Dumazet, Benjamin LaHaise, netdev, Cosmin Ratiu On Sun, Oct 25, 2009 at 02:49:00AM +0300, Octavian Purdila wrote: > On Saturday 24 October 2009 17:24:27 you wrote: > > Paul E. McKenney a écrit : > > > On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote: > > >> On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms > > > > > > That sounds like the right range, depending on what else is happening > > > on the machine at the time. > > > > > > The synchronize_rcu_expedited() primitive would run in the 10s-100s > > > of microseconds. It involves a pair of wakeups and a pair of context > > > switches on each CPU. > > > > Hmm... I'll make some experiments Monday and post results, but it seems > > very promising. > > > > Got some time today and did some experiments myself. The test is deleting 1000 > dummy interfaces (interface status down, no IP/IPv6 addresses assigned) on a > UP non-preempt ppc750 @800Mhz system. > > 1. Ben's patch: > > real 0m 3.42s > user 0m 0.00s > sys 0m 0.00s > > 2. Eric's schedule_timeout_uninterruptible(1); > > real 0m 3.00s > user 0m 0.00s > sys 0m 0.00s > > 3. Simple synchronize_rcu_expedited() > > This doesn't seem to work well with the UP non-preempt case since > synchronize_rcu_expedited() is a noop in this case - turning > netdev_wait_allrefs() into a while(1) loop. Indeed -- but then again, in the UP case, synchronize_rcu() itself is pretty much a no-op. So if your main target is UP, you should be able to have seriously fast RCU updates. (I know, I know, you want SMP to run fast as well...) Thanx, Paul ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-24 23:49 ` Octavian Purdila 2009-10-25 4:47 ` Paul E. McKenney @ 2009-10-25 8:35 ` Eric Dumazet 2009-10-25 15:19 ` Octavian Purdila 1 sibling, 1 reply; 41+ messages in thread From: Eric Dumazet @ 2009-10-25 8:35 UTC (permalink / raw) To: Octavian Purdila; +Cc: paulmck, Benjamin LaHaise, netdev, Cosmin Ratiu Octavian Purdila a écrit : > > Got some time today and did some experiments myself. The test is deleting 1000 > dummy interfaces (interface status down, no IP/IPv6 addresses assigned) on a > UP non-preempt ppc750 @800Mhz system. > > 1. Ben's patch: > > real 0m 3.42s > user 0m 0.00s > sys 0m 0.00s > > 2. Eric's schedule_timeout_uninterruptible(1); > > real 0m 3.00s > user 0m 0.00s > sys 0m 0.00s > > 3. Simple synchronize_rcu_expedited() > > This doesn't seem to work well with the UP non-preempt case since > synchronize_rcu_expedited() is a noop in this case - turning > netdev_wait_allrefs() into a while(1) loop. > Thanks for these numbers. I presume HZ value is 1000 on this platform ? Could you give us your scripts so that we can use same "benchmark" ? BTW, I found I could not use IPV6 with many devices on x86_32, because of the huge per_cpu allocations (on IPV6, each device has percpu SNMP counters) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-25 8:35 ` Eric Dumazet @ 2009-10-25 15:19 ` Octavian Purdila 2009-10-25 19:28 ` Eric Dumazet 0 siblings, 1 reply; 41+ messages in thread From: Octavian Purdila @ 2009-10-25 15:19 UTC (permalink / raw) To: Eric Dumazet; +Cc: paulmck, Benjamin LaHaise, netdev, Cosmin Ratiu [-- Attachment #1: Type: Text/Plain, Size: 1468 bytes --] On Sunday 25 October 2009 10:35:10 you wrote: > > Got some time today and did some experiments myself. The test is deleting > > 1000 dummy interfaces (interface status down, no IP/IPv6 addresses > > assigned) on a UP non-preempt ppc750 @800Mhz system. > > > > 1. Ben's patch: > > > > real 0m 3.42s > > user 0m 0.00s > > sys 0m 0.00s > > > > 2. Eric's schedule_timeout_uninterruptible(1); > > > > real 0m 3.00s > > user 0m 0.00s > > sys 0m 0.00s > > > > 3. Simple synchronize_rcu_expedited() > > > > This doesn't seem to work well with the UP non-preempt case since > > synchronize_rcu_expedited() is a noop in this case - turning > > netdev_wait_allrefs() into a while(1) loop. > > Thanks for these numbers. I presume HZ value is 1000 on this platform ? > Yes. I've attach the full config to this email as well. > Could you give us your scripts so that we can use same "benchmark" ? > Sure, I've attached the hack module code I've used. For creating interfaces: echo 1000 > /proc/sys/net/ndst/add For deleting interface echo start_ifindex stop_ifindex > /proc/sys/net/ndst/del Some more information: - on our old and optimized kernel I am getting 0.4s for creating 128000 interfaces and 0.57s for deleting them - the 2.6.31 kernel I got the 3s numbers does have some patches to speed-up interface creating and deletion (removal of per device sysctl and dev_snmp6 entries) I'll start posting the patches we have as RFC. Thanks, tavi [-- Attachment #2: ndst.c --] [-- Type: text/x-csrc, Size: 4494 bytes --] #include <linux/module.h> #include <linux/kernel.h> #include <linux/if_arp.h> #include <linux/inetdevice.h> #include <linux/rtnetlink.h> #include <linux/ip.h> #include <net/route.h> #include <linux/netfilter.h> #include <linux/netfilter_ipv4.h> #include <linux/version.h> #include <net/ip.h> #include <net/flow.h> #include <net/ipv6.h> #include <linux/netfilter_ipv6.h> #include <net/ip6_route.h> #include <net/addrconf.h> #include <linux/version.h> #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,24) #define __INIT_NET(x) x #else #include <net/net_namespace.h> #define __INIT_NET(x) &init_net,x #endif #ifndef CONFIG_IXIA_CONSOLE static inline void netdev_set_header_ops(struct net_device *dev, const struct header_ops *hops) { dev->header_ops = hops; } static inline void netdev_set_ops(struct net_device *dev, const struct net_device_ops *ops) { dev->netdev_ops = ops; } #endif static struct net_device_ops ndst_ops = { }; int ndst_add(int n) { int err, i; struct net_device * dev = NULL; char name[IFNAMSIZ]; for(i = 0; i < n; i++) { /* temporary hack until we fix __dev_alloc_name - it is O(n) ! */ rtnl_lock(); do { static unsigned counter = 1; snprintf(name, IFNAMSIZ, "ixtest%d", counter++); } while(__dev_get_by_name(__INIT_NET(name))); rtnl_unlock(); dev = alloc_netdev(0, name, ether_setup); if (dev == NULL) { err = -ENOMEM; goto err; } netdev_set_ops(dev, &ndst_ops); netdev_set_header_ops(dev, NULL); err = register_netdev(dev); if (err) goto err; } return 0; // Error handling. err: if (dev) free_netdev(dev); module_put(THIS_MODULE); printk(KERN_ERR "%s: failed to register netdev: %d\n", __func__, err); return err; } int ndst_del(int start, int stop) { struct net_device *dev; int i; for(i = start; i <= stop; i++) { rtnl_lock(); dev = __dev_get_by_index(__INIT_NET(i)); if (!dev) { rtnl_unlock(); return -EINVAL; } unregister_netdevice(dev); rtnl_unlock(); free_netdev(dev); } return 0; } #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27) static int proc_do_add(struct ctl_table *ctl, int write, struct file * filp, void __user *buffer, size_t *lenp) #else static int proc_do_add(struct ctl_table *ctl, int write, struct file * filp, void __user *buffer, size_t *lenp, loff_t *ppos) #endif { int ret = 0; uint32_t data; ctl->data = &data; if (write) { #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27) ret = proc_dointvec(ctl, write, filp, buffer, lenp); #else ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos); #endif ndst_add(data); } return ret; } #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27) static int proc_do_del(struct ctl_table *ctl, int write, struct file * filp, void __user *buffer, size_t *lenp) #else static int proc_do_del(struct ctl_table *ctl, int write, struct file * filp, void __user *buffer, size_t *lenp, loff_t *ppos) #endif { int ret = 0; uint32_t data[2]; ctl->data = data; if (write) { #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27) ret = proc_dointvec(ctl, write, filp, buffer, lenp); #else ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos); #endif ndst_del(data[0], data[1]); } return ret; } static ctl_table ndst_sysctl_table[] = { { #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27) .ctl_name = 1, #endif .procname = "add", .maxlen = sizeof(uint32_t), .mode = 0200, .proc_handler = &proc_do_add, }, { #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27) .ctl_name = 2, #endif .procname = "del", .maxlen = 2*sizeof(uint32_t), .mode = 0200, .proc_handler = &proc_do_del, }, {} }; static ctl_table ndst_sysctl_net_table[] = { { #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27) .ctl_name = 1024, #endif .procname = "ndst", .data = NULL, .maxlen = 0, .mode = 0555, .child = ndst_sysctl_table }, {} }; static ctl_table ndst_sysctl_root[] = { { .ctl_name = CTL_NET, .procname = "net", .data = NULL, .maxlen = 0, .mode = 0555, .child = ndst_sysctl_net_table }, {} }; static struct ctl_table_header *ndst_sysctl_hdr; int ndst_init(void) { #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27) ndst_sysctl_hdr = register_sysctl_table(ndst_sysctl_root, 0); #else ndst_sysctl_hdr = register_sysctl_table(ndst_sysctl_root); #endif if (ndst_sysctl_hdr == NULL) return -EINVAL; return 0; } void ndst_cleanup(void) { unregister_sysctl_table(ndst_sysctl_hdr); } module_init(ndst_init); module_exit(ndst_cleanup); [-- Attachment #3: .config --] [-- Type: text/plain, Size: 22178 bytes --] # # Automatically generated make config: don't edit # Linux kernel version: 2.6.31 # Sat Oct 24 20:54:34 2009 # # CONFIG_PPC64 is not set # # Processor support # CONFIG_PPC_BOOK3S_32=y # CONFIG_PPC_85xx is not set # CONFIG_PPC_8xx is not set # CONFIG_40x is not set # CONFIG_44x is not set # CONFIG_E200 is not set CONFIG_PPC_BOOK3S=y CONFIG_6xx=y CONFIG_PPC_FPU=y # CONFIG_ALTIVEC is not set CONFIG_PPC_STD_MMU=y CONFIG_PPC_STD_MMU_32=y # CONFIG_PPC_MM_SLICES is not set CONFIG_PPC_HAVE_PMU_SUPPORT=y # CONFIG_SMP is not set CONFIG_PPC32=y CONFIG_WORD_SIZE=32 # CONFIG_ARCH_PHYS_ADDR_T_64BIT is not set CONFIG_MMU=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y # CONFIG_HAVE_SETUP_PER_CPU_AREA is not set CONFIG_IRQ_PER_CPU=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_ARCH_HAS_ILOG2_U32=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_FIND_NEXT_BIT=y # CONFIG_ARCH_NO_VIRT_TO_BUS is not set CONFIG_PPC=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_NVRAM=y CONFIG_SCHED_OMIT_FRAME_POINTER=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_PPC_OF=y CONFIG_OF=y # CONFIG_PPC_UDBG_16550 is not set # CONFIG_GENERIC_TBSYNC is not set CONFIG_AUDIT_ARCH=y CONFIG_GENERIC_BUG=y CONFIG_DTC=y # CONFIG_DEFAULT_UIMAGE is not set # CONFIG_PPC_DCR_NATIVE is not set # CONFIG_PPC_DCR_MMIO is not set CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_CONSTRUCTORS=y # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y # CONFIG_POSIX_MQUEUE is not set # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_AUDIT is not set # # RCU Subsystem # CONFIG_CLASSIC_RCU=y # CONFIG_TREE_RCU is not set # CONFIG_PREEMPT_RCU is not set # CONFIG_TREE_RCU_TRACE is not set # CONFIG_PREEMPT_RCU_TRACE is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=17 # CONFIG_GROUP_SCHED is not set # CONFIG_CGROUPS is not set # CONFIG_RELAY is not set # CONFIG_NAMESPACES is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_RD_GZIP=y # CONFIG_RD_BZIP2 is not set # CONFIG_RD_LZMA is not set # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_ANON_INODES=y CONFIG_EMBEDDED=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set # CONFIG_HOTPLUG is not set CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y # CONFIG_SIGNALFD is not set CONFIG_TIMERFD=y # CONFIG_EVENTFD is not set # CONFIG_SHMEM is not set # CONFIG_AIO is not set CONFIG_HAVE_PERF_COUNTERS=y # # Performance Counters # # CONFIG_PERF_COUNTERS is not set CONFIG_VM_EVENT_COUNTERS=y # CONFIG_PCI_QUIRKS is not set # CONFIG_STRIP_ASM_SYMS is not set CONFIG_COMPAT_BRK=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set CONFIG_PROFILING=y CONFIG_TRACEPOINTS=y CONFIG_MARKERS=y CONFIG_OPROFILE=y CONFIG_HAVE_OPROFILE=y # CONFIG_KPROBES is not set CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y CONFIG_HAVE_IOREMAP_PROT=y CONFIG_HAVE_KPROBES=y CONFIG_HAVE_KRETPROBES=y CONFIG_HAVE_ARCH_TRACEHOOK=y # # GCOV-based kernel profiling # # CONFIG_GCOV_KERNEL is not set # CONFIG_SLOW_WORK is not set # CONFIG_HAVE_GENERIC_DMA_COHERENT is not set CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y CONFIG_BASE_SMALL=0 CONFIG_MODULES=y # CONFIG_MODULE_FORCE_LOAD is not set CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set # CONFIG_BLOCK is not set # CONFIG_FREEZER is not set # # Platform support # # CONFIG_PPC_CHRP is not set # CONFIG_MPC5121_ADS is not set # CONFIG_MPC5121_GENERIC is not set # CONFIG_PPC_MPC52xx is not set # CONFIG_PPC_PMAC is not set # CONFIG_PPC_CELL is not set # CONFIG_PPC_CELL_NATIVE is not set # CONFIG_PPC_82xx is not set # CONFIG_PQ2ADS is not set # CONFIG_PPC_83xx is not set # CONFIG_PPC_86xx is not set # CONFIG_EMBEDDED6xx is not set # CONFIG_AMIGAONE is not set CONFIG_PPC_IXIA=y # CONFIG_PPC_OF_BOOT_TRAMPOLINE is not set # CONFIG_IPIC is not set # CONFIG_MPIC is not set # CONFIG_MPIC_WEIRD is not set # CONFIG_PPC_I8259 is not set # CONFIG_PPC_RTAS is not set # CONFIG_MMIO_NVRAM is not set # CONFIG_PPC_MPC106 is not set # CONFIG_PPC_970_NAP is not set # CONFIG_PPC_INDIRECT_IO is not set # CONFIG_GENERIC_IOMAP is not set # CONFIG_CPU_FREQ is not set # CONFIG_TAU is not set # CONFIG_FSL_ULI1575 is not set # CONFIG_SIMPLE_GPIO is not set # # Kernel options # # CONFIG_HIGHMEM is not set CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 CONFIG_SCHED_HRTICK=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_BINFMT_ELF=y # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set # CONFIG_HAVE_AOUT is not set # CONFIG_BINFMT_MISC is not set # CONFIG_IOMMU_HELPER is not set # CONFIG_SWIOTLB is not set CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_ARCH_HAS_WALK_MEMORY=y CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y # CONFIG_KEXEC is not set # CONFIG_CRASH_DUMP is not set CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_PAGEFLAGS_EXTENDED=y CONFIG_SPLIT_PTLOCK_CPUS=4 # CONFIG_MIGRATION is not set # CONFIG_PHYS_ADDR_T_64BIT is not set CONFIG_ZONE_DMA_FLAG=1 CONFIG_VIRT_TO_BUS=y CONFIG_HAVE_MLOCK=y CONFIG_HAVE_MLOCKED_PAGE_BIT=y CONFIG_DEFAULT_MMAP_MIN_ADDR=4096 CONFIG_PPC_4K_PAGES=y # CONFIG_PPC_16K_PAGES is not set # CONFIG_PPC_64K_PAGES is not set # CONFIG_PPC_256K_PAGES is not set CONFIG_FORCE_MAX_ZONEORDER=11 # CONFIG_PROC_DEVICETREE is not set CONFIG_CMDLINE_BOOL=y CONFIG_CMDLINE="console=ttyS0 rootfstype=ramfs powersave=off" CONFIG_EXTRA_TARGETS="" # CONFIG_PM is not set # CONFIG_SECCOMP is not set CONFIG_ISA_DMA_API=y # # Bus options # CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y # CONFIG_PPC_INDIRECT_PCI is not set CONFIG_PCI=y CONFIG_PCI_DOMAINS=y CONFIG_PCI_SYSCALL=y # CONFIG_PCIEPORTBUS is not set CONFIG_ARCH_SUPPORTS_MSI=y # CONFIG_PCI_MSI is not set # CONFIG_PCI_LEGACY is not set # CONFIG_PCI_DEBUG is not set # CONFIG_PCI_STUB is not set # CONFIG_PCI_IOV is not set # CONFIG_HAS_RAPIDIO is not set # # Advanced setup # CONFIG_ADVANCED_OPTIONS=y CONFIG_LOWMEM_SIZE_BOOL=y CONFIG_LOWMEM_SIZE=0x70000000 CONFIG_PAGE_OFFSET_BOOL=y CONFIG_PAGE_OFFSET=0x80000000 CONFIG_KERNEL_START_BOOL=y CONFIG_KERNEL_START=0x80000000 CONFIG_PHYSICAL_START=0x00000000 CONFIG_TASK_SIZE_BOOL=y CONFIG_TASK_SIZE=0x70000000 CONFIG_NET=y # # Networking options # # CONFIG_NET_SYSCTL_DEV is not set CONFIG_PACKET=y CONFIG_PACKET_MMAP=y CONFIG_UNIX=y CONFIG_XFRM=y CONFIG_XFRM_USER=m # CONFIG_XFRM_SUB_POLICY is not set # CONFIG_XFRM_MIGRATE is not set # CONFIG_XFRM_STATISTICS is not set CONFIG_XFRM_IPCOMP=m CONFIG_NET_KEY=m # CONFIG_NET_KEY_MIGRATE is not set CONFIG_INET=y CONFIG_IP_MULTICAST=y CONFIG_IP_ADVANCED_ROUTER=y CONFIG_ASK_IP_FIB_HASH=y # CONFIG_IP_FIB_TRIE is not set CONFIG_IP_FIB_HASH=y CONFIG_IP_MULTIPLE_TABLES=y CONFIG_IP_IXIA_ROUTING=y # CONFIG_IP_ROUTE_MULTIPATH is not set # CONFIG_IP_ROUTE_VERBOSE is not set # CONFIG_IP_PNP is not set # CONFIG_NET_IPIP is not set # CONFIG_NET_IPGRE is not set # CONFIG_IP_MROUTE is not set # CONFIG_ARPD is not set # CONFIG_SYN_COOKIES is not set CONFIG_INET_AH=m CONFIG_INET_ESP=m CONFIG_INET_IPCOMP=m CONFIG_INET_XFRM_TUNNEL=m CONFIG_INET_TUNNEL=y CONFIG_INET_XFRM_MODE_TRANSPORT=y CONFIG_INET_XFRM_MODE_TUNNEL=y CONFIG_INET_XFRM_MODE_BEET=y # CONFIG_INET_LRO is not set CONFIG_INET_DIAG=y CONFIG_INET_TCP_DIAG=y # CONFIG_TCP_CONG_ADVANCED is not set CONFIG_TCP_CONG_CUBIC=y CONFIG_DEFAULT_TCP_CONG="cubic" # CONFIG_TCP_MD5SIG is not set CONFIG_IPV6=y # CONFIG_IPV6_PRIVACY is not set # CONFIG_IPV6_ROUTER_PREF is not set # CONFIG_IPV6_OPTIMISTIC_DAD is not set CONFIG_INET6_AH=m CONFIG_INET6_ESP=m # CONFIG_INET6_IPCOMP is not set # CONFIG_IPV6_MIP6 is not set # CONFIG_INET6_XFRM_TUNNEL is not set # CONFIG_INET6_TUNNEL is not set CONFIG_INET6_XFRM_MODE_TRANSPORT=y CONFIG_INET6_XFRM_MODE_TUNNEL=y CONFIG_INET6_XFRM_MODE_BEET=y # CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set CONFIG_IPV6_SIT=y CONFIG_IPV6_NDISC_NODETYPE=y # CONFIG_IPV6_TUNNEL is not set # CONFIG_IPV6_MULTIPLE_TABLES is not set # CONFIG_IPV6_MROUTE is not set # CONFIG_NETWORK_SECMARK is not set CONFIG_NETFILTER=y # CONFIG_NETFILTER_DEBUG is not set # CONFIG_NETFILTER_ADVANCED is not set # # Core Netfilter Configuration # # CONFIG_NETFILTER_NETLINK_LOG is not set # CONFIG_NF_CONNTRACK is not set CONFIG_NETFILTER_XTABLES=m # CONFIG_NETFILTER_XT_TARGET_MARK is not set # CONFIG_NETFILTER_XT_TARGET_NFLOG is not set # CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set # CONFIG_NETFILTER_XT_MATCH_MARK is not set # CONFIG_NETFILTER_XT_MATCH_POLICY is not set # CONFIG_IP_VS is not set # # IP: Netfilter Configuration # # CONFIG_NF_DEFRAG_IPV4 is not set CONFIG_IP_NF_IPTABLES=m CONFIG_IP_NF_FILTER=m # CONFIG_IP_NF_TARGET_REJECT is not set # CONFIG_IP_NF_TARGET_LOG is not set # CONFIG_IP_NF_TARGET_ULOG is not set # CONFIG_IP_NF_MANGLE is not set # # IPv6: Netfilter Configuration # CONFIG_IP6_NF_IPTABLES=m # CONFIG_IP6_NF_MATCH_IPV6HEADER is not set # CONFIG_IP6_NF_TARGET_LOG is not set CONFIG_IP6_NF_FILTER=m # CONFIG_IP6_NF_TARGET_REJECT is not set # CONFIG_IP6_NF_MANGLE is not set # CONFIG_IP_DCCP is not set # CONFIG_IP_SCTP is not set # CONFIG_TIPC is not set # CONFIG_ATM is not set # CONFIG_BRIDGE is not set # CONFIG_NET_DSA is not set # CONFIG_VLAN_8021Q is not set # CONFIG_DECNET is not set CONFIG_LLC=y CONFIG_LLC2=y # CONFIG_IPX is not set # CONFIG_ATALK is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # CONFIG_PHONET is not set # CONFIG_IEEE802154 is not set CONFIG_NET_SCHED=y # # Queueing/Scheduling # # CONFIG_NET_SCH_CBQ is not set # CONFIG_NET_SCH_HTB is not set # CONFIG_NET_SCH_HFSC is not set # CONFIG_NET_SCH_PRIO is not set # CONFIG_NET_SCH_MULTIQ is not set # CONFIG_NET_SCH_RED is not set # CONFIG_NET_SCH_SFQ is not set # CONFIG_NET_SCH_TEQL is not set CONFIG_NET_SCH_TBF=m # CONFIG_NET_SCH_GRED is not set # CONFIG_NET_SCH_DSMARK is not set # CONFIG_NET_SCH_NETEM is not set # CONFIG_NET_SCH_DRR is not set CONFIG_NET_SCH_INGRESS=m # # Classification # CONFIG_NET_CLS=y # CONFIG_NET_CLS_BASIC is not set CONFIG_NET_CLS_TCINDEX=m CONFIG_NET_CLS_ROUTE4=m CONFIG_NET_CLS_ROUTE=y CONFIG_NET_CLS_FW=m CONFIG_NET_CLS_U32=m # CONFIG_CLS_U32_PERF is not set # CONFIG_CLS_U32_MARK is not set # CONFIG_NET_CLS_RSVP is not set # CONFIG_NET_CLS_RSVP6 is not set # CONFIG_NET_CLS_FLOW is not set # CONFIG_NET_EMATCH is not set CONFIG_NET_CLS_ACT=y CONFIG_NET_ACT_POLICE=y # CONFIG_NET_ACT_GACT is not set # CONFIG_NET_ACT_MIRRED is not set # CONFIG_NET_ACT_IPT is not set # CONFIG_NET_ACT_NAT is not set # CONFIG_NET_ACT_PEDIT is not set # CONFIG_NET_ACT_SIMP is not set # CONFIG_NET_ACT_SKBEDIT is not set # CONFIG_NET_CLS_IND is not set CONFIG_NET_SCH_FIFO=y # CONFIG_DCB is not set # # Network testing # # CONFIG_NET_PKTGEN is not set # CONFIG_NET_DROP_MONITOR is not set CONFIG_NET_TXTIMESTAMP=y # CONFIG_HAMRADIO is not set # CONFIG_CAN is not set # CONFIG_IRDA is not set # CONFIG_BT is not set # CONFIG_AF_RXRPC is not set CONFIG_FIB_RULES=y # CONFIG_WIRELESS is not set # CONFIG_WIMAX is not set # CONFIG_RFKILL is not set # CONFIG_NET_9P is not set # # Device Drivers # # # Generic Driver Options # CONFIG_STANDALONE=y CONFIG_PREVENT_FIRMWARE_BUILD=y # CONFIG_DEBUG_DRIVER is not set # CONFIG_DEBUG_DEVRES is not set # CONFIG_SYS_HYPERVISOR is not set # CONFIG_CONNECTOR is not set # CONFIG_MTD is not set CONFIG_OF_DEVICE=y # CONFIG_PARPORT is not set # CONFIG_MISC_DEVICES is not set CONFIG_HAVE_IDE=y # # SCSI device support # # CONFIG_SCSI_DMA is not set # CONFIG_SCSI_NETLINK is not set # CONFIG_FUSION is not set # # IEEE 1394 (FireWire) support # # # You can enable one or both FireWire driver stacks. # # # See the help texts for more information. # # CONFIG_FIREWIRE is not set # CONFIG_IEEE1394 is not set # CONFIG_I2O is not set # CONFIG_MACINTOSH_DRIVERS is not set CONFIG_NETDEVICES=y # CONFIG_IFB is not set # CONFIG_DUMMY is not set # CONFIG_BONDING is not set # CONFIG_MACVLAN is not set # CONFIG_EQUALIZER is not set # CONFIG_TUN is not set # CONFIG_VETH is not set # CONFIG_ARCNET is not set # CONFIG_NET_ETHERNET is not set # CONFIG_NETDEV_1000 is not set # CONFIG_NETDEV_10000 is not set # CONFIG_TR is not set # # Wireless LAN # # CONFIG_WLAN_PRE80211 is not set # CONFIG_WLAN_80211 is not set # # Enable WiMAX (Networking options) to see the WiMAX drivers # # CONFIG_WAN is not set # CONFIG_FDDI is not set # CONFIG_HIPPI is not set # CONFIG_PPP is not set # CONFIG_SLIP is not set # CONFIG_NETCONSOLE is not set # CONFIG_NETPOLL is not set # CONFIG_NET_POLL_CONTROLLER is not set # CONFIG_ISDN is not set # CONFIG_PHONE is not set # # Input device support # CONFIG_INPUT=y # CONFIG_INPUT_FF_MEMLESS is not set # CONFIG_INPUT_POLLDEV is not set # # Userland interfaces # # CONFIG_INPUT_MOUSEDEV is not set # CONFIG_INPUT_JOYDEV is not set # CONFIG_INPUT_EVDEV is not set # CONFIG_INPUT_EVBUG is not set # # Input Device Drivers # # CONFIG_INPUT_KEYBOARD is not set # CONFIG_INPUT_MOUSE is not set # CONFIG_INPUT_JOYSTICK is not set # CONFIG_INPUT_TABLET is not set # CONFIG_INPUT_TOUCHSCREEN is not set # CONFIG_INPUT_MISC is not set # # Hardware I/O ports # # CONFIG_SERIO is not set # CONFIG_GAMEPORT is not set # # Character devices # # CONFIG_VT is not set CONFIG_DEVKMEM=y # CONFIG_SERIAL_NONSTANDARD is not set # CONFIG_NOZOMI is not set # # Serial drivers # # CONFIG_SERIAL_8250 is not set # # Non-8250 serial port support # # CONFIG_SERIAL_UARTLITE is not set # CONFIG_SERIAL_JSM is not set CONFIG_UNIX98_PTYS=y # CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set # CONFIG_LEGACY_PTYS is not set # CONFIG_HVC_UDBG is not set # CONFIG_IPMI_HANDLER is not set CONFIG_HW_RANDOM=m # CONFIG_HW_RANDOM_TIMERIOMEM is not set # CONFIG_NVRAM is not set # CONFIG_GEN_RTC is not set # CONFIG_R3964 is not set # CONFIG_APPLICOM is not set # CONFIG_TCG_TPM is not set CONFIG_DEVPORT=y CONFIG_IXIA_CONSOLE=y # CONFIG_I2C is not set # CONFIG_SPI is not set # # PPS support # # CONFIG_PPS is not set CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y # CONFIG_GPIOLIB is not set # CONFIG_W1 is not set # CONFIG_POWER_SUPPLY is not set # CONFIG_HWMON is not set # CONFIG_THERMAL is not set # CONFIG_THERMAL_HWMON is not set # CONFIG_WATCHDOG is not set CONFIG_SSB_POSSIBLE=y # # Sonics Silicon Backplane # # CONFIG_SSB is not set # # Multifunction device drivers # # CONFIG_MFD_CORE is not set # CONFIG_MFD_SM501 is not set # CONFIG_HTC_PASIC3 is not set # CONFIG_MFD_TMIO is not set # CONFIG_REGULATOR is not set # CONFIG_MEDIA_SUPPORT is not set # # Graphics support # # CONFIG_AGP is not set # CONFIG_DRM is not set # CONFIG_VGASTATE is not set # CONFIG_VIDEO_OUTPUT_CONTROL is not set # CONFIG_FB is not set # CONFIG_BACKLIGHT_LCD_SUPPORT is not set # # Display device support # # CONFIG_DISPLAY_SUPPORT is not set # CONFIG_SOUND is not set # CONFIG_HID_SUPPORT is not set # CONFIG_USB_SUPPORT is not set # CONFIG_UWB is not set # CONFIG_MMC is not set # CONFIG_MEMSTICK is not set # CONFIG_NEW_LEDS is not set # CONFIG_ACCESSIBILITY is not set # CONFIG_INFINIBAND is not set # CONFIG_EDAC is not set # CONFIG_RTC_CLASS is not set # CONFIG_DMADEVICES is not set # CONFIG_AUXDISPLAY is not set # CONFIG_UIO is not set # # TI VLYNQ # # CONFIG_STAGING is not set # # File systems # CONFIG_FILE_LOCKING=y CONFIG_FSNOTIFY=y # CONFIG_DNOTIFY is not set # CONFIG_INOTIFY is not set CONFIG_INOTIFY_USER=y # CONFIG_QUOTA is not set # CONFIG_AUTOFS_FS is not set # CONFIG_AUTOFS4_FS is not set # CONFIG_FUSE_FS is not set # # Caches # # CONFIG_FSCACHE is not set # # Pseudo filesystems # CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_SYSCTL=y # CONFIG_PROC_PAGE_MONITOR is not set # CONFIG_SYSFS is not set CONFIG_TMPFS=y # CONFIG_TMPFS_POSIX_ACL is not set # CONFIG_HUGETLB_PAGE is not set # CONFIG_MISC_FILESYSTEMS is not set CONFIG_NETWORK_FILESYSTEMS=y CONFIG_NFS_FS=y CONFIG_NFS_V3=y # CONFIG_NFS_V3_ACL is not set # CONFIG_NFS_V4 is not set # CONFIG_NFSD is not set CONFIG_LOCKD=y CONFIG_LOCKD_V4=y CONFIG_NFS_COMMON=y CONFIG_SUNRPC=y # CONFIG_RPCSEC_GSS_KRB5 is not set # CONFIG_RPCSEC_GSS_SPKM3 is not set # CONFIG_SMB_FS is not set # CONFIG_CIFS is not set # CONFIG_NCP_FS is not set # CONFIG_CODA_FS is not set # CONFIG_AFS_FS is not set # CONFIG_NLS is not set CONFIG_BINARY_PRINTF=y # # Library routines # CONFIG_GENERIC_FIND_LAST_BIT=y # CONFIG_CRC_CCITT is not set # CONFIG_CRC16 is not set # CONFIG_CRC_T10DIF is not set # CONFIG_CRC_ITU_T is not set # CONFIG_CRC32 is not set # CONFIG_CRC7 is not set # CONFIG_LIBCRC32C is not set CONFIG_ZLIB_INFLATE=y CONFIG_ZLIB_DEFLATE=y CONFIG_DECOMPRESS_GZIP=y CONFIG_HAS_IOMEM=y CONFIG_HAS_IOPORT=y CONFIG_HAS_DMA=y CONFIG_HAVE_LMB=y CONFIG_NLATTR=y CONFIG_GENERIC_ATOMIC64=y # # Kernel hacking # CONFIG_PRINTK_TIME=y CONFIG_ENABLE_WARN_DEPRECATED=y CONFIG_ENABLE_MUST_CHECK=y CONFIG_FRAME_WARN=1024 # CONFIG_MAGIC_SYSRQ is not set # CONFIG_UNUSED_SYMBOLS is not set CONFIG_DEBUG_FS=y # CONFIG_HEADERS_CHECK is not set CONFIG_DEBUG_KERNEL=y # CONFIG_DEBUG_SHIRQ is not set # CONFIG_DETECT_SOFTLOCKUP is not set # CONFIG_DETECT_HUNG_TASK is not set # CONFIG_SCHED_DEBUG is not set # CONFIG_SCHEDSTATS is not set # CONFIG_TIMER_STATS is not set # CONFIG_DEBUG_OBJECTS is not set # CONFIG_DEBUG_SLAB is not set # CONFIG_DEBUG_RT_MUTEXES is not set # CONFIG_RT_MUTEX_TESTER is not set # CONFIG_DEBUG_SPINLOCK is not set # CONFIG_DEBUG_MUTEXES is not set # CONFIG_DEBUG_LOCK_ALLOC is not set # CONFIG_PROVE_LOCKING is not set # CONFIG_LOCK_STAT is not set # CONFIG_DEBUG_SPINLOCK_SLEEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set CONFIG_STACKTRACE=y # CONFIG_DEBUG_KOBJECT is not set CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_INFO=y # CONFIG_DEBUG_VM is not set # CONFIG_DEBUG_WRITECOUNT is not set # CONFIG_DEBUG_MEMORY_INIT is not set # CONFIG_DEBUG_LIST is not set # CONFIG_DEBUG_SG is not set # CONFIG_DEBUG_NOTIFIERS is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_RCU_CPU_STALL_DETECTOR is not set # CONFIG_BACKTRACE_SELF_TEST is not set # CONFIG_FAULT_INJECTION is not set # CONFIG_LATENCYTOP is not set CONFIG_SYSCTL_SYSCALL_CHECK=y # CONFIG_DEBUG_PAGEALLOC is not set CONFIG_NOP_TRACER=y CONFIG_HAVE_FUNCTION_TRACER=y CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y CONFIG_HAVE_DYNAMIC_FTRACE=y CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y CONFIG_RING_BUFFER=y CONFIG_EVENT_TRACING=y CONFIG_CONTEXT_SWITCH_TRACER=y CONFIG_TRACING=y CONFIG_TRACING_SUPPORT=y # CONFIG_FTRACE is not set # CONFIG_DYNAMIC_DEBUG is not set # CONFIG_SAMPLES is not set CONFIG_HAVE_ARCH_KGDB=y # CONFIG_KGDB is not set # CONFIG_KMEMCHECK is not set # CONFIG_PPC_DISABLE_WERROR is not set CONFIG_PPC_WERROR=y CONFIG_PRINT_STACK_DEPTH=64 # CONFIG_DEBUG_STACKOVERFLOW is not set # CONFIG_DEBUG_STACK_USAGE is not set # CONFIG_PPC_EMULATED_STATS is not set # CONFIG_CODE_PATCHING_SELFTEST is not set # CONFIG_FTR_FIXUP_SELFTEST is not set # CONFIG_MSI_BITMAP_SELFTEST is not set # CONFIG_XMON is not set # CONFIG_IRQSTACKS is not set # CONFIG_VIRQ_DEBUG is not set # CONFIG_BDI_SWITCH is not set # CONFIG_BOOTX_TEXT is not set # CONFIG_PPC_EARLY_DEBUG is not set # # Security options # # CONFIG_KEYS is not set # CONFIG_SECURITYFS is not set # CONFIG_SECURITY_FILE_CAPABILITIES is not set CONFIG_CRYPTO=y # # Crypto core or helper # # CONFIG_CRYPTO_FIPS is not set CONFIG_CRYPTO_ALGAPI=y CONFIG_CRYPTO_ALGAPI2=y CONFIG_CRYPTO_AEAD=m CONFIG_CRYPTO_AEAD2=y CONFIG_CRYPTO_BLKCIPHER=y CONFIG_CRYPTO_BLKCIPHER2=y CONFIG_CRYPTO_HASH=y CONFIG_CRYPTO_HASH2=y CONFIG_CRYPTO_RNG2=y CONFIG_CRYPTO_PCOMP=y CONFIG_CRYPTO_MANAGER=y CONFIG_CRYPTO_MANAGER2=y # CONFIG_CRYPTO_GF128MUL is not set # CONFIG_CRYPTO_NULL is not set CONFIG_CRYPTO_WORKQUEUE=y # CONFIG_CRYPTO_CRYPTD is not set CONFIG_CRYPTO_AUTHENC=m # CONFIG_CRYPTO_TEST is not set # # Authenticated Encryption with Associated Data # # CONFIG_CRYPTO_CCM is not set # CONFIG_CRYPTO_GCM is not set # CONFIG_CRYPTO_SEQIV is not set # # Block modes # CONFIG_CRYPTO_CBC=y # CONFIG_CRYPTO_CTR is not set # CONFIG_CRYPTO_CTS is not set # CONFIG_CRYPTO_ECB is not set # CONFIG_CRYPTO_LRW is not set # CONFIG_CRYPTO_PCBC is not set # CONFIG_CRYPTO_XTS is not set # # Hash modes # CONFIG_CRYPTO_HMAC=y # CONFIG_CRYPTO_XCBC is not set # # Digest # # CONFIG_CRYPTO_CRC32C is not set # CONFIG_CRYPTO_MD4 is not set CONFIG_CRYPTO_MD5=y # CONFIG_CRYPTO_MICHAEL_MIC is not set # CONFIG_CRYPTO_RMD128 is not set # CONFIG_CRYPTO_RMD160 is not set # CONFIG_CRYPTO_RMD256 is not set # CONFIG_CRYPTO_RMD320 is not set CONFIG_CRYPTO_SHA1=y # CONFIG_CRYPTO_SHA256 is not set # CONFIG_CRYPTO_SHA512 is not set # CONFIG_CRYPTO_TGR192 is not set # CONFIG_CRYPTO_WP512 is not set # # Ciphers # # CONFIG_CRYPTO_AES is not set # CONFIG_CRYPTO_ANUBIS is not set # CONFIG_CRYPTO_ARC4 is not set # CONFIG_CRYPTO_BLOWFISH is not set # CONFIG_CRYPTO_CAMELLIA is not set # CONFIG_CRYPTO_CAST5 is not set # CONFIG_CRYPTO_CAST6 is not set CONFIG_CRYPTO_DES=y # CONFIG_CRYPTO_FCRYPT is not set # CONFIG_CRYPTO_KHAZAD is not set # CONFIG_CRYPTO_SALSA20 is not set # CONFIG_CRYPTO_SEED is not set # CONFIG_CRYPTO_SERPENT is not set # CONFIG_CRYPTO_TEA is not set # CONFIG_CRYPTO_TWOFISH is not set # # Compression # CONFIG_CRYPTO_DEFLATE=y # CONFIG_CRYPTO_ZLIB is not set # CONFIG_CRYPTO_LZO is not set # # Random Number Generation # # CONFIG_CRYPTO_ANSI_CPRNG is not set # CONFIG_CRYPTO_HW is not set # CONFIG_PPC_CLOCK is not set # CONFIG_VIRTUALIZATION is not set ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-25 15:19 ` Octavian Purdila @ 2009-10-25 19:28 ` Eric Dumazet 0 siblings, 0 replies; 41+ messages in thread From: Eric Dumazet @ 2009-10-25 19:28 UTC (permalink / raw) To: Octavian Purdila; +Cc: paulmck, Benjamin LaHaise, netdev, Cosmin Ratiu Octavian Purdila a écrit : > On Sunday 25 October 2009 10:35:10 you wrote: >>> Got some time today and did some experiments myself. The test is deleting >>> 1000 dummy interfaces (interface status down, no IP/IPv6 addresses >>> assigned) on a UP non-preempt ppc750 @800Mhz system. >>> >>> 1. Ben's patch: >>> >>> real 0m 3.42s >>> user 0m 0.00s >>> sys 0m 0.00s >>> >>> 2. Eric's schedule_timeout_uninterruptible(1); >>> >>> real 0m 3.00s >>> user 0m 0.00s >>> sys 0m 0.00s >>> >>> 3. Simple synchronize_rcu_expedited() >>> >>> This doesn't seem to work well with the UP non-preempt case since >>> synchronize_rcu_expedited() is a noop in this case - turning >>> netdev_wait_allrefs() into a while(1) loop. >> Thanks for these numbers. I presume HZ value is 1000 on this platform ? >> > > Yes. I've attach the full config to this email as well. > >> Could you give us your scripts so that we can use same "benchmark" ? >> > > Sure, I've attached the hack module code I've used. > > For creating interfaces: echo 1000 > /proc/sys/net/ndst/add > For deleting interface echo start_ifindex stop_ifindex > /proc/sys/net/ndst/del > > Some more information: > > - on our old and optimized kernel I am getting 0.4s for creating 128000 > interfaces and 0.57s for deleting them > > - the 2.6.31 kernel I got the 3s numbers does have some patches to speed-up > interface creating and deletion (removal of per device sysctl and dev_snmp6 > entries) > > I'll start posting the patches we have as RFC. > OK thanks, I thought you were using dummy module $ time insmod drivers/net/dummy.ko numdummies=100 real 0m2.493s user 0m0.001s sys 0m0.021s $ time rmmod dummy real 0m1.610s user 0m0.000s sys 0m0.001s $ time insmod drivers/net/dummy.ko numdummies=200 real 0m10.118s user 0m0.000s sys 0m0.015s $ time rmmod dummy real 0m3.218s user 0m0.000s sys 0m0.001s $ time insmod drivers/net/dummy.ko numdummies=300 real 0m22.564s user 0m0.000s sys 0m0.034s $ time rmmod dummy real 0m4.755s user 0m0.000s sys 0m0.006s $ perf record -f insmod drivers/net/dummy.ko numdummies=300 $ perf report # Samples: 898 # # Overhead Command Shared Object Symbol # ........ ....... ...................... ...... # 41.65% insmod [kernel] [k] __register_sysctl_paths 22.83% insmod [kernel] [k] strcmp 5.46% insmod [kernel] [k] pcpu_alloc 2.23% insmod [kernel] [k] sysfs_find_dirent 1.56% insmod [kernel] [k] __sysfs_add_one 1.11% insmod [kernel] [k] pcpu_alloc_area 1.11% insmod [kernel] [k] _spin_lock 1.00% insmod [kernel] [k] kmemdup 1.00% insmod [kernel] [k] kmem_cache_alloc 0.67% insmod [kernel] [k] find_symbol_in_section 0.67% insmod [kernel] [k] find_next_zero_bit 0.67% insmod [kernel] [k] idr_get_empty_slot 0.67% insmod [kernel] [k] mutex_lock 0.67% insmod [kernel] [k] mutex_unlock 0.56% insmod [kernel] [k] vunmap_page_range 0.56% insmod [kernel] [k] __slab_alloc ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster 2009-10-24 4:35 ` Eric Dumazet 2009-10-24 5:49 ` Paul E. McKenney @ 2009-10-24 20:22 ` Stephen Hemminger 1 sibling, 0 replies; 41+ messages in thread From: Stephen Hemminger @ 2009-10-24 20:22 UTC (permalink / raw) To: Eric Dumazet Cc: paulmck, Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu On Sat, 24 Oct 2009 06:35:53 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > Paul E. McKenney a écrit : > > On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote: > >> [PATCH] net: allow netdev_wait_allrefs() to run faster > >> > >> netdev_wait_allrefs() waits that all references to a device vanishes. > >> > >> It currently uses a _very_ pessimistic 250 ms delay between each probe. > >> Some users report that no more than 4 devices can be dismantled per second, > >> this is a pretty serious problem for extreme setups. > >> > >> Most likely, references only wait for a rcu grace period that should come > >> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery. > > > > Is this a place where synchronize_rcu_expedited() is appropriate? > > (It went in to 2.6.32-rc1.) > > > > Thanks for the tip Paul > > I believe netdev_wait_allrefs() is not a perfect candidate, because > synchronize_sched_expedited() seems really expensive. > > Maybe we could call it once only, if we had to call 1 times > the jiffie delay ? > > diff --git a/net/core/dev.c b/net/core/dev.c > index fa88dcd..9b04b9a 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev); > static void netdev_wait_allrefs(struct net_device *dev) > { > unsigned long rebroadcast_time, warning_time; > + unsigned int count = 0; > > rebroadcast_time = warning_time = jiffies; > while (atomic_read(&dev->refcnt) != 0) { > @@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev) > rebroadcast_time = jiffies; > } > > - msleep(250); > + if (count++ == 1) > + synchronize_rcu_expedited(); > + else > + schedule_timeout_uninterruptible(1); > > if (time_after(jiffies, warning_time + 10 * HZ)) { > printk(KERN_EMERG "unregister_netdevice: " Actually, anything that requires more than one pass through the loop is broken. Devices and protocols should be cleaning up on the first notifier. The worst offender seems to be the dst cache gc code. -- ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2010-08-09 21:17 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-10-17 22:18 [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second Benjamin LaHaise 2009-10-18 4:26 ` Eric Dumazet 2009-10-18 16:13 ` Benjamin LaHaise 2009-10-18 17:51 ` Eric Dumazet 2009-10-18 18:21 ` Benjamin LaHaise 2009-10-18 19:36 ` Eric Dumazet 2009-10-21 12:39 ` Octavian Purdila 2009-10-21 15:40 ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet 2009-10-21 16:09 ` Eric Dumazet 2009-10-21 16:51 ` Benjamin LaHaise 2009-10-21 19:54 ` Eric Dumazet 2009-10-29 23:07 ` Eric W. Biederman 2009-10-29 23:38 ` Benjamin LaHaise 2009-10-30 1:45 ` Eric W. Biederman 2009-10-30 14:35 ` Benjamin LaHaise 2009-10-30 14:43 ` Eric Dumazet 2009-10-30 23:25 ` Eric W. Biederman 2009-10-30 23:53 ` Benjamin LaHaise 2009-10-31 0:37 ` Eric W. Biederman 2010-08-09 17:23 ` Ben Greear 2010-08-09 17:34 ` Benjamin LaHaise 2010-08-09 17:44 ` Ben Greear 2010-08-09 17:48 ` Benjamin LaHaise 2010-08-09 18:03 ` Ben Greear 2010-08-09 19:59 ` Eric W. Biederman 2010-08-09 21:03 ` Benjamin LaHaise 2010-08-09 21:17 ` Eric W. Biederman 2009-10-21 16:55 ` Octavian Purdila 2009-10-23 21:13 ` Paul E. McKenney 2009-10-24 4:35 ` Eric Dumazet 2009-10-24 5:49 ` Paul E. McKenney 2009-10-24 8:49 ` Eric Dumazet 2009-10-24 13:52 ` Paul E. McKenney 2009-10-24 14:24 ` Eric Dumazet 2009-10-24 14:46 ` Paul E. McKenney 2009-10-24 23:49 ` Octavian Purdila 2009-10-25 4:47 ` Paul E. McKenney 2009-10-25 8:35 ` Eric Dumazet 2009-10-25 15:19 ` Octavian Purdila 2009-10-25 19:28 ` Eric Dumazet 2009-10-24 20:22 ` Stephen Hemminger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).