[PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second
@ 2009-10-17 22:18 Benjamin LaHaise
  2009-10-18  4:26 ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin LaHaise @ 2009-10-17 22:18 UTC (permalink / raw)
  To: netdev

Hi folks,

Below is a patch that changes the interaction between netdev_wait_allrefs() 
and dev_put() to replace an msleep(250) with a wait_event() on the final 
dev_put().  This change reduces the time spent sleeping during an 
unregister_netdev(), causing the system to go from <1% CPU time to something 
more CPU bound (50+% in a test vm).  This increases the speed of a bulk 
unregister_netdev() from 4 interfaces per second to over 500 per second on 
my test system.  The requirement comes from handling thousands of L2TP 
sessions where a tunnel flap results in all interfaces being torn down at 
one time.

Note that there is still more work to be done in this area.

		-ben

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 812a5f3..e20d4a4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1514,10 +1514,7 @@ extern void netdev_run_todo(void);
  *
  * Release reference to device to allow it to be freed.
  */
-static inline void dev_put(struct net_device *dev)
-{
-	atomic_dec(&dev->refcnt);
-}
+void dev_put(struct net_device *dev);
 
 /**
  *	dev_hold - get reference to device
diff --git a/net/core/dev.c b/net/core/dev.c
index b8f74cf..155217f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4945,6 +4945,16 @@ out:
 }
 EXPORT_SYMBOL(register_netdev);
 
+DECLARE_WAIT_QUEUE_HEAD(netdev_refcnt_wait);
+
+void dev_put(struct net_device *dev)
+{
+        if (atomic_dec_and_test(&dev->refcnt))
+		wake_up(&netdev_refcnt_wait);
+}
+EXPORT_SYMBOL(dev_put);
+
+
 /*
  * netdev_wait_allrefs - wait until all references are gone.
  *
@@ -4984,7 +4994,8 @@ static void netdev_wait_allrefs(struct net_device *dev)
 			rebroadcast_time = jiffies;
 		}
 
-		msleep(250);
+		wait_event_timeout(netdev_refcnt_wait,
+				   !atomic_read(&dev->refcnt), HZ/4);
 
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second
  2009-10-17 22:18 [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second Benjamin LaHaise
@ 2009-10-18  4:26 ` Eric Dumazet
  2009-10-18 16:13   ` Benjamin LaHaise
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2009-10-18  4:26 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: netdev

Benjamin LaHaise a écrit :
> Hi folks,
> 
> Below is a patch that changes the interaction between netdev_wait_allrefs() 
> and dev_put() to replace an msleep(250) with a wait_event() on the final 
> dev_put().  This change reduces the time spent sleeping during an 
> unregister_netdev(), causing the system to go from <1% CPU time to something 
> more CPU bound (50+% in a test vm).  This increases the speed of a bulk 
> unregister_netdev() from 4 interfaces per second to over 500 per second on 
> my test system.  The requirement comes from handling thousands of L2TP 
> sessions where a tunnel flap results in all interfaces being torn down at 
> one time.
> 
> Note that there is still more work to be done in this area.
> 
> 		-ben
>  

> +DECLARE_WAIT_QUEUE_HEAD(netdev_refcnt_wait);
> +
> +void dev_put(struct net_device *dev)
> +{
> +        if (atomic_dec_and_test(&dev->refcnt))
> +		wake_up(&netdev_refcnt_wait);
> +}
> +EXPORT_SYMBOL(dev_put);
> +


Unfortunatly this slow down fast path by an order of magnitude.

atomic_dec() is pretty cheap (and eventually could use a per_cpu thing,
now we have a new and sexy per_cpu allocator), but atomic_dec_and_test()
is not that cheap and more important forbids a per_cpu conversion.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second
  2009-10-18  4:26 ` Eric Dumazet
@ 2009-10-18 16:13   ` Benjamin LaHaise
  2009-10-18 17:51     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin LaHaise @ 2009-10-18 16:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Sun, Oct 18, 2009 at 06:26:22AM +0200, Eric Dumazet wrote:
> Unfortunatly this slow down fast path by an order of magnitude.
> 
> atomic_dec() is pretty cheap (and eventually could use a per_cpu thing,
> now we have a new and sexy per_cpu allocator), but atomic_dec_and_test()
> is not that cheap and more important forbids a per_cpu conversion.

dev_put() is not a fast path by any means.  atomic_dec_and_test() costs 
the same as atomic_dec() on any modern CPU -- the cost is in the cacheline 
bouncing and serialisation both require.  The case of the device count 
becoming 0 is quite rare -- any device with a route on it will never hit 
a reference count of 0.

		-ben

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second
  2009-10-18 16:13   ` Benjamin LaHaise
@ 2009-10-18 17:51     ` Eric Dumazet
  2009-10-18 18:21       ` Benjamin LaHaise
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2009-10-18 17:51 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: netdev

Benjamin LaHaise a écrit :
> On Sun, Oct 18, 2009 at 06:26:22AM +0200, Eric Dumazet wrote:
>> Unfortunatly this slow down fast path by an order of magnitude.
>>
>> atomic_dec() is pretty cheap (and eventually could use a per_cpu thing,
>> now we have a new and sexy per_cpu allocator), but atomic_dec_and_test()
>> is not that cheap and more important forbids a per_cpu conversion.
> 
> dev_put() is not a fast path by any means.  atomic_dec_and_test() costs 
> the same as atomic_dec() on any modern CPU -- the cost is in the cacheline 
> bouncing and serialisation both require.  The case of the device count 
> becoming 0 is quite rare -- any device with a route on it will never hit 
> a reference count of 0.

You forgot af_packet sendmsg() users, and heavy routers where route cache
is stressed or disabled. I know several of them, they even added mmap TX 
support to get better performance. They will be disapointed by your patch.

atomic_dec_and_test() is definitly more expensive, because of strong barrier
semantics and added test after the decrement.
refcnt being close to zero or not has not impact, even on 2 years old cpus.

Machines hardly had to dismantle a netdevice in a normal lifetime, so maybe
we were lazy with this insane msleep(250). This came from old linux times,
when cpus were soooo slow and programers soooo lazy :)

The msleep(250) should be tuned first. Then if this is really necessary
to dismantle 100.000 netdevices per second, we might have to think a bit more.

Just try msleep(1 or 2), it should work quite well.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second
  2009-10-18 17:51     ` Eric Dumazet
@ 2009-10-18 18:21       ` Benjamin LaHaise
  2009-10-18 19:36         ` Eric Dumazet
  2009-10-21 12:39         ` Octavian Purdila
  0 siblings, 2 replies; 41+ messages in thread
From: Benjamin LaHaise @ 2009-10-18 18:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Sun, Oct 18, 2009 at 07:51:56PM +0200, Eric Dumazet wrote:
> You forgot af_packet sendmsg() users, and heavy routers where route cache
> is stressed or disabled. I know several of them, they even added mmap TX 
> support to get better performance. They will be disapointed by your patch.

If that's a problem, the cacheline overhead is a more serious issue.  
AF_PACKET should really keep the reference on the device between syscalls.  
Do you have a benchmark in mind that would show the overhead?

> atomic_dec_and_test() is definitly more expensive, because of strong barrier
> semantics and added test after the decrement.
> refcnt being close to zero or not has not impact, even on 2 years old cpus.

At least on x86, the atomic_dec_and_test() cost is pretty much identical to 
atomic_dec().  If this really is a performance bottleneck, people should be 
complaining about the cache miss overhead and lock overhead which will dwarf 
the atomic_dec_and_test() cost vs atomic_dec().  Granted, I'm not saying 
that it isn't an issue on other architectures, but for x86 the lock prefix 
is what's expensive, not checking the flags or not after doing the operation.

If your complaint is about uninlining dev_put(), I'm indifferent to keeping 
it inline or out of line and can change the patch to suit.

> Machines hardly had to dismantle a netdevice in a normal lifetime, so maybe
> we were lazy with this insane msleep(250). This came from old linux times,
> when cpus were soooo slow and programers soooo lazy :)

It's only now that machines can actually route one or more 10Gbps links 
that it really becomes an issue.  I've been hacking around it for some 
time, but fixing it properly is starting to be a real requirement.

> The msleep(250) should be tuned first. Then if this is really necessary
> to dismantle 100.000 netdevices per second, we might have to think a bit more.
> 
> Just try msleep(1 or 2), it should work quite well.

My goal is tearing down 100,000 interfaces in a few seconds, which really is 
necessary.  Right now we're running about 40,000 interfaces on a not yet 
saturated 10Gbps link.  Going to dual 10Gbps links means pushing more than 
100,000 subscriber interfaces, and it looks like a modern dual socket system 
can handle that.

A bigger concern is rtnl_lock().  It is a huge impediment to scaling up 
interface creation/deletion on multicore systems.  That's going to be a 
lot more invasive to fix, though.

		-ben

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second
  2009-10-18 18:21       ` Benjamin LaHaise
@ 2009-10-18 19:36         ` Eric Dumazet
  2009-10-21 12:39         ` Octavian Purdila
  1 sibling, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2009-10-18 19:36 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: netdev

Benjamin LaHaise a écrit :
> 
> My goal is tearing down 100,000 interfaces in a few seconds, which really is 
> necessary.  Right now we're running about 40,000 interfaces on a not yet 
> saturated 10Gbps link.  Going to dual 10Gbps links means pushing more than 
> 100,000 subscriber interfaces, and it looks like a modern dual socket system 
> can handle that.
> 
> A bigger concern is rtnl_lock().  It is a huge impediment to scaling up 
> interface creation/deletion on multicore systems.  That's going to be a 
> lot more invasive to fix, though.

Dont forget synchronize_net() too (two calls per rollback_registered())

You need something to dismantle XXXXX interfaces at once, instead
of serializing one by one. Because in three years you'll want to dismantle
1.000.000 interfaces in one second.

Maybe defining an asynchronous unregister_netdev() function...




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second
  2009-10-18 18:21       ` Benjamin LaHaise
  2009-10-18 19:36         ` Eric Dumazet
@ 2009-10-21 12:39         ` Octavian Purdila
  2009-10-21 15:40           ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: Octavian Purdila @ 2009-10-21 12:39 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric Dumazet, netdev, Cosmin Ratiu

On Sunday 18 October 2009 21:21:44 you wrote:
> > The msleep(250) should be tuned first. Then if this is really necessary
> > to dismantle 100.000 netdevices per second, we might have to think a bit
> > more. 
> > Just try msleep(1 or 2), it should work quite well.
> 
> My goal is tearing down 100,000 interfaces in a few seconds, which really
>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>  system can handle that.
> 

I would also like to see this patch in, we are running into scalability issues 
with creating/deleting lots of interfaces as well.

Thanks,
tavi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-21 12:39         ` Octavian Purdila
@ 2009-10-21 15:40           ` Eric Dumazet
  2009-10-21 16:09             ` Eric Dumazet
                               ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Eric Dumazet @ 2009-10-21 15:40 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu

Octavian Purdila a écrit :
> On Sunday 18 October 2009 21:21:44 you wrote:
>>> The msleep(250) should be tuned first. Then if this is really necessary
>>> to dismantle 100.000 netdevices per second, we might have to think a bit
>>> more. 
>>> Just try msleep(1 or 2), it should work quite well.
>> My goal is tearing down 100,000 interfaces in a few seconds, which really
>>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>>  system can handle that.
>>
> 
> I would also like to see this patch in, we are running into scalability issues 
> with creating/deleting lots of interfaces as well.

Ben patch only address interface deletion, and one part of the problem,
maybe the more visible one for the current kernel.

Adding lots of interfaces only needs several threads to run concurently.

Before applying/examining his patch I suggest identifying all dev_put() spots than
can be deleted and replaced by something more scalable. I began this job
but others can help me.

RTNL and rcu grace periods are going to hurt anyway, so you probably need
to use many tasks to be able to delete lots of interfaces in parallel.

netdev_run_todo() should also use a better algorithm to allow parallelism.

Following patch doesnt slow down dev_put() users and real scalability
problems will surface and might be addressed.

[PATCH] net: allow netdev_wait_allrefs() to run faster

netdev_wait_allrefs() waits that all references to a device vanishes.

It currently uses a _very_ pessimistic 250 ms delay between each probe.
Some users report that no more than 4 devices can be dismantled per second,
this is a pretty serious problem for extreme setups.

Most likely, references only wait for a rcu grace period that should come
fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/dev.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 28b0b9e..fca2e4a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4983,7 +4983,7 @@ static void netdev_wait_allrefs(struct net_device *dev)
 			rebroadcast_time = jiffies;
 		}

-		msleep(250);
+		schedule_timeout_uninterruptible(1);

 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-21 15:40           ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet
@ 2009-10-21 16:09             ` Eric Dumazet
  2009-10-21 16:51             ` Benjamin LaHaise
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2009-10-21 16:09 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu

Eric Dumazet a écrit :
> Octavian Purdila a écrit :
>> On Sunday 18 October 2009 21:21:44 you wrote:
>>>> The msleep(250) should be tuned first. Then if this is really necessary
>>>> to dismantle 100.000 netdevices per second, we might have to think a bit
>>>> more. 
>>>> Just try msleep(1 or 2), it should work quite well.
>>> My goal is tearing down 100,000 interfaces in a few seconds, which really
>>>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>>>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>>>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>>>  system can handle that.
>>>
>> I would also like to see this patch in, we are running into scalability issues 
>> with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put() spots than
> can be deleted and replaced by something more scalable. I began this job
> but others can help me.
> 
> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 
> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 

Here are typical timings (on current kernel, but on following example
netdev_wait_allrefs() doesnt wait at all, because my netdevice has no refs)

# time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link set mv161 up

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link set mv161 down

real    0m0.021s
user    0m0.000s
sys     0m0.001s
# time ip link del mv161

real    0m0.022s
user    0m0.000s
sys     0m0.001s

# time ip link add link eth3 address 00:1E:0B:8F:D0:D6 mv161 type macvlan

real    0m0.001s
user    0m0.001s
sys     0m0.001s
# time ip link set mv161 up

real    0m0.001s
user    0m0.000s
sys     0m0.001s
# time ip link del mv161

real    0m0.036s
user    0m0.000s
sys     0m0.001s

22 ms (or 36 ms) delay are also problematic if you want to dismantle 1.000.000 netdevices at once.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-21 15:40           ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet
  2009-10-21 16:09             ` Eric Dumazet
@ 2009-10-21 16:51             ` Benjamin LaHaise
  2009-10-21 19:54               ` Eric Dumazet
  2009-10-29 23:07               ` Eric W. Biederman
  2009-10-21 16:55             ` Octavian Purdila
  2009-10-23 21:13             ` Paul E. McKenney
  3 siblings, 2 replies; 41+ messages in thread
From: Benjamin LaHaise @ 2009-10-21 16:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, netdev, Cosmin Ratiu

On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.

The first part I've been tackling has been the overhead in procfs, sysctl 
and sysfs.  I've got patches for some of the issues, hacks for others, and 
should have something to post in a few days.  Getting rid of those overheads 
is enough to get to decent interface creation times for the first 20 or 30k 
interfaces.

On the interface deletion side of things, within the network code, fib_hash 
has a few linear scans that really start hurting.  trie is a bit better, 
but I haven't started digging too deeply into its flush/remove overhead yet, 
aside from noticing that trie has a linear scan if 
CONFIG_IP_ROUTE_MULTIPATH is set since it sets the hash size to 1.  
fn_trie_flush() is currently the worst offender during deletion.

		-ben

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-21 16:51             ` Benjamin LaHaise
@ 2009-10-21 19:54               ` Eric Dumazet
  2009-10-29 23:07               ` Eric W. Biederman
  1 sibling, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2009-10-21 19:54 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Octavian Purdila, netdev, Cosmin Ratiu

Benjamin LaHaise a écrit :
> On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
>> Ben patch only address interface deletion, and one part of the problem,
>> maybe the more visible one for the current kernel.
> 
> The first part I've been tackling has been the overhead in procfs, sysctl 
> and sysfs.  I've got patches for some of the issues, hacks for others, and 
> should have something to post in a few days.  Getting rid of those overheads 
> is enough to get to decent interface creation times for the first 20 or 30k 
> interfaces.
> 
> On the interface deletion side of things, within the network code, fib_hash 
> has a few linear scans that really start hurting.  trie is a bit better, 
> but I haven't started digging too deeply into its flush/remove overhead yet, 
> aside from noticing that trie has a linear scan if 
> CONFIG_IP_ROUTE_MULTIPATH is set since it sets the hash size to 1.  
> fn_trie_flush() is currently the worst offender during deletion.

Well, there are many things to change...

# ip -o link | wc -l
13097
# time ip -o link show mv22248
13045: mv22248@eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN \    link/ether 00:1e:0b:8e:c8:08 brd ff:ff:ff:ff:ff:ff

real    0m0.840s
user    0m0.473s
sys     0m0.368s

almost one second to get link status of one particular interface :(

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-21 16:51             ` Benjamin LaHaise
  2009-10-21 19:54               ` Eric Dumazet
@ 2009-10-29 23:07               ` Eric W. Biederman
  2009-10-29 23:38                 ` Benjamin LaHaise
  1 sibling, 1 reply; 41+ messages in thread
From: Eric W. Biederman @ 2009-10-29 23:07 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
>> Ben patch only address interface deletion, and one part of the problem,
>> maybe the more visible one for the current kernel.
>
> The first part I've been tackling has been the overhead in procfs, sysctl 
> and sysfs.  

Could you keep me in the loop with that.  I have some pending cleanups for
all of those pieces of code and may be able to help/advice/review.

Eric

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-29 23:07               ` Eric W. Biederman
@ 2009-10-29 23:38                 ` Benjamin LaHaise
  2009-10-30  1:45                   ` Eric W. Biederman
  2010-08-09 17:23                   ` Ben Greear
  0 siblings, 2 replies; 41+ messages in thread
From: Benjamin LaHaise @ 2009-10-29 23:38 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

On Thu, Oct 29, 2009 at 04:07:18PM -0700, Eric W. Biederman wrote:
> Could you keep me in the loop with that.  I have some pending cleanups for
> all of those pieces of code and may be able to help/advice/review.

Here are the sysfs scaling improvements.  I have to break them up, as there 
are 3 separate changes in this patch: 1. use an rbtree for name lookup in 
sysfs, 2. keep track of the number of directories for the purpose of 
generating the link count, as otherwise too much cpu time is spent in 
sysfs_count_nlink when new entries are added, and 3. when adding a new 
sysfs_dirent, walk the list backwards when linking it in, as higher 
numbered inodes tend to be at the end of the list, not the beginning.

		-ben


diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 5fad489..38ad7c8 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -43,10 +43,18 @@ static DEFINE_IDA(sysfs_ino_ida);
 static void sysfs_link_sibling(struct sysfs_dirent *sd)
 {
 	struct sysfs_dirent *parent_sd = sd->s_parent;
-	struct sysfs_dirent **pos;
+	struct sysfs_dirent **pos, *prev = NULL;
+	struct rb_node **new, *parent;
 
 	BUG_ON(sd->s_sibling);
 
+	if (parent_sd->s_dir.children_tail &&
+	    parent_sd->s_dir.children_tail->s_ino < sd->s_ino) {
+		prev = parent_sd->s_dir.children_tail;
+		pos = &prev->s_sibling;
+		goto got_it;
+	}
+
 	/* Store directory entries in order by ino.  This allows
 	 * readdir to properly restart without having to add a
 	 * cursor into the s_dir.children list.
@@ -54,9 +62,36 @@ static void sysfs_link_sibling(struct sysfs_dirent *sd)
 	for (pos = &parent_sd->s_dir.children; *pos; pos = &(*pos)->s_sibling) {
 		if (sd->s_ino < (*pos)->s_ino)
 			break;
+		prev = *pos;
 	}
+got_it:
+	if (prev == parent_sd->s_dir.children_tail)
+		parent_sd->s_dir.children_tail = sd;
 	sd->s_sibling = *pos;
+	sd->s_sibling_prev = prev;
 	*pos = sd;
+	parent_sd->s_nr_children_dir += (sysfs_type(sd) == SYSFS_DIR);
+
+	// rb tree insert
+	new = &(parent_sd->s_dir.child_rb_root.rb_node);
+	parent = NULL;
+
+	while (*new) {
+		struct sysfs_dirent *this =
+			container_of(*new, struct sysfs_dirent, s_rb_node);
+		int result = strcmp(sd->s_name, this->s_name);
+
+		parent = *new;
+		if (result < 0)
+			new = &((*new)->rb_left);
+		else if (result > 0)
+			new = &((*new)->rb_right);
+		else
+			BUG();
+	}
+
+	rb_link_node(&sd->s_rb_node, parent, new);
+	rb_insert_color(&sd->s_rb_node, &parent_sd->s_dir.child_rb_root);
 }
 
 /**
@@ -71,16 +106,22 @@ static void sysfs_link_sibling(struct sysfs_dirent *sd)
  */
 static void sysfs_unlink_sibling(struct sysfs_dirent *sd)
 {
-	struct sysfs_dirent **pos;
+	struct sysfs_dirent **pos, *prev = NULL;
 
-	for (pos = &sd->s_parent->s_dir.children; *pos;
-	     pos = &(*pos)->s_sibling) {
-		if (*pos == sd) {
-			*pos = sd->s_sibling;
-			sd->s_sibling = NULL;
-			break;
-		}
-	}
+	prev = sd->s_sibling_prev;
+	if (prev)
+		pos = &prev->s_sibling;
+	else
+		pos = &sd->s_parent->s_dir.children;
+	if (sd == sd->s_parent->s_dir.children_tail)
+		sd->s_parent->s_dir.children_tail = prev;
+	*pos = sd->s_sibling;
+	if (sd->s_sibling)
+		sd->s_sibling->s_sibling_prev = prev;
+	
+	sd->s_parent->s_nr_children_dir -= (sysfs_type(sd) == SYSFS_DIR);
+	sd->s_sibling_prev = NULL;
+	rb_erase(&sd->s_rb_node, &sd->s_parent->s_dir.child_rb_root);
 }
 
 /**
@@ -331,6 +372,9 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type)
 	sd->s_mode = mode;
 	sd->s_flags = type;
 
+	if (type == SYSFS_DIR)
+		sd->s_dir.child_rb_root = RB_ROOT;
+
 	return sd;
 
  err_out2:
@@ -630,11 +674,20 @@ void sysfs_addrm_finish(struct sysfs_addrm_cxt *acxt)
 struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd,
 				       const unsigned char *name)
 {
-	struct sysfs_dirent *sd;
-
-	for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling)
-		if (!strcmp(sd->s_name, name))
-			return sd;
+	struct rb_node *node = parent_sd->s_dir.child_rb_root.rb_node;
+
+	while (node) {
+		struct sysfs_dirent *data =
+			container_of(node, struct sysfs_dirent, s_rb_node);
+		int result;
+		result = strcmp(name, data->s_name);
+		if (result < 0)
+			node = node->rb_left;
+		else if (result > 0)
+			node = node->rb_right;
+		else
+			return data;
+	}
 	return NULL;
 }
 
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index e28cecf..ff6e960 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -191,14 +191,7 @@ static struct lock_class_key sysfs_inode_imutex_key;
 
 static int sysfs_count_nlink(struct sysfs_dirent *sd)
 {
-	struct sysfs_dirent *child;
-	int nr = 0;
-
-	for (child = sd->s_dir.children; child; child = child->s_sibling)
-		if (sysfs_type(child) == SYSFS_DIR)
-			nr++;
-
-	return nr + 2;
+	return sd->s_nr_children_dir + 2;
 }
 
 static void sysfs_init_inode(struct sysfs_dirent *sd, struct inode *inode)
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index af4c4e7..22fd1bc 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -9,6 +9,7 @@
  */
 
 #include <linux/fs.h>
+#include <linux/rbtree.h>
 
 struct sysfs_open_dirent;
 
@@ -16,7 +17,10 @@ struct sysfs_open_dirent;
 struct sysfs_elem_dir {
 	struct kobject		*kobj;
 	/* children list starts here and goes through sd->s_sibling */
+	
 	struct sysfs_dirent	*children;
+	struct sysfs_dirent	*children_tail;
+	struct rb_root		child_rb_root;
 };
 
 struct sysfs_elem_symlink {
@@ -52,6 +56,8 @@ struct sysfs_dirent {
 	atomic_t		s_active;
 	struct sysfs_dirent	*s_parent;
 	struct sysfs_dirent	*s_sibling;
+	struct sysfs_dirent	*s_sibling_prev;
+	struct rb_node		s_rb_node;
 	const char		*s_name;
 
 	union {
@@ -65,6 +71,8 @@ struct sysfs_dirent {
 	ino_t			s_ino;
 	umode_t			s_mode;
 	struct sysfs_inode_attrs *s_iattr;
+
+	int			s_nr_children_dir;
 };
 
 #define SD_DEACTIVATED_BIAS		INT_MIN

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-29 23:38                 ` Benjamin LaHaise
@ 2009-10-30  1:45                   ` Eric W. Biederman
  2009-10-30 14:35                     ` Benjamin LaHaise
  2010-08-09 17:23                   ` Ben Greear
  1 sibling, 1 reply; 41+ messages in thread
From: Eric W. Biederman @ 2009-10-30  1:45 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Thu, Oct 29, 2009 at 04:07:18PM -0700, Eric W. Biederman wrote:
>> Could you keep me in the loop with that.  I have some pending cleanups for
>> all of those pieces of code and may be able to help/advice/review.
>
> Here are the sysfs scaling improvements.  I have to break them up, as there 
> are 3 separate changes in this patch: 1. use an rbtree for name lookup in 
> sysfs, 2. keep track of the number of directories for the purpose of 
> generating the link count, as otherwise too much cpu time is spent in 
> sysfs_count_nlink when new entries are added, and 3. when adding a new 
> sysfs_dirent, walk the list backwards when linking it in, as higher 
> numbered inodes tend to be at the end of the list, not the beginning.

The reason for the existence of sysfs_dirent is as things grow larger
we want to keep the amount of RAM consumed down.  So we don't pin
everything in the dcache.  So we try and keep the amount of memory
consumed down.

So I would like to see how much we can par down.

For dealing with seeks in the middle of readdir I expect the best way
to do that is to be inspired by htrees in extNfs and return a hash of
the filename as our position, and keep the filename list sorted by
that hash.  Since we are optimizing for size we don't need to store
that hash.  Then we can turn that list into a some flavor of sorted
binary tree.

I'm surprised sysfs_count_nlink shows up, as it is not directly on the
add or remove path.  I think the answer there is to change s_flags
into a set of bitfields and make link_count one of them, perhaps
16bits long.  If we ever overflow our bitfield we can just set link
count to 0, and userspace (aka find) will know it can't optimized
based on link count.

I was expecting someone to run into problems with the linear directory
of sysfs someday.

Eric

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-30  1:45                   ` Eric W. Biederman
@ 2009-10-30 14:35                     ` Benjamin LaHaise
  2009-10-30 14:43                       ` Eric Dumazet
  2009-10-30 23:25                       ` Eric W. Biederman
  0 siblings, 2 replies; 41+ messages in thread
From: Benjamin LaHaise @ 2009-10-30 14:35 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

On Thu, Oct 29, 2009 at 06:45:32PM -0700, Eric W. Biederman wrote:
> The reason for the existence of sysfs_dirent is as things grow larger
> we want to keep the amount of RAM consumed down.  So we don't pin
> everything in the dcache.  So we try and keep the amount of memory
> consumed down.

I'm aware of that, but for users running into this sort of scaling issue, 
the amount of RAM required is a non-issue (30,000 interfaces require about 
1GB of RAM at present), making the question more one of how to avoid the 
overhead for users who don't require it.  I'd prefer a config option.  The 
only way I can really see saving memory usage is to somehow tie sysfs dirent 
lookups into the network stack's own tables for looking up device entries.  
The network stack already has to cope with this kind of scaling, and that 
would save the RAM.

> So I would like to see how much we can par down.

> For dealing with seeks in the middle of readdir I expect the best way
> to do that is to be inspired by htrees in extNfs and return a hash of
> the filename as our position, and keep the filename list sorted by
> that hash.  Since we are optimizing for size we don't need to store
> that hash.  Then we can turn that list into a some flavor of sorted
> binary tree.

readdir() generally isn't an issue at present.

> I'm surprised sysfs_count_nlink shows up, as it is not directly on the
> add or remove path.  I think the answer there is to change s_flags
> into a set of bitfields and make link_count one of them, perhaps
> 16bits long.  If we ever overflow our bitfield we can just set link
> count to 0, and userspace (aka find) will know it can't optimized
> based on link count.

It shows up because of the bits of userspace (udev) touching the directory 
from things like the hotplug code path.

> I was expecting someone to run into problems with the linear directory
> of sysfs someday.

Alas, sysfs isn't the only offender.

		-ben

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-30 14:35                     ` Benjamin LaHaise
@ 2009-10-30 14:43                       ` Eric Dumazet
  2009-10-30 23:25                       ` Eric W. Biederman
  1 sibling, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2009-10-30 14:43 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Eric W. Biederman, Octavian Purdila, netdev, Cosmin Ratiu

Benjamin LaHaise a écrit :
> On Thu, Oct 29, 2009 at 06:45:32PM -0700, Eric W. Biederman wrote:
>> I was expecting someone to run into problems with the linear directory
>> of sysfs someday.
> 
> Alas, sysfs isn't the only offender.
> 
> 		-ben

In my tests, the sysfs lookup by name is the big offender, I believe you should
post your rb_tree patch ASAP ;) Then we can go further




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-30 14:35                     ` Benjamin LaHaise
  2009-10-30 14:43                       ` Eric Dumazet
@ 2009-10-30 23:25                       ` Eric W. Biederman
  2009-10-30 23:53                         ` Benjamin LaHaise
  1 sibling, 1 reply; 41+ messages in thread
From: Eric W. Biederman @ 2009-10-30 23:25 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Thu, Oct 29, 2009 at 06:45:32PM -0700, Eric W. Biederman wrote:
>> The reason for the existence of sysfs_dirent is as things grow larger
>> we want to keep the amount of RAM consumed down.  So we don't pin
>> everything in the dcache.  So we try and keep the amount of memory
>> consumed down.
>
> I'm aware of that, but for users running into this sort of scaling issue, 
> the amount of RAM required is a non-issue (30,000 interfaces require about 
> 1GB of RAM at present), making the question more one of how to avoid the 
> overhead for users who don't require it.  I'd prefer a config option.  The 
> only way I can really see saving memory usage is to somehow tie sysfs dirent 
> lookups into the network stack's own tables for looking up device entries.  
> The network stack already has to cope with this kind of scaling, and that 
> would save the RAM.

There is that.  I'm trying to figure out how to add the improvements
without making sysfs_dirent larger.  Which I think that is doable.

>> So I would like to see how much we can par down.
>
>> For dealing with seeks in the middle of readdir I expect the best way
>> to do that is to be inspired by htrees in extNfs and return a hash of
>> the filename as our position, and keep the filename list sorted by
>> that hash.  Since we are optimizing for size we don't need to store
>> that hash.  Then we can turn that list into a some flavor of sorted
>> binary tree.
>
> readdir() generally isn't an issue at present.

Supporting seekdir into the middle of a directory is the entire reason
I keep the entries sorted by inode.  If we sort by a hash of the name.
We can use the hash to support directory position in readdir and seekdir.
And we can completely remove the linear list when the rb_tree is introduced.

>> I'm surprised sysfs_count_nlink shows up, as it is not directly on the
>> add or remove path.  I think the answer there is to change s_flags
>> into a set of bitfields and make link_count one of them, perhaps
>> 16bits long.  If we ever overflow our bitfield we can just set link
>> count to 0, and userspace (aka find) will know it can't optimized
>> based on link count.
>
> It shows up because of the bits of userspace (udev) touching the directory 
> from things like the hotplug code path.

I realized after sending the message that s_mode in sysfs_dirent is a
real size offense.  It is a 16bit field packed in between two longs.
So in practice it is possible to move the s_mode  up next to s_flags
and add a s_nlink after it both unsigned short and get a cheap sysfs_nlink.

>> I was expecting someone to run into problems with the linear directory
>> of sysfs someday.
>
> Alas, sysfs isn't the only offender.

Agreed. Sysfs is probably the easiest to untangle.

Since I'm not quite ready to post my patches.  I will briefly
mention what I have in my queue and hopefully get things posted.

I have changes to make it so that sysfs never has to go from
the sysfs_dirent to the sysfs inode.  

I have changes to sys_sysctl() so that it becomes a filesystem lookup
under /proc/sys.  Which ultimately makes the code easier to maintain
and debug.

Now back to getting things forward ported and ready to post.

Eric

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-30 23:25                       ` Eric W. Biederman
@ 2009-10-30 23:53                         ` Benjamin LaHaise
  2009-10-31  0:37                           ` Eric W. Biederman
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin LaHaise @ 2009-10-30 23:53 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

On Fri, Oct 30, 2009 at 04:25:52PM -0700, Eric W. Biederman wrote:
> I realized after sending the message that s_mode in sysfs_dirent is a
> real size offense.  It is a 16bit field packed in between two longs.
> So in practice it is possible to move the s_mode  up next to s_flags
> and add a s_nlink after it both unsigned short and get a cheap sysfs_nlink.

That doesn't work -- the number of directory entries can easily exceed 65535.  
Current mid range hardware is good enough to terminate 100,000 network 
interfaces on a single host.

> Since I'm not quite ready to post my patches.  I will briefly
> mention what I have in my queue and hopefully get things posted.
> 
> I have changes to make it so that sysfs never has to go from
> the sysfs_dirent to the sysfs inode.  

Ah, interesting.

> I have changes to sys_sysctl() so that it becomes a filesystem lookup
> under /proc/sys.  Which ultimately makes the code easier to maintain
> and debug.

That sounds like a much saner approach, but has the wrinkle that procfs can 
be configured out.

> Now back to getting things forward ported and ready to post.

I'm looking forward to those changes.  I've been ignoring procfs for the 
time being by disabling the per-interface entries in the network stack, 
but there is some desire to be able to enable rp_filter on a per-interface 
radius config at runtime.  rp_filter has to be disabled across the board 
on my access routers, as there are several places where assymetric routing 
is used for performance reasons.

		-ben

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-30 23:53                         ` Benjamin LaHaise
@ 2009-10-31  0:37                           ` Eric W. Biederman
  0 siblings, 0 replies; 41+ messages in thread
From: Eric W. Biederman @ 2009-10-31  0:37 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Fri, Oct 30, 2009 at 04:25:52PM -0700, Eric W. Biederman wrote:
>> I realized after sending the message that s_mode in sysfs_dirent is a
>> real size offense.  It is a 16bit field packed in between two longs.
>> So in practice it is possible to move the s_mode  up next to s_flags
>> and add a s_nlink after it both unsigned short and get a cheap sysfs_nlink.
>
> That doesn't work -- the number of directory entries can easily exceed 65535.  
> Current mid range hardware is good enough to terminate 100,000 network 
> interfaces on a single host.

On overflow you nlink becomes zero and you leave it there.  That is how
ondisk filesystems handle that case on directories, and find etc
knows how to deal.

>> Since I'm not quite ready to post my patches.  I will briefly
>> mention what I have in my queue and hopefully get things posted.
>> 
>> I have changes to make it so that sysfs never has to go from
>> the sysfs_dirent to the sysfs inode.  
>
> Ah, interesting.

I have to cleanup sysfs before I merge changes for supporting
multiple network namespaces.

>> I have changes to sys_sysctl() so that it becomes a filesystem lookup
>> under /proc/sys.  Which ultimately makes the code easier to maintain
>> and debug.
>
> That sounds like a much saner approach, but has the wrinkle that procfs can 
> be configured out.

So I will add the dependency.  There are very few serious users of sys_sysctl,
and all of them have been getting a deprecated interface warning every
time they use it for the last several years.

>> Now back to getting things forward ported and ready to post.
>
> I'm looking forward to those changes.  I've been ignoring procfs for the 
> time being by disabling the per-interface entries in the network stack, 
> but there is some desire to be able to enable rp_filter on a per-interface 
> radius config at runtime.  rp_filter has to be disabled across the board 
> on my access routers, as there are several places where assymetric routing 
> is used for performance reasons.

Just out of curiosity does the loose rp_filter mode work for you?

Eric

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-29 23:38                 ` Benjamin LaHaise
  2009-10-30  1:45                   ` Eric W. Biederman
@ 2010-08-09 17:23                   ` Ben Greear
  2010-08-09 17:34                     ` Benjamin LaHaise
  1 sibling, 1 reply; 41+ messages in thread
From: Ben Greear @ 2010-08-09 17:23 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev,
	Cosmin Ratiu

On 10/29/2009 04:38 PM, Benjamin LaHaise wrote:
> On Thu, Oct 29, 2009 at 04:07:18PM -0700, Eric W. Biederman wrote:
>> Could you keep me in the loop with that.  I have some pending cleanups for
>> all of those pieces of code and may be able to help/advice/review.
>
> Here are the sysfs scaling improvements.  I have to break them up, as there
> are 3 separate changes in this patch: 1. use an rbtree for name lookup in
> sysfs, 2. keep track of the number of directories for the purpose of
> generating the link count, as otherwise too much cpu time is spent in
> sysfs_count_nlink when new entries are added, and 3. when adding a new
> sysfs_dirent, walk the list backwards when linking it in, as higher
> numbered inodes tend to be at the end of the list, not the beginning.

I was just comparing my out-of-tree patch set to .35, and it appears
little or none of the patches discussed in this thread are in the
upstream kernel yet.

Specifically, there is still that msleep(250) in
netdev_wait_allrefs

Is anyone still trying to get the improvements needed for adding/deleting
lots of interfaces into the kernel?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2010-08-09 17:23                   ` Ben Greear
@ 2010-08-09 17:34                     ` Benjamin LaHaise
  2010-08-09 17:44                       ` Ben Greear
  2010-08-09 19:59                       ` Eric W. Biederman
  0 siblings, 2 replies; 41+ messages in thread
From: Benjamin LaHaise @ 2010-08-09 17:34 UTC (permalink / raw)
  To: Ben Greear
  Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev,
	Cosmin Ratiu

Hello Ben,

On Mon, Aug 09, 2010 at 10:23:37AM -0700, Ben Greear wrote:
> I was just comparing my out-of-tree patch set to .35, and it appears
> little or none of the patches discussed in this thread are in the
> upstream kernel yet.

I was waiting on Eric's sysfs changes for namespaces to settle down, but 
ended up getting busy on other things.  I guess now is a good time to pick 
this back up and try to merge my changes for improving interface scaling.  
I'll send out a new version of the patches sometime in the next couple of 
days.  I'm also about to make a new Babylon release as well, I just need 
to write some more documentation. :-/

Btw, one thing I noticed but haven't been able to come up with a fix for 
yet is that iptables has scaling issues with lots of interfaces.  
Specifically, we had to start adding one iptables rule per interface for smtp 
filtering (not all subscribers are permitted to send smtp directly out to 
the net, so it has to be per-interface).  It seems that those all get 
dumped into a giant list.  What I'd like to do is to be able to attach rules 
directly to the interface, but I haven't really had the time to do a mergable 
set of changes for that.  Thoughts anyone?

		-ben

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2010-08-09 17:34                     ` Benjamin LaHaise
@ 2010-08-09 17:44                       ` Ben Greear
  2010-08-09 17:48                         ` Benjamin LaHaise
  2010-08-09 19:59                       ` Eric W. Biederman
  1 sibling, 1 reply; 41+ messages in thread
From: Ben Greear @ 2010-08-09 17:44 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev,
	Cosmin Ratiu

On 08/09/2010 10:34 AM, Benjamin LaHaise wrote:
> Hello Ben,
>
> On Mon, Aug 09, 2010 at 10:23:37AM -0700, Ben Greear wrote:
>> I was just comparing my out-of-tree patch set to .35, and it appears
>> little or none of the patches discussed in this thread are in the
>> upstream kernel yet.
>
> I was waiting on Eric's sysfs changes for namespaces to settle down, but
> ended up getting busy on other things.  I guess now is a good time to pick
> this back up and try to merge my changes for improving interface scaling.
> I'll send out a new version of the patches sometime in the next couple of
> days.  I'm also about to make a new Babylon release as well, I just need
> to write some more documentation. :-/
>
> Btw, one thing I noticed but haven't been able to come up with a fix for
> yet is that iptables has scaling issues with lots of interfaces.
> Specifically, we had to start adding one iptables rule per interface for smtp
> filtering (not all subscribers are permitted to send smtp directly out to
> the net, so it has to be per-interface).  It seems that those all get
> dumped into a giant list.  What I'd like to do is to be able to attach rules
> directly to the interface, but I haven't really had the time to do a mergable
> set of changes for that.  Thoughts anyone?

We also have a few rules per interface, and notice that it takes around 10ms
per rule when we are removing them, even when using batching in 'ip':

This is on a high-end core i7, otherwise lightly loaded.

Total IPv4 rule listings: 2097
Cleaning 2094 rules with ip -batch...
time -p ip -4 -force -batch /tmp/crr_batch_cmds_4.txt
real 17.81
user 0.05
sys 0.00


Patrick thought had an idea, but I don't think he had time to
look at it further:

"Its probably the synchronize_rcu() in fib_nl_delrule() and
the route flushing happening after rule removal."


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2010-08-09 17:44                       ` Ben Greear
@ 2010-08-09 17:48                         ` Benjamin LaHaise
  2010-08-09 18:03                           ` Ben Greear
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin LaHaise @ 2010-08-09 17:48 UTC (permalink / raw)
  To: Ben Greear
  Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev,
	Cosmin Ratiu

On Mon, Aug 09, 2010 at 10:44:14AM -0700, Ben Greear wrote:
> We also have a few rules per interface, and notice that it takes around 10ms
> per rule when we are removing them, even when using batching in 'ip':
...
> Patrick thought had an idea, but I don't think he had time to
> look at it further:
> 
> "Its probably the synchronize_rcu() in fib_nl_delrule() and
> the route flushing happening after rule removal."

Yes, that would be a problem, but the issue is deeper than that -- if I'm 
not mistaken it's on the packet processing path that iptables doesn't scale 
for 100k interfaces with 1 rule per interface.  It's been a while since I 
ran the tests, but I don't think it's changed much.

		-ben

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2010-08-09 17:48                         ` Benjamin LaHaise
@ 2010-08-09 18:03                           ` Ben Greear
  0 siblings, 0 replies; 41+ messages in thread
From: Ben Greear @ 2010-08-09 18:03 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Eric W. Biederman, Eric Dumazet, Octavian Purdila, netdev,
	Cosmin Ratiu

On 08/09/2010 10:48 AM, Benjamin LaHaise wrote:
> On Mon, Aug 09, 2010 at 10:44:14AM -0700, Ben Greear wrote:
>> We also have a few rules per interface, and notice that it takes around 10ms
>> per rule when we are removing them, even when using batching in 'ip':
> ...
>> Patrick thought had an idea, but I don't think he had time to
>> look at it further:
>>
>> "Its probably the synchronize_rcu() in fib_nl_delrule() and
>> the route flushing happening after rule removal."
>
> Yes, that would be a problem, but the issue is deeper than that -- if I'm
> not mistaken it's on the packet processing path that iptables doesn't scale
> for 100k interfaces with 1 rule per interface.  It's been a while since I
> ran the tests, but I don't think it's changed much.

It would be nice to tie the rules based on 'iif' to a specific
interface.  Seems it should give near constant time lookup for rules
if we only have a few per interface....

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2010-08-09 17:34                     ` Benjamin LaHaise
  2010-08-09 17:44                       ` Ben Greear
@ 2010-08-09 19:59                       ` Eric W. Biederman
  2010-08-09 21:03                         ` Benjamin LaHaise
  1 sibling, 1 reply; 41+ messages in thread
From: Eric W. Biederman @ 2010-08-09 19:59 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ben Greear, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

Benjamin LaHaise <bcrl@lhnet.ca> writes:

> Hello Ben,
>
> On Mon, Aug 09, 2010 at 10:23:37AM -0700, Ben Greear wrote:
>> I was just comparing my out-of-tree patch set to .35, and it appears
>> little or none of the patches discussed in this thread are in the
>> upstream kernel yet.

The network device deletion batching code has gone in, which is
a big help, as have some dev_put deletions, so we hit that 250ms
delay less often.  

> I was waiting on Eric's sysfs changes for namespaces to settle down, but 
> ended up getting busy on other things.  I guess now is a good time to pick 
> this back up and try to merge my changes for improving interface scaling.  
> I'll send out a new version of the patches sometime in the next couple of 
> days.  I'm also about to make a new Babylon release as well, I just need 
> to write some more documentation. :-/

sysfs feature wise has now settled down, and the regressions have all
been stamped out so now should be a good time to work on scaling.

I still have some preliminary patches in my tree, that I will dig up
as time goes by.

Eric

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2010-08-09 19:59                       ` Eric W. Biederman
@ 2010-08-09 21:03                         ` Benjamin LaHaise
  2010-08-09 21:17                           ` Eric W. Biederman
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin LaHaise @ 2010-08-09 21:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Greear, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

On Mon, Aug 09, 2010 at 12:59:14PM -0700, Eric W. Biederman wrote:
> The network device deletion batching code has gone in, which is
> a big help, as have some dev_put deletions, so we hit that 250ms
> delay less often.  

I'll see how much that helps.  Odds are I'm going to have to move the 
device deletion into a separate thread.  That should give me a natural 
boundary to queue up deletions at, which should fix the tunnel-flap and 
partial tunnel-flap cases I'm worried about.  At some point I have to 
figure out how to get my API needs met by the in-kernel L2TP code, but 
that's a worry for another day.

> sysfs feature wise has now settled down, and the regressions have all
> been stamped out so now should be a good time to work on scaling.
> 
> I still have some preliminary patches in my tree, that I will dig up
> as time goes by.

I should have some time this evening to run a few tests, and hopefully can 
post some results.

		-ben

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2010-08-09 21:03                         ` Benjamin LaHaise
@ 2010-08-09 21:17                           ` Eric W. Biederman
  0 siblings, 0 replies; 41+ messages in thread
From: Eric W. Biederman @ 2010-08-09 21:17 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ben Greear, Eric Dumazet, Octavian Purdila, netdev, Cosmin Ratiu

Benjamin LaHaise <bcrl@lhnet.ca> writes:

> On Mon, Aug 09, 2010 at 12:59:14PM -0700, Eric W. Biederman wrote:
>> The network device deletion batching code has gone in, which is
>> a big help, as have some dev_put deletions, so we hit that 250ms
>> delay less often.  
>
> I'll see how much that helps.  Odds are I'm going to have to move the 
> device deletion into a separate thread.  That should give me a natural 
> boundary to queue up deletions at, which should fix the tunnel-flap and 
> partial tunnel-flap cases I'm worried about.  At some point I have to 
> figure out how to get my API needs met by the in-kernel L2TP code, but 
> that's a worry for another day.

In case it is useful, if you delete a network namespace in general
all of the network device deletions can be batched.  

Eric

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-21 15:40           ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet
  2009-10-21 16:09             ` Eric Dumazet
  2009-10-21 16:51             ` Benjamin LaHaise
@ 2009-10-21 16:55             ` Octavian Purdila
  2009-10-23 21:13             ` Paul E. McKenney
  3 siblings, 0 replies; 41+ messages in thread
From: Octavian Purdila @ 2009-10-21 16:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Benjamin LaHaise, netdev, Cosmin Ratiu

On Wednesday 21 October 2009 18:40:07 you wrote:

> >
> > I would also like to see this patch in, we are running into scalability
> > issues with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put()
>  spots than can be deleted and replaced by something more scalable. I began
>  this job but others can help me.
> 

Yes, I agree with you, there are multiple places which needs to be touched to 
allow for better scaling with regard to the number of interfaces. We do have 
patches that addresses some of these issues, but unfortunately they are based 
on 2.6.7 and some of them are quite ugly hacks :) 

However, we are in the process of switching to 2.6.31 so I hope we will be 
able to contribute on this effort.

> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 

Hmm, how would multiple tasks help here? Isn't the RTNL mutex global?

> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 
> [PATCH] net: allow netdev_wait_allrefs() to run faster
> 

Thanks, I am going to test it on our platform and send back the results.

tavi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-21 15:40           ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet
                               ` (2 preceding siblings ...)
  2009-10-21 16:55             ` Octavian Purdila
@ 2009-10-23 21:13             ` Paul E. McKenney
  2009-10-24  4:35               ` Eric Dumazet
  3 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2009-10-23 21:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu

On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> Octavian Purdila a écrit :
> > On Sunday 18 October 2009 21:21:44 you wrote:
> >>> The msleep(250) should be tuned first. Then if this is really necessary
> >>> to dismantle 100.000 netdevices per second, we might have to think a bit
> >>> more. 
> >>> Just try msleep(1 or 2), it should work quite well.
> >> My goal is tearing down 100,000 interfaces in a few seconds, which really
> >>  is  necessary.  Right now we're running about 40,000 interfaces on a not
> >>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
> >>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
> >>  system can handle that.
> >>
> > 
> > I would also like to see this patch in, we are running into scalability issues 
> > with creating/deleting lots of interfaces as well.
> 
> Ben patch only address interface deletion, and one part of the problem,
> maybe the more visible one for the current kernel.
> 
> Adding lots of interfaces only needs several threads to run concurently.
> 
> Before applying/examining his patch I suggest identifying all dev_put() spots than
> can be deleted and replaced by something more scalable. I began this job
> but others can help me.
> 
> RTNL and rcu grace periods are going to hurt anyway, so you probably need
> to use many tasks to be able to delete lots of interfaces in parallel.
> 
> netdev_run_todo() should also use a better algorithm to allow parallelism.
> 
> Following patch doesnt slow down dev_put() users and real scalability
> problems will surface and might be addressed.
> 
> [PATCH] net: allow netdev_wait_allrefs() to run faster
> 
> netdev_wait_allrefs() waits that all references to a device vanishes.
> 
> It currently uses a _very_ pessimistic 250 ms delay between each probe.
> Some users report that no more than 4 devices can be dismantled per second,
> this is a pretty serious problem for extreme setups.
> 
> Most likely, references only wait for a rcu grace period that should come
> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.

Is this a place where synchronize_rcu_expedited() is appropriate?
(It went in to 2.6.32-rc1.)

							Thanx, Paul

> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/core/dev.c |    2 +-
>  1 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 28b0b9e..fca2e4a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4983,7 +4983,7 @@ static void netdev_wait_allrefs(struct net_device *dev)
>  			rebroadcast_time = jiffies;
>  		}
> 
> -		msleep(250);
> +		schedule_timeout_uninterruptible(1);
> 
>  		if (time_after(jiffies, warning_time + 10 * HZ)) {
>  			printk(KERN_EMERG "unregister_netdevice: "
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-23 21:13             ` Paul E. McKenney
@ 2009-10-24  4:35               ` Eric Dumazet
  2009-10-24  5:49                 ` Paul E. McKenney
  2009-10-24 20:22                 ` Stephen Hemminger
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2009-10-24  4:35 UTC (permalink / raw)
  To: paulmck; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu

Paul E. McKenney a écrit :
> On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
>> [PATCH] net: allow netdev_wait_allrefs() to run faster
>>
>> netdev_wait_allrefs() waits that all references to a device vanishes.
>>
>> It currently uses a _very_ pessimistic 250 ms delay between each probe.
>> Some users report that no more than 4 devices can be dismantled per second,
>> this is a pretty serious problem for extreme setups.
>>
>> Most likely, references only wait for a rcu grace period that should come
>> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.
> 
> Is this a place where synchronize_rcu_expedited() is appropriate?
> (It went in to 2.6.32-rc1.)
> 

Thanks for the tip Paul

I believe netdev_wait_allrefs() is not a perfect candidate, because 
synchronize_sched_expedited() seems really expensive.

Maybe we could call it once only, if we had to call 1 times
the jiffie delay ?

diff --git a/net/core/dev.c b/net/core/dev.c
index fa88dcd..9b04b9a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev);
 static void netdev_wait_allrefs(struct net_device *dev)
 {
 	unsigned long rebroadcast_time, warning_time;
+	unsigned int count = 0;
 
 	rebroadcast_time = warning_time = jiffies;
 	while (atomic_read(&dev->refcnt) != 0) {
@@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev)
 			rebroadcast_time = jiffies;
 		}
 
-		msleep(250);
+		if (count++ == 1)
+			synchronize_rcu_expedited();
+		else
+			schedule_timeout_uninterruptible(1);
 
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-24  4:35               ` Eric Dumazet
@ 2009-10-24  5:49                 ` Paul E. McKenney
  2009-10-24  8:49                   ` Eric Dumazet
  2009-10-24 20:22                 ` Stephen Hemminger
  1 sibling, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2009-10-24  5:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu

On Sat, Oct 24, 2009 at 06:35:53AM +0200, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
> > On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> >> [PATCH] net: allow netdev_wait_allrefs() to run faster
> >>
> >> netdev_wait_allrefs() waits that all references to a device vanishes.
> >>
> >> It currently uses a _very_ pessimistic 250 ms delay between each probe.
> >> Some users report that no more than 4 devices can be dismantled per second,
> >> this is a pretty serious problem for extreme setups.
> >>
> >> Most likely, references only wait for a rcu grace period that should come
> >> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.
> > 
> > Is this a place where synchronize_rcu_expedited() is appropriate?
> > (It went in to 2.6.32-rc1.)
> 
> Thanks for the tip Paul
> 
> I believe netdev_wait_allrefs() is not a perfect candidate, because 
> synchronize_sched_expedited() seems really expensive.

It does indeed keep the CPUs quite busy for a bit.  ;-)

> Maybe we could call it once only, if we had to call 1 times
> the jiffie delay ?

This could be a very useful approach!

However, please keep in mind that although synchronize_rcu_expedited()
forces a grace period, it does nothing to speed the invocation of other
RCU callbacks.  In short, synchronize_rcu_expedited() is a faster version
of synchronize_rcu(), but doesn't necessarily help other synchronize_rcu()
or call_rcu() invocations.

The reason I point this out is that it looks to me that the code below is
waiting for some other task which is in turn waiting on a grace period.
But I don't know this code, so could easily be confused.

						Thanx, paul

> diff --git a/net/core/dev.c b/net/core/dev.c
> index fa88dcd..9b04b9a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev);
>  static void netdev_wait_allrefs(struct net_device *dev)
>  {
>  	unsigned long rebroadcast_time, warning_time;
> +	unsigned int count = 0;
> 
>  	rebroadcast_time = warning_time = jiffies;
>  	while (atomic_read(&dev->refcnt) != 0) {
> @@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev)
>  			rebroadcast_time = jiffies;
>  		}
> 
> -		msleep(250);
> +		if (count++ == 1)
> +			synchronize_rcu_expedited();
> +		else
> +			schedule_timeout_uninterruptible(1);
> 
>  		if (time_after(jiffies, warning_time + 10 * HZ)) {
>  			printk(KERN_EMERG "unregister_netdevice: "
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-24  5:49                 ` Paul E. McKenney
@ 2009-10-24  8:49                   ` Eric Dumazet
  2009-10-24 13:52                     ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2009-10-24  8:49 UTC (permalink / raw)
  To: paulmck; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu

Paul E. McKenney a écrit :
> On Sat, Oct 24, 2009 at 06:35:53AM +0200, Eric Dumazet wrote:
> 
>> Maybe we could call it once only, if we had to call 1 times
>> the jiffie delay ?
> 
> This could be a very useful approach!
> 
> However, please keep in mind that although synchronize_rcu_expedited()
> forces a grace period, it does nothing to speed the invocation of other
> RCU callbacks.  In short, synchronize_rcu_expedited() is a faster version
> of synchronize_rcu(), but doesn't necessarily help other synchronize_rcu()
> or call_rcu() invocations.
> 
> The reason I point this out is that it looks to me that the code below is
> waiting for some other task which is in turn waiting on a grace period.
> But I don't know this code, so could easily be confused.
> 

Normally, we need a synchronize_rcu() calls, but I feel its bit more than really
needed here.

On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms


messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.580259] synchronize_net() 4045596 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.588262] synchronize_net() 7769327 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.625014] synchronize_net() 4772052 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.633008] synchronize_net() 7773896 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.669260] synchronize_net() 3958141 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.677259] synchronize_net() 7755817 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.712011] synchronize_net() 2502544 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.720011] synchronize_net() 7767748 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.754259] synchronize_net() 2087946 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.762258] synchronize_net() 7738054 ns
messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.796011] synchronize_net() 3392760 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.808025] synchronize_net() 11814619 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.848010] synchronize_net() 8970220 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.856015] synchronize_net() 7800782 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.893008] synchronize_net() 6650174 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.897012] synchronize_net() 3744808 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.940202] synchronize_net() 8354366 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.952137] synchronize_net() 11693215 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.985010] synchronize_net() 2355970 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.989009] synchronize_net() 3771419 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.028137] synchronize_net() 7661195 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.036152] synchronize_net() 7800056 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.083135] synchronize_net() 6774026 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.089145] synchronize_net() 5727189 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.130385] synchronize_net() 10133932 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.134399] synchronize_net() 3773058 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.170136] synchronize_net() 4479194 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.178138] synchronize_net() 7710466 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.217198] synchronize_net() 4323437 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.226206] synchronize_net() 8723108 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.268013] synchronize_net() 6221155 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.280007] synchronize_net() 11719297 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.324008] synchronize_net() 11654511 ns
messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.332009] synchronize_net() 7744182 ns


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-24  8:49                   ` Eric Dumazet
@ 2009-10-24 13:52                     ` Paul E. McKenney
  2009-10-24 14:24                       ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2009-10-24 13:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu

On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
> > On Sat, Oct 24, 2009 at 06:35:53AM +0200, Eric Dumazet wrote:
> > 
> >> Maybe we could call it once only, if we had to call 1 times
> >> the jiffie delay ?
> > 
> > This could be a very useful approach!
> > 
> > However, please keep in mind that although synchronize_rcu_expedited()
> > forces a grace period, it does nothing to speed the invocation of other
> > RCU callbacks.  In short, synchronize_rcu_expedited() is a faster version
> > of synchronize_rcu(), but doesn't necessarily help other synchronize_rcu()
> > or call_rcu() invocations.
> > 
> > The reason I point this out is that it looks to me that the code below is
> > waiting for some other task which is in turn waiting on a grace period.
> > But I don't know this code, so could easily be confused.
> > 
> 
> Normally, we need a synchronize_rcu() calls, but I feel its bit more than really
> needed here.
> 
> On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms

That sounds like the right range, depending on what else is happening
on the machine at the time.

The synchronize_rcu_expedited() primitive would run in the 10s-100s
of microseconds.  It involves a pair of wakeups and a pair of context
switches on each CPU.

							Thanx, Paul

> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.580259] synchronize_net() 4045596 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.588262] synchronize_net() 7769327 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.625014] synchronize_net() 4772052 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.633008] synchronize_net() 7773896 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.669260] synchronize_net() 3958141 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.677259] synchronize_net() 7755817 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.712011] synchronize_net() 2502544 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.720011] synchronize_net() 7767748 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.754259] synchronize_net() 2087946 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.762258] synchronize_net() 7738054 ns
> messages:Oct 21 19:13:14 svivoipvnx001-00 kernel: [ 2515.796011] synchronize_net() 3392760 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.808025] synchronize_net() 11814619 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.848010] synchronize_net() 8970220 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.856015] synchronize_net() 7800782 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.893008] synchronize_net() 6650174 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.897012] synchronize_net() 3744808 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.940202] synchronize_net() 8354366 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.952137] synchronize_net() 11693215 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.985010] synchronize_net() 2355970 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2515.989009] synchronize_net() 3771419 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.028137] synchronize_net() 7661195 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.036152] synchronize_net() 7800056 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.083135] synchronize_net() 6774026 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.089145] synchronize_net() 5727189 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.130385] synchronize_net() 10133932 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.134399] synchronize_net() 3773058 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.170136] synchronize_net() 4479194 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.178138] synchronize_net() 7710466 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.217198] synchronize_net() 4323437 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.226206] synchronize_net() 8723108 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.268013] synchronize_net() 6221155 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.280007] synchronize_net() 11719297 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.324008] synchronize_net() 11654511 ns
> messages:Oct 21 19:13:15 svivoipvnx001-00 kernel: [ 2516.332009] synchronize_net() 7744182 ns
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-24 13:52                     ` Paul E. McKenney
@ 2009-10-24 14:24                       ` Eric Dumazet
  2009-10-24 14:46                         ` Paul E. McKenney
  2009-10-24 23:49                         ` Octavian Purdila
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2009-10-24 14:24 UTC (permalink / raw)
  To: paulmck; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu

Paul E. McKenney a écrit :
> On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote:
>>
>> On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms
> 
> That sounds like the right range, depending on what else is happening
> on the machine at the time.
> 
> The synchronize_rcu_expedited() primitive would run in the 10s-100s
> of microseconds.  It involves a pair of wakeups and a pair of context
> switches on each CPU.
> 

Hmm... I'll make some experiments Monday and post results, but it seems very
promising.

Do you think the "on_each_cpu(flush_backlog, dev, 1);"
we perform right before calling netdev_wait_allrefs() could be changed
somehow to speedup rcu callbacks ? Maybe we ould avoid sending IPI twice to
cpus ?

Thanks


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-24 14:24                       ` Eric Dumazet
@ 2009-10-24 14:46                         ` Paul E. McKenney
  2009-10-24 23:49                         ` Octavian Purdila
  1 sibling, 0 replies; 41+ messages in thread
From: Paul E. McKenney @ 2009-10-24 14:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu

On Sat, Oct 24, 2009 at 04:24:27PM +0200, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
> > On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote:
> >>
> >> On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms
> > 
> > That sounds like the right range, depending on what else is happening
> > on the machine at the time.
> > 
> > The synchronize_rcu_expedited() primitive would run in the 10s-100s
> > of microseconds.  It involves a pair of wakeups and a pair of context
> > switches on each CPU.
> 
> Hmm... I'll make some experiments Monday and post results, but it seems very
> promising.

I should hasten to add that synchronize_rcu_expedited() goes fast for
TREE_RCU but not yet for TREE_PREEMPT_RCU (where it maps safely but
slowly to synchronize_rcu()).

> Do you think the "on_each_cpu(flush_backlog, dev, 1);"
> we perform right before calling netdev_wait_allrefs() could be changed
> somehow to speedup rcu callbacks ? Maybe we ould avoid sending IPI twice to
> cpus ?

This is an interesting possibility, and might fit in with some of the
changes that I am thinking about to reduce OS jitter for the heavy-duty
numerical-computing guys.

In the meantime, you could try doing the following from flush_backlog():

	local_irq_save(flags);
	rcu_check_callbacks(smp_processor_id(), 0);
	local_irq_restore(flags);

This would emulate a much-faster HZ value, but only for RCU.  This works
better in TREE_RCU than it does in TREE_PREEMPT_RCU at the moment (on my
todo list!).  In older kernels, this should also work for CLASSIC_RCU.
Of course, in TINY_RCU, synchronize_rcu() is a no-op anyway.  ;-)

And just to be clear, synchronize_rcu_expedited() currently just does
wakeups, not explicit IPIs.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-24 14:24                       ` Eric Dumazet
  2009-10-24 14:46                         ` Paul E. McKenney
@ 2009-10-24 23:49                         ` Octavian Purdila
  2009-10-25  4:47                           ` Paul E. McKenney
  2009-10-25  8:35                           ` Eric Dumazet
  1 sibling, 2 replies; 41+ messages in thread
From: Octavian Purdila @ 2009-10-24 23:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: paulmck, Benjamin LaHaise, netdev, Cosmin Ratiu

On Saturday 24 October 2009 17:24:27 you wrote:
> Paul E. McKenney a écrit :
> > On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote:
> >> On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms
> >
> > That sounds like the right range, depending on what else is happening
> > on the machine at the time.
> >
> > The synchronize_rcu_expedited() primitive would run in the 10s-100s
> > of microseconds.  It involves a pair of wakeups and a pair of context
> > switches on each CPU.
> 
> Hmm... I'll make some experiments Monday and post results, but it seems
>  very promising.
> 

Got some time today and did some experiments myself. The test is deleting 1000 
dummy interfaces (interface status down, no IP/IPv6 addresses assigned) on a 
UP non-preempt ppc750 @800Mhz system.

1. Ben's patch:

real    0m 3.42s
user    0m 0.00s
sys     0m 0.00s

2. Eric's schedule_timeout_uninterruptible(1);

real    0m 3.00s
user    0m 0.00s
sys     0m 0.00s

3. Simple synchronize_rcu_expedited()

This doesn't seem to work well with the UP non-preempt case since 
synchronize_rcu_expedited() is a noop in this case - turning 
netdev_wait_allrefs() into a while(1) loop.

tavi





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-24 23:49                         ` Octavian Purdila
@ 2009-10-25  4:47                           ` Paul E. McKenney
  2009-10-25  8:35                           ` Eric Dumazet
  1 sibling, 0 replies; 41+ messages in thread
From: Paul E. McKenney @ 2009-10-25  4:47 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Eric Dumazet, Benjamin LaHaise, netdev, Cosmin Ratiu

On Sun, Oct 25, 2009 at 02:49:00AM +0300, Octavian Purdila wrote:
> On Saturday 24 October 2009 17:24:27 you wrote:
> > Paul E. McKenney a écrit :
> > > On Sat, Oct 24, 2009 at 10:49:55AM +0200, Eric Dumazet wrote:
> > >> On my dev machine, a synchronize_rcu() lasts between 2 an 12 ms
> > >
> > > That sounds like the right range, depending on what else is happening
> > > on the machine at the time.
> > >
> > > The synchronize_rcu_expedited() primitive would run in the 10s-100s
> > > of microseconds.  It involves a pair of wakeups and a pair of context
> > > switches on each CPU.
> > 
> > Hmm... I'll make some experiments Monday and post results, but it seems
> >  very promising.
> > 
> 
> Got some time today and did some experiments myself. The test is deleting 1000 
> dummy interfaces (interface status down, no IP/IPv6 addresses assigned) on a 
> UP non-preempt ppc750 @800Mhz system.
> 
> 1. Ben's patch:
> 
> real    0m 3.42s
> user    0m 0.00s
> sys     0m 0.00s
> 
> 2. Eric's schedule_timeout_uninterruptible(1);
> 
> real    0m 3.00s
> user    0m 0.00s
> sys     0m 0.00s
> 
> 3. Simple synchronize_rcu_expedited()
> 
> This doesn't seem to work well with the UP non-preempt case since 
> synchronize_rcu_expedited() is a noop in this case - turning 
> netdev_wait_allrefs() into a while(1) loop.

Indeed -- but then again, in the UP case, synchronize_rcu() itself
is pretty much a no-op.  So if your main target is UP, you should
be able to have seriously fast RCU updates.

(I know, I know, you want SMP to run fast as well...)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-24 23:49                         ` Octavian Purdila
  2009-10-25  4:47                           ` Paul E. McKenney
@ 2009-10-25  8:35                           ` Eric Dumazet
  2009-10-25 15:19                             ` Octavian Purdila
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2009-10-25  8:35 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: paulmck, Benjamin LaHaise, netdev, Cosmin Ratiu

Octavian Purdila a écrit :
> 
> Got some time today and did some experiments myself. The test is deleting 1000 
> dummy interfaces (interface status down, no IP/IPv6 addresses assigned) on a 
> UP non-preempt ppc750 @800Mhz system.
> 
> 1. Ben's patch:
> 
> real    0m 3.42s
> user    0m 0.00s
> sys     0m 0.00s
> 
> 2. Eric's schedule_timeout_uninterruptible(1);
> 
> real    0m 3.00s
> user    0m 0.00s
> sys     0m 0.00s
> 
> 3. Simple synchronize_rcu_expedited()
> 
> This doesn't seem to work well with the UP non-preempt case since 
> synchronize_rcu_expedited() is a noop in this case - turning 
> netdev_wait_allrefs() into a while(1) loop.
> 

Thanks for these numbers. I presume HZ value is 1000 on this platform ?

Could you give us your scripts so that we can use same "benchmark" ?

BTW, I found I could not use IPV6 with many devices on x86_32, because of
the huge per_cpu allocations (on IPV6, each device has percpu SNMP counters)



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-25  8:35                           ` Eric Dumazet
@ 2009-10-25 15:19                             ` Octavian Purdila
  2009-10-25 19:28                               ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Octavian Purdila @ 2009-10-25 15:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: paulmck, Benjamin LaHaise, netdev, Cosmin Ratiu

[-- Attachment #1: Type: Text/Plain, Size: 1468 bytes --]

On Sunday 25 October 2009 10:35:10 you wrote:
> > Got some time today and did some experiments myself. The test is deleting
> > 1000 dummy interfaces (interface status down, no IP/IPv6 addresses
> > assigned) on a UP non-preempt ppc750 @800Mhz system.
> >
> > 1. Ben's patch:
> >
> > real    0m 3.42s
> > user    0m 0.00s
> > sys     0m 0.00s
> >
> > 2. Eric's schedule_timeout_uninterruptible(1);
> >
> > real    0m 3.00s
> > user    0m 0.00s
> > sys     0m 0.00s
> >
> > 3. Simple synchronize_rcu_expedited()
> >
> > This doesn't seem to work well with the UP non-preempt case since
> > synchronize_rcu_expedited() is a noop in this case - turning
> > netdev_wait_allrefs() into a while(1) loop.
> 
> Thanks for these numbers. I presume HZ value is 1000 on this platform ?
> 

Yes. I've attach the full config to this email as well.

> Could you give us your scripts so that we can use same "benchmark" ?
> 

Sure, I've attached the hack module code I've used. 

For creating interfaces: echo 1000 > /proc/sys/net/ndst/add
For deleting interface echo start_ifindex stop_ifindex > /proc/sys/net/ndst/del

Some more information:

- on our old and optimized kernel I am getting 0.4s for creating 128000 
interfaces and 0.57s for deleting them

- the 2.6.31 kernel I got the 3s numbers does have some patches to speed-up 
interface creating and deletion (removal of per device sysctl and dev_snmp6 
entries)

I'll start posting the patches we have as RFC.

Thanks,
tavi

[-- Attachment #2: ndst.c --]
[-- Type: text/x-csrc, Size: 4494 bytes --]

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/if_arp.h>
#include <linux/inetdevice.h>
#include <linux/rtnetlink.h>
#include <linux/ip.h>
#include <net/route.h>
#include <linux/netfilter.h>
#include <linux/netfilter_ipv4.h>
#include <linux/version.h>
#include <net/ip.h>
#include <net/flow.h>
#include <net/ipv6.h>
#include <linux/netfilter_ipv6.h>
#include <net/ip6_route.h>
#include <net/addrconf.h>
#include <linux/version.h>

#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,24)
#define __INIT_NET(x) x
#else
#include <net/net_namespace.h>
#define __INIT_NET(x) &init_net,x
#endif

#ifndef CONFIG_IXIA_CONSOLE
static inline void 
netdev_set_header_ops(struct net_device *dev, const struct header_ops *hops)
{
	dev->header_ops = hops;
}

static inline void
netdev_set_ops(struct net_device *dev, const struct net_device_ops *ops)
{
	dev->netdev_ops = ops;
}
#endif


static struct net_device_ops ndst_ops = {
};

int ndst_add(int n)
{
	int err, i;
	struct net_device * dev = NULL;
	char name[IFNAMSIZ];

	for(i = 0; i < n; i++) {
		/* temporary hack until we fix __dev_alloc_name - it is O(n) ! */
		rtnl_lock();
		do {
			static unsigned counter = 1;
			snprintf(name, IFNAMSIZ, "ixtest%d", counter++);
		} while(__dev_get_by_name(__INIT_NET(name)));
		rtnl_unlock();
	     
		dev = alloc_netdev(0, name, ether_setup);
		if (dev == NULL) {
			err = -ENOMEM;
			goto err;
		}

		netdev_set_ops(dev, &ndst_ops);
		netdev_set_header_ops(dev, NULL);

		err = register_netdev(dev);
		if (err)
			goto err;
	}

	return 0;

	// Error handling.
 err:
	if (dev)
		free_netdev(dev);
	module_put(THIS_MODULE);
	printk(KERN_ERR "%s: failed to register netdev: %d\n", __func__, err);
	return err;
}


int ndst_del(int start, int stop)
{
	struct net_device *dev;
	int i;

	for(i = start; i <= stop; i++) {
		rtnl_lock();
		dev = __dev_get_by_index(__INIT_NET(i));
		if (!dev) {
			rtnl_unlock();
			return -EINVAL;
		}
		unregister_netdevice(dev);
		rtnl_unlock();
		free_netdev(dev);
	}

	return 0;
}

#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27)
static int proc_do_add(struct ctl_table *ctl, int write, struct file * filp,
		       void __user *buffer, size_t *lenp)
#else
static int proc_do_add(struct ctl_table *ctl, int write, struct file * filp,
		       void __user *buffer, size_t *lenp, loff_t *ppos)
#endif
{
	int ret = 0;
	uint32_t data;

	ctl->data = &data;

	if (write) {
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27)
		ret = proc_dointvec(ctl, write, filp, buffer, lenp);
#else
		ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos);
#endif
		ndst_add(data);
	}
	
	return ret;
}

#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27)
static int proc_do_del(struct ctl_table *ctl, int write, struct file * filp,
		       void __user *buffer, size_t *lenp)
#else
static int proc_do_del(struct ctl_table *ctl, int write, struct file * filp,
		       void __user *buffer, size_t *lenp, loff_t *ppos)
#endif
{
	int ret = 0;
	uint32_t data[2];

	ctl->data = data;

	if (write) {
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27)
		ret = proc_dointvec(ctl, write, filp, buffer, lenp);
#else
		ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos);
#endif
		ndst_del(data[0], data[1]);
	} 
	
	return ret;
}

static ctl_table ndst_sysctl_table[] = {
	{
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27)
		.ctl_name = 1,
#endif
		.procname = "add",
		.maxlen = sizeof(uint32_t),
		.mode = 0200,
		.proc_handler = &proc_do_add,
	},
	{
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27)
		.ctl_name = 2,
#endif
		.procname = "del",
		.maxlen = 2*sizeof(uint32_t),
		.mode = 0200,
		.proc_handler = &proc_do_del,
	},

	{}
};

static ctl_table ndst_sysctl_net_table[] = {
	{
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27)
		.ctl_name = 1024,
#endif
		.procname = "ndst",
		.data = NULL,
		.maxlen = 0,
		.mode = 0555,
		.child = ndst_sysctl_table
	},

	{}
};


static ctl_table ndst_sysctl_root[] = {
	{
		.ctl_name = CTL_NET,
		.procname = "net",
		.data = NULL,
		.maxlen = 0,
		.mode = 0555,
		.child = ndst_sysctl_net_table
	},

	{}
};

static struct ctl_table_header *ndst_sysctl_hdr;


int ndst_init(void)
{
#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,27)
	ndst_sysctl_hdr = register_sysctl_table(ndst_sysctl_root, 0);
#else
	ndst_sysctl_hdr = register_sysctl_table(ndst_sysctl_root);
#endif

	if (ndst_sysctl_hdr == NULL)
		return -EINVAL;

	return 0;
}

void ndst_cleanup(void)
{
	unregister_sysctl_table(ndst_sysctl_hdr);
}


module_init(ndst_init);
module_exit(ndst_cleanup);

[-- Attachment #3: .config --]
[-- Type: text/plain, Size: 22178 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.31
# Sat Oct 24 20:54:34 2009
#
# CONFIG_PPC64 is not set

#
# Processor support
#
CONFIG_PPC_BOOK3S_32=y
# CONFIG_PPC_85xx is not set
# CONFIG_PPC_8xx is not set
# CONFIG_40x is not set
# CONFIG_44x is not set
# CONFIG_E200 is not set
CONFIG_PPC_BOOK3S=y
CONFIG_6xx=y
CONFIG_PPC_FPU=y
# CONFIG_ALTIVEC is not set
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_STD_MMU_32=y
# CONFIG_PPC_MM_SLICES is not set
CONFIG_PPC_HAVE_PMU_SUPPORT=y
# CONFIG_SMP is not set
CONFIG_PPC32=y
CONFIG_WORD_SIZE=32
# CONFIG_ARCH_PHYS_ADDR_T_64BIT is not set
CONFIG_MMU=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not set
CONFIG_IRQ_PER_CPU=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_ILOG2_U32=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
# CONFIG_ARCH_NO_VIRT_TO_BUS is not set
CONFIG_PPC=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_NVRAM=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_PPC_OF=y
CONFIG_OF=y
# CONFIG_PPC_UDBG_16550 is not set
# CONFIG_GENERIC_TBSYNC is not set
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
CONFIG_DTC=y
# CONFIG_DEFAULT_UIMAGE is not set
# CONFIG_PPC_DCR_NATIVE is not set
# CONFIG_PPC_DCR_MMIO is not set
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_CONSTRUCTORS=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set

#
# RCU Subsystem
#
CONFIG_CLASSIC_RCU=y
# CONFIG_TREE_RCU is not set
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
# CONFIG_GROUP_SCHED is not set
# CONFIG_CGROUPS is not set
# CONFIG_RELAY is not set
# CONFIG_NAMESPACES is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
# CONFIG_RD_BZIP2 is not set
# CONFIG_RD_LZMA is not set
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_EMBEDDED=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
# CONFIG_HOTPLUG is not set
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
# CONFIG_SIGNALFD is not set
CONFIG_TIMERFD=y
# CONFIG_EVENTFD is not set
# CONFIG_SHMEM is not set
# CONFIG_AIO is not set
CONFIG_HAVE_PERF_COUNTERS=y

#
# Performance Counters
#
# CONFIG_PERF_COUNTERS is not set
CONFIG_VM_EVENT_COUNTERS=y
# CONFIG_PCI_QUIRKS is not set
# CONFIG_STRIP_ASM_SYMS is not set
CONFIG_COMPAT_BRK=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_MARKERS=y
CONFIG_OPROFILE=y
CONFIG_HAVE_OPROFILE=y
# CONFIG_KPROBES is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
# CONFIG_SLOW_WORK is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_BLOCK is not set
# CONFIG_FREEZER is not set

#
# Platform support
#
# CONFIG_PPC_CHRP is not set
# CONFIG_MPC5121_ADS is not set
# CONFIG_MPC5121_GENERIC is not set
# CONFIG_PPC_MPC52xx is not set
# CONFIG_PPC_PMAC is not set
# CONFIG_PPC_CELL is not set
# CONFIG_PPC_CELL_NATIVE is not set
# CONFIG_PPC_82xx is not set
# CONFIG_PQ2ADS is not set
# CONFIG_PPC_83xx is not set
# CONFIG_PPC_86xx is not set
# CONFIG_EMBEDDED6xx is not set
# CONFIG_AMIGAONE is not set
CONFIG_PPC_IXIA=y
# CONFIG_PPC_OF_BOOT_TRAMPOLINE is not set
# CONFIG_IPIC is not set
# CONFIG_MPIC is not set
# CONFIG_MPIC_WEIRD is not set
# CONFIG_PPC_I8259 is not set
# CONFIG_PPC_RTAS is not set
# CONFIG_MMIO_NVRAM is not set
# CONFIG_PPC_MPC106 is not set
# CONFIG_PPC_970_NAP is not set
# CONFIG_PPC_INDIRECT_IO is not set
# CONFIG_GENERIC_IOMAP is not set
# CONFIG_CPU_FREQ is not set
# CONFIG_TAU is not set
# CONFIG_FSL_ULI1575 is not set
# CONFIG_SIMPLE_GPIO is not set

#
# Kernel options
#
# CONFIG_HIGHMEM is not set
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
# CONFIG_HAVE_AOUT is not set
# CONFIG_BINFMT_MISC is not set
# CONFIG_IOMMU_HELPER is not set
# CONFIG_SWIOTLB is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_HAS_WALK_MEMORY=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_MIGRATION is not set
# CONFIG_PHYS_ADDR_T_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1
CONFIG_VIRT_TO_BUS=y
CONFIG_HAVE_MLOCK=y
CONFIG_HAVE_MLOCKED_PAGE_BIT=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_PPC_4K_PAGES=y
# CONFIG_PPC_16K_PAGES is not set
# CONFIG_PPC_64K_PAGES is not set
# CONFIG_PPC_256K_PAGES is not set
CONFIG_FORCE_MAX_ZONEORDER=11
# CONFIG_PROC_DEVICETREE is not set
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE="console=ttyS0 rootfstype=ramfs powersave=off"
CONFIG_EXTRA_TARGETS=""
# CONFIG_PM is not set
# CONFIG_SECCOMP is not set
CONFIG_ISA_DMA_API=y

#
# Bus options
#
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
# CONFIG_PPC_INDIRECT_PCI is not set
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCI_SYSCALL=y
# CONFIG_PCIEPORTBUS is not set
CONFIG_ARCH_SUPPORTS_MSI=y
# CONFIG_PCI_MSI is not set
# CONFIG_PCI_LEGACY is not set
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
# CONFIG_PCI_IOV is not set
# CONFIG_HAS_RAPIDIO is not set

#
# Advanced setup
#
CONFIG_ADVANCED_OPTIONS=y
CONFIG_LOWMEM_SIZE_BOOL=y
CONFIG_LOWMEM_SIZE=0x70000000
CONFIG_PAGE_OFFSET_BOOL=y
CONFIG_PAGE_OFFSET=0x80000000
CONFIG_KERNEL_START_BOOL=y
CONFIG_KERNEL_START=0x80000000
CONFIG_PHYSICAL_START=0x00000000
CONFIG_TASK_SIZE_BOOL=y
CONFIG_TASK_SIZE=0x70000000
CONFIG_NET=y

#
# Networking options
#
# CONFIG_NET_SYSCTL_DEV is not set
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=m
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
CONFIG_XFRM_IPCOMP=m
CONFIG_NET_KEY=m
# CONFIG_NET_KEY_MIGRATE is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_IXIA_ROUTING=y
# CONFIG_IP_ROUTE_MULTIPATH is not set
# CONFIG_IP_ROUTE_VERBOSE is not set
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=y
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=y
# CONFIG_INET_LRO is not set
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=y
CONFIG_IPV6_NDISC_NODETYPE=y
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
# CONFIG_NETFILTER_ADVANCED is not set

#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NF_CONNTRACK is not set
CONFIG_NETFILTER_XTABLES=m
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV4 is not set
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_FILTER=m
# CONFIG_IP_NF_TARGET_REJECT is not set
# CONFIG_IP_NF_TARGET_LOG is not set
# CONFIG_IP_NF_TARGET_ULOG is not set
# CONFIG_IP_NF_MANGLE is not set

#
# IPv6: Netfilter Configuration
#
CONFIG_IP6_NF_IPTABLES=m
# CONFIG_IP6_NF_MATCH_IPV6HEADER is not set
# CONFIG_IP6_NF_TARGET_LOG is not set
CONFIG_IP6_NF_FILTER=m
# CONFIG_IP6_NF_TARGET_REJECT is not set
# CONFIG_IP6_NF_MANGLE is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=y
CONFIG_LLC2=y
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
CONFIG_NET_SCH_TBF=m
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
CONFIG_NET_SCH_INGRESS=m

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
# CONFIG_CLS_U32_PERF is not set
# CONFIG_CLS_U32_MARK is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
# CONFIG_NET_EMATCH is not set
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=y
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_IPT is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_CLS_IND is not set
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_DROP_MONITOR is not set
CONFIG_NET_TXTIMESTAMP=y
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
CONFIG_OF_DEVICE=y
# CONFIG_PARPORT is not set
# CONFIG_MISC_DEVICES is not set
CONFIG_HAVE_IDE=y

#
# SCSI device support
#
# CONFIG_SCSI_DMA is not set
# CONFIG_SCSI_NETLINK is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#

#
# You can enable one or both FireWire driver stacks.
#

#
# See the help texts for more information.
#
# CONFIG_FIREWIRE is not set
# CONFIG_IEEE1394 is not set
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
# CONFIG_IFB is not set
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_VETH is not set
# CONFIG_ARCNET is not set
# CONFIG_NET_ETHERNET is not set
# CONFIG_NETDEV_1000 is not set
# CONFIG_NETDEV_10000 is not set
# CONFIG_TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set
# CONFIG_INPUT_POLLDEV is not set

#
# Userland interfaces
#
# CONFIG_INPUT_MOUSEDEV is not set
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_EVDEV is not set
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
# CONFIG_INPUT_KEYBOARD is not set
# CONFIG_INPUT_MOUSE is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
# CONFIG_INPUT_MISC is not set

#
# Hardware I/O ports
#
# CONFIG_SERIO is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
# CONFIG_VT is not set
CONFIG_DEVKMEM=y
# CONFIG_SERIAL_NONSTANDARD is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
# CONFIG_SERIAL_8250 is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_UARTLITE is not set
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
# CONFIG_LEGACY_PTYS is not set
# CONFIG_HVC_UDBG is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=m
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_NVRAM is not set
# CONFIG_GEN_RTC is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_TCG_TPM is not set
CONFIG_DEVPORT=y
CONFIG_IXIA_CONSOLE=y
# CONFIG_I2C is not set
# CONFIG_SPI is not set

#
# PPS support
#
# CONFIG_PPS is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
# CONFIG_POWER_SUPPLY is not set
# CONFIG_HWMON is not set
# CONFIG_THERMAL is not set
# CONFIG_THERMAL_HWMON is not set
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_REGULATOR is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
# CONFIG_AGP is not set
# CONFIG_DRM is not set
# CONFIG_VGASTATE is not set
# CONFIG_VIDEO_OUTPUT_CONTROL is not set
# CONFIG_FB is not set
# CONFIG_BACKLIGHT_LCD_SUPPORT is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set
# CONFIG_SOUND is not set
# CONFIG_HID_SUPPORT is not set
# CONFIG_USB_SUPPORT is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
# CONFIG_EDAC is not set
# CONFIG_RTC_CLASS is not set
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set

#
# TI VLYNQ
#
# CONFIG_STAGING is not set

#
# File systems
#
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
# CONFIG_DNOTIFY is not set
# CONFIG_INOTIFY is not set
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
# CONFIG_AUTOFS_FS is not set
# CONFIG_AUTOFS4_FS is not set
# CONFIG_FUSE_FS is not set

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
# CONFIG_PROC_PAGE_MONITOR is not set
# CONFIG_SYSFS is not set
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLB_PAGE is not set
# CONFIG_MISC_FILESYSTEMS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_NFS_V3_ACL is not set
# CONFIG_NFS_V4 is not set
# CONFIG_NFSD is not set
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
# CONFIG_RPCSEC_GSS_KRB5 is not set
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_NLS is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_GENERIC_FIND_LAST_BIT=y
# CONFIG_CRC_CCITT is not set
# CONFIG_CRC16 is not set
# CONFIG_CRC_T10DIF is not set
# CONFIG_CRC_ITU_T is not set
# CONFIG_CRC32 is not set
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_DECOMPRESS_GZIP=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_HAVE_LMB=y
CONFIG_NLATTR=y
CONFIG_GENERIC_ATOMIC64=y

#
# Kernel hacking
#
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=1024
# CONFIG_MAGIC_SYSRQ is not set
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
# CONFIG_DETECT_SOFTLOCKUP is not set
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_SCHED_DEBUG is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_WRITECOUNT is not set
# CONFIG_DEBUG_MEMORY_INIT is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SYSCTL_SYSCALL_CHECK=y
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_TRACING_SUPPORT=y
# CONFIG_FTRACE is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
# CONFIG_KMEMCHECK is not set
# CONFIG_PPC_DISABLE_WERROR is not set
CONFIG_PPC_WERROR=y
CONFIG_PRINT_STACK_DEPTH=64
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_PPC_EMULATED_STATS is not set
# CONFIG_CODE_PATCHING_SELFTEST is not set
# CONFIG_FTR_FIXUP_SELFTEST is not set
# CONFIG_MSI_BITMAP_SELFTEST is not set
# CONFIG_XMON is not set
# CONFIG_IRQSTACKS is not set
# CONFIG_VIRQ_DEBUG is not set
# CONFIG_BDI_SWITCH is not set
# CONFIG_BOOTX_TEXT is not set
# CONFIG_PPC_EARLY_DEBUG is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITYFS is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set

#
# Digest
#
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
# CONFIG_CRYPTO_AES is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_HW is not set
# CONFIG_PPC_CLOCK is not set
# CONFIG_VIRTUALIZATION is not set

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-25 15:19                             ` Octavian Purdila
@ 2009-10-25 19:28                               ` Eric Dumazet
  0 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2009-10-25 19:28 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: paulmck, Benjamin LaHaise, netdev, Cosmin Ratiu

Octavian Purdila a écrit :
> On Sunday 25 October 2009 10:35:10 you wrote:
>>> Got some time today and did some experiments myself. The test is deleting
>>> 1000 dummy interfaces (interface status down, no IP/IPv6 addresses
>>> assigned) on a UP non-preempt ppc750 @800Mhz system.
>>>
>>> 1. Ben's patch:
>>>
>>> real    0m 3.42s
>>> user    0m 0.00s
>>> sys     0m 0.00s
>>>
>>> 2. Eric's schedule_timeout_uninterruptible(1);
>>>
>>> real    0m 3.00s
>>> user    0m 0.00s
>>> sys     0m 0.00s
>>>
>>> 3. Simple synchronize_rcu_expedited()
>>>
>>> This doesn't seem to work well with the UP non-preempt case since
>>> synchronize_rcu_expedited() is a noop in this case - turning
>>> netdev_wait_allrefs() into a while(1) loop.
>> Thanks for these numbers. I presume HZ value is 1000 on this platform ?
>>
> 
> Yes. I've attach the full config to this email as well.
> 
>> Could you give us your scripts so that we can use same "benchmark" ?
>>
> 
> Sure, I've attached the hack module code I've used. 
> 
> For creating interfaces: echo 1000 > /proc/sys/net/ndst/add
> For deleting interface echo start_ifindex stop_ifindex > /proc/sys/net/ndst/del
> 
> Some more information:
> 
> - on our old and optimized kernel I am getting 0.4s for creating 128000 
> interfaces and 0.57s for deleting them
> 
> - the 2.6.31 kernel I got the 3s numbers does have some patches to speed-up 
> interface creating and deletion (removal of per device sysctl and dev_snmp6 
> entries)
> 
> I'll start posting the patches we have as RFC.
> 

OK thanks, I thought you were using dummy module

$ time insmod drivers/net/dummy.ko numdummies=100

real    0m2.493s
user    0m0.001s
sys     0m0.021s

$ time rmmod dummy

real    0m1.610s
user    0m0.000s
sys     0m0.001s

$ time insmod drivers/net/dummy.ko numdummies=200

real    0m10.118s
user    0m0.000s
sys     0m0.015s

$ time rmmod dummy

real    0m3.218s
user    0m0.000s
sys     0m0.001s

$ time insmod drivers/net/dummy.ko numdummies=300

real    0m22.564s
user    0m0.000s
sys     0m0.034s

$ time rmmod dummy

real    0m4.755s
user    0m0.000s
sys     0m0.006s

$ perf record -f insmod drivers/net/dummy.ko numdummies=300
$ perf report
# Samples: 898
#
# Overhead  Command           Shared Object  Symbol
# ........  .......  ......................  ......
#
    41.65%   insmod  [kernel]                [k] __register_sysctl_paths
    22.83%   insmod  [kernel]                [k] strcmp
     5.46%   insmod  [kernel]                [k] pcpu_alloc
     2.23%   insmod  [kernel]                [k] sysfs_find_dirent
     1.56%   insmod  [kernel]                [k] __sysfs_add_one
     1.11%   insmod  [kernel]                [k] pcpu_alloc_area
     1.11%   insmod  [kernel]                [k] _spin_lock
     1.00%   insmod  [kernel]                [k] kmemdup
     1.00%   insmod  [kernel]                [k] kmem_cache_alloc
     0.67%   insmod  [kernel]                [k] find_symbol_in_section
     0.67%   insmod  [kernel]                [k] find_next_zero_bit
     0.67%   insmod  [kernel]                [k] idr_get_empty_slot
     0.67%   insmod  [kernel]                [k] mutex_lock
     0.67%   insmod  [kernel]                [k] mutex_unlock
     0.56%   insmod  [kernel]                [k] vunmap_page_range
     0.56%   insmod  [kernel]                [k] __slab_alloc

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: allow netdev_wait_allrefs() to run faster
  2009-10-24  4:35               ` Eric Dumazet
  2009-10-24  5:49                 ` Paul E. McKenney
@ 2009-10-24 20:22                 ` Stephen Hemminger
  1 sibling, 0 replies; 41+ messages in thread
From: Stephen Hemminger @ 2009-10-24 20:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: paulmck, Octavian Purdila, Benjamin LaHaise, netdev, Cosmin Ratiu

On Sat, 24 Oct 2009 06:35:53 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Paul E. McKenney a écrit :
> > On Wed, Oct 21, 2009 at 05:40:07PM +0200, Eric Dumazet wrote:
> >> [PATCH] net: allow netdev_wait_allrefs() to run faster
> >>
> >> netdev_wait_allrefs() waits that all references to a device vanishes.
> >>
> >> It currently uses a _very_ pessimistic 250 ms delay between each probe.
> >> Some users report that no more than 4 devices can be dismantled per second,
> >> this is a pretty serious problem for extreme setups.
> >>
> >> Most likely, references only wait for a rcu grace period that should come
> >> fast, so use a schedule_timeout_uninterruptible(1) to allow faster recovery.
> > 
> > Is this a place where synchronize_rcu_expedited() is appropriate?
> > (It went in to 2.6.32-rc1.)
> > 
> 
> Thanks for the tip Paul
> 
> I believe netdev_wait_allrefs() is not a perfect candidate, because 
> synchronize_sched_expedited() seems really expensive.
> 
> Maybe we could call it once only, if we had to call 1 times
> the jiffie delay ?
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fa88dcd..9b04b9a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4970,6 +4970,7 @@ EXPORT_SYMBOL(register_netdev);
>  static void netdev_wait_allrefs(struct net_device *dev)
>  {
>  	unsigned long rebroadcast_time, warning_time;
> +	unsigned int count = 0;
>  
>  	rebroadcast_time = warning_time = jiffies;
>  	while (atomic_read(&dev->refcnt) != 0) {
> @@ -4995,7 +4996,10 @@ static void netdev_wait_allrefs(struct net_device *dev)
>  			rebroadcast_time = jiffies;
>  		}
>  
> -		msleep(250);
> +		if (count++ == 1)
> +			synchronize_rcu_expedited();
> +		else
> +			schedule_timeout_uninterruptible(1);
>  
>  		if (time_after(jiffies, warning_time + 10 * HZ)) {
>  			printk(KERN_EMERG "unregister_netdevice: "

Actually, anything that requires more than one pass through the loop is
broken. Devices and protocols should be cleaning up on the first notifier.
The worst offender seems to be the dst cache gc code.


-- 

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2010-08-09 21:17 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-17 22:18 [PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second Benjamin LaHaise
2009-10-18  4:26 ` Eric Dumazet
2009-10-18 16:13   ` Benjamin LaHaise
2009-10-18 17:51     ` Eric Dumazet
2009-10-18 18:21       ` Benjamin LaHaise
2009-10-18 19:36         ` Eric Dumazet
2009-10-21 12:39         ` Octavian Purdila
2009-10-21 15:40           ` [PATCH] net: allow netdev_wait_allrefs() to run faster Eric Dumazet
2009-10-21 16:09             ` Eric Dumazet
2009-10-21 16:51             ` Benjamin LaHaise
2009-10-21 19:54               ` Eric Dumazet
2009-10-29 23:07               ` Eric W. Biederman
2009-10-29 23:38                 ` Benjamin LaHaise
2009-10-30  1:45                   ` Eric W. Biederman
2009-10-30 14:35                     ` Benjamin LaHaise
2009-10-30 14:43                       ` Eric Dumazet
2009-10-30 23:25                       ` Eric W. Biederman
2009-10-30 23:53                         ` Benjamin LaHaise
2009-10-31  0:37                           ` Eric W. Biederman
2010-08-09 17:23                   ` Ben Greear
2010-08-09 17:34                     ` Benjamin LaHaise
2010-08-09 17:44                       ` Ben Greear
2010-08-09 17:48                         ` Benjamin LaHaise
2010-08-09 18:03                           ` Ben Greear
2010-08-09 19:59                       ` Eric W. Biederman
2010-08-09 21:03                         ` Benjamin LaHaise
2010-08-09 21:17                           ` Eric W. Biederman
2009-10-21 16:55             ` Octavian Purdila
2009-10-23 21:13             ` Paul E. McKenney
2009-10-24  4:35               ` Eric Dumazet
2009-10-24  5:49                 ` Paul E. McKenney
2009-10-24  8:49                   ` Eric Dumazet
2009-10-24 13:52                     ` Paul E. McKenney
2009-10-24 14:24                       ` Eric Dumazet
2009-10-24 14:46                         ` Paul E. McKenney
2009-10-24 23:49                         ` Octavian Purdila
2009-10-25  4:47                           ` Paul E. McKenney
2009-10-25  8:35                           ` Eric Dumazet
2009-10-25 15:19                             ` Octavian Purdila
2009-10-25 19:28                               ` Eric Dumazet
2009-10-24 20:22                 ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).