Netdev List
 help / color / mirror / Atom feed
* [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
@ 2007-08-16 12:09 Eric Dumazet
  2007-08-16 12:41 ` Herbert Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2007-08-16 12:09 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi all

When the periodic route cache flush is done, this machine suffers a lot and eventually triggers the "soft lockup" message.

dst_run_gc() is doing a scan of a possibly huge list of dst_entries, eventually freeing some (less than 1%) of them, but holding the dst_lock spinlock for the whole scan.

Then it rearms a timer to redo the full thing 1/10 s later. The slowdown can last one minute or so.
No need to say that machine is not really usable in the meantime.

I already looked at the code and I am testing a patch (included in this mail) where I limit the number of entries that can be scanned at each dst_run_gc() call, and not holding dst_lock during the scan. But then I dont know how to address the dst_dev_event() problem, since we must be able to dst_ifdown() all entries attached to one dev.

I am wondering how to really solve the problem. Could a workqueue be used here instead of a timer ?


Aug 16 06:21:37 SRV1 kernel: BUG: soft lockup detected on CPU#0!
Aug 16 06:21:37 SRV1 kernel: 
Aug 16 06:21:37 SRV1 kernel: Call Trace:
Aug 16 06:21:37 SRV1 kernel:  <IRQ>  [<ffffffff802286f0>] wake_up_process+0x10/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80251e09>] softlockup_tick+0xe9/0x110
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd380>] dst_run_gc+0x0/0x140
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802376f3>] run_local_timers+0x13/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802379c7>] update_process_times+0x57/0x90
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80216034>] smp_local_timer_interrupt+0x34/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802165cc>] smp_apic_timer_interrupt+0x5c/0x80
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8020a816>] apic_timer_interrupt+0x66/0x70
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd3d3>] dst_run_gc+0x53/0x140
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd3c6>] dst_run_gc+0x46/0x140
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80237148>] run_timer_softirq+0x148/0x1c0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8023340c>] __do_softirq+0x6c/0xe0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8020ad6c>] call_softirq+0x1c/0x30
Aug 16 06:21:37 SRV1 kernel:  <EOI>  [<ffffffff8020cb34>] do_softirq+0x34/0x90
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802331cf>] local_bh_enable_ip+0x3f/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80422913>] _spin_unlock_bh+0x13/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803dfde8>] rt_garbage_collect+0x1d8/0x320
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd4dd>] dst_alloc+0x1d/0xa0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803e1433>] __ip_route_output_key+0x573/0x800
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803c02e2>] sock_common_recvmsg+0x32/0x50
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803e16dc>] ip_route_output_flow+0x1c/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80400160>] tcp_v4_connect+0x150/0x610
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803ebf07>] inet_bind_bucket_create+0x17/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8040cd16>] inet_stream_connect+0xa6/0x2c0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80422981>] _spin_lock_bh+0x11/0x30
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803c0bdf>] lock_sock_nested+0xcf/0xe0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80422981>] _spin_lock_bh+0x11/0x30
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803be551>] sys_connect+0x71/0xa0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803eee3f>] tcp_setsockopt+0x1f/0x30
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803c030f>] sock_common_setsockopt+0xf/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803be4bd>] sys_setsockopt+0x9d/0xc0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8028881e>] sys_ioctl+0x5e/0x80
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80209c4e>] system_call+0x7e/0x83

Prelimary patch :

--- linux-2.6.22/net/core/dst.c
+++ linux-2.6.22-ed/net/core/dst.c
@@ -28,6 +28,9 @@
  * 4) All operations modify state, so a spinlock is used.
  */
 static struct dst_entry 	*dst_garbage_list;
+static struct dst_entry     *dst_garbage_wrk;
+static struct dst_entry     *dst_garbage_inuse;
+static int dst_gc_running;
 #if RT_CACHE_DEBUG >= 2
 static atomic_t			 dst_total = ATOMIC_INIT(0);
 #endif
@@ -42,26 +45,42 @@
 
 static void dst_run_gc(unsigned long dummy)
 {
-	int    delayed = 0;
-	int    work_performed;
-	struct dst_entry * dst, **dstp;
-
-	if (!spin_trylock(&dst_lock)) {
-		mod_timer(&dst_gc_timer, jiffies + HZ/10);
-		return;
-	}
-
+	int    quota = 1000;
+	int    work_performed = 0;
+	struct dst_entry *dst, *next;
+	struct dst_entry *first = NULL, *last = NULL;
+#if RT_CACHE_DEBUG >= 2
+	unsigned long t0 = jiffies;
+#endif
+	spin_lock(&dst_lock);
+	if (dst_gc_running)
+		goto out;
 	del_timer(&dst_gc_timer);
-	dstp = &dst_garbage_list;
-	work_performed = 0;
-	while ((dst = *dstp) != NULL) {
-		if (atomic_read(&dst->__refcnt)) {
-			dstp = &dst->next;
-			delayed++;
+	if (!dst_garbage_wrk) {
+		dst_garbage_wrk = dst_garbage_list;
+		dst_garbage_list = dst_garbage_inuse;
+		dst_garbage_inuse = NULL;
+	}
+	/*
+	 * before releasing dst_lock, tell others we are currently running
+	 * so they wont start timer or mess dst_garbage_wrk
+	 */
+	dst_gc_running = 1;
+	next = dst_garbage_wrk;
+	spin_unlock(&dst_lock);
+
+	while ((dst = next) != NULL) {
+		next = dst->next;
+		if (likely(atomic_read(&dst->__refcnt))) {
+			if (unlikely(last == NULL))
+				last = dst;
+			dst->next = first;
+			first = dst;
+			if (--quota == 0)
+				break;
 			continue;
 		}
-		*dstp = dst->next;
-		work_performed = 1;
+		work_performed++;
 
 		dst = dst_destroy(dst);
 		if (dst) {
@@ -77,16 +96,26 @@
 				continue;
 
 			___dst_free(dst);
-			dst->next = *dstp;
-			*dstp = dst;
-			dstp = &dst->next;
+			dst->next = next;
+			next = dst;
 		}
 	}
-	if (!dst_garbage_list) {
+	dst_garbage_wrk = next;
+
+	spin_lock(&dst_lock);
+	dst_gc_running = 0;
+	if (last != NULL) {
+		last->next = dst_garbage_inuse;
+		dst_garbage_inuse = first;
+	}
+	/*
+	 * If all lists are empty, no need to rearm a timer
+	 */
+	if (!dst_garbage_list && !dst_garbage_wrk && !dst_garbage_inuse) {
 		dst_gc_timer_inc = DST_GC_MAX;
 		goto out;
 	}
-	if (!work_performed) {
+	if (!work_performed && !dst_garbage_wrk) {
 		if ((dst_gc_timer_expires += dst_gc_timer_inc) > DST_GC_MAX)
 			dst_gc_timer_expires = DST_GC_MAX;
 		dst_gc_timer_inc += DST_GC_INC;
@@ -95,8 +124,12 @@
 		dst_gc_timer_expires = DST_GC_MIN;
 	}
 #if RT_CACHE_DEBUG >= 2
-	printk("dst_total: %d/%d %ld\n",
-	       atomic_read(&dst_total), delayed,  dst_gc_timer_expires);
+	if (net_msg_warn)
+		printk(KERN_DEBUG "dst_run_gc: "
+		"dst_total=%d quota=%d perf=%d expires=%ld elapsed=%lu\n",
+		atomic_read(&dst_total),
+		quota,  work_performed,
+		dst_gc_timer_expires, jiffies - t0);
 #endif
 	/* if the next desired timer is more than 4 seconds in the future
 	 * then round the timer to whole seconds
@@ -157,7 +190,7 @@
 	___dst_free(dst);
 	dst->next = dst_garbage_list;
 	dst_garbage_list = dst;
-	if (dst_gc_timer_inc > DST_GC_INC) {
+	if (dst_gc_timer_inc > DST_GC_INC && !dst_gc_running) {
 		dst_gc_timer_inc = DST_GC_INC;
 		dst_gc_timer_expires = DST_GC_MIN;
 		mod_timer(&dst_gc_timer, jiffies + dst_gc_timer_expires);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
  2007-08-16 12:09 [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ? Eric Dumazet
@ 2007-08-16 12:41 ` Herbert Xu
  2007-08-16 13:36   ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Herbert Xu @ 2007-08-16 12:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Eric Dumazet <dada1@cosmosbay.com> wrote:
>
> I am wondering how to really solve the problem. Could a workqueue be used here instead of a timer ?

I think so.

A mutex would separate the GC and dst_ifdown.  You'd take the
spin lock in the GC only to replace the garbage list with NULL.
You then drop the lock to process it.  Once it's done you take
the lock again and join it with whatever that's been added in
the mean time.  This is easy because you should already have
the tail after the GC process.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
  2007-08-16 12:41 ` Herbert Xu
@ 2007-08-16 13:36   ` Eric Dumazet
  2007-08-16 13:59     ` Herbert Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2007-08-16 13:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Thu, 16 Aug 2007 20:41:58 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> Eric Dumazet <dada1@cosmosbay.com> wrote:
> >
> > I am wondering how to really solve the problem. Could a workqueue be used here instead of a timer ?
> 
> I think so.
> 
> A mutex would separate the GC and dst_ifdown.  You'd take the
> spin lock in the GC only to replace the garbage list with NULL.
> You then drop the lock to process it.  Once it's done you take
> the lock again and join it with whatever that's been added in
> the mean time.  This is easy because you should already have
> the tail after the GC process.

Thanks Herbert

Yes, I already did this (with the current softirq based timer model),
 but how can dst_dev_event() do its work, since the GC is using 
a private list. (In my patch, time to GC process XXX.000 entries is about XX seconds.)

We would have to change dst_dev_event() to :
- Signal to GC it has to stop as soon as possible.
- Wait for GC be stoped (busy wait I suspect we cannot sleep in dst_dev_event() ? )
, giving us a full garbage_list.
- Process the whole list.
- Restart GC


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
  2007-08-16 13:36   ` Eric Dumazet
@ 2007-08-16 13:59     ` Herbert Xu
  2007-08-16 15:40       ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Herbert Xu @ 2007-08-16 13:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: herbert, netdev

Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> Yes, I already did this (with the current softirq based timer model),
> but how can dst_dev_event() do its work, since the GC is using 
> a private list. (In my patch, time to GC process XXX.000 entries is about XX seconds.)
> 
> We would have to change dst_dev_event() to :
> - Signal to GC it has to stop as soon as possible.
> - Wait for GC be stoped (busy wait I suspect we cannot sleep in dst_dev_event() ? )

You can sleep in dst_dev_event so just use a mutex to separate
it from the GC.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
  2007-08-16 13:59     ` Herbert Xu
@ 2007-08-16 15:40       ` Eric Dumazet
  2007-08-16 23:33         ` Herbert Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2007-08-16 15:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Thu, 16 Aug 2007 21:59:43 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> Eric Dumazet <dada1@cosmosbay.com> wrote:
> > 
> > Yes, I already did this (with the current softirq based timer model),
> > but how can dst_dev_event() do its work, since the GC is using 
> > a private list. (In my patch, time to GC process XXX.000 entries is about XX seconds.)
> > 
> > We would have to change dst_dev_event() to :
> > - Signal to GC it has to stop as soon as possible.
> > - Wait for GC be stoped (busy wait I suspect we cannot sleep in dst_dev_event() ? )
> 
> You can sleep in dst_dev_event so just use a mutex to separate
> it from the GC.

Thanks for the suggestion, I added dst_mutex and attempt a mutex_trylock() inside dst_run_gc().

So do you think this patch is enough or should we convert dst_run_gc processing from softirq to workqueue too ?

Eric

[PATCH] net/core/dst.c : let dst_run_gc() do its job incrementally

When the periodic route cache flush is done, we can feed dst_garbage_list with a huge number of dst_entry structs.

Then dst_run_gc() is doing a scan of a possibly huge list of dst_entries, eventually freeing some (less than 1%) of them, while holding the dst_lock spinlock for the whole scan. This sometimes triggers a soft lockup. 

Then it rearms a timer to redo the full thing 1/10 s later. The slowdown can last one minute or so.
No need to say that machine is not really usable in the meantime.

This patch attempts to solve the problem giving dst_run_gc() the ability to perform its work by chunks instead of whole list scan. We limit a chunk by the number of entries that could not be freed in a run. This should limit the CPU cache trashing (currently 128 bytes per entry on x86_64), yet giving a chance to free unreferenced entries (not included in the quota) if the load is really high.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

--- linux-2.6.22/net/core/dst.c.orig
+++ linux-2.6.22/net/core/dst.c
@@ -9,6 +9,7 @@
 #include <linux/errno.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
+#include <linux/mutex.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/netdevice.h>
@@ -28,10 +29,17 @@
  * 4) All operations modify state, so a spinlock is used.
  */
 static struct dst_entry 	*dst_garbage_list;
+static struct dst_entry     *dst_garbage_wrk;
+static struct dst_entry     *dst_garbage_inuse;
+static int dst_gc_running;
 #if RT_CACHE_DEBUG >= 2
 static atomic_t			 dst_total = ATOMIC_INIT(0);
 #endif
 static DEFINE_SPINLOCK(dst_lock);
+/*
+ * A mutex is used to synchronize GC with dst_dev_event()
+ */
+static DEFINE_MUTEX(dst_mutex);
 
 static unsigned long dst_gc_timer_expires;
 static unsigned long dst_gc_timer_inc = DST_GC_MAX;
@@ -42,26 +50,39 @@ static DEFINE_TIMER(dst_gc_timer, dst_ru
 
 static void dst_run_gc(unsigned long dummy)
 {
-	int    delayed = 0;
-	int    work_performed;
-	struct dst_entry * dst, **dstp;
+	int    quota = 1000;
+	int    work_performed = 0;
+	struct dst_entry *dst, *next;
+	struct dst_entry *first = NULL, *last = NULL;
 
-	if (!spin_trylock(&dst_lock)) {
+	if (!mutex_trylock(&dst_mutex)) {
 		mod_timer(&dst_gc_timer, jiffies + HZ/10);
 		return;
 	}
+	dst_gc_running = 1;
 
+	spin_lock(&dst_lock);
 	del_timer(&dst_gc_timer);
-	dstp = &dst_garbage_list;
-	work_performed = 0;
-	while ((dst = *dstp) != NULL) {
-		if (atomic_read(&dst->__refcnt)) {
-			dstp = &dst->next;
-			delayed++;
+	if (!dst_garbage_wrk) {
+		dst_garbage_wrk = dst_garbage_list;
+		dst_garbage_list = dst_garbage_inuse;
+		dst_garbage_inuse = NULL;
+	}
+	next = dst_garbage_wrk;
+	spin_unlock(&dst_lock);
+
+	while ((dst = next) != NULL) {
+		next = dst->next;
+		if (likely(atomic_read(&dst->__refcnt))) {
+			if (unlikely(last == NULL))
+				last = dst;
+			dst->next = first;
+			first = dst;
+			if (--quota == 0)
+				break;
 			continue;
 		}
-		*dstp = dst->next;
-		work_performed = 1;
+		work_performed++;
 
 		dst = dst_destroy(dst);
 		if (dst) {
@@ -77,16 +98,26 @@ static void dst_run_gc(unsigned long dum
 				continue;
 
 			___dst_free(dst);
-			dst->next = *dstp;
-			*dstp = dst;
-			dstp = &dst->next;
+			dst->next = next;
+			next = dst;
 		}
 	}
-	if (!dst_garbage_list) {
+	dst_garbage_wrk = next;
+
+	spin_lock(&dst_lock);
+	dst_gc_running = 0;
+	if (last != NULL) {
+		last->next = dst_garbage_inuse;
+		dst_garbage_inuse = first;
+	}
+	/*
+	 * If all lists are empty, no need to rearm a timer
+	 */
+	if (!dst_garbage_list && !dst_garbage_wrk && !dst_garbage_inuse) {
 		dst_gc_timer_inc = DST_GC_MAX;
 		goto out;
 	}
-	if (!work_performed) {
+	if (!work_performed && !dst_garbage_wrk) {
 		if ((dst_gc_timer_expires += dst_gc_timer_inc) > DST_GC_MAX)
 			dst_gc_timer_expires = DST_GC_MAX;
 		dst_gc_timer_inc += DST_GC_INC;
@@ -95,8 +126,11 @@ static void dst_run_gc(unsigned long dum
 		dst_gc_timer_expires = DST_GC_MIN;
 	}
 #if RT_CACHE_DEBUG >= 2
-	printk("dst_total: %d/%d %ld\n",
-	       atomic_read(&dst_total), delayed,  dst_gc_timer_expires);
+	if (net_msg_warn)
+		printk(KERN_DEBUG "dst_run_gc: "
+		"dst_total=%d quota=%d perf=%d expires=%ld\n",
+		atomic_read(&dst_total), quota,
+		work_performed, dst_gc_timer_expires);
 #endif
 	/* if the next desired timer is more than 4 seconds in the future
 	 * then round the timer to whole seconds
@@ -109,6 +143,7 @@ static void dst_run_gc(unsigned long dum
 
 out:
 	spin_unlock(&dst_lock);
+	mutex_unlock(&dst_mutex);
 }
 
 static int dst_discard(struct sk_buff *skb)
@@ -157,7 +192,7 @@ void __dst_free(struct dst_entry * dst)
 	___dst_free(dst);
 	dst->next = dst_garbage_list;
 	dst_garbage_list = dst;
-	if (dst_gc_timer_inc > DST_GC_INC) {
+	if (dst_gc_timer_inc > DST_GC_INC && !dst_gc_running) {
 		dst_gc_timer_inc = DST_GC_INC;
 		dst_gc_timer_expires = DST_GC_MIN;
 		mod_timer(&dst_gc_timer, jiffies + dst_gc_timer_expires);
@@ -224,7 +259,7 @@ again:
  *
  * Commented and originally written by Alexey.
  */
-static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
+static void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 			      int unregister)
 {
 	if (dst->ops->ifdown)
@@ -251,15 +286,34 @@ static int dst_dev_event(struct notifier
 {
 	struct net_device *dev = ptr;
 	struct dst_entry *dst;
+	struct dst_entry *first, *last;
 
 	switch (event) {
 	case NETDEV_UNREGISTER:
 	case NETDEV_DOWN:
+		mutex_lock(&dst_mutex); /* guard against GC */
+
 		spin_lock_bh(&dst_lock);
-		for (dst = dst_garbage_list; dst; dst = dst->next) {
-			dst_ifdown(dst, dev, event != NETDEV_DOWN);
+		first = dst_garbage_list;
+		if (first) {
+			dst_garbage_list = NULL;
+			spin_unlock_bh(&dst_lock);
+			for (dst = first; dst; dst = dst->next) {
+				last = dst;
+				dst_ifdown(dst, dev, event != NETDEV_DOWN);
+			}
+			spin_lock_bh(&dst_lock);
+			last->next = dst_garbage_list;
+			dst_garbage_list = first;
 		}
 		spin_unlock_bh(&dst_lock);
+		for (dst = dst_garbage_wrk; dst; dst = dst->next) {
+			dst_ifdown(dst, dev, event != NETDEV_DOWN);
+		}
+		for (dst = dst_garbage_inuse; dst; dst = dst->next) {
+			dst_ifdown(dst, dev, event != NETDEV_DOWN);
+		}
+		mutex_unlock(&dst_mutex);
 		break;
 	}
 	return NOTIFY_DONE;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
  2007-08-16 15:40       ` Eric Dumazet
@ 2007-08-16 23:33         ` Herbert Xu
  2007-08-17  8:10           ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Herbert Xu @ 2007-08-16 23:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, Aug 16, 2007 at 05:40:44PM +0200, Eric Dumazet wrote:
> 
> So do you think this patch is enough or should we convert dst_run_gc processing from softirq to workqueue too ?

I think a workqueue would be the best solution since with
that you wouldn't have to worry about processing things in
chunks.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
  2007-08-16 23:33         ` Herbert Xu
@ 2007-08-17  8:10           ` Eric Dumazet
  2007-08-17  8:15             ` Herbert Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2007-08-17  8:10 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Fri, 17 Aug 2007 07:33:39 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Thu, Aug 16, 2007 at 05:40:44PM +0200, Eric Dumazet wrote:
> > 
> > So do you think this patch is enough or should we convert dst_run_gc processing from softirq to workqueue too ?
> 
> I think a workqueue would be the best solution since with
> that you wouldn't have to worry about processing things in
> chunks.

Will a workqueue react the same in case of a DDOS situation, 
where softirq could use all CPU cycles to handle incoming
packets and feed the GC list, and GC would never
have a chance to scan and free some items ?

About chunk processing, I did it on purpose, to not throw away
all CPU cache. Goal is to process entries, but not all of them
in a row, especially if we find many yet referenced entries
(and thus not candidates to freeing)

Thanks



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
  2007-08-17  8:10           ` Eric Dumazet
@ 2007-08-17  8:15             ` Herbert Xu
  2007-08-17  8:31               ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Herbert Xu @ 2007-08-17  8:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Fri, Aug 17, 2007 at 10:10:30AM +0200, Eric Dumazet wrote:
>
> Will a workqueue react the same in case of a DDOS situation, 
> where softirq could use all CPU cycles to handle incoming
> packets and feed the GC list, and GC would never
> have a chance to scan and free some items ?

Well when that happens the softirqs will be deferred to
ksoftirqd which should share the CPU fairly with the
workqueue.

> About chunk processing, I did it on purpose, to not throw away
> all CPU cache. Goal is to process entries, but not all of them
> in a row, especially if we find many yet referenced entries
> (and thus not candidates to freeing)

I agree that chunks are desirable for a timer since you'd
be hogging the CPU otherwise.  However, if you went to a
workqueue then it's less of a concern and would simplify
things.  In particular, you won't have to pick a good
chunk size :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
  2007-08-17  8:15             ` Herbert Xu
@ 2007-08-17  8:31               ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2007-08-17  8:31 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Fri, 17 Aug 2007 16:15:22 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Fri, Aug 17, 2007 at 10:10:30AM +0200, Eric Dumazet wrote:
> >
> > Will a workqueue react the same in case of a DDOS situation, 
> > where softirq could use all CPU cycles to handle incoming
> > packets and feed the GC list, and GC would never
> > have a chance to scan and free some items ?
> 
> Well when that happens the softirqs will be deferred to
> ksoftirqd which should share the CPU fairly with the
> workqueue.

Thats nice :)

I'll code a workqueue based thing in about 10 days after my hollidays,
and perform DOS tests as well.

Thanks for the feedback.

> 
> > About chunk processing, I did it on purpose, to not throw away
> > all CPU cache. Goal is to process entries, but not all of them
> > in a row, especially if we find many yet referenced entries
> > (and thus not candidates to freeing)
> 
> I agree that chunks are desirable for a timer since you'd
> be hogging the CPU otherwise.  However, if you went to a
> workqueue then it's less of a concern and would simplify
> things.  In particular, you won't have to pick a good
> chunk size :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-08-17  8:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-16 12:09 [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ? Eric Dumazet
2007-08-16 12:41 ` Herbert Xu
2007-08-16 13:36   ` Eric Dumazet
2007-08-16 13:59     ` Herbert Xu
2007-08-16 15:40       ` Eric Dumazet
2007-08-16 23:33         ` Herbert Xu
2007-08-17  8:10           ` Eric Dumazet
2007-08-17  8:15             ` Herbert Xu
2007-08-17  8:31               ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox