public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] softirq performance fixes, cleanups, 2.4.10.
@ 2001-09-26 16:44 Ingo Molnar
  2001-09-26 17:48 ` Mike Kravetz
                   ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-26 16:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Alan Cox, Alexey Kuznetsov, Ben LaHaise,
	Andrea Arcangeli

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4179 bytes --]


the Linux softirq code still has a number of performance and latency
problems as of 2.4.10.

one issue is that there are still places in the kernel that disable/enable
softirq processing, but do not restart softirqs. This creates softirq
processing latencies, which can show up eg. as 'stuttering' packet
processing. Longer latencies between hard interrupts and soft interrupt
processing also decreases caching efficiency - if eg. a socket buffer was
touched in a network driver, it might get dropped from the cache by the
time the skb is processed by its softirq handler.

another problem is increased scheduling and softirq handling overhead due
to ksoftirqd, and related performance degradation in high-speed network
environments. (Performance drops of more than 10% were reported with
certain gigabit cards.) Under various multi-process networking loads
ksoftirqd is very active.

the attached softirq-2.4.10-A5 patch solves these two main problems and
also cleans up softirq.c.

main changes in softirq handling:

 - softirq handling can now be restarted N times within do_softirq(), if a
   softirq gets reactivated while it's being handled.

 - implemented a new scheduler mechanizm, 'unwakeup()', to undo ksoftirqd
   wakeups if softirqs happen to be fully handled before ksoftirqd runs.
   (unwakeup() does not touch the runqueue lock if the task in question is
   already running.)

 - cpu_raise_softirq() used to wake ksoftirqd up - instead of handling
   softirqs immediately. All softirq users are using __cpu_raise_softirq()
   now, and have to call rerun_softirqs() after the softirq-atomic section
   has finished.

none of these changes results in any change of tasklet or bottom-half
semantics.

the HTTP load situation i tested shows the following changes in scheduling
frequency:

                         context switches per second
                         (measured over a period of 10 seconds,
                          repeated 10 times and averaged.)

 2.4.10-vanilla:         39299

 2.4.10-softirq-A6:      35552

a 10.5% improvement. HTTP performance increased by 2%, but the system had
idle time left. Kernels with the softirq-A6 patch applied show almost no
ksoftirqd activity, while vanilla 2.4.10 shows frequent ksoftirqd
activation.

other fixes/cleanups to softirq.c:

 - removed 'mask' handling from do_softirq() - it's unnecessery due to the
   restarts. this further simplifies the code.

 - tasklet_hi_schedule() and tasklet_lo_schedule() are now rerunning
   softirqs, instead of just kicking ksoftirqd.

 - removed raise_softirq() and cpu_raise_softirq(), they are not used by
   any other code anymore. unexported them.

 - simplified argument passing between spawn_ksoftirqd() and ksoftirqd(),
   passing an argument by pointer and waiting for ksoftirqd tasks to start
   up is unnecessery.

 - it's unnecessary to spin scheduling in ksoftirqd() startup, waiting for
   the process to migrate - it's enough to call schedule() once, the
   scheduler will not run the task on the wrong CPU.

 - '[ksoftirqd_CPU0]' is confusing on UP systems, changed it to
   '[ksoftirqd]' instead.

 - simplified ksoftirqd()'s loop, it's both shorter and faster by a few
   instructions now.

 - __netif_schedule() is using __cpu_raise_softirq(), instead of
   cpu_raise_softirq() [which did not restart softirq handling, it only
   woke ksoftirqd up].

 - dev_kfree_skb_irq() ditto. (although this function is mostly called
   from IRQ contexts, where softirq restarts are not possible - but the
   IRQ code will restart them nevertheless, on IRQ exit.)

 - the generic definition of __cpu_raise_softirq() used to override
   any lowlevel definitions done in asm/softirq.h. It's now conditional so
   the architecture definitions should actually be used.

i've tested the patch both on UP and SMP systems, and saw no problems at
all. The changes decrease the size of softirq object code by ~8%. Network
packet handling appears to be smoother. (this is subjective, it's hard to
measure it). Ben, does this patch fix gigabit performance in your test, or
is still something else going on as well?

(The patch also applies cleanly to the -ac tree.)

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 10815 bytes --]

--- linux/kernel/ksyms.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/ksyms.c	Wed Sep 26 17:04:48 2001
@@ -538,8 +538,6 @@
 EXPORT_SYMBOL(tasklet_kill);
 EXPORT_SYMBOL(__run_task_queue);
 EXPORT_SYMBOL(do_softirq);
-EXPORT_SYMBOL(raise_softirq);
-EXPORT_SYMBOL(cpu_raise_softirq);
 EXPORT_SYMBOL(__tasklet_schedule);
 EXPORT_SYMBOL(__tasklet_hi_schedule);
 
--- linux/kernel/sched.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/sched.c	Wed Sep 26 17:04:48 2001
@@ -366,6 +366,28 @@
 }
 
 /**
+ * unwakeup - undo wakeup if possible.
+ * @p: task
+ * @state: new task state
+ *
+ * Undo a previous wakeup of the specified task - if the process
+ * is not running already. The main interface to be used is
+ * unwakeup_process(), it will do a lockless test whether the task
+ * is on the runqueue.
+ */
+void __unwakeup_process(struct task_struct * p, long state)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&runqueue_lock, flags);
+	if (!p->has_cpu && (p != current) && task_on_runqueue(p)) {
+		del_from_runqueue(p);
+		p->state = state;
+	}
+	spin_unlock_irqrestore(&runqueue_lock, flags);
+}
+
+/**
  * schedule_timeout - sleep until timeout
  * @timeout: timeout value in jiffies
  *
--- linux/kernel/softirq.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/softirq.c	Wed Sep 26 17:45:00 2001
@@ -58,12 +58,35 @@
 		wake_up_process(tsk);
 }
 
+/*
+ * If a softirqs were fully handled after ksoftirqd was woken
+ * up then try to undo the wakeup.
+ */
+static inline void unwakeup_softirqd(unsigned cpu)
+{
+	struct task_struct * tsk = ksoftirqd_task(cpu);
+
+	if (tsk)
+		unwakeup_process(tsk, TASK_INTERRUPTIBLE);
+}
+
+/*
+ * We restart softirq processing MAX_SOFTIRQ_RESTART times,
+ * and we fall back to softirqd after that.
+ *
+ * This number has been established via experimentation.
+ * The two things to balance is latency against fairness -
+ * we want to handle softirqs as soon as possible, but they
+ * should not be able to lock up the box.
+ */
+#define MAX_SOFTIRQ_RESTART 10
+
 asmlinkage void do_softirq()
 {
+	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu = smp_processor_id();
-	__u32 pending;
+	__u32 pending, mask;
 	long flags;
-	__u32 mask;
 
 	if (in_interrupt())
 		return;
@@ -95,55 +118,37 @@
 		local_irq_disable();
 
 		pending = softirq_pending(cpu);
-		if (pending & mask) {
-			mask &= ~pending;
+		if (pending && --max_restart)
 			goto restart;
-		}
 		__local_bh_enable();
 
 		if (pending)
+			/*
+			 * In the normal case ksoftirqd is rarely activated,
+			 * increased scheduling hurts performance.
+			 * It's a safety measure: if external load starts
+			 * to flood the system with softirqs then we
+			 * will mitigate softirq work to the softirq thread.
+			 */
 			wakeup_softirqd(cpu);
+		else
+			/*
+			 * All softirqs are handled - undo any possible
+			 * wakeup of softirqd. This reduces context switch
+			 * overhead.
+			 */
+			unwakeup_softirqd(cpu);
 	}
 
 	local_irq_restore(flags);
 }
 
-/*
- * This function must run with irq disabled!
- */
-inline void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
-{
-	__cpu_raise_softirq(cpu, nr);
-
-	/*
-	 * If we're in an interrupt or bh, we're done
-	 * (this also catches bh-disabled code). We will
-	 * actually run the softirq once we return from
-	 * the irq or bh.
-	 *
-	 * Otherwise we wake up ksoftirqd to make sure we
-	 * schedule the softirq soon.
-	 */
-	if (!(local_irq_count(cpu) | local_bh_count(cpu)))
-		wakeup_softirqd(cpu);
-}
-
-void raise_softirq(unsigned int nr)
-{
-	long flags;
-
-	local_irq_save(flags);
-	cpu_raise_softirq(smp_processor_id(), nr);
-	local_irq_restore(flags);
-}
-
 void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
 {
 	softirq_vec[nr].data = data;
 	softirq_vec[nr].action = action;
 }
 
-
 /* Tasklets */
 
 struct tasklet_head tasklet_vec[NR_CPUS] __cacheline_aligned;
@@ -157,8 +162,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_vec[cpu].list;
 	tasklet_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
+	__cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 void __tasklet_hi_schedule(struct tasklet_struct *t)
@@ -169,8 +175,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_hi_vec[cpu].list;
 	tasklet_hi_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, HI_SOFTIRQ);
+	__cpu_raise_softirq(cpu, HI_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 static void tasklet_action(struct softirq_action *a)
@@ -241,7 +248,6 @@
 	}
 }
 
-
 void tasklet_init(struct tasklet_struct *t,
 		  void (*func)(unsigned long), unsigned long data)
 {
@@ -268,8 +274,6 @@
 	clear_bit(TASKLET_STATE_SCHED, &t->state);
 }
 
-
-
 /* Old style BHs */
 
 static void (*bh_base[32])(void);
@@ -325,7 +329,7 @@
 {
 	int i;
 
-	for (i=0; i<32; i++)
+	for (i = 0; i < 32; i++)
 		tasklet_init(bh_task_vec+i, bh_action, i);
 
 	open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
@@ -361,56 +365,52 @@
 
 static int ksoftirqd(void * __bind_cpu)
 {
-	int bind_cpu = *(int *) __bind_cpu;
-	int cpu = cpu_logical_map(bind_cpu);
+	int cpu = cpu_logical_map((int)__bind_cpu);
 
 	daemonize();
-	current->nice = 19;
+
 	sigfillset(&current->blocked);
 
 	/* Migrate to the right CPU */
-	current->cpus_allowed = 1UL << cpu;
-	while (smp_processor_id() != cpu)
-		schedule();
+	current->cpus_allowed = 1 << cpu;
 
-	sprintf(current->comm, "ksoftirqd_CPU%d", bind_cpu);
+#if CONFIG_SMP
+	sprintf(current->comm, "ksoftirqd CPU%d", cpu);
+#else
+	sprintf(current->comm, "ksoftirqd");
+#endif
 
+	current->nice = 19;
+	schedule();
 	__set_current_state(TASK_INTERRUPTIBLE);
-	mb();
-
 	ksoftirqd_task(cpu) = current;
 
 	for (;;) {
-		if (!softirq_pending(cpu))
-			schedule();
-
-		__set_current_state(TASK_RUNNING);
-
-		while (softirq_pending(cpu)) {
+back:
+		do {
 			do_softirq();
 			if (current->need_resched)
-				schedule();
-		}
-
+				goto preempt;
+		} while (softirq_pending(cpu));
+		schedule();
 		__set_current_state(TASK_INTERRUPTIBLE);
 	}
+
+preempt:
+	__set_current_state(TASK_RUNNING);
+	schedule();
+	__set_current_state(TASK_INTERRUPTIBLE);
+	goto back;
 }
 
 static __init int spawn_ksoftirqd(void)
 {
 	int cpu;
 
-	for (cpu = 0; cpu < smp_num_cpus; cpu++) {
-		if (kernel_thread(ksoftirqd, (void *) &cpu,
+	for (cpu = 0; cpu < smp_num_cpus; cpu++)
+		if (kernel_thread(ksoftirqd, (void *) cpu,
 				  CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
-			printk("spawn_ksoftirqd() failed for cpu %d\n", cpu);
-		else {
-			while (!ksoftirqd_task(cpu_logical_map(cpu))) {
-				current->policy |= SCHED_YIELD;
-				schedule();
-			}
-		}
-	}
+			BUG();
 
 	return 0;
 }
--- linux/include/linux/netdevice.h.orig	Wed Sep 26 17:04:36 2001
+++ linux/include/linux/netdevice.h	Wed Sep 26 17:08:20 2001
@@ -486,8 +486,9 @@
 		local_irq_save(flags);
 		dev->next_sched = softnet_data[cpu].output_queue;
 		softnet_data[cpu].output_queue = dev;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
@@ -535,8 +536,9 @@
 		local_irq_save(flags);
 		skb->next = softnet_data[cpu].completion_queue;
 		softnet_data[cpu].completion_queue = skb;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
--- linux/include/linux/interrupt.h.orig	Wed Sep 26 17:04:40 2001
+++ linux/include/linux/interrupt.h	Wed Sep 26 17:45:23 2001
@@ -74,9 +74,16 @@
 asmlinkage void do_softirq(void);
 extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data);
 extern void softirq_init(void);
-#define __cpu_raise_softirq(cpu, nr) do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
-extern void FASTCALL(cpu_raise_softirq(unsigned int cpu, unsigned int nr));
-extern void FASTCALL(raise_softirq(unsigned int nr));
+#ifndef __cpu_raise_softirq
+#define __cpu_raise_softirq(cpu, nr) \
+		do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
+#endif
+
+#define rerun_softirqs(cpu) 					\
+do {								\
+	if (!(local_irq_count(cpu) | local_bh_count(cpu)))	\
+		do_softirq();					\
+} while (0);
 
 
 
--- linux/include/linux/sched.h.orig	Wed Sep 26 17:04:40 2001
+++ linux/include/linux/sched.h	Wed Sep 26 17:08:16 2001
@@ -556,6 +556,7 @@
 
 extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__unwakeup_process(struct task_struct * p, long state));
 extern void FASTCALL(sleep_on(wait_queue_head_t *q));
 extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
 				      signed long timeout));
@@ -574,6 +575,13 @@
 #define wake_up_interruptible_all(x)	__wake_up((x),TASK_INTERRUPTIBLE, 0)
 #define wake_up_interruptible_sync(x)	__wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
 #define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE,  nr)
+
+#define unwakeup_process(tsk,state)		\
+do {						\
+	if (task_on_runqueue(tsk))		\
+		__unwakeup_process(tsk,state);	\
+} while (0)
+
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
 
 extern int in_group_p(gid_t);
--- linux/include/asm-i386/softirq.h.orig	Wed Sep 26 17:04:40 2001
+++ linux/include/asm-i386/softirq.h	Wed Sep 26 17:08:16 2001
@@ -45,4 +45,9 @@
 		/* no registers clobbered */ );				\
 } while (0)
 
+
+/* It's using __set_bit() because it only needs to be IRQ-atomic. */
+
+#define __cpu_raise_softirq(cpu, nr) __set_bit(nr, &softirq_pending(cpu))
+
 #endif	/* __ASM_SOFTIRQ_H */
--- linux/net/core/dev.c.orig	Wed Sep 26 17:04:41 2001
+++ linux/net/core/dev.c	Wed Sep 26 17:04:48 2001
@@ -1218,8 +1218,9 @@
 			dev_hold(skb->dev);
 			__skb_queue_tail(&queue->input_pkt_queue,skb);
 			/* Runs from irqs or BH's, no need to wake BH */
-			cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+			__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 			local_irq_restore(flags);
+			rerun_softirqs(this_cpu);
 #ifndef OFFLINE_SAMPLE
 			get_sample_stats(this_cpu);
 #endif
@@ -1529,8 +1530,9 @@
 	local_irq_disable();
 	netdev_rx_stat[this_cpu].time_squeeze++;
 	/* This already runs in BH context, no need to wake up BH's */
-	cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+	__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 	local_irq_enable();
+	rerun_softirqs(this_cpu);
 
 	NET_PROFILE_LEAVE(softnet_process);
 	return;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-26 16:44 [patch] softirq performance fixes, cleanups, 2.4.10 Ingo Molnar
@ 2001-09-26 17:48 ` Mike Kravetz
  2001-09-26 18:48   ` Ingo Molnar
  2001-09-26 18:55 ` Russell King
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Mike Kravetz @ 2001-09-26 17:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise, Andrea Arcangeli

On Wed, Sep 26, 2001 at 06:44:03PM +0200, Ingo Molnar wrote:
> +void __unwakeup_process(struct task_struct * p, long state)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&runqueue_lock, flags);
> +	if (!p->has_cpu && (p != current) && task_on_runqueue(p)) {
> +		del_from_runqueue(p);
> +		p->state = state;
> +	}
> +	spin_unlock_irqrestore(&runqueue_lock, flags);
> +}

Is it really possible for a task to be 'current' without having
'has_cpu' set?  If so, then don't you need to compare p to
'current' on all CPUs since 'current' is CPU specific?

-- 
Mike Kravetz                                  kravetz@us.ibm.com
IBM Linux Technology Center       (we're not at Sequent anymore)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-26 17:48 ` Mike Kravetz
@ 2001-09-26 18:48   ` Ingo Molnar
  0 siblings, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-26 18:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise, Andrea Arcangeli


On Wed, 26 Sep 2001, Mike Kravetz wrote:

> > +	if (!p->has_cpu && (p != current) && task_on_runqueue(p)) {

> Is it really possible for a task to be 'current' without having
> 'has_cpu' set?  If so, then don't you need to compare p to 'current'
> on all CPUs since 'current' is CPU specific?

you are correct that the check is redundant on SMP, but ->has_cpu is not
set on UP kernels, so the 'current' check is needed there. I did not want
to introduce yet another #if CONFIG_SMP ifdef.

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-26 16:44 [patch] softirq performance fixes, cleanups, 2.4.10 Ingo Molnar
  2001-09-26 17:48 ` Mike Kravetz
@ 2001-09-26 18:55 ` Russell King
  2001-09-26 19:14   ` Ingo Molnar
  2001-09-27 23:31 ` Andrea Arcangeli
  2001-09-28 16:18 ` [patch] softirq performance fixes, cleanups, 2.4.10 kuznet
  3 siblings, 1 reply; 44+ messages in thread
From: Russell King @ 2001-09-26 18:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise, Andrea Arcangeli

On Wed, Sep 26, 2001 at 06:44:03PM +0200, Ingo Molnar wrote:
>  - the generic definition of __cpu_raise_softirq() used to override
>    any lowlevel definitions done in asm/softirq.h. It's now conditional so
>    the architecture definitions should actually be used.

Ingo,

The generic definition is the one to use - we used to allow
__cpu_raise_softirq from outside IRQ context, but all RISC architectures
ended up with code as follows:

	restore irqs
	save + disable irqs
	set bit
	restore irqs

In the latest kernels, you will notice that __cpu_raise_softirq is always
within an IRQ protected region (thanks to Andrea for that work), so there
is no reason for it to protected itself from interrupts.  Hence the comment
above cpu_raise_softirq:

 * This function must run with irq disabled!


--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-26 18:55 ` Russell King
@ 2001-09-26 19:14   ` Ingo Molnar
  0 siblings, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-26 19:14 UTC (permalink / raw)
  To: Russell King
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise, Andrea Arcangeli

[-- Attachment #1: Type: TEXT/PLAIN, Size: 271 bytes --]


On Wed, 26 Sep 2001, Russell King wrote:

> The generic definition is the one to use [...]

right, i forgot, thanks for pointing it out. New patch attached that
removes all the old definitions from other architectures and makes the
generic version unconditional.

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 11338 bytes --]

--- linux/kernel/ksyms.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/ksyms.c	Wed Sep 26 17:04:48 2001
@@ -538,8 +538,6 @@
 EXPORT_SYMBOL(tasklet_kill);
 EXPORT_SYMBOL(__run_task_queue);
 EXPORT_SYMBOL(do_softirq);
-EXPORT_SYMBOL(raise_softirq);
-EXPORT_SYMBOL(cpu_raise_softirq);
 EXPORT_SYMBOL(__tasklet_schedule);
 EXPORT_SYMBOL(__tasklet_hi_schedule);
 
--- linux/kernel/sched.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/sched.c	Wed Sep 26 17:04:48 2001
@@ -366,6 +366,28 @@
 }
 
 /**
+ * unwakeup - undo wakeup if possible.
+ * @p: task
+ * @state: new task state
+ *
+ * Undo a previous wakeup of the specified task - if the process
+ * is not running already. The main interface to be used is
+ * unwakeup_process(), it will do a lockless test whether the task
+ * is on the runqueue.
+ */
+void __unwakeup_process(struct task_struct * p, long state)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&runqueue_lock, flags);
+	if (!p->has_cpu && (p != current) && task_on_runqueue(p)) {
+		del_from_runqueue(p);
+		p->state = state;
+	}
+	spin_unlock_irqrestore(&runqueue_lock, flags);
+}
+
+/**
  * schedule_timeout - sleep until timeout
  * @timeout: timeout value in jiffies
  *
--- linux/kernel/softirq.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/softirq.c	Wed Sep 26 17:45:00 2001
@@ -58,12 +58,35 @@
 		wake_up_process(tsk);
 }
 
+/*
+ * If a softirqs were fully handled after ksoftirqd was woken
+ * up then try to undo the wakeup.
+ */
+static inline void unwakeup_softirqd(unsigned cpu)
+{
+	struct task_struct * tsk = ksoftirqd_task(cpu);
+
+	if (tsk)
+		unwakeup_process(tsk, TASK_INTERRUPTIBLE);
+}
+
+/*
+ * We restart softirq processing MAX_SOFTIRQ_RESTART times,
+ * and we fall back to softirqd after that.
+ *
+ * This number has been established via experimentation.
+ * The two things to balance is latency against fairness -
+ * we want to handle softirqs as soon as possible, but they
+ * should not be able to lock up the box.
+ */
+#define MAX_SOFTIRQ_RESTART 10
+
 asmlinkage void do_softirq()
 {
+	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu = smp_processor_id();
-	__u32 pending;
+	__u32 pending, mask;
 	long flags;
-	__u32 mask;
 
 	if (in_interrupt())
 		return;
@@ -95,55 +118,37 @@
 		local_irq_disable();
 
 		pending = softirq_pending(cpu);
-		if (pending & mask) {
-			mask &= ~pending;
+		if (pending && --max_restart)
 			goto restart;
-		}
 		__local_bh_enable();
 
 		if (pending)
+			/*
+			 * In the normal case ksoftirqd is rarely activated,
+			 * increased scheduling hurts performance.
+			 * It's a safety measure: if external load starts
+			 * to flood the system with softirqs then we
+			 * will mitigate softirq work to the softirq thread.
+			 */
 			wakeup_softirqd(cpu);
+		else
+			/*
+			 * All softirqs are handled - undo any possible
+			 * wakeup of softirqd. This reduces context switch
+			 * overhead.
+			 */
+			unwakeup_softirqd(cpu);
 	}
 
 	local_irq_restore(flags);
 }
 
-/*
- * This function must run with irq disabled!
- */
-inline void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
-{
-	__cpu_raise_softirq(cpu, nr);
-
-	/*
-	 * If we're in an interrupt or bh, we're done
-	 * (this also catches bh-disabled code). We will
-	 * actually run the softirq once we return from
-	 * the irq or bh.
-	 *
-	 * Otherwise we wake up ksoftirqd to make sure we
-	 * schedule the softirq soon.
-	 */
-	if (!(local_irq_count(cpu) | local_bh_count(cpu)))
-		wakeup_softirqd(cpu);
-}
-
-void raise_softirq(unsigned int nr)
-{
-	long flags;
-
-	local_irq_save(flags);
-	cpu_raise_softirq(smp_processor_id(), nr);
-	local_irq_restore(flags);
-}
-
 void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
 {
 	softirq_vec[nr].data = data;
 	softirq_vec[nr].action = action;
 }
 
-
 /* Tasklets */
 
 struct tasklet_head tasklet_vec[NR_CPUS] __cacheline_aligned;
@@ -157,8 +162,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_vec[cpu].list;
 	tasklet_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
+	__cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 void __tasklet_hi_schedule(struct tasklet_struct *t)
@@ -169,8 +175,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_hi_vec[cpu].list;
 	tasklet_hi_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, HI_SOFTIRQ);
+	__cpu_raise_softirq(cpu, HI_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 static void tasklet_action(struct softirq_action *a)
@@ -241,7 +248,6 @@
 	}
 }
 
-
 void tasklet_init(struct tasklet_struct *t,
 		  void (*func)(unsigned long), unsigned long data)
 {
@@ -268,8 +274,6 @@
 	clear_bit(TASKLET_STATE_SCHED, &t->state);
 }
 
-
-
 /* Old style BHs */
 
 static void (*bh_base[32])(void);
@@ -325,7 +329,7 @@
 {
 	int i;
 
-	for (i=0; i<32; i++)
+	for (i = 0; i < 32; i++)
 		tasklet_init(bh_task_vec+i, bh_action, i);
 
 	open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
@@ -361,56 +365,52 @@
 
 static int ksoftirqd(void * __bind_cpu)
 {
-	int bind_cpu = *(int *) __bind_cpu;
-	int cpu = cpu_logical_map(bind_cpu);
+	int cpu = cpu_logical_map((int)__bind_cpu);
 
 	daemonize();
-	current->nice = 19;
+
 	sigfillset(&current->blocked);
 
 	/* Migrate to the right CPU */
-	current->cpus_allowed = 1UL << cpu;
-	while (smp_processor_id() != cpu)
-		schedule();
+	current->cpus_allowed = 1 << cpu;
 
-	sprintf(current->comm, "ksoftirqd_CPU%d", bind_cpu);
+#if CONFIG_SMP
+	sprintf(current->comm, "ksoftirqd CPU%d", cpu);
+#else
+	sprintf(current->comm, "ksoftirqd");
+#endif
 
+	current->nice = 19;
+	schedule();
 	__set_current_state(TASK_INTERRUPTIBLE);
-	mb();
-
 	ksoftirqd_task(cpu) = current;
 
 	for (;;) {
-		if (!softirq_pending(cpu))
-			schedule();
-
-		__set_current_state(TASK_RUNNING);
-
-		while (softirq_pending(cpu)) {
+back:
+		do {
 			do_softirq();
 			if (current->need_resched)
-				schedule();
-		}
-
+				goto preempt;
+		} while (softirq_pending(cpu));
+		schedule();
 		__set_current_state(TASK_INTERRUPTIBLE);
 	}
+
+preempt:
+	__set_current_state(TASK_RUNNING);
+	schedule();
+	__set_current_state(TASK_INTERRUPTIBLE);
+	goto back;
 }
 
 static __init int spawn_ksoftirqd(void)
 {
 	int cpu;
 
-	for (cpu = 0; cpu < smp_num_cpus; cpu++) {
-		if (kernel_thread(ksoftirqd, (void *) &cpu,
+	for (cpu = 0; cpu < smp_num_cpus; cpu++)
+		if (kernel_thread(ksoftirqd, (void *) cpu,
 				  CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
-			printk("spawn_ksoftirqd() failed for cpu %d\n", cpu);
-		else {
-			while (!ksoftirqd_task(cpu_logical_map(cpu))) {
-				current->policy |= SCHED_YIELD;
-				schedule();
-			}
-		}
-	}
+			BUG();
 
 	return 0;
 }
--- linux/include/linux/netdevice.h.orig	Wed Sep 26 17:04:36 2001
+++ linux/include/linux/netdevice.h	Wed Sep 26 17:08:20 2001
@@ -486,8 +486,9 @@
 		local_irq_save(flags);
 		dev->next_sched = softnet_data[cpu].output_queue;
 		softnet_data[cpu].output_queue = dev;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
@@ -535,8 +536,9 @@
 		local_irq_save(flags);
 		skb->next = softnet_data[cpu].completion_queue;
 		softnet_data[cpu].completion_queue = skb;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
--- linux/include/linux/interrupt.h.orig	Wed Sep 26 17:04:40 2001
+++ linux/include/linux/interrupt.h	Wed Sep 26 20:57:12 2001
@@ -74,9 +74,14 @@
 asmlinkage void do_softirq(void);
 extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data);
 extern void softirq_init(void);
-#define __cpu_raise_softirq(cpu, nr) do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
-extern void FASTCALL(cpu_raise_softirq(unsigned int cpu, unsigned int nr));
-extern void FASTCALL(raise_softirq(unsigned int nr));
+#define __cpu_raise_softirq(cpu, nr) \
+		do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
+
+#define rerun_softirqs(cpu) 					\
+do {								\
+	if (!(local_irq_count(cpu) | local_bh_count(cpu)))	\
+		do_softirq();					\
+} while (0);
 
 
 
--- linux/include/linux/sched.h.orig	Wed Sep 26 17:04:40 2001
+++ linux/include/linux/sched.h	Wed Sep 26 17:08:16 2001
@@ -556,6 +556,7 @@
 
 extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__unwakeup_process(struct task_struct * p, long state));
 extern void FASTCALL(sleep_on(wait_queue_head_t *q));
 extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
 				      signed long timeout));
@@ -574,6 +575,13 @@
 #define wake_up_interruptible_all(x)	__wake_up((x),TASK_INTERRUPTIBLE, 0)
 #define wake_up_interruptible_sync(x)	__wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
 #define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE,  nr)
+
+#define unwakeup_process(tsk,state)		\
+do {						\
+	if (task_on_runqueue(tsk))		\
+		__unwakeup_process(tsk,state);	\
+} while (0)
+
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
 
 extern int in_group_p(gid_t);
--- linux/include/asm-mips/softirq.h.orig	Wed Sep 26 20:58:00 2001
+++ linux/include/asm-mips/softirq.h	Wed Sep 26 20:58:07 2001
@@ -40,6 +40,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-#define __cpu_raise_softirq(cpu, nr)	set_bit(nr, &softirq_pending(cpu))
-
 #endif /* _ASM_SOFTIRQ_H */
--- linux/include/asm-mips64/softirq.h.orig	Wed Sep 26 20:58:20 2001
+++ linux/include/asm-mips64/softirq.h	Wed Sep 26 20:58:26 2001
@@ -39,19 +39,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-extern inline void __cpu_raise_softirq(int cpu, int nr)
-{
-	unsigned int *m = (unsigned int *) &softirq_pending(cpu);
-	unsigned int temp;
-
-	__asm__ __volatile__(
-		"1:\tll\t%0, %1\t\t\t# __cpu_raise_softirq\n\t"
-		"or\t%0, %2\n\t"
-		"sc\t%0, %1\n\t"
-		"beqz\t%0, 1b"
-		: "=&r" (temp), "=m" (*m)
-		: "ir" (1UL << nr), "m" (*m)
-		: "memory");
-}
-
 #endif /* _ASM_SOFTIRQ_H */
--- linux/net/core/dev.c.orig	Wed Sep 26 17:04:41 2001
+++ linux/net/core/dev.c	Wed Sep 26 17:04:48 2001
@@ -1218,8 +1218,9 @@
 			dev_hold(skb->dev);
 			__skb_queue_tail(&queue->input_pkt_queue,skb);
 			/* Runs from irqs or BH's, no need to wake BH */
-			cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+			__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 			local_irq_restore(flags);
+			rerun_softirqs(this_cpu);
 #ifndef OFFLINE_SAMPLE
 			get_sample_stats(this_cpu);
 #endif
@@ -1529,8 +1530,9 @@
 	local_irq_disable();
 	netdev_rx_stat[this_cpu].time_squeeze++;
 	/* This already runs in BH context, no need to wake up BH's */
-	cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+	__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 	local_irq_enable();
+	rerun_softirqs(this_cpu);
 
 	NET_PROFILE_LEAVE(softnet_process);
 	return;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
@ 2001-09-27 23:29 Oleg Nesterov
  2001-09-28  0:03 ` Andrea Arcangeli
  2001-09-28  6:57 ` Ingo Molnar
  0 siblings, 2 replies; 44+ messages in thread
From: Oleg Nesterov @ 2001-09-27 23:29 UTC (permalink / raw)
  To: linux-kernel, mingo

Hello.

I am trying to understand the basics of softirq handling.

It seems to me that ksoftirqd()'s loop can be cleanuped a bit
with following (untested) patch on top of 2.4.10-softirq-A7.

It also removes	the 'mask' variable in do_softirq().

	Oleg

--- 2.4.10-softirq-A7/kernel/softirq.c.orig	Thu Sep 27 22:31:06 2001
+++ 2.4.10-softirq-A7/kernel/softirq.c	Thu Sep 27 22:54:37 2001
@@ -85,7 +85,7 @@
 {
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu = smp_processor_id();
-	__u32 pending, mask;
+	__u32 pending;
 	long flags;
 
 	if (in_interrupt())
@@ -98,7 +98,6 @@
 	if (pending) {
 		struct softirq_action *h;
 
-		mask = ~pending;
 		local_bh_disable();
 restart:
 		/* Reset the pending bitmask before enabling irqs */
@@ -381,26 +380,22 @@
 #endif
 
 	current->nice = 19;
-	schedule();
-	__set_current_state(TASK_INTERRUPTIBLE);
 	ksoftirqd_task(cpu) = current;
 
 	for (;;) {
-back:
+		schedule();
+		__set_current_state(TASK_INTERRUPTIBLE);
+
 		do {
 			do_softirq();
 			if (current->need_resched)
 				goto preempt;
 		} while (softirq_pending(cpu));
-		schedule();
-		__set_current_state(TASK_INTERRUPTIBLE);
-	}
 
+		continue;
 preempt:
-	__set_current_state(TASK_RUNNING);
-	schedule();
-	__set_current_state(TASK_INTERRUPTIBLE);
-	goto back;
+		__set_current_state(TASK_RUNNING);
+	}
 }
 
 static __init int spawn_ksoftirqd(void)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-26 16:44 [patch] softirq performance fixes, cleanups, 2.4.10 Ingo Molnar
  2001-09-26 17:48 ` Mike Kravetz
  2001-09-26 18:55 ` Russell King
@ 2001-09-27 23:31 ` Andrea Arcangeli
  2001-09-28  3:20   ` Andrea Arcangeli
  2001-09-28  7:18   ` [patch] softirq-2.4.10-B2 Ingo Molnar
  2001-09-28 16:18 ` [patch] softirq performance fixes, cleanups, 2.4.10 kuznet
  3 siblings, 2 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2001-09-27 23:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise

On Wed, Sep 26, 2001 at 06:44:03PM +0200, Ingo Molnar wrote:

some comment after reading your softirq-2.4.10-A7.

>  - softirq handling can now be restarted N times within do_softirq(), if a
>    softirq gets reactivated while it's being handled.

is this really necessary after introducing the unwakeup logic? What do
you get if you allow at max 1 softirq pass as before?

>  - '[ksoftirqd_CPU0]' is confusing on UP systems, changed it to
>    '[ksoftirqd]' instead.

"confusing" for you maybe, not for me, but I don't care about this one
anyways :).

>  - simplified ksoftirqd()'s loop, it's both shorter and faster by a few
>    instructions now.

only detail: ksoftirqd can show up as sleeping from /proc while it's
runnable but I don't think it's a problem and saving the state
clobbering is probably more sensible.

no other obvious issue, except I preferred to wait each ksoftirqd to
startup succesfully to be strictier and I'd also put an assert after
the schedule() to verify ksoftirqd is running in the right cpu.

Andrea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-27 23:29 Oleg Nesterov
@ 2001-09-28  0:03 ` Andrea Arcangeli
  2001-09-28  6:57 ` Ingo Molnar
  1 sibling, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2001-09-28  0:03 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, mingo

On Fri, Sep 28, 2001 at 03:29:13AM +0400, Oleg Nesterov wrote:
> --- 2.4.10-softirq-A7/kernel/softirq.c.orig	Thu Sep 27 22:31:06 2001
> +++ 2.4.10-softirq-A7/kernel/softirq.c	Thu Sep 27 22:54:37 2001
> @@ -85,7 +85,7 @@
>  {
>  	int max_restart = MAX_SOFTIRQ_RESTART;
>  	int cpu = smp_processor_id();
> -	__u32 pending, mask;
> +	__u32 pending;
>  	long flags;
>  
>  	if (in_interrupt())
> @@ -98,7 +98,6 @@
>  	if (pending) {
>  		struct softirq_action *h;
>  
> -		mask = ~pending;
>  		local_bh_disable();
>  restart:
>  		/* Reset the pending bitmask before enabling irqs */

correct.

> @@ -381,26 +380,22 @@
>  #endif
>  
>  	current->nice = 19;
> -	schedule();
> -	__set_current_state(TASK_INTERRUPTIBLE);

buggy (check cpus_allowed).

>  	for (;;) {
> -back:
> +		schedule();
> +		__set_current_state(TASK_INTERRUPTIBLE);
> +
>  		do {
>  			do_softirq();
>  			if (current->need_resched)
>  				goto preempt;
>  		} while (softirq_pending(cpu));
> -		schedule();
> -		__set_current_state(TASK_INTERRUPTIBLE);
> -	}
>  
> +		continue;
>  preempt:
> -	__set_current_state(TASK_RUNNING);
> -	schedule();
> -	__set_current_state(TASK_INTERRUPTIBLE);
> -	goto back;
> +		__set_current_state(TASK_RUNNING);
> +	}
>  }

you dropped Ingo's optimization (but you resurrected the strictier /proc
statistics).

Andrea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
@ 2001-09-28  2:50 Oleg Nesterov
  2001-09-28  7:56 ` Ingo Molnar
  0 siblings, 1 reply; 44+ messages in thread
From: Oleg Nesterov @ 2001-09-28  2:50 UTC (permalink / raw)
  To: andrea, linux-kernel

Hello.

Thanks a lot for Your response.

On Fri, Sep 28, 2001 at 04:03:39 +0400, Andrea Arcangeli wrote:
> > @@ -381,26 +380,22 @@
> >  #endif
> >  
> >  	current->nice = 19;
> > -	schedule();
> > -	__set_current_state(TASK_INTERRUPTIBLE);
> 
> buggy (check cpus_allowed).

Why? The next three lines is

	for (;;) {
		schedule();
		__set_current_state(TASK_INTERRUPTIBLE);

And if I misunderstand schedule()'s

still_running:
	if (!(prev->cpus_allowed & (1UL << this_cpu)))
		goto still_running_back;

Ingo's patch was buggy as well.

> you dropped Ingo's optimization (but you resurrected the strictier /proc
> statistics).

Again, I can't understand. The new loop

	for (;;) {
		schedule();
		__set_current_state(TASK_INTERRUPTIBLE);

		do {
			do_softirq();
			if (current->need_resched)
				goto preempt;
		} while (softirq_pending(cpu));

		continue;
preempt:
		__set_current_state(TASK_RUNNING);
	}

seems to be _equivalent_ to Ingo's ... 

What i am missed? I apologize in advance, if it is
something obvious.

Oleg

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-27 23:31 ` Andrea Arcangeli
@ 2001-09-28  3:20   ` Andrea Arcangeli
  2001-09-28  7:34     ` Ingo Molnar
  2001-09-28  7:18   ` [patch] softirq-2.4.10-B2 Ingo Molnar
  1 sibling, 1 reply; 44+ messages in thread
From: Andrea Arcangeli @ 2001-09-28  3:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise

On Fri, Sep 28, 2001 at 01:31:06AM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 26, 2001 at 06:44:03PM +0200, Ingo Molnar wrote:
> 
> some comment after reading your softirq-2.4.10-A7.
> 
> >  - softirq handling can now be restarted N times within do_softirq(), if a
> >    softirq gets reactivated while it's being handled.
> 
> is this really necessary after introducing the unwakeup logic? What do
> you get if you allow at max 1 softirq pass as before?

I'm just curious, what are the numbers of your A7 patch compared with
this one?

diff -urN 2.4.10/include/asm-mips/softirq.h softirq/include/asm-mips/softirq.h
--- 2.4.10/include/asm-mips/softirq.h	Sun Sep 23 21:11:41 2001
+++ softirq/include/asm-mips/softirq.h	Fri Sep 28 05:18:45 2001
@@ -40,6 +40,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-#define __cpu_raise_softirq(cpu, nr)	set_bit(nr, &softirq_pending(cpu))
-
 #endif /* _ASM_SOFTIRQ_H */
diff -urN 2.4.10/include/asm-mips64/softirq.h softirq/include/asm-mips64/softirq.h
--- 2.4.10/include/asm-mips64/softirq.h	Sun Sep 23 21:11:41 2001
+++ softirq/include/asm-mips64/softirq.h	Fri Sep 28 05:18:45 2001
@@ -39,19 +39,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-extern inline void __cpu_raise_softirq(int cpu, int nr)
-{
-	unsigned int *m = (unsigned int *) &softirq_pending(cpu);
-	unsigned int temp;
-
-	__asm__ __volatile__(
-		"1:\tll\t%0, %1\t\t\t# __cpu_raise_softirq\n\t"
-		"or\t%0, %2\n\t"
-		"sc\t%0, %1\n\t"
-		"beqz\t%0, 1b"
-		: "=&r" (temp), "=m" (*m)
-		: "ir" (1UL << nr), "m" (*m)
-		: "memory");
-}
-
 #endif /* _ASM_SOFTIRQ_H */
diff -urN 2.4.10/include/linux/interrupt.h softirq/include/linux/interrupt.h
--- 2.4.10/include/linux/interrupt.h	Fri Sep 28 04:08:52 2001
+++ softirq/include/linux/interrupt.h	Fri Sep 28 05:18:45 2001
@@ -74,9 +74,14 @@
 asmlinkage void do_softirq(void);
 extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data);
 extern void softirq_init(void);
-#define __cpu_raise_softirq(cpu, nr) do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
-extern void FASTCALL(cpu_raise_softirq(unsigned int cpu, unsigned int nr));
-extern void FASTCALL(raise_softirq(unsigned int nr));
+#define __cpu_raise_softirq(cpu, nr) \
+		do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
+
+#define rerun_softirqs(cpu) 					\
+do {								\
+	if (!(local_irq_count(cpu) | local_bh_count(cpu)))	\
+		do_softirq();					\
+} while (0);
 
 
 
diff -urN 2.4.10/include/linux/netdevice.h softirq/include/linux/netdevice.h
--- 2.4.10/include/linux/netdevice.h	Fri Sep 28 04:09:37 2001
+++ softirq/include/linux/netdevice.h	Fri Sep 28 05:18:45 2001
@@ -486,8 +486,9 @@
 		local_irq_save(flags);
 		dev->next_sched = softnet_data[cpu].output_queue;
 		softnet_data[cpu].output_queue = dev;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
@@ -535,8 +536,9 @@
 		local_irq_save(flags);
 		skb->next = softnet_data[cpu].completion_queue;
 		softnet_data[cpu].completion_queue = skb;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
diff -urN 2.4.10/include/linux/sched.h softirq/include/linux/sched.h
--- 2.4.10/include/linux/sched.h	Sun Sep 23 21:11:42 2001
+++ softirq/include/linux/sched.h	Fri Sep 28 05:18:45 2001
@@ -556,6 +556,7 @@
 
 extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__unwakeup_process(struct task_struct * p, long state));
 extern void FASTCALL(sleep_on(wait_queue_head_t *q));
 extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
 				      signed long timeout));
@@ -574,6 +575,13 @@
 #define wake_up_interruptible_all(x)	__wake_up((x),TASK_INTERRUPTIBLE, 0)
 #define wake_up_interruptible_sync(x)	__wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
 #define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE,  nr)
+
+#define unwakeup_process(tsk,state)		\
+do {						\
+	if (task_on_runqueue(tsk))		\
+		__unwakeup_process(tsk,state);	\
+} while (0)
+
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
 
 extern int in_group_p(gid_t);
diff -urN 2.4.10/kernel/ksyms.c softirq/kernel/ksyms.c
--- 2.4.10/kernel/ksyms.c	Sun Sep 23 21:11:43 2001
+++ softirq/kernel/ksyms.c	Fri Sep 28 05:18:45 2001
@@ -538,8 +538,6 @@
 EXPORT_SYMBOL(tasklet_kill);
 EXPORT_SYMBOL(__run_task_queue);
 EXPORT_SYMBOL(do_softirq);
-EXPORT_SYMBOL(raise_softirq);
-EXPORT_SYMBOL(cpu_raise_softirq);
 EXPORT_SYMBOL(__tasklet_schedule);
 EXPORT_SYMBOL(__tasklet_hi_schedule);
 
diff -urN 2.4.10/kernel/sched.c softirq/kernel/sched.c
--- 2.4.10/kernel/sched.c	Sun Sep 23 21:11:43 2001
+++ softirq/kernel/sched.c	Fri Sep 28 05:18:46 2001
@@ -366,6 +366,28 @@
 }
 
 /**
+ * unwakeup - undo wakeup if possible.
+ * @p: task
+ * @state: new task state
+ *
+ * Undo a previous wakeup of the specified task - if the process
+ * is not running already. The main interface to be used is
+ * unwakeup_process(), it will do a lockless test whether the task
+ * is on the runqueue.
+ */
+void __unwakeup_process(struct task_struct * p, long state)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&runqueue_lock, flags);
+	if (!p->has_cpu && (p != current) && task_on_runqueue(p)) {
+		del_from_runqueue(p);
+		p->state = state;
+	}
+	spin_unlock_irqrestore(&runqueue_lock, flags);
+}
+
+/**
  * schedule_timeout - sleep until timeout
  * @timeout: timeout value in jiffies
  *
diff -urN 2.4.10/kernel/softirq.c softirq/kernel/softirq.c
--- 2.4.10/kernel/softirq.c	Sun Sep 23 21:11:43 2001
+++ softirq/kernel/softirq.c	Fri Sep 28 05:18:47 2001
@@ -58,12 +58,23 @@
 		wake_up_process(tsk);
 }
 
+/*
+ * If a softirqs were fully handled after ksoftirqd was woken
+ * up then try to undo the wakeup.
+ */
+static inline void unwakeup_softirqd(unsigned cpu)
+{
+	struct task_struct * tsk = ksoftirqd_task(cpu);
+
+	if (tsk)
+		unwakeup_process(tsk, TASK_INTERRUPTIBLE);
+}
+
 asmlinkage void do_softirq()
 {
 	int cpu = smp_processor_id();
-	__u32 pending;
+	__u32 pending, mask;
 	long flags;
-	__u32 mask;
 
 	if (in_interrupt())
 		return;
@@ -102,48 +113,32 @@
 		__local_bh_enable();
 
 		if (pending)
+			/*
+			 * In the normal case ksoftirqd is rarely activated,
+			 * increased scheduling hurts performance.
+			 * It's a safety measure: if external load starts
+			 * to flood the system with softirqs then we
+			 * will mitigate softirq work to the softirq thread.
+			 */
 			wakeup_softirqd(cpu);
+		else
+			/*
+			 * All softirqs are handled - undo any possible
+			 * wakeup of softirqd. This reduces context switch
+			 * overhead.
+			 */
+			unwakeup_softirqd(cpu);
 	}
 
 	local_irq_restore(flags);
 }
 
-/*
- * This function must run with irq disabled!
- */
-inline void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
-{
-	__cpu_raise_softirq(cpu, nr);
-
-	/*
-	 * If we're in an interrupt or bh, we're done
-	 * (this also catches bh-disabled code). We will
-	 * actually run the softirq once we return from
-	 * the irq or bh.
-	 *
-	 * Otherwise we wake up ksoftirqd to make sure we
-	 * schedule the softirq soon.
-	 */
-	if (!(local_irq_count(cpu) | local_bh_count(cpu)))
-		wakeup_softirqd(cpu);
-}
-
-void raise_softirq(unsigned int nr)
-{
-	long flags;
-
-	local_irq_save(flags);
-	cpu_raise_softirq(smp_processor_id(), nr);
-	local_irq_restore(flags);
-}
-
 void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
 {
 	softirq_vec[nr].data = data;
 	softirq_vec[nr].action = action;
 }
 
-
 /* Tasklets */
 
 struct tasklet_head tasklet_vec[NR_CPUS] __cacheline_aligned;
@@ -157,8 +152,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_vec[cpu].list;
 	tasklet_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
+	__cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 void __tasklet_hi_schedule(struct tasklet_struct *t)
@@ -169,8 +165,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_hi_vec[cpu].list;
 	tasklet_hi_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, HI_SOFTIRQ);
+	__cpu_raise_softirq(cpu, HI_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 static void tasklet_action(struct softirq_action *a)
@@ -241,7 +238,6 @@
 	}
 }
 
-
 void tasklet_init(struct tasklet_struct *t,
 		  void (*func)(unsigned long), unsigned long data)
 {
@@ -268,8 +264,6 @@
 	clear_bit(TASKLET_STATE_SCHED, &t->state);
 }
 
-
-
 /* Old style BHs */
 
 static void (*bh_base[32])(void);
@@ -325,7 +319,7 @@
 {
 	int i;
 
-	for (i=0; i<32; i++)
+	for (i = 0; i < 32; i++)
 		tasklet_init(bh_task_vec+i, bh_action, i);
 
 	open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
@@ -361,39 +355,45 @@
 
 static int ksoftirqd(void * __bind_cpu)
 {
-	int bind_cpu = *(int *) __bind_cpu;
-	int cpu = cpu_logical_map(bind_cpu);
+	int cpu = cpu_logical_map((int)__bind_cpu);
 
 	daemonize();
-	current->nice = 19;
+
 	sigfillset(&current->blocked);
 
 	/* Migrate to the right CPU */
 	current->cpus_allowed = 1UL << cpu;
-	while (smp_processor_id() != cpu)
-		schedule();
 
-	sprintf(current->comm, "ksoftirqd_CPU%d", bind_cpu);
+#if CONFIG_SMP
+	sprintf(current->comm, "ksoftirqd CPU%d", cpu);
+#else
+	sprintf(current->comm, "ksoftirqd");
+#endif
 
+	current->nice = 19;
+	schedule();
+	if (smp_processor_id() != cpu)
+		BUG();
 	__set_current_state(TASK_INTERRUPTIBLE);
 	mb();
-
 	ksoftirqd_task(cpu) = current;
 
 	for (;;) {
-		if (!softirq_pending(cpu))
-			schedule();
-
-		__set_current_state(TASK_RUNNING);
-
-		while (softirq_pending(cpu)) {
+		schedule();
+		do {
 			do_softirq();
 			if (current->need_resched)
-				schedule();
-		}
-
+				goto preempt;
+		back:
+			;
+		} while (softirq_pending(cpu));
 		__set_current_state(TASK_INTERRUPTIBLE);
 	}
+
+preempt:
+	__set_current_state(TASK_RUNNING);
+	schedule();
+	goto back;
 }
 
 static __init int spawn_ksoftirqd(void)
@@ -401,7 +401,7 @@
 	int cpu;
 
 	for (cpu = 0; cpu < smp_num_cpus; cpu++) {
-		if (kernel_thread(ksoftirqd, (void *) &cpu,
+		if (kernel_thread(ksoftirqd, (void *) cpu,
 				  CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
 			printk("spawn_ksoftirqd() failed for cpu %d\n", cpu);
 		else {
diff -urN 2.4.10/net/core/dev.c softirq/net/core/dev.c
--- 2.4.10/net/core/dev.c	Sun Sep 23 21:11:43 2001
+++ softirq/net/core/dev.c	Fri Sep 28 05:18:46 2001
@@ -1218,8 +1218,9 @@
 			dev_hold(skb->dev);
 			__skb_queue_tail(&queue->input_pkt_queue,skb);
 			/* Runs from irqs or BH's, no need to wake BH */
-			cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+			__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 			local_irq_restore(flags);
+			rerun_softirqs(this_cpu);
 #ifndef OFFLINE_SAMPLE
 			get_sample_stats(this_cpu);
 #endif
@@ -1529,8 +1530,9 @@
 	local_irq_disable();
 	netdev_rx_stat[this_cpu].time_squeeze++;
 	/* This already runs in BH context, no need to wake up BH's */
-	cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+	__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 	local_irq_enable();
+	rerun_softirqs(this_cpu);
 
 	NET_PROFILE_LEAVE(softnet_process);
 	return;

Andrea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-27 23:29 Oleg Nesterov
  2001-09-28  0:03 ` Andrea Arcangeli
@ 2001-09-28  6:57 ` Ingo Molnar
  1 sibling, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28  6:57 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel


On Fri, 28 Sep 2001, Oleg Nesterov wrote:

> I am trying to understand the basics of softirq handling.
>
> It seems to me that ksoftirqd()'s loop can be cleanuped a bit with
> following (untested) patch on top of 2.4.10-softirq-A7.

thanks - the patch looks mostly correct.

> It also removes	the 'mask' variable in do_softirq().

(right, applied.)

> -	schedule();
> -	__set_current_state(TASK_INTERRUPTIBLE);

we need a single schedule to migrate the CPU to the right CPU. Btw, it's
not a big offense to run on the wrong CPU: do_softirq() will run
correctly. But there wont be a ksoftirqd exeuting on the 'right' CPU (for
this short amount of time until it schedules once), and we might miss a
single instance of softirq processing. So it's cleaner to do a schedule()
that guarantees that we migrate to the target CPU.

> +		__set_current_state(TASK_RUNNING);
> +	}

well, this basically reverts it to the original code and removes the
optimization - but i dont have any strong feelings in either way. I think
Andrea is right and it's more correct to set it TASK_RUNNING, for
task-statistics reasons. Reverted it to the old code.

and in fact while doing that i found another bug in the old code:
TASK_INTERRUPTIBLE was set in a non-IRQ-atomic way, so theoretically it
could be possible that the compiler reorders the setting of
TASK_INTERRUPTIBLE and the test for softirq_pending(), and if an IRQ gets
inbetween the we could miss processing softirq for 1/HZ again. Added an
optimization barrier(). (SMP-consistency is not needed, since we are
per-CPU.)

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch] softirq-2.4.10-B2
  2001-09-27 23:31 ` Andrea Arcangeli
  2001-09-28  3:20   ` Andrea Arcangeli
@ 2001-09-28  7:18   ` Ingo Molnar
  2001-09-28 15:58     ` Andrea Arcangeli
  2001-09-28 18:36     ` Simon Kirby
  1 sibling, 2 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28  7:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5426 bytes --]


On Fri, 28 Sep 2001, Andrea Arcangeli wrote:

> some comment after reading your softirq-2.4.10-A7.
>
> >  - softirq handling can now be restarted N times within do_softirq(), if a
> >    softirq gets reactivated while it's being handled.
>
> is this really necessary after introducing the unwakeup logic? What do
> you get if you allow at max 1 softirq pass as before?

yes, of course it's necessery. The reason is really simple: softirqs have
a natural ordering, and if we are handling softirq #2, while softirq #1
gets reactivated, nothing will process softirq #1 if we do only a single
loop. I explained this in full detail during my previous softirq patch a
few months ago. The unwakeup logic only reverts the wakeup of ksoftirqd,
it will not magically process pending softirqs! I explained all these
effects in detail when i wrote the first softirq-looping patch, and when
you originally came up with ksoftirqd.

(Even with just a single softirq activated, if we are just on the way out
of handing softirqs (the first and last time) at the syscall level, and a
hardirq comes in that activates this softirq, then there is nothing that
will process the softirq: 1) the hardirq's own handling of softirqs is
inhibed due to the softirq-atomic section within do_softirq() 2) this loop
will not process the new work since it's on the way out.)

basically the problem is that there is a big 'gap' between the activation
of softirqs, and the time when ksoftirqd starts running. There are a
number of mechanisms within the networking stack that are quite
timing-sensitive. And generally, if there is work A and work B that are
related, and we've executed work A (the hardware interrupt), then it's
almost always the best idea to execute work B as soon as possible. Any
'delaying' of work B should only be done for non-performance reasons:
eg. fairness in this case. Now it's MAX_SOFTIRQ_RESTART that balances
performance against fairness. (in most kernel subsystems we almost always
give preference to performance over fairness - without ignoring fairness
of course.)

there is also another bad side-effect of ksoftirqd as well: if it's
relatively inactive for some time then it will 'collect' current->counter
scheduler ticks, basically boosting its performance way above that of the
intended ->nice = 19. It will then often 'suck' softirq handling to
itself, due to its more agressive scheduling position. To combat this
effect, i've modified ksoftirq to do:

	if (current->counter > 1)
		current->counter = 1;

(this is a tiny bit racy wrt. the timer interrupt, but it's harmless.)

the current form of softirqs were designed by Alexey and David for the
purposes high-performance networking, as part of the 'softnet' effort.
Networking remains the biggest user of softirqs - while there are a few
cases of high-frequency tasklet uses, generally it's the network stack's
TX_SOFTIRQ and RX_SOFTIRQ workload that we care about most - and tasklets.
(see the tasklet fixes in the patch.) Via TX-completion-IRQ capable cards,
there can be a constant and separate TX and RX softirq workload added.

especially under high loads, the work done in the 'later' net-softirq,
NET_RX_SOFTIRQ can mount up, and thus the amount of pending work within
NET_TX_SOFTIRQ can mount up. Furthermore, there is a mechanizm within both
the tx and rx softirq that can break out of softirq handling before all
work has been handled: if a jiffy (10 msecs) has passed, or if we have
processed more than netdev_max_backlog (default: 300) packets.

there are a number of other options i experimented with:

 - handling softirqs in schedule(), before runqueue_lock is taken, in a
   softirq- and irq- atomic way, unless ->need_resched is set. This was
   done in earlier kernels, and might be a good idea to do again =>
   especially with unwakeup(). The downside is extra cost within
   schedule().

 - tuning the amount of work within the tx/rx handlers, both increasing
   and decreasing the amount of packets. Decreasing the amount of work has
   the effect of decreasing the latency of processing RX-triggered TX
   events (such as ACK), and generally handling TX/RX events more
   smoothly, but it also has the effect of increasing the cache footprint.

 - exchanging the order of tx and rx softirqs.

 - using jiffies within do_softirq() to make sure it does not execute for
   more than 10-20 msecs.

 - feeding back a 'work left' integer through the ->action functions to
   do_softirq() - who can then do decisions which softirq to restart.
   (basically a mini softirq scheduler.)

this later one looked pretty powerful because it provides more information
ot the generic layer - but it's something i think might be too intrusive
for 2.4. For now, the simplest and most effective method of all was the
looping.

- i've done one more refinement to the current patch: do_softirq() now
  checks current->need_resched and it will break out of softirq processing
  if it's 1. Note that do_softirq() is a rare function which *must not* do
  '!current->need_resched': poll_idle() uses need_resched == -1 as a
  special value. (but normally irq-level code does not check
  ->need_resched so this is a special case). This prevent irqs that hit
  the idle-poll task to do normal softirq processing - and not break out
  after one loop.

i've attached the softirq-2.4.10-B2 that has your TASK_RUNNING suggestion,
Oleg's fixes and this change included.

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 11556 bytes --]

--- linux/kernel/ksyms.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/ksyms.c	Wed Sep 26 17:04:48 2001
@@ -538,8 +538,6 @@
 EXPORT_SYMBOL(tasklet_kill);
 EXPORT_SYMBOL(__run_task_queue);
 EXPORT_SYMBOL(do_softirq);
-EXPORT_SYMBOL(raise_softirq);
-EXPORT_SYMBOL(cpu_raise_softirq);
 EXPORT_SYMBOL(__tasklet_schedule);
 EXPORT_SYMBOL(__tasklet_hi_schedule);
 
--- linux/kernel/sched.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/sched.c	Wed Sep 26 17:04:48 2001
@@ -366,6 +366,28 @@
 }
 
 /**
+ * unwakeup - undo wakeup if possible.
+ * @p: task
+ * @state: new task state
+ *
+ * Undo a previous wakeup of the specified task - if the process
+ * is not running already. The main interface to be used is
+ * unwakeup_process(), it will do a lockless test whether the task
+ * is on the runqueue.
+ */
+void __unwakeup_process(struct task_struct * p, long state)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&runqueue_lock, flags);
+	if (!p->has_cpu && (p != current) && task_on_runqueue(p)) {
+		del_from_runqueue(p);
+		p->state = state;
+	}
+	spin_unlock_irqrestore(&runqueue_lock, flags);
+}
+
+/**
  * schedule_timeout - sleep until timeout
  * @timeout: timeout value in jiffies
  *
--- linux/kernel/softirq.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/softirq.c	Fri Sep 28 08:56:08 2001
@@ -58,12 +58,35 @@
 		wake_up_process(tsk);
 }
 
+/*
+ * If a softirqs were fully handled after ksoftirqd was woken
+ * up then try to undo the wakeup.
+ */
+static inline void unwakeup_softirqd(unsigned cpu)
+{
+	struct task_struct * tsk = ksoftirqd_task(cpu);
+
+	if (tsk)
+		unwakeup_process(tsk, TASK_INTERRUPTIBLE);
+}
+
+/*
+ * We restart softirq processing MAX_SOFTIRQ_RESTART times,
+ * and we fall back to softirqd after that.
+ *
+ * This number has been established via experimentation.
+ * The two things to balance is latency against fairness -
+ * we want to handle softirqs as soon as possible, but they
+ * should not be able to lock up the box.
+ */
+#define MAX_SOFTIRQ_RESTART 10
+
 asmlinkage void do_softirq()
 {
+	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu = smp_processor_id();
 	__u32 pending;
 	long flags;
-	__u32 mask;
 
 	if (in_interrupt())
 		return;
@@ -75,7 +98,6 @@
 	if (pending) {
 		struct softirq_action *h;
 
-		mask = ~pending;
 		local_bh_disable();
 restart:
 		/* Reset the pending bitmask before enabling irqs */
@@ -95,55 +117,37 @@
 		local_irq_disable();
 
 		pending = softirq_pending(cpu);
-		if (pending & mask) {
-			mask &= ~pending;
+		if (pending && --max_restart && (current->need_resched != 1))
 			goto restart;
-		}
 		__local_bh_enable();
 
 		if (pending)
+			/*
+			 * In the normal case ksoftirqd is rarely activated,
+			 * increased scheduling hurts performance.
+			 * It's a safety measure: if external load starts
+			 * to flood the system with softirqs then we
+			 * will mitigate softirq work to the softirq thread.
+			 */
 			wakeup_softirqd(cpu);
+		else
+			/*
+			 * All softirqs are handled - undo any possible
+			 * wakeup of softirqd. This reduces context switch
+			 * overhead.
+			 */
+			unwakeup_softirqd(cpu);
 	}
 
 	local_irq_restore(flags);
 }
 
-/*
- * This function must run with irq disabled!
- */
-inline void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
-{
-	__cpu_raise_softirq(cpu, nr);
-
-	/*
-	 * If we're in an interrupt or bh, we're done
-	 * (this also catches bh-disabled code). We will
-	 * actually run the softirq once we return from
-	 * the irq or bh.
-	 *
-	 * Otherwise we wake up ksoftirqd to make sure we
-	 * schedule the softirq soon.
-	 */
-	if (!(local_irq_count(cpu) | local_bh_count(cpu)))
-		wakeup_softirqd(cpu);
-}
-
-void raise_softirq(unsigned int nr)
-{
-	long flags;
-
-	local_irq_save(flags);
-	cpu_raise_softirq(smp_processor_id(), nr);
-	local_irq_restore(flags);
-}
-
 void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
 {
 	softirq_vec[nr].data = data;
 	softirq_vec[nr].action = action;
 }
 
-
 /* Tasklets */
 
 struct tasklet_head tasklet_vec[NR_CPUS] __cacheline_aligned;
@@ -157,8 +161,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_vec[cpu].list;
 	tasklet_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
+	__cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 void __tasklet_hi_schedule(struct tasklet_struct *t)
@@ -169,8 +174,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_hi_vec[cpu].list;
 	tasklet_hi_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, HI_SOFTIRQ);
+	__cpu_raise_softirq(cpu, HI_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 static void tasklet_action(struct softirq_action *a)
@@ -241,7 +247,6 @@
 	}
 }
 
-
 void tasklet_init(struct tasklet_struct *t,
 		  void (*func)(unsigned long), unsigned long data)
 {
@@ -268,8 +273,6 @@
 	clear_bit(TASKLET_STATE_SCHED, &t->state);
 }
 
-
-
 /* Old style BHs */
 
 static void (*bh_base[32])(void);
@@ -325,7 +328,7 @@
 {
 	int i;
 
-	for (i=0; i<32; i++)
+	for (i = 0; i < 32; i++)
 		tasklet_init(bh_task_vec+i, bh_action, i);
 
 	open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
@@ -361,38 +364,42 @@
 
 static int ksoftirqd(void * __bind_cpu)
 {
-	int bind_cpu = *(int *) __bind_cpu;
-	int cpu = cpu_logical_map(bind_cpu);
+	int cpu = cpu_logical_map((int)__bind_cpu);
 
 	daemonize();
-	current->nice = 19;
+
 	sigfillset(&current->blocked);
 
 	/* Migrate to the right CPU */
-	current->cpus_allowed = 1UL << cpu;
-	while (smp_processor_id() != cpu)
-		schedule();
+	current->cpus_allowed = 1 << cpu;
 
-	sprintf(current->comm, "ksoftirqd_CPU%d", bind_cpu);
-
-	__set_current_state(TASK_INTERRUPTIBLE);
-	mb();
+#if CONFIG_SMP
+	sprintf(current->comm, "ksoftirqd CPU%d", cpu);
+#else
+	sprintf(current->comm, "ksoftirqd");
+#endif
 
+	current->nice = 19;
+	schedule();
 	ksoftirqd_task(cpu) = current;
 
 	for (;;) {
-		if (!softirq_pending(cpu))
-			schedule();
-
-		__set_current_state(TASK_RUNNING);
-
 		while (softirq_pending(cpu)) {
 			do_softirq();
 			if (current->need_resched)
-				schedule();
+				goto preempt;
 		}
 
 		__set_current_state(TASK_INTERRUPTIBLE);
+		/* This has to be here to make the test IRQ-correct. */
+		barrier();
+		if (!softirq_pending(cpu)) {
+preempt:
+			if (current->counter > 1)
+				current->counter = 1;
+			schedule();
+		}
+		__set_current_state(TASK_RUNNING);
 	}
 }
 
@@ -400,17 +407,10 @@
 {
 	int cpu;
 
-	for (cpu = 0; cpu < smp_num_cpus; cpu++) {
-		if (kernel_thread(ksoftirqd, (void *) &cpu,
+	for (cpu = 0; cpu < smp_num_cpus; cpu++)
+		if (kernel_thread(ksoftirqd, (void *) cpu,
 				  CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
-			printk("spawn_ksoftirqd() failed for cpu %d\n", cpu);
-		else {
-			while (!ksoftirqd_task(cpu_logical_map(cpu))) {
-				current->policy |= SCHED_YIELD;
-				schedule();
-			}
-		}
-	}
+			BUG();
 
 	return 0;
 }
--- linux/include/linux/netdevice.h.orig	Wed Sep 26 17:04:36 2001
+++ linux/include/linux/netdevice.h	Fri Sep 28 07:44:01 2001
@@ -486,8 +486,9 @@
 		local_irq_save(flags);
 		dev->next_sched = softnet_data[cpu].output_queue;
 		softnet_data[cpu].output_queue = dev;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
@@ -535,8 +536,9 @@
 		local_irq_save(flags);
 		skb->next = softnet_data[cpu].completion_queue;
 		softnet_data[cpu].completion_queue = skb;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
--- linux/include/linux/interrupt.h.orig	Wed Sep 26 17:04:40 2001
+++ linux/include/linux/interrupt.h	Fri Sep 28 07:44:01 2001
@@ -74,9 +74,14 @@
 asmlinkage void do_softirq(void);
 extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data);
 extern void softirq_init(void);
-#define __cpu_raise_softirq(cpu, nr) do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
-extern void FASTCALL(cpu_raise_softirq(unsigned int cpu, unsigned int nr));
-extern void FASTCALL(raise_softirq(unsigned int nr));
+#define __cpu_raise_softirq(cpu, nr) \
+		do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
+
+#define rerun_softirqs(cpu) 					\
+do {								\
+	if (!(local_irq_count(cpu) | local_bh_count(cpu)))	\
+		do_softirq();					\
+} while (0);
 
 
 
--- linux/include/linux/sched.h.orig	Wed Sep 26 17:04:40 2001
+++ linux/include/linux/sched.h	Fri Sep 28 07:44:01 2001
@@ -556,6 +556,7 @@
 
 extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__unwakeup_process(struct task_struct * p, long state));
 extern void FASTCALL(sleep_on(wait_queue_head_t *q));
 extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
 				      signed long timeout));
@@ -574,6 +575,13 @@
 #define wake_up_interruptible_all(x)	__wake_up((x),TASK_INTERRUPTIBLE, 0)
 #define wake_up_interruptible_sync(x)	__wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
 #define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE,  nr)
+
+#define unwakeup_process(tsk,state)		\
+do {						\
+	if (task_on_runqueue(tsk))		\
+		__unwakeup_process(tsk,state);	\
+} while (0)
+
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
 
 extern int in_group_p(gid_t);
--- linux/include/asm-mips/softirq.h.orig	Wed Sep 26 20:58:00 2001
+++ linux/include/asm-mips/softirq.h	Wed Sep 26 20:58:07 2001
@@ -40,6 +40,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-#define __cpu_raise_softirq(cpu, nr)	set_bit(nr, &softirq_pending(cpu))
-
 #endif /* _ASM_SOFTIRQ_H */
--- linux/include/asm-mips64/softirq.h.orig	Wed Sep 26 20:58:20 2001
+++ linux/include/asm-mips64/softirq.h	Wed Sep 26 20:58:26 2001
@@ -39,19 +39,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-extern inline void __cpu_raise_softirq(int cpu, int nr)
-{
-	unsigned int *m = (unsigned int *) &softirq_pending(cpu);
-	unsigned int temp;
-
-	__asm__ __volatile__(
-		"1:\tll\t%0, %1\t\t\t# __cpu_raise_softirq\n\t"
-		"or\t%0, %2\n\t"
-		"sc\t%0, %1\n\t"
-		"beqz\t%0, 1b"
-		: "=&r" (temp), "=m" (*m)
-		: "ir" (1UL << nr), "m" (*m)
-		: "memory");
-}
-
 #endif /* _ASM_SOFTIRQ_H */
--- linux/net/core/dev.c.orig	Wed Sep 26 17:04:41 2001
+++ linux/net/core/dev.c	Wed Sep 26 17:04:48 2001
@@ -1218,8 +1218,9 @@
 			dev_hold(skb->dev);
 			__skb_queue_tail(&queue->input_pkt_queue,skb);
 			/* Runs from irqs or BH's, no need to wake BH */
-			cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+			__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 			local_irq_restore(flags);
+			rerun_softirqs(this_cpu);
 #ifndef OFFLINE_SAMPLE
 			get_sample_stats(this_cpu);
 #endif
@@ -1529,8 +1530,9 @@
 	local_irq_disable();
 	netdev_rx_stat[this_cpu].time_squeeze++;
 	/* This already runs in BH context, no need to wake up BH's */
-	cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+	__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 	local_irq_enable();
+	rerun_softirqs(this_cpu);
 
 	NET_PROFILE_LEAVE(softnet_process);
 	return;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28  3:20   ` Andrea Arcangeli
@ 2001-09-28  7:34     ` Ingo Molnar
  2001-09-28 15:17       ` Andrea Arcangeli
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28  7:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise


On Fri, 28 Sep 2001, Andrea Arcangeli wrote:

> I'm just curious, what are the numbers of your A7 patch compared with
> this one?

well, a quick glance shows that this is mostly the -A7 patch with the
single-loop omitted, right? I tested it before (the unwakeup mechanizm is
orthogonal to the looping concept), and while unwakeup alone helps
somewhat, its effect cannot be compared to the effects of looping. Please
check my previous mail for details, i tried a large set of other
combinations as well.

frankly, i dont understand what your problem is with the looping concept.
It's actually simpler than the mask-bitmap version, and it ensures
finegrained handling and low latencies of softirq handlers. We loop in the
softirq handlers themselves already. Ideally we'd like to loop until all
work is done - but we exit on some limits, and it's generally a good idea
to not let the handlers themselves loop for a too long time. (to get good
scheduling latencies.) And we can increase/decrease MAX_SOFTIRQ_RESTART if
it's ever shown to be excessive/insufficient. (We could even runtime tune
it - but i thought that to be an overdesigning of a nonexisting problem.)

(i had an interim version of the patch which had a sysctl-tunable
MAX_SOFTIRQ_RESTART - this is where the value of '10' comes from. A value
of '1' means a single-loop. Just to make sure i also tested the 'mask'
version which is a bit more than just a single loop in the current patch -
but for tx/rx purposes it's almost equivalent to a single loop.)

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28  2:50 Oleg Nesterov
@ 2001-09-28  7:56 ` Ingo Molnar
  0 siblings, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28  7:56 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: andrea, linux-kernel


On Fri, 28 Sep 2001, Oleg Nesterov wrote:

> > >  	current->nice = 19;
> > > -	schedule();
> > > -	__set_current_state(TASK_INTERRUPTIBLE);

your change is indeed correct. (the history if this is that the schedule()
was not always at the top of the loop - eg. it's not at the top of it in
the current -B2 patch.)

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28  7:34     ` Ingo Molnar
@ 2001-09-28 15:17       ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2001-09-28 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise

On Fri, Sep 28, 2001 at 09:34:52AM +0200, Ingo Molnar wrote:
> frankly, i dont understand what your problem is with the looping concept.

There is no problem, it's just that I am _curious_ about is the
"difference" in "numbers" not words. I'm not claiming there is no
difference of course, obviously there is difference, I expect that.

Andrea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq-2.4.10-B2
  2001-09-28  7:18   ` [patch] softirq-2.4.10-B2 Ingo Molnar
@ 2001-09-28 15:58     ` Andrea Arcangeli
  2001-09-28 18:36     ` Simon Kirby
  1 sibling, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2001-09-28 15:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Alan Cox, Alexey Kuznetsov,
	Ben LaHaise

On Fri, Sep 28, 2001 at 09:18:17AM +0200, Ingo Molnar wrote:
> i've attached the softirq-2.4.10-B2 that has your TASK_RUNNING suggestion,
> Oleg's fixes and this change included.

please include this safety checke too:

--- ./kernel/softirq.c.~1~	Fri Sep 28 17:42:07 2001
+++ ./kernel/softirq.c	Fri Sep 28 17:46:32 2001
@@ -381,6 +381,8 @@
 
 	current->nice = 19;
 	schedule();
+	if (smp_processor_id() != cpu)
+		BUG();
 	ksoftirqd_task(cpu) = current;
 
 	for (;;) {

Andrea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-26 16:44 [patch] softirq performance fixes, cleanups, 2.4.10 Ingo Molnar
                   ` (2 preceding siblings ...)
  2001-09-27 23:31 ` Andrea Arcangeli
@ 2001-09-28 16:18 ` kuznet
  2001-09-28 16:31   ` Ingo Molnar
                     ` (3 more replies)
  3 siblings, 4 replies; 44+ messages in thread
From: kuznet @ 2001-09-28 16:18 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, alan, bcrl, andrea

Hello!

>  - removed 'mask' handling from do_softirq() - it's unnecessery due to the
>    restarts. this further simplifies the code.

Ingo, but this means that only the first softirq is handled.
"mask" implements round-robin and this is necessary.


>  - '[ksoftirqd_CPU0]' is confusing on UP systems, changed it to
>    '[ksoftirqd]' instead.

It is useless to argue about preferences, but universal naming scheme
looks as less confusing yet. :-)


Generally, I dislike this patch. It is utterly ugly.

Also, you did not assure me that you interpret problem correctly.
netif_rx() is __insensitive__ to latencies due to blocked softirq restarts.
It stops spinning only when it becomes true cpu hog. And scheduling ksoftirqd
is the only variant here.

Most likely, your problem will disappear when you renice ksoftirqd
to higher priority f.e. equal to priority of tux threads.

Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 16:18 ` [patch] softirq performance fixes, cleanups, 2.4.10 kuznet
@ 2001-09-28 16:31   ` Ingo Molnar
  2001-09-28 17:04     ` kuznet
  2001-09-28 16:32   ` Andrea Arcangeli
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28 16:31 UTC (permalink / raw)
  To: kuznet; +Cc: torvalds, linux-kernel, alan, bcrl, andrea


On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:

> >  - removed 'mask' handling from do_softirq() - it's unnecessery due to the
> >    restarts. this further simplifies the code.
>
> Ingo, but this means that only the first softirq is handled.
> "mask" implements round-robin and this is necessary.

it does not, please read the code again. We iterate over all active bits
in the 'pending' bitmask. All softirqs that are active are handled. This
inner loop is repeated up to MAX_SOFTIRQ_RESTART times, if any softirq got
reactivated meanwhile.

> Also, you did not assure me that you interpret problem correctly.
> netif_rx() is __insensitive__ to latencies due to blocked softirq
> restarts. It stops spinning only when it becomes true cpu hog. [...]

(i suspect you meant net_rx_action().)

net_rx_action() stops spinning if 1) a jiffy has passed 2) 300 packets
have been processed. [the jiffy test is not completely accurate as it can
break out of the loop after a short amount of time, but that is a minor
issue.]

there is nothing sacred about the old method of processing NET_RX_SOFTIRQ,
then NET_TX_SOFTIRQ, then breaking out of do_softirq() (the mechanizm was
a bit more complex than that, but in insignificant ways). It's just as
arbitrary as 10 loops - with the difference that 10 loops perform better.

> And scheduling ksoftirqd is the only variant here.

scheduling ksoftirqd is the only variant once we have decided that
softirqs are a CPU hog and we want to make progress in non-irq contexts.
A single loop over softirqs is just as arbitrary as 10 loops, this all is
a matter of balancing.

> Most likely, your problem will disappear when you renice ksoftirqd
> to higher priority f.e. equal to priority of tux threads.

(no. see the previous mails.)

	Ingo



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 16:18 ` [patch] softirq performance fixes, cleanups, 2.4.10 kuznet
  2001-09-28 16:31   ` Ingo Molnar
@ 2001-09-28 16:32   ` Andrea Arcangeli
  2001-09-28 16:46     ` Ingo Molnar
  2001-09-28 16:35   ` [patch] softirq-2.4.10-B3 Ingo Molnar
  2001-09-29 11:03   ` [patch] softirq performance fixes, cleanups, 2.4.10 Rusty Russell
  3 siblings, 1 reply; 44+ messages in thread
From: Andrea Arcangeli @ 2001-09-28 16:32 UTC (permalink / raw)
  To: kuznet; +Cc: mingo, torvalds, linux-kernel, alan, bcrl

On Fri, Sep 28, 2001 at 08:18:20PM +0400, kuznet@ms2.inr.ac.ru wrote:
> Hello!
> 
> >  - removed 'mask' handling from do_softirq() - it's unnecessery due to the
> >    restarts. this further simplifies the code.
> 
> Ingo, but this means that only the first softirq is handled.
> "mask" implements round-robin and this is necessary.

he's allowing to repeat the loop more than once to hide it, to do the
"mask" with repetition correctly we'd need a per-softirq counter, not
just a bitmask so it wouldn't be handy to allocate on the stack, but
it's nothing unfixable.

However I also preferred the previous behaviour, I think it was much
nicer for general purpose (non specweb99 gigabit like scenarios).

> >  - '[ksoftirqd_CPU0]' is confusing on UP systems, changed it to
> >    '[ksoftirqd]' instead.
> 
> It is useless to argue about preferences, but universal naming scheme
> looks as less confusing yet. :-)

Agreed.

> Generally, I dislike this patch. It is utterly ugly.

I also dislike it overall but I can see why it improves performance, and
the deschedule thing makes sense for the flooding case.

I would be very confortable in only merging the deschedule part and this
is why I asked Ingo if he could measure the difference.

Andrea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [patch] softirq-2.4.10-B3
  2001-09-28 16:18 ` [patch] softirq performance fixes, cleanups, 2.4.10 kuznet
  2001-09-28 16:31   ` Ingo Molnar
  2001-09-28 16:32   ` Andrea Arcangeli
@ 2001-09-28 16:35   ` Ingo Molnar
  2001-09-29  0:40     ` J . A . Magallon
  2001-09-29 11:03   ` [patch] softirq performance fixes, cleanups, 2.4.10 Rusty Russell
  3 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28 16:35 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Linus Torvalds, linux-kernel, Alan Cox, bcrl, Andrea Arcangeli

[-- Attachment #1: Type: TEXT/PLAIN, Size: 418 bytes --]


On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:

> >  - '[ksoftirqd_CPU0]' is confusing on UP systems, changed it to
> >    '[ksoftirqd]' instead.
>
> It is useless to argue about preferences, but universal naming scheme
> looks as less confusing yet. :-)

since you are the second one to complain, i'm convinced :) reverted.

i've also added back the BUG() check for smp_processor_id() == cpu.

-B3 attached.

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 28319 bytes --]

--- linux/kernel/ksyms.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/ksyms.c	Wed Sep 26 17:04:48 2001
@@ -538,8 +538,6 @@
 EXPORT_SYMBOL(tasklet_kill);
 EXPORT_SYMBOL(__run_task_queue);
 EXPORT_SYMBOL(do_softirq);
-EXPORT_SYMBOL(raise_softirq);
-EXPORT_SYMBOL(cpu_raise_softirq);
 EXPORT_SYMBOL(__tasklet_schedule);
 EXPORT_SYMBOL(__tasklet_hi_schedule);
 
--- linux/kernel/sched.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/sched.c	Wed Sep 26 17:04:48 2001
@@ -366,6 +366,28 @@
 }
 
 /**
+ * unwakeup - undo wakeup if possible.
+ * @p: task
+ * @state: new task state
+ *
+ * Undo a previous wakeup of the specified task - if the process
+ * is not running already. The main interface to be used is
+ * unwakeup_process(), it will do a lockless test whether the task
+ * is on the runqueue.
+ */
+void __unwakeup_process(struct task_struct * p, long state)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&runqueue_lock, flags);
+	if (!p->has_cpu && (p != current) && task_on_runqueue(p)) {
+		del_from_runqueue(p);
+		p->state = state;
+	}
+	spin_unlock_irqrestore(&runqueue_lock, flags);
+}
+
+/**
  * schedule_timeout - sleep until timeout
  * @timeout: timeout value in jiffies
  *
--- linux/kernel/softirq.c.orig	Wed Sep 26 17:04:40 2001
+++ linux/kernel/softirq.c	Fri Sep 28 18:16:21 2001
@@ -58,12 +58,35 @@
 		wake_up_process(tsk);
 }
 
+/*
+ * If a softirqs were fully handled after ksoftirqd was woken
+ * up then try to undo the wakeup.
+ */
+static inline void unwakeup_softirqd(unsigned cpu)
+{
+	struct task_struct * tsk = ksoftirqd_task(cpu);
+
+	if (tsk)
+		unwakeup_process(tsk, TASK_INTERRUPTIBLE);
+}
+
+/*
+ * We restart softirq processing MAX_SOFTIRQ_RESTART times,
+ * and we fall back to softirqd after that.
+ *
+ * This number has been established via experimentation.
+ * The two things to balance is latency against fairness -
+ * we want to handle softirqs as soon as possible, but they
+ * should not be able to lock up the box.
+ */
+#define MAX_SOFTIRQ_RESTART 10
+
 asmlinkage void do_softirq()
 {
+	int max_restart = MAX_SOFTIRQ_RESTART;
 	int cpu = smp_processor_id();
 	__u32 pending;
 	long flags;
-	__u32 mask;
 
 	if (in_interrupt())
 		return;
@@ -75,7 +98,6 @@
 	if (pending) {
 		struct softirq_action *h;
 
-		mask = ~pending;
 		local_bh_disable();
 restart:
 		/* Reset the pending bitmask before enabling irqs */
@@ -95,55 +117,37 @@
 		local_irq_disable();
 
 		pending = softirq_pending(cpu);
-		if (pending & mask) {
-			mask &= ~pending;
+		if (pending && --max_restart && (current->need_resched != 1))
 			goto restart;
-		}
 		__local_bh_enable();
 
 		if (pending)
+			/*
+			 * In the normal case ksoftirqd is rarely activated,
+			 * increased scheduling hurts performance.
+			 * It's a safety measure: if external load starts
+			 * to flood the system with softirqs then we
+			 * will mitigate softirq work to the softirq thread.
+			 */
 			wakeup_softirqd(cpu);
+		else
+			/*
+			 * All softirqs are handled - undo any possible
+			 * wakeup of softirqd. This reduces context switch
+			 * overhead.
+			 */
+			unwakeup_softirqd(cpu);
 	}
 
 	local_irq_restore(flags);
 }
 
-/*
- * This function must run with irq disabled!
- */
-inline void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
-{
-	__cpu_raise_softirq(cpu, nr);
-
-	/*
-	 * If we're in an interrupt or bh, we're done
-	 * (this also catches bh-disabled code). We will
-	 * actually run the softirq once we return from
-	 * the irq or bh.
-	 *
-	 * Otherwise we wake up ksoftirqd to make sure we
-	 * schedule the softirq soon.
-	 */
-	if (!(local_irq_count(cpu) | local_bh_count(cpu)))
-		wakeup_softirqd(cpu);
-}
-
-void raise_softirq(unsigned int nr)
-{
-	long flags;
-
-	local_irq_save(flags);
-	cpu_raise_softirq(smp_processor_id(), nr);
-	local_irq_restore(flags);
-}
-
 void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
 {
 	softirq_vec[nr].data = data;
 	softirq_vec[nr].action = action;
 }
 
-
 /* Tasklets */
 
 struct tasklet_head tasklet_vec[NR_CPUS] __cacheline_aligned;
@@ -157,8 +161,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_vec[cpu].list;
 	tasklet_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
+	__cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 void __tasklet_hi_schedule(struct tasklet_struct *t)
@@ -169,8 +174,9 @@
 	local_irq_save(flags);
 	t->next = tasklet_hi_vec[cpu].list;
 	tasklet_hi_vec[cpu].list = t;
-	cpu_raise_softirq(cpu, HI_SOFTIRQ);
+	__cpu_raise_softirq(cpu, HI_SOFTIRQ);
 	local_irq_restore(flags);
+	rerun_softirqs(cpu);
 }
 
 static void tasklet_action(struct softirq_action *a)
@@ -241,7 +247,6 @@
 	}
 }
 
-
 void tasklet_init(struct tasklet_struct *t,
 		  void (*func)(unsigned long), unsigned long data)
 {
@@ -268,8 +273,6 @@
 	clear_bit(TASKLET_STATE_SCHED, &t->state);
 }
 
-
-
 /* Old style BHs */
 
 static void (*bh_base[32])(void);
@@ -325,7 +328,7 @@
 {
 	int i;
 
-	for (i=0; i<32; i++)
+	for (i = 0; i < 32; i++)
 		tasklet_init(bh_task_vec+i, bh_action, i);
 
 	open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
@@ -361,38 +364,41 @@
 
 static int ksoftirqd(void * __bind_cpu)
 {
-	int bind_cpu = *(int *) __bind_cpu;
-	int cpu = cpu_logical_map(bind_cpu);
+	int cpu = cpu_logical_map((int)__bind_cpu);
 
 	daemonize();
-	current->nice = 19;
+
 	sigfillset(&current->blocked);
 
 	/* Migrate to the right CPU */
-	current->cpus_allowed = 1UL << cpu;
-	while (smp_processor_id() != cpu)
-		schedule();
-
-	sprintf(current->comm, "ksoftirqd_CPU%d", bind_cpu);
+	current->cpus_allowed = 1 << cpu;
 
-	__set_current_state(TASK_INTERRUPTIBLE);
-	mb();
+	sprintf(current->comm, "ksoftirqd CPU%d", cpu);
 
+	current->nice = 19;
+	schedule();
 	ksoftirqd_task(cpu) = current;
 
-	for (;;) {
-		if (!softirq_pending(cpu))
-			schedule();
-
-		__set_current_state(TASK_RUNNING);
+	if (smp_processor_id() != cpu)
+		BUG();
 
+	for (;;) {
 		while (softirq_pending(cpu)) {
 			do_softirq();
 			if (current->need_resched)
-				schedule();
+				goto preempt;
 		}
 
 		__set_current_state(TASK_INTERRUPTIBLE);
+		/* This has to be here to make the test IRQ-correct. */
+		barrier();
+		if (!softirq_pending(cpu)) {
+preempt:
+			if (current->counter > 1)
+				current->counter = 1;
+			schedule();
+		}
+		__set_current_state(TASK_RUNNING);
 	}
 }
 
@@ -400,17 +406,10 @@
 {
 	int cpu;
 
-	for (cpu = 0; cpu < smp_num_cpus; cpu++) {
-		if (kernel_thread(ksoftirqd, (void *) &cpu,
+	for (cpu = 0; cpu < smp_num_cpus; cpu++)
+		if (kernel_thread(ksoftirqd, (void *) cpu,
 				  CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
-			printk("spawn_ksoftirqd() failed for cpu %d\n", cpu);
-		else {
-			while (!ksoftirqd_task(cpu_logical_map(cpu))) {
-				current->policy |= SCHED_YIELD;
-				schedule();
-			}
-		}
-	}
+			BUG();
 
 	return 0;
 }
--- linux/include/linux/netdevice.h.orig	Wed Sep 26 17:04:36 2001
+++ linux/include/linux/netdevice.h	Fri Sep 28 07:44:01 2001
@@ -486,8 +486,9 @@
 		local_irq_save(flags);
 		dev->next_sched = softnet_data[cpu].output_queue;
 		softnet_data[cpu].output_queue = dev;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
@@ -535,8 +536,9 @@
 		local_irq_save(flags);
 		skb->next = softnet_data[cpu].completion_queue;
 		softnet_data[cpu].completion_queue = skb;
-		cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+		__cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		rerun_softirqs(cpu);
 	}
 }
 
--- linux/include/linux/interrupt.h.orig	Wed Sep 26 17:04:40 2001
+++ linux/include/linux/interrupt.h	Fri Sep 28 07:44:01 2001
@@ -74,9 +74,14 @@
 asmlinkage void do_softirq(void);
 extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data);
 extern void softirq_init(void);
-#define __cpu_raise_softirq(cpu, nr) do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
-extern void FASTCALL(cpu_raise_softirq(unsigned int cpu, unsigned int nr));
-extern void FASTCALL(raise_softirq(unsigned int nr));
+#define __cpu_raise_softirq(cpu, nr) \
+		do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
+
+#define rerun_softirqs(cpu) 					\
+do {								\
+	if (!(local_irq_count(cpu) | local_bh_count(cpu)))	\
+		do_softirq();					\
+} while (0);
 
 
 
--- linux/include/linux/sched.h.orig	Wed Sep 26 17:04:40 2001
+++ linux/include/linux/sched.h	Fri Sep 28 07:44:01 2001
@@ -556,6 +556,7 @@
 
 extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__unwakeup_process(struct task_struct * p, long state));
 extern void FASTCALL(sleep_on(wait_queue_head_t *q));
 extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
 				      signed long timeout));
@@ -574,6 +575,13 @@
 #define wake_up_interruptible_all(x)	__wake_up((x),TASK_INTERRUPTIBLE, 0)
 #define wake_up_interruptible_sync(x)	__wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
 #define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE,  nr)
+
+#define unwakeup_process(tsk,state)		\
+do {						\
+	if (task_on_runqueue(tsk))		\
+		__unwakeup_process(tsk,state);	\
+} while (0)
+
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
 
 extern int in_group_p(gid_t);
--- linux/include/asm-mips/softirq.h.orig	Wed Sep 26 20:58:00 2001
+++ linux/include/asm-mips/softirq.h	Wed Sep 26 20:58:07 2001
@@ -40,6 +40,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-#define __cpu_raise_softirq(cpu, nr)	set_bit(nr, &softirq_pending(cpu))
-
 #endif /* _ASM_SOFTIRQ_H */
--- linux/include/asm-mips64/softirq.h.orig	Wed Sep 26 20:58:20 2001
+++ linux/include/asm-mips64/softirq.h	Wed Sep 26 20:58:26 2001
@@ -39,19 +39,4 @@
 
 #define in_softirq() (local_bh_count(smp_processor_id()) != 0)
 
-extern inline void __cpu_raise_softirq(int cpu, int nr)
-{
-	unsigned int *m = (unsigned int *) &softirq_pending(cpu);
-	unsigned int temp;
-
-	__asm__ __volatile__(
-		"1:\tll\t%0, %1\t\t\t# __cpu_raise_softirq\n\t"
-		"or\t%0, %2\n\t"
-		"sc\t%0, %1\n\t"
-		"beqz\t%0, 1b"
-		: "=&r" (temp), "=m" (*m)
-		: "ir" (1UL << nr), "m" (*m)
-		: "memory");
-}
-
 #endif /* _ASM_SOFTIRQ_H */
--- linux/net/core/dev.c.orig	Wed Sep 26 17:04:41 2001
+++ linux/net/core/dev.c	Wed Sep 26 17:04:48 2001
@@ -1218,8 +1218,9 @@
 			dev_hold(skb->dev);
 			__skb_queue_tail(&queue->input_pkt_queue,skb);
 			/* Runs from irqs or BH's, no need to wake BH */
-			cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+			__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 			local_irq_restore(flags);
+			rerun_softirqs(this_cpu);
 #ifndef OFFLINE_SAMPLE
 			get_sample_stats(this_cpu);
 #endif
@@ -1529,8 +1530,9 @@
 	local_irq_disable();
 	netdev_rx_stat[this_cpu].time_squeeze++;
 	/* This already runs in BH context, no need to wake up BH's */
-	cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+	__cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
 	local_irq_enable();
+	rerun_softirqs(this_cpu);
 
 	NET_PROFILE_LEAVE(softnet_process);
 	return;
--- linux/linux/include/linux/netdevice.h.orig	Wed Sep 26 21:14:55 2001
+++ linux/linux/include/linux/netdevice.h	Thu Sep 27 07:51:00 2001
@@ -390,6 +390,8 @@
 						     unsigned char *haddr);
 	int			(*neigh_setup)(struct net_device *dev, struct neigh_parms *);
 	int			(*accept_fastpath)(struct net_device *, struct dst_entry*);
+#define HAVE_POLL_CONTROLLER
+	void			(*poll_controller)(struct net_device *dev);
 
 	/* open/release and usage marking */
 	struct module *owner;
--- linux/linux/drivers/net/tulip/tulip_core.c.orig	Wed Sep 26 21:14:58 2001
+++ linux/linux/drivers/net/tulip/tulip_core.c	Fri Sep 28 09:23:22 2001
@@ -243,6 +243,7 @@
 static struct net_device_stats *tulip_get_stats(struct net_device *dev);
 static int private_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
 static void set_rx_mode(struct net_device *dev);
+static void poll_tulip(struct net_device *dev);
 
 
 
@@ -1671,6 +1672,9 @@
 	dev->get_stats = tulip_get_stats;
 	dev->do_ioctl = private_ioctl;
 	dev->set_multicast_list = set_rx_mode;
+#ifdef HAVE_POLL_CONTROLLER
+	dev->poll_controller = &poll_tulip;
+#endif
 
 	if (register_netdev(dev))
 		goto err_out_free_ring;
@@ -1839,6 +1843,24 @@
 
 	/* pci_power_off (pdev, -1); */
 }
+
+
+#ifdef HAVE_POLL_CONTROLLER
+
+/*
+ * Polling 'interrupt' - used by things like netconsole to send skbs
+ * without having to re-enable interrupts. It's not called while
+ * the interrupt routine is executing.
+ */
+
+static void poll_tulip (struct net_device *dev)
+{
+       disable_irq(dev->irq);
+       tulip_interrupt (dev->irq, dev, NULL);
+       enable_irq(dev->irq);
+}
+
+#endif
 
 
 static struct pci_driver tulip_driver = {
--- linux/linux/drivers/net/eepro100.c.orig	Wed Sep 26 21:14:58 2001
+++ linux/linux/drivers/net/eepro100.c	Thu Sep 27 08:11:41 2001
@@ -542,6 +542,7 @@
 static int speedo_rx(struct net_device *dev);
 static void speedo_tx_buffer_gc(struct net_device *dev);
 static void speedo_interrupt(int irq, void *dev_instance, struct pt_regs *regs);
+static void poll_speedo (struct net_device *dev);
 static int speedo_close(struct net_device *dev);
 static struct net_device_stats *speedo_get_stats(struct net_device *dev);
 static int speedo_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
@@ -837,6 +838,9 @@
 	dev->get_stats = &speedo_get_stats;
 	dev->set_multicast_list = &set_rx_mode;
 	dev->do_ioctl = &speedo_ioctl;
+#ifdef HAVE_POLL_CONTROLLER
+	dev->poll_controller = &poll_speedo;
+#endif
 
 	return 0;
 }
@@ -1594,6 +1598,23 @@
 	clear_bit(0, (void*)&sp->in_interrupt);
 	return;
 }
+
+#ifdef HAVE_POLL_CONTROLLER
+
+/*
+ * Polling 'interrupt' - used by things like netconsole to send skbs
+ * without having to re-enable interrupts. It's not called while
+ * the interrupt routine is executing.
+ */
+
+static void poll_speedo (struct net_device *dev)
+{
+	disable_irq(dev->irq);
+	speedo_interrupt (dev->irq, dev, NULL);
+	enable_irq(dev->irq);
+}
+
+#endif
 
 static inline struct RxFD *speedo_rx_alloc(struct net_device *dev, int entry)
 {
--- linux/linux/drivers/net/netconsole.c.orig	Wed Sep 26 21:15:15 2001
+++ linux/linux/drivers/net/netconsole.c	Thu Sep 27 16:12:10 2001
@@ -0,0 +1,331 @@
+/*
+ *  linux/drivers/net/netconsole.c
+ *
+ *  Copyright (C) 2001  Ingo Molnar <mingo@redhat.com>
+ *
+ *  This file contains the implementation of an IRQ-safe, crash-safe
+ *  kernel console implementation that outputs kernel messages to the
+ *  network.
+ *
+ * Modification history:
+ *
+ * 2001-09-17    started by Ingo Molnar.
+ */
+
+/****************************************************************
+ *      This program is free software; you can redistribute it and/or modify
+ *      it under the terms of the GNU General Public License as published by
+ *      the Free Software Foundation; either version 2, or (at your option)
+ *      any later version.
+ *
+ *      This program is distributed in the hope that it will be useful,
+ *      but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *      GNU General Public License for more details.
+ *
+ *      You should have received a copy of the GNU General Public License
+ *      along with this program; if not, write to the Free Software
+ *      Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ ****************************************************************/
+
+#include <net/tcp.h>
+#include <net/udp.h>
+#include <linux/mm.h>
+#include <linux/tty.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <asm/unaligned.h>
+#include <linux/console.h>
+#include <linux/smp_lock.h>
+#include <linux/netdevice.h>
+#include <linux/tty_driver.h>
+#include <linux/etherdevice.h>
+
+static struct net_device *netconsole_dev;
+static u16 source_port, target_port;
+static u32 source_ip, target_ip;
+static unsigned char daddr[6] = {0xff, 0xff, 0xff, 0xff, 0xff, 0xff} ;
+
+#define NETCONSOLE_VERSION 0x01
+#define HEADER_LEN 5
+
+#define MAX_UDP_CHUNK 1460
+#define MAX_PRINT_CHUNK (MAX_UDP_CHUNK-HEADER_LEN)
+
+/*
+ * We maintain a small pool of fully-sized skbs,
+ * to make sure the message gets out even in
+ * extreme OOM situations.
+ */
+#define MAX_NETCONSOLE_SKBS 32
+
+static spinlock_t netconsole_lock = SPIN_LOCK_UNLOCKED;
+static int nr_netconsole_skbs;
+static struct sk_buff *netconsole_skbs;
+
+#define MAX_SKB_SIZE \
+		(MAX_UDP_CHUNK + sizeof(struct udphdr) + \
+				sizeof(struct iphdr) + sizeof(struct ethhdr))
+
+static void __refill_netconsole_skbs(void)
+{
+	struct sk_buff *skb;
+	unsigned long flags;
+
+	spin_lock_irqsave(&netconsole_lock, flags);
+	while (nr_netconsole_skbs < MAX_NETCONSOLE_SKBS) {
+		skb = alloc_skb(MAX_SKB_SIZE, GFP_ATOMIC);
+		if (!skb)
+			break;
+		if (netconsole_skbs)
+			skb->next = netconsole_skbs;
+		else
+			skb->next = NULL;
+		netconsole_skbs = skb;
+		nr_netconsole_skbs++;
+	}
+	spin_unlock_irqrestore(&netconsole_lock, flags);
+}
+
+static struct sk_buff * get_netconsole_skb(void)
+{
+	struct sk_buff *skb;
+
+	unsigned long flags;
+
+	spin_lock_irqsave(&netconsole_lock, flags);
+	skb = netconsole_skbs;
+	if (skb)
+		netconsole_skbs = skb->next;
+	skb->next = NULL;
+	nr_netconsole_skbs--;
+	spin_unlock_irqrestore(&netconsole_lock, flags);
+
+	return skb;
+}
+
+static spinlock_t sequence_lock = SPIN_LOCK_UNLOCKED;
+static unsigned int offset;
+
+static void send_netconsole_skb(struct net_device *dev, const char *msg, unsigned int msg_len)
+{
+	int total_len, eth_len, ip_len, udp_len;
+	unsigned long flags;
+	struct sk_buff *skb;
+	struct udphdr *udph;
+	struct iphdr *iph;
+	struct ethhdr *eth;
+
+	udp_len = msg_len + HEADER_LEN + sizeof(*udph);
+	ip_len = eth_len = udp_len + sizeof(*iph);
+	total_len = eth_len + ETH_HLEN;
+
+	if (nr_netconsole_skbs < MAX_NETCONSOLE_SKBS)
+		__refill_netconsole_skbs();
+
+	skb = alloc_skb(total_len, GFP_ATOMIC);
+	if (!skb) {
+		skb = get_netconsole_skb();
+		if (!skb)
+			/* tough! */
+			return;
+	}
+
+	atomic_set(&skb->users, 1);
+	skb_reserve(skb, total_len - msg_len - HEADER_LEN);
+	skb->data[0] = NETCONSOLE_VERSION;
+
+	spin_lock_irqsave(&sequence_lock, flags);
+	put_unaligned(htonl(offset), (u32 *) (skb->data + 1));
+	offset += msg_len;
+	spin_unlock_irqrestore(&sequence_lock, flags);
+
+	memcpy(skb->data + HEADER_LEN, msg, msg_len);
+	skb->len += msg_len + HEADER_LEN;
+
+	udph = (struct udphdr *) skb_push(skb, sizeof(*udph));
+	udph->source = source_port;
+	udph->dest = target_port;
+	udph->len = htons(udp_len);
+	udph->check = 0;
+
+	iph = (struct iphdr *)skb_push(skb, sizeof(*iph));
+
+	iph->version  = 4;
+	iph->ihl      = 5;
+	iph->tos      = 0;
+        iph->tot_len  = htons(ip_len);
+	iph->id       = 0;
+	iph->frag_off = 0;
+	iph->ttl      = 64;
+        iph->protocol = IPPROTO_UDP;
+	iph->check    = 0;
+        iph->saddr    = source_ip;
+        iph->daddr    = target_ip;
+	iph->check    = ip_fast_csum((unsigned char *)iph, iph->ihl);
+
+	eth = (struct ethhdr *) skb_push(skb, ETH_HLEN);
+
+	eth->h_proto = htons(ETH_P_IP);
+	memcpy(eth->h_source, dev->dev_addr, dev->addr_len);
+	memcpy(eth->h_dest, daddr, dev->addr_len);
+
+repeat:
+	spin_lock(&dev->xmit_lock);
+	dev->xmit_lock_owner = smp_processor_id();
+
+	if (netif_queue_stopped(dev)) {
+		dev->xmit_lock_owner = -1;
+		spin_unlock(&dev->xmit_lock);
+
+		dev->poll_controller(dev);
+		goto repeat;
+	}
+
+	dev->hard_start_xmit(skb, dev);
+
+	dev->xmit_lock_owner = -1;
+	spin_unlock(&dev->xmit_lock);
+}
+
+static void write_netconsole_msg(struct console *con, const char *msg, unsigned int msg_len)
+{
+	int len, left;
+	struct net_device *dev;
+
+	dev = netconsole_dev;
+	if (!dev)
+		return;
+
+	if (dev->poll_controller && netif_running(dev)) {
+		unsigned long flags;
+
+		__save_flags(flags);
+		__cli();
+		left = msg_len;
+repeat:
+		if (left > MAX_PRINT_CHUNK)
+			len = MAX_PRINT_CHUNK;
+		else
+			len = left;
+		send_netconsole_skb(dev, msg, len);
+		msg += len;
+		left -= len;
+		if (left)
+			goto repeat;
+		__restore_flags(flags);
+	}
+}
+
+static char *dev;
+static int target_eth_byte0 = 255;
+static int target_eth_byte1 = 255;
+static int target_eth_byte2 = 255;
+static int target_eth_byte3 = 255;
+static int target_eth_byte4 = 255;
+static int target_eth_byte5 = 255;
+
+MODULE_PARM(target_ip, "i");
+MODULE_PARM(target_eth_byte0, "i");
+MODULE_PARM(target_eth_byte1, "i");
+MODULE_PARM(target_eth_byte2, "i");
+MODULE_PARM(target_eth_byte3, "i");
+MODULE_PARM(target_eth_byte4, "i");
+MODULE_PARM(target_eth_byte5, "i");
+MODULE_PARM(source_port, "h");
+MODULE_PARM(target_port, "h");
+MODULE_PARM(dev, "s");
+
+static struct console netconsole =
+	 { flags: CON_ENABLED, write: write_netconsole_msg };
+
+static int init_netconsole(void)
+{
+	struct net_device *ndev = NULL;
+	struct in_device *in_dev;
+
+	printk(KERN_INFO "netconsole: using network device <%s>\n", dev);
+	// this will be valid once the device goes up.
+	if (dev)
+		ndev = dev_get_by_name(dev);
+	if (!ndev) {
+		printk(KERN_ERR "netconsole: network device %s does not exist, aborting.\n", dev);
+		return -1;
+	}
+	if (!ndev->poll_controller) {
+		printk(KERN_ERR "netconsole: %s's network driver does not implement netlogging yet, aborting.\n", dev);
+		return -1;
+	}
+        in_dev = in_dev_get(ndev);
+	if (!in_dev) {
+		printk(KERN_ERR "netconsole: network device %s is not an IP protocol device, aborting.\n", dev);
+		return -1;
+	}
+	source_ip = ntohl(in_dev->ifa_list->ifa_local);
+	if (!source_ip) {
+		printk(KERN_ERR "netconsole: network device %s has no local address, aborting.\n", dev);
+		return -1;
+	}
+#define IP(x) ((char *)&source_ip)[x]
+	printk(KERN_INFO "netconsole: using source IP %i.%i.%i.%i\n",
+		IP(3), IP(2), IP(1), IP(0));
+#undef IP
+	source_ip = htonl(source_ip);
+	if (!target_ip) {
+		printk(KERN_ERR "netconsole: target_ip parameter not specified, aborting.\n");
+		return -1;
+	}
+#define IP(x) ((char *)&target_ip)[x]
+	printk(KERN_INFO "netconsole: using target IP %i.%i.%i.%i\n",
+		IP(3), IP(2), IP(1), IP(0));
+#undef IP
+	target_ip = htonl(target_ip);
+	if (!source_port) {
+		printk(KERN_ERR "netconsole: source_port parameter not specified, aborting.\n");
+		return -1;
+	}
+	printk(KERN_INFO "netconsole: using source UDP port: %i\n", source_port);
+	source_port = htons(source_port);
+	if (!target_port) {
+		printk(KERN_ERR "netconsole: target_port parameter not specified, aborting.\n");
+		return -1;
+	}
+	printk(KERN_INFO "netconsole: using target UDP port: %i\n", target_port);
+	target_port = htons(target_port);
+
+	daddr[0] = target_eth_byte0;
+	daddr[1] = target_eth_byte1;
+	daddr[2] = target_eth_byte2;
+	daddr[3] = target_eth_byte3;
+	daddr[4] = target_eth_byte4;
+	daddr[5] = target_eth_byte5;
+
+	if ((daddr[0] & daddr[1] & daddr[2] & daddr[3] & daddr[4] & daddr[5]) == 255)
+		printk(KERN_INFO "netconsole: using broadcast ethernet frames to send packets.\n");
+	else
+		printk(KERN_INFO "netconsole: using target ethernet address %02x:%02x:%02x:%02x:%02x:%02x.\n", daddr[0], daddr[1], daddr[2], daddr[3], daddr[4], daddr[5]);
+		
+	netconsole_dev = ndev;
+#define STARTUP_MSG "[...network console startup...]\n"
+	write_netconsole_msg(NULL, STARTUP_MSG, strlen(STARTUP_MSG));
+
+	register_console(&netconsole);
+	printk(KERN_INFO "netconsole: network logging started up successfully!\n");
+	return 0;
+}
+
+static void cleanup_netconsole(void)
+{
+	printk(KERN_INFO "netconsole: network logging shut down.\n");
+	unregister_console(&netconsole);
+
+#define SHUTDOWN_MSG "[...network console shutdown...]\n"
+	write_netconsole_msg(NULL, SHUTDOWN_MSG, strlen(SHUTDOWN_MSG));
+	netconsole_dev = NULL;
+}
+
+module_init(init_netconsole);
+module_exit(cleanup_netconsole);
+
+int dummy = MAX_SKB_SIZE;
--- linux/linux/drivers/net/Makefile.orig	Wed Sep 26 21:14:58 2001
+++ linux/linux/drivers/net/Makefile	Wed Sep 26 21:15:15 2001
@@ -216,6 +216,8 @@
 obj-y		+= ../acorn/net/acorn-net.o
 endif
 
+obj-$(CONFIG_NETCONSOLE) += netconsole.o
+
 #
 # HIPPI adapters
 #
--- linux/linux/drivers/net/Config.in.orig	Wed Sep 26 21:14:58 2001
+++ linux/linux/drivers/net/Config.in	Wed Sep 26 21:15:15 2001
@@ -248,6 +248,8 @@
    dep_tristate '  SysKonnect FDDI PCI support' CONFIG_SKFP $CONFIG_PCI
 fi
 
+dep_tristate 'Network logging support' CONFIG_NETCONSOLE
+
 if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then
    if [ "$CONFIG_INET" = "y" ]; then
       bool 'HIPPI driver support (EXPERIMENTAL)' CONFIG_HIPPI
--- linux/linux/Documentation/networking/netlogging.txt.orig	Wed Sep 26 21:15:15 2001
+++ linux/linux/Documentation/networking/netlogging.txt	Wed Sep 26 21:15:15 2001
@@ -0,0 +1,46 @@
+
+started by Ingo Molnar <mingo@redhat.com>, 2001.09.17
+
+the network console can be configured over the module interface,
+by specifying the dev=, target_ip=, source_port=, target_port=
+module parameters. Sample use:
+
+insmod netconsole dev=eth0 target_ip=0x0a000701 \
+                  source_port=6666 target_port=6666
+
+IP addresses must be specified in hexadecimal numbers. The
+above example uses target host IP 10.0.7.1.
+
+if the module is loaded then all kernel messages are sent to the
+target host via UDP packets. The remote host should run the
+client-side 'netconsole' daemon to display & log the messages.
+
+WARNING: the default setting uses the broadcast ethernet address
+to send packets, which can cause increased load on other systems
+on the same ethernet segment. If this is not desired, then the
+target ethernet address can be specified via the target_eth_byte0,
+target_eth_byte1 ... byte5 module parameters. Eg.:
+
+insmod netconsole dev=eth0 target_ip=0x0a000701 \
+                  source_port=6666 target_port=6666 \
+                  target_eth_byte0=0x00 \
+                  target_eth_byte1=0x02\
+                  target_eth_byte2=0xA5 \
+                  target_eth_byte3=0x13 \
+                  target_eth_byte4=0x51 \
+                  target_eth_byte5=0xDD
+
+will instruct the netconsole code to use target ethernet address
+00:02:A5:13:51:DD.
+
+NOTE: the network device (eth0 in the above case) can run any kind
+of other network traffic, netconsole is not intrusive. Netconsole
+might cause slight delays in other traffic if the volume of kernel
+messages is high, but should have no other impact.
+
+netconsole was designed to be as instantaneous as possible, to
+enable the logging of even the most critical kernel bugs. It works
+from IRQ contexts as well, and does not enable interrupts while
+sending packets. Due to these unique needs, configuration can not
+be more automatic, and some fundamental limitations will remain:
+only IP networks, UDP packets and ethernet devices are supported.
--- linux/linux/Documentation/Configure.help.orig	Wed Sep 26 21:14:55 2001
+++ linux/linux/Documentation/Configure.help	Wed Sep 26 21:15:15 2001
@@ -9997,6 +9997,11 @@
   say M here and read Documentation/modules.txt. This is recommended.
   The module will be called skfp.o.
 
+Network logging support
+CONFIG_NETCONSOLE
+  If you want to log kernel messages over the network, then say
+  "M" here. See Documentation/networking/netlogging.txt for details.
+
 HIgh Performance Parallel Interface support (EXPERIMENTAL)
 CONFIG_HIPPI
   HIgh Performance Parallel Interface (HIPPI) is a 800Mbit/sec and

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 16:32   ` Andrea Arcangeli
@ 2001-09-28 16:46     ` Ingo Molnar
  2001-09-28 16:58       ` Andrea Arcangeli
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28 16:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Alexey Kuznetsov, Linus Torvalds, linux-kernel, Alan Cox, bcrl


On Fri, 28 Sep 2001, Andrea Arcangeli wrote:

> he's allowing to repeat the loop more than once to hide it, [...]

it's not done to 'hide' it in any way. I removed the mask method because
it's redundant under the new scheme.

Softirqs *cannot* be handled 'fairly', there is always going to be one
that is called sooner, and one that is called later, and even with the old
code, their position within the bitmask decides which one goes first.

> to do the "mask" with repetition correctly we'd need a per-softirq
> counter, not just a bitmask so it wouldn't be handy to allocate on the
> stack, but it's nothing unfixable.

with a repetition count of 10, there is no difference, and no reason to
add the mask mechanizm again. (In fact it will statistically more fair
because we 'interleave' the softirqs in a fair pattern.)

to make it easier to compare, here are verbal descriptions of the two
mechanizms. (for those who have *cough* trouble understanding the patch.)

the old code's "mask method": it starts processing softirqs in strict
order. It will reprocess any softirq (in strict order again) that got
reactivated meanwhile, but it will only process softirqs that were not
processed yet during this run.

it's not roundrobin, and it's not fair in any way, there is a strict
priority between softirqs, with the additional twist of processing
softirqs that got activated meanwhile. This concept is completely
meaningless, it neither tries to process all pending softirqs, nor does it
try implement some simple policy, like 'process all softirqs that are
active *now* and *once*'.

the new method does a simple loop (in priority order, first HI, then TX,
then RX and LO) over all currently active softirqs, then it does a
(limited) loop over all of them again if some softirq got activated while
these were processed.

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 16:46     ` Ingo Molnar
@ 2001-09-28 16:58       ` Andrea Arcangeli
  0 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2001-09-28 16:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Kuznetsov, Linus Torvalds, linux-kernel, Alan Cox, bcrl

On Fri, Sep 28, 2001 at 06:46:16PM +0200, Ingo Molnar wrote:
> 
> On Fri, 28 Sep 2001, Andrea Arcangeli wrote:
> 
> > he's allowing to repeat the loop more than once to hide it, [...]
> 
> it's not done to 'hide' it in any way. I removed the mask method because
> it's redundant under the new scheme.

What you are missing is a property provided by the old method.

We have the NET_RX_SOFTIRQ that floods very heavily, so far so good.

Then we have HI_SOFTIRQ, incidentally HI_SOFTIRQ from irq wants to be
executed with very low latency, with your patch it _can_ be postpone to
ksoftirqd processing just because there's the NET_RX_SOFTIRQ cpu hog in
background. With the old method it was guaranteed that the HI_SOFTIRQ
was executed with very low latency within the irq, no matter of the
NET_RX_SOFTIRQ flood.

So there _is_ a difference, and the multiple-loop will tend to "hide"
the difference.

Andrea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 16:31   ` Ingo Molnar
@ 2001-09-28 17:04     ` kuznet
  2001-09-28 17:21       ` Rik van Riel
  2001-09-28 17:31       ` Ingo Molnar
  0 siblings, 2 replies; 44+ messages in thread
From: kuznet @ 2001-09-28 17:04 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, alan, bcrl, andrea

Hello!

> it does not, please read the code again. We iterate over all active bits
> in the 'pending' bitmask.

OK.


> net_rx_action() stops spinning if 1) a jiffy has passed 2) 300 packets
> have been processed. [the jiffy test is not completely accurate

It is. The break is forced not earlier than after one jiffie, but
it may loop for up to 2 jiffies.


> there is nothing sacred about the old method of processing NET_RX_SOFTIRQ,
> then NET_TX_SOFTIRQ, then breaking out of do_softirq() (the mechanizm was
> a bit more complex than that, but in insignificant ways). It's just as
> arbitrary as 10 loops - with the difference that 10 loops perform better.

Do not cheat yourself. 10 is more than 1 by order of magnitude.
I would eat this argument if the limit was 2. :-)


> > Most likely, your problem will disappear when you renice ksoftirqd
> > to higher priority f.e. equal to priority of tux threads.
> 
> (no. see the previous mails.)

But I do not know what exactly to look for there.

Please, explain who exactly obtains an advantage of looping.
net_rx_action()? Do you see drops in backlog?
Actually, I could see a sense in your approach if you said
something sort of: "I see drops in backlog, I want to increase
it to 3000, but it breaks latency, so that let it remain 300,
but instead I will make 10 loops". This would be sane justification
at least.

net_tx_action()? It does not look critical.

Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 17:04     ` kuznet
@ 2001-09-28 17:21       ` Rik van Riel
  2001-09-28 17:31         ` Andrea Arcangeli
                           ` (2 more replies)
  2001-09-28 17:31       ` Ingo Molnar
  1 sibling, 3 replies; 44+ messages in thread
From: Rik van Riel @ 2001-09-28 17:21 UTC (permalink / raw)
  To: kuznet; +Cc: mingo, torvalds, linux-kernel, alan, bcrl, andrea

On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:

> Please, explain who exactly obtains an advantage of looping.
> net_rx_action()? Do you see drops in backlog?

> net_tx_action()? It does not look critical.

Then how would you explain the 10% speed difference
Ben and others have seen with gigabit ethernet ?

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 17:04     ` kuznet
  2001-09-28 17:21       ` Rik van Riel
@ 2001-09-28 17:31       ` Ingo Molnar
  2001-09-28 17:56         ` kuznet
  1 sibling, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28 17:31 UTC (permalink / raw)
  To: kuznet; +Cc: torvalds, linux-kernel, alan, bcrl, andrea


On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:

> > have been processed. [the jiffy test is not completely accurate
>
> It is. The break is forced not earlier than after one jiffie, but
> it may loop for up to 2 jiffies.

(you are right. it does jiffies - start_time > 1. good.)

> > there is nothing sacred about the old method of processing NET_RX_SOFTIRQ,
> > then NET_TX_SOFTIRQ, then breaking out of do_softirq() (the mechanizm was
> > a bit more complex than that, but in insignificant ways). It's just as
> > arbitrary as 10 loops - with the difference that 10 loops perform better.
>
> Do not cheat yourself. 10 is more than 1 by order of magnitude.
> I would eat this argument if the limit was 2. :-)

what are you trying to argue, that it's incorrect to re-process softirqs
if they got activated during the previous pass? I told a number of
fundamental reasons why it's a good idea to loop, tell me one not to do
so, on a 'healthy' system.

(Lets assume that a loop of 10 still ensures basic safety in situations
where external load overloads the system. I will tell you that not even a
loop of 1 ensures safety, unless one tweaks the softirq handlers
themselves (netdev_max_backlog) to do less and less work per interrupt.
Is your point to make the softirq loop to be sysctl-tunable? I can sure
show you slow enough systems which can be overloaded via hardirqs alone.)

> Please, explain who exactly obtains an advantage of looping.

I think the technically most reasonable approach is to process softirqs
*as soon as possible*, without context switches, for obvious performance
reasons. They are called 'interrupts' after all. Or do we want to delegate
hardware interrupts to process contexts as well?

interrupts, in most cases, are 'replies to a request'. It's a pair of
code: the first part issues the work, then comes the completion of work.
Processes are usually the entities that 'issue' work, interrupts/softirqs
are the ones that complete it. It's conceptually the right thing to
complete work that was started.

Softirq work is something that has to be executed sooner or later, in a
healthy system. I prefer 'sooner' to 'later'. I prefer delaying current
contexts instead of buffering work - executing current contexts will
likely only increase the amount of work that has to be done in the future.

the only reason why the loop is not endless is reliability: weaker
systems, wich can be overloaded over external interfaces need some
protection against endless softirq loops. I measured 10 to be something
pretty close to a loop of 10000.

i prefer my system to go near its peak performance, and i do not want to
suffer from any 'safety measure' until it's absolutely necessery. and if
it's absolutely necessery i prefer to design the safety measure in a way
to make fast enough systems still go fast. Not the other way around.

please point out flaws in this thinking.

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 17:21       ` Rik van Riel
@ 2001-09-28 17:31         ` Andrea Arcangeli
  2001-09-28 17:41         ` kuznet
  2001-09-28 18:39         ` Josh MacDonald
  2 siblings, 0 replies; 44+ messages in thread
From: Andrea Arcangeli @ 2001-09-28 17:31 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kuznet, mingo, torvalds, linux-kernel, alan, bcrl

On Fri, Sep 28, 2001 at 02:21:07PM -0300, Rik van Riel wrote:
> Then how would you explain the 10% speed difference
> Ben and others have seen with gigabit ethernet ?

partly because of the unwakeup logic, I've no problem with it, this is
why I asked to measure how much of the 10% improvement cames from the
unwakeup/ksoftirqd-deschedule logic, I was just curious about that.

Andrea

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 17:21       ` Rik van Riel
  2001-09-28 17:31         ` Andrea Arcangeli
@ 2001-09-28 17:41         ` kuznet
  2001-09-28 17:46           ` Ingo Molnar
  2001-09-28 18:39         ` Josh MacDonald
  2 siblings, 1 reply; 44+ messages in thread
From: kuznet @ 2001-09-28 17:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: mingo, torvalds, linux-kernel, alan, bcrl, andrea

Hello!

> Then how would you explain the 10% speed difference
> Ben and others have seen with gigabit ethernet ?

Huh... kill all the softirqs (and, even better, scheduler too) and you will
see even more effect. :-)

More seriously, the question is not why these 10% appears,
but rather why they _disappeared_ in 2.4.7.

Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 17:41         ` kuznet
@ 2001-09-28 17:46           ` Ingo Molnar
  0 siblings, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28 17:46 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Rik van Riel, Linus Torvalds, linux-kernel, Alan Cox, bcrl,
	Andrea Arcangeli


On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:

> More seriously, the question is not why these 10% appears, but rather
> why they _disappeared_ in 2.4.7.

the effects i saw (Ben is not around unfortunately, so i cannot confirm
neither deny whether there is any correlation between the effects Ben saw
and the effects i saw) were due to ksoftirqd's generic tendency to
increase the latency between the issuing of work and the completion of it.
Increasing ksoftirqd's priority (in fact, setting current->counter = 2
everytime schedule() is called :-) does not fix this fundamental property
=> it still causes 'work generators' (processes) to use more CPU time than
'work completion' (irqs, softirqs). Furthermore, it also increases the
latency between hardirqs and softirqs.

so my patch makes it sure that work is completed as soon as possible - but
i've also kept an exit door open. [which you might find insufficient, but
that i think is up to individual tuning anyway - i think i'm generally
running faster machines than you, so our perspectives are slightly
different.]

[processes are not always work generators, and irqs/softirqs not always do
work completion, but i think generally they fit nicely into these two
categories. Eg. the TCP stack often generates new work from softirqs, but
IMO this does not change the fundamental equation.]

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 17:31       ` Ingo Molnar
@ 2001-09-28 17:56         ` kuznet
  2001-09-28 18:28           ` Ingo Molnar
  2001-09-28 18:51           ` Benjamin LaHaise
  0 siblings, 2 replies; 44+ messages in thread
From: kuznet @ 2001-09-28 17:56 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, alan, bcrl, andrea

Hello!

> what are you trying to argue, that it's incorrect to re-process softirqs
> if they got activated during the previous pass? I told a number of
> fundamental reasons

I did not find them in your mails.


> (Lets assume that a loop of 10 still ensures basic safety in situations
> where external load overloads the system.

It does not, evidently. And 1 does not. But 10 is 10 times worse.


> Is your point to make the softirq loop to be sysctl-tunable?

No. My point is to make it correctly eventually.

If you will repeat such attempts to "improve" it each third 2.4.x,
it will remain broken forever.


> show you slow enough systems which can be overloaded via hardirqs alone.)

This problem has been solved ages ago. The only remaining question
is how to make this more nicely.


> I think the technically most reasonable approach is to process softirqs
> *as soon as possible*,
...
> please point out flaws in this thinking.

They are processed as soon as possible.

Ingo, I told net_rx_action() is small do_softirq() restarting not 10,
but not less than 300 times in row.

If this logic is wrong, you should explain why it is wrong.
Looping dozen of times may look like cure, but for me it still
looks rather like steroids.

Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 17:56         ` kuznet
@ 2001-09-28 18:28           ` Ingo Molnar
  2001-09-28 19:23             ` kuznet
  2001-09-28 19:39             ` kuznet
  2001-09-28 18:51           ` Benjamin LaHaise
  1 sibling, 2 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28 18:28 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Linus Torvalds, linux-kernel, Alan Cox, bcrl, Andrea Arcangeli


On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:

> > Is your point to make the softirq loop to be sysctl-tunable?
>
> No. My point is to make it correctly eventually.

there is no 'technically correct' way but to process all softirqs
immediately. Ie. an infinite loop. The only reason why we are doing less
than that is to be nice to systems that can be overloaded.

> > show you slow enough systems which can be overloaded via hardirqs alone.)
>
> This problem has been solved ages ago. The only remaining question is
> how to make this more nicely.

this problem has not been solved. hardirqs have properties unlike to
softirqs that make it harder to overload them.

> > I think the technically most reasonable approach is to process softirqs
> > *as soon as possible*,
> ...
> > please point out flaws in this thinking.
>
> They are processed as soon as possible.

they are *not*. The old code deliberately ignores newly pending softirqs.

> Ingo, I told net_rx_action() is small do_softirq() restarting not 10,
> but not less than 300 times in row.

net_rx_action is 'small' in code, but not small in terms of cachemisses
and other, unavoidable overhead. It goes over device-side skbs which can
and will take several microseconds per packet.

> If this logic is wrong, you should explain why it is wrong. Looping
> dozen of times may look like cure, but for me it still looks rather
> like steroids.

the fundamental issue here that makes looping inevitable is the following
one. (not that looping itself is anything bad.) We enable hardirqs during
the processing of softirqs. We also guarantee exclusivity of softirq
processing. We also have multiple 'queues of work'. Ie. if a hardirq adds
some new work while we were off doing other work, then we have no choice
but look at this other work as well.  Ie. we have to loop. q.e.d.

in fact even the current code is a mini-loop of ~1.5 iterations, due to
the mask code. (which has the effect of processing 'a little bit more
work'.)

looping can of course be avoided - but only by breaking any of the above
assumptions. My patch took the 2.4 semantics of softirqs more or less as
cast into stone - for obvious reasons.

Some more radical options:

 - do not use split softirqs, use one global (but per-cpu) 'work to be
   done' queue, that lists multiple things. The downside: no 'priority'
   between softirqs.

(sidenote: i do think that priority of softirqs is overdesign and a fad,
or, in the best case, implemented incorrectly. It has no practical effect
in any workload i saw. *If* a higher-priority work queue is higher
priority, then we should abort softirq processing immediately if a
higher-priority softirq is activated. Ie. we should break out of
net_rx_action as soon as a higher priority, ie. any of the SOFTIRQ_HI or
SOFTIRQ_TX softirqs is activated. We dont do that - in fact we didnt even
go to the higher priority active softirq within the inner loop. So i
consider the softirq priority arguments to be pulled here by their hair.
softirq priorities never existed in any realistic form, for any serious
softirq load.)

so this is option #1: get rid of distrinction between 'work', and just do
a global (as in type, but per-CPU) work-FIFO. This concept works really
well in a number of other design areas within the Linux kernel.

advantage: no fancy priorities. We process work as it comes. More work
means a straightforward loop over all work that need to be done. We can
break out of this loop very easily and in a finegrained way: eg. if
current->need_resched is set, or if we spend more than 1-2 jiffies
processing softirqs then we break out of the loop. 'Measuring'
softirq-related load becomes much easier as well - and if it ever becomes
a problem we can throttle it artificially. (but normal systems would never
see this throttling.)

disadvantage: there is some cost to build 'context' of processing an
event. The current method of looping over pending events that are sorted
by event type (HI, TX, RX, LO) has the advantage of preserving this
'context'. OTOH, most of the event processing cost right now is not
related to building of context, but related to all sorts of cachemisses
like touching device-DMA-ed headers first time, or touching long-untouched
skbs and sockets.

unless you have strong arguments against this approach, i will start
coding this. It's a pretty intrusive change, because all current softirq
users will have to agree on a generic event format + callback that can be
queued. This has to be embedded into skbs and bh-handling structs. What do
you think?

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq-2.4.10-B2
  2001-09-28  7:18   ` [patch] softirq-2.4.10-B2 Ingo Molnar
  2001-09-28 15:58     ` Andrea Arcangeli
@ 2001-09-28 18:36     ` Simon Kirby
  2001-09-28 18:47       ` Ingo Molnar
  1 sibling, 1 reply; 44+ messages in thread
From: Simon Kirby @ 2001-09-28 18:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrea Arcangeli, linux-kernel, Alexey Kuznetsov

On Fri, Sep 28, 2001 at 09:18:17AM +0200, Ingo Molnar wrote:

> basically the problem is that there is a big 'gap' between the activation
> of softirqs, and the time when ksoftirqd starts running. There are a
> number of mechanisms within the networking stack that are quite
> timing-sensitive. And generally, if there is work A and work B that are
> related, and we've executed work A (the hardware interrupt), then it's
> almost always the best idea to execute work B as soon as possible. Any
> 'delaying' of work B should only be done for non-performance reasons:
> eg. fairness in this case. Now it's MAX_SOFTIRQ_RESTART that balances
> performance against fairness. (in most kernel subsystems we almost always
> give preference to performance over fairness - without ignoring fairness
> of course.)

This may be somewhat related to something I've been thinking about
lately: The fact that most machines will choke under a simple UDP
while(1) sendto() attack (eg: around 100Mbit wire speed with lots of small
packets).  It seems that what's happening is a new packet comes in
immediately after the last interrupt is serviced (because the packets are
so small), so the CPU has no time to execute anything else and goes right
back into the interrupt handler.  This starves userspace CPU time.

Would your changes affect this at all?  It would be nice if this could
somehow be batched so that a Linux router could be capable of handling
large amounts of traffic with small packets, and wouldn't choke when lots
of small packets (most commonly from a TCP SYN bomb) go through it.

Actually, I just tested my while(1) sendto() UDP spamming program on an
Acenic card and noticed that it seems to do some sort of batching /
grouping, because the number of interrupts reported by vmstat is only
8,000 while it's easily 75,000 on other eepro100 boxes.  Interesting...

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[       sim@stormix.com       ][       sim@netnation.com        ]
[ Opinions expressed are not necessarily those of my employers. ]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 17:21       ` Rik van Riel
  2001-09-28 17:31         ` Andrea Arcangeli
  2001-09-28 17:41         ` kuznet
@ 2001-09-28 18:39         ` Josh MacDonald
  2 siblings, 0 replies; 44+ messages in thread
From: Josh MacDonald @ 2001-09-28 18:39 UTC (permalink / raw)
  To: linux-kernel

Quoting Rik van Riel (riel@conectiva.com.br):
> On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:
> 
> > Please, explain who exactly obtains an advantage of looping.
> > net_rx_action()? Do you see drops in backlog?
> 
> > net_tx_action()? It does not look critical.
> 
> Then how would you explain the 10% speed difference
> Ben and others have seen with gigabit ethernet ?

Could this possibly be due to I-cache improvements?  If the same 
interrupt handling code is being run 10 times at once you should
expect an improvement of that kind.

-josh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq-2.4.10-B2
  2001-09-28 18:36     ` Simon Kirby
@ 2001-09-28 18:47       ` Ingo Molnar
  2001-09-28 19:31         ` kuznet
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28 18:47 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Andrea Arcangeli, linux-kernel, Alexey Kuznetsov


On Fri, 28 Sep 2001, Simon Kirby wrote:

> Actually, I just tested my while(1) sendto() UDP spamming program on
> an Acenic card and noticed that it seems to do some sort of batching /
> grouping, because the number of interrupts reported by vmstat is only
> 8,000 while it's easily 75,000 on other eepro100 boxes.
> Interesting...

yes, acenic.c (and the card itself) supports irq-rate limitation and
work-coalescing. Almost all gigabit cards do. Check out the tx_coal_tick,
rx_coal_tick, max_tx_dsc, max_rx_desc tunables that can be set at insmod
time. Note that this does not decrease the amount of work generated, but
it will reduce the irq-processing overhead significantly.

The eepro100 card does not provide such ways - there the only way to stop
future interrupts from happening is to stop the receiver, or to not refill
the rx-skb queue. (or disable the interrupt, which is a pretty crude and
inaccurate method.)

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 17:56         ` kuznet
  2001-09-28 18:28           ` Ingo Molnar
@ 2001-09-28 18:51           ` Benjamin LaHaise
  1 sibling, 0 replies; 44+ messages in thread
From: Benjamin LaHaise @ 2001-09-28 18:51 UTC (permalink / raw)
  To: kuznet; +Cc: mingo, torvalds, linux-kernel, alan, andrea

On Fri, Sep 28, 2001 at 09:56:24PM +0400, kuznet@ms2.inr.ac.ru wrote:
> > (Lets assume that a loop of 10 still ensures basic safety in situations
> > where external load overloads the system.
> 
> It does not, evidently. And 1 does not. But 10 is 10 times worse.

10 is not 10 times worse.  The effect is much more subtle than that 
when you factor in the amount of work done as well as the cache state.

> Ingo, I told net_rx_action() is small do_softirq() restarting not 10,
> but not less than 300 times in row.

The problem comes from net_rx_action doing less than the allowable amount 
of work if another rx interrupt comes in while softirqs were being 
run -- if the rx action is not repeated, the new packets are not 
acknowledged until 10's of ms later, which is a *lot* of data in the 
future.  Running the logic via ksoftirqd alone results in a significant 
cache hit for just the context switch (our stack and task structs all 
map to the *SAME* L1 cache lines in many processors), which softirqs 
do not.  Blanket disallowing repeats generates more overhead than it 
reduces overload: it makes things worse by increasing the amount of 
work that needs to be done during overload.

		-ben

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 18:28           ` Ingo Molnar
@ 2001-09-28 19:23             ` kuznet
  2001-09-28 19:48               ` Ingo Molnar
  2001-09-30  9:01               ` Ingo Molnar
  2001-09-28 19:39             ` kuznet
  1 sibling, 2 replies; 44+ messages in thread
From: kuznet @ 2001-09-28 19:23 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, alan, bcrl, andrea

Hello!

> this problem has not been solved. hardirqs have properties unlike to
> softirqs that make it harder to overload them.

What properties do you mean? I see the only property, they are "hard",
and hence they should be harder to tame. :-) :-)

[ No matter. It is out of topic, but it were you who said about this.
  Did you ever hear about "interrupt mitigation", "hw flowcontrol",
  "polling" eventually? All the schemes work.
]


> they are *not*. The old code deliberately ignores newly pending softirqs.

I tell you that net_rx_action() already LOOPs so much that it is silly
to loop even more. You multiplied an infinity with 10 and dare to state
that this infinity became larger. I am sorry, it is not.


> net_rx_action is 'small' in code, but not small in terms of cachemisses
> and other, unavoidable overhead. It goes over device-side skbs which can
> and will take several microseconds per packet.

Big news for me. :-) And how is it related to the issue?



> > If this logic is wrong, you should explain why it is wrong. Looping
> > dozen of times may look like cure, but for me it still looks rather
> > like steroids.
> 
> the fundamental issue here that makes looping inevitable is the following
> one. (not that looping itself is anything bad.) We enable hardirqs during
> the processing of softirqs. We also guarantee exclusivity of softirq
> processing. We also have multiple 'queues of work'. Ie. if a hardirq adds
> some new work while we were off doing other work, then we have no choice
> but look at this other work as well.  Ie. we have to loop. q.e.d.

If irq adds new work it is processed immedately without any new hacks.
Look into net_rx_action(). 


>  - do not use split softirqs, use one global (but per-cpu) 'work to be
>    done' queue, that lists multiple things. The downside: no 'priority'
>    between softirqs.
> 
> (sidenote: i do think that priority of softirqs is overdesign

Where did you find priority there? As soon as you see something is enumerated,
it still does not mean that low numbers have some priority.

Argh... I see, what you mean. It is not priority, it is bug introduced
in 2.4.7. (By you? :-)) It was not present before, all the softirq
were processed each round.

Well, just restore old net_rx_action() from 2.4.6.

OK, now I understand your point. You mean that net_rx_action()
is not started immediately, when irq arrives while net_tx_action()
or a timer or something is running. Yes, it is really bad.

But it still does not allow you to restart net_rx_action() 10 times in row.


> unless you have strong arguments against this approach, i will start
> coding this.

I am sorry, I do not find anything new in this approach.
Seems, all that you propose is to replace single bit in single word
to some struct. It will take sense when we have more than one real
softirq. :-)


> queued. This has to be embedded into skbs and bh-handling structs. What do
> you think?

I think that you should change subject to discuss this. :-)

It is not related to the problem.

Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq-2.4.10-B2
  2001-09-28 18:47       ` Ingo Molnar
@ 2001-09-28 19:31         ` kuznet
  0 siblings, 0 replies; 44+ messages in thread
From: kuznet @ 2001-09-28 19:31 UTC (permalink / raw)
  To: mingo; +Cc: sim, andrea, linux-kernel

Hello!

> The eepro100 card does not provide such ways 

It does of course. Only it is not documented.
Intel's driver does mitigate.

Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 18:28           ` Ingo Molnar
  2001-09-28 19:23             ` kuznet
@ 2001-09-28 19:39             ` kuznet
  2001-09-28 20:03               ` Ingo Molnar
  1 sibling, 1 reply; 44+ messages in thread
From: kuznet @ 2001-09-28 19:39 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, alan, bcrl, andrea

Hello!

> unless you have strong arguments against this approach, i will start
> coding this. It's a pretty intrusive change, because all current softirq
> users will have to agree on a generic event format + callback that can be
> queued. This has to be embedded into skbs and bh-handling structs. What do
> you think?

Why skbs?

Anyway, scheme sort of:

do_softirq()
{
	start = jiffies;

	while (dequeue()) {
		if (jiffies - start > 1)
			goto wakeup_ksoftirq
		process();
	}
}

and raise_softirq() enqueuing event to tail, if it is still not queued
is nice, not intrusive and very easy.

Only "skbs" scares me. :-)

Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 19:23             ` kuznet
@ 2001-09-28 19:48               ` Ingo Molnar
  2001-09-29 16:35                 ` kuznet
  2001-09-30  9:01               ` Ingo Molnar
  1 sibling, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28 19:48 UTC (permalink / raw)
  To: kuznet; +Cc: torvalds, linux-kernel, alan, bcrl, andrea


On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:

> > this problem has not been solved. hardirqs have properties unlike to
> > softirqs that make it harder to overload them.
>
> What properties do you mean? I see the only property, they are "hard",
> and hence they should be harder to tame. :-) :-)

(they can be disabled at their source more easily. softirqs keep getting
new work queued via hardirqs, so it's harder to 'tame' them other than
dropping them or delaying them - both of which can cause additional
problems.)

> [ No matter. It is out of topic, but it were you who said about this.
>   Did you ever hear about "interrupt mitigation", "hw flowcontrol",
>   "polling" eventually? All the schemes work.
> ]

(sure it works. but Linux boxes still effectively locks up if bombarded
with UDP packets - ie. the problem has not been solved. But this is a
different topic - i think i understand that you meant 'has been solved',
as 'solved theoretically'. What i meant is that it's implemented in Linux,
for a number of cases, yet, and that Linux provides no automatic guarantee
against such overloads. I think we both agree and i was just too short and
to inaccurate in specifying what i meant to say. Issue closed?)

> > they are *not*. The old code deliberately ignores newly pending softirqs.
>
> I tell you that net_rx_action() already LOOPs so much that it is silly
> to loop even more. You multiplied an infinity with 10 and dare to
> state that this infinity became larger. I am sorry, it is not.

no matter how much rx_action is looping, if it's tx work that is queueing
up meanwhile and if we ignore that tx work after having finished rx
processing. How many times do i have to loop over this simple packet of a
fact to receive an ACK? :-)

> > net_rx_action is 'small' in code, but not small in terms of cachemisses
> > and other, unavoidable overhead. It goes over device-side skbs which can
> > and will take several microseconds per packet.
>
> Big news for me. :-) And how is it related to the issue?

it's very much related because it opens up a real window for other
softirqs to be re-activated. which is the basic situation my patch solves.

> > the fundamental issue here that makes looping inevitable is the following
> > one. (not that looping itself is anything bad.) We enable hardirqs during
> > the processing of softirqs. We also guarantee exclusivity of softirq
> > processing. We also have multiple 'queues of work'. Ie. if a hardirq adds
> > some new work while we were off doing other work, then we have no choice
> > but look at this other work as well.  Ie. we have to loop. q.e.d.
>
> If irq adds new work it is processed immedately without any new hacks.
> Look into net_rx_action().

it's not processed if we are on our way out eg. processing the LO softirq.
Or tx work does not get processed if we are in rx processing.

> >  - do not use split softirqs, use one global (but per-cpu) 'work to be
> >    done' queue, that lists multiple things. The downside: no 'priority'
> >    between softirqs.
> >
> > (sidenote: i do think that priority of softirqs is overdesign
>
> Where did you find priority there? [...]

i found no functional concept of priority there. But i had to react to the
following, largely bogus claim by Andrea:

----------->
> What you are missing is a property provided by the old method.
>
> We have the NET_RX_SOFTIRQ that floods very heavily, so far so good.
>
> Then we have HI_SOFTIRQ, incidentally HI_SOFTIRQ from irq wants to be
> executed with very low latency, with your patch it _can_ be postpone to
> ksoftirqd processing just because there's the NET_RX_SOFTIRQ cpu hog in
> background. With the old method it was guaranteed that the HI_SOFTIRQ
> was executed with very low latency within the irq, no matter of the
> NET_RX_SOFTIRQ flood.
<-----------

i was wrong to attribute this claim to you - sorry! I have hard time
multitasking between your and Andrea's arguments, so sometimes work
request and completion interrupt gets mixed up :-)


> As soon as you see something is enumerated, it still does not mean
> that low numbers have some priority.

yep.


> OK, now I understand your point. You mean that net_rx_action() is not
> started immediately, when irq arrives while net_tx_action() or a timer
> or something is running. Yes, it is really bad.

good :)

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 19:39             ` kuznet
@ 2001-09-28 20:03               ` Ingo Molnar
  0 siblings, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-28 20:03 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Linus Torvalds, linux-kernel, Alan Cox, bcrl, Andrea Arcangeli


On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:

> Why skbs?

because i did not mean to queue 'softirq bits'. the current bitmap is
perfectly okay and equivalent to any queued scheme. (i'd say queueing
those bits is pretty stupid and only adds overhead. It might make sense
once there are more softirqs, but that can wait.)

what i really meant to queue is something much more finegrained: to queue
softirq *events*. The ones that get queued by netif_rx into
softnet_data[this_cpu]'s input_pkt_queue. We basically have the skb as the
'event context', and we have a number of queues that are used by hardirqs
to queue work towards softirq processing.

do you see the whole scope of this approach? It has a number of
disadvantages (some of which i outlined in the previous mail), but it also
has a number of (i think important) advantages. It's also in fact simpler,
once implemented.

but this means that some of the queueing, that is softnet_data & skb
pointers now, has to be generalized. netif_rx would use this generic
interface to push work towards softirq processing, and the generic softirq
engine would use a callback in the event data structure to do actual
processing of the event.

eg. a fastroute packet could set its event handler to be a function that
does dev_queue_xmit. Other packets could set their event handlers based on
their ->dev. The current demultiplexing done in net_rx_action could be
done in a more natural way, and eg. fastrouting would not add runtime
overhead.

another effect: i think it might be useful to preserve micro-ordering of
rx and tx events. It's certainly the best way to get the most out of
existing cache footprint, but the downside is that it has a higher
'environment' cache footprint, due to rx and tx functions being called at
high frequencies and being intermixed randomly.

> Only "skbs" scares me. :-)

well :)

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq-2.4.10-B3
  2001-09-28 16:35   ` [patch] softirq-2.4.10-B3 Ingo Molnar
@ 2001-09-29  0:40     ` J . A . Magallon
  0 siblings, 0 replies; 44+ messages in thread
From: J . A . Magallon @ 2001-09-29  0:40 UTC (permalink / raw)
  To: mingo
  Cc: Alexey Kuznetsov, Linus Torvalds, linux-kernel, Alan Cox, bcrl,
	Andrea Arcangeli


On 20010928 Ingo Molnar wrote:
>
>On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:
>
>> >  - '[ksoftirqd_CPU0]' is confusing on UP systems, changed it to
>> >    '[ksoftirqd]' instead.
>>
>> It is useless to argue about preferences, but universal naming scheme
>> looks as less confusing yet. :-)
>
>since you are the second one to complain, i'm convinced :) reverted.
>
>i've also added back the BUG() check for smp_processor_id() == cpu.
>
>-B3 attached.
>

Beware: netconsole slipped into the patch, and makes 'make xconfig' fail:

...
./tkparse < ../arch/i386/config.in >> kconfig.tk
drivers/net/Config.in: 251: can't handle dep_bool/dep_mbool/dep_tristate condition
make[1]: *** [kconfig.tk] Error 1
make[1]: Leaving directory `/usr/src/linux-2.4.10-bw/scripts'
make: *** [xconfig] Error 2

-- 
J.A. Magallon                           #  Let the source be with you...        
mailto:jamagallon@able.es
Mandrake Linux release 8.2 (Cooker) for i586
Linux werewolf 2.4.10-bw #1 SMP Thu Sep 27 00:32:53 CEST 2001 i686

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 16:18 ` [patch] softirq performance fixes, cleanups, 2.4.10 kuznet
                     ` (2 preceding siblings ...)
  2001-09-28 16:35   ` [patch] softirq-2.4.10-B3 Ingo Molnar
@ 2001-09-29 11:03   ` Rusty Russell
  3 siblings, 0 replies; 44+ messages in thread
From: Rusty Russell @ 2001-09-29 11:03 UTC (permalink / raw)
  To: kuznet; +Cc: mingo, torvalds, linux-kernel, alan, bcrl, andrea

On Fri, 28 Sep 2001 20:18:20 +0400 (MSK DST)
kuznet@ms2.inr.ac.ru wrote:

> Also, you did not assure me that you interpret problem correctly.
> netif_rx() is __insensitive__ to latencies due to blocked softirq
restarts.
> It stops spinning only when it becomes true cpu hog. And scheduling
ksoftirqd
> is the only variant here.

Hi Ingo at al,

Attached is the approach I sent to l-k earlier this week: if a timer tick
goes off, back off to ksoftirqd.

This should, statistically, do the right thing, even in the case of smart
softirqs like netif_rx.

Please consider,
Rusty.

--- linux-pmac/kernel/softirq.c	Sun Sep  9 15:11:37 2001
+++ working-pmac-ksoftirq/kernel/softirq.c	Mon Sep 24 09:44:07 2001
@@ -63,11 +63,12 @@
 	int cpu = smp_processor_id();
 	__u32 pending;
 	long flags;
-	__u32 mask;
+	long start;
 
 	if (in_interrupt())
 		return;
 
+	start = jiffies;
 	local_irq_save(flags);
 
 	pending = softirq_pending(cpu);
@@ -75,32 +76,32 @@
 	if (pending) {
 		struct softirq_action *h;
 
-		mask = ~pending;
 		local_bh_disable();
-restart:
-		/* Reset the pending bitmask before enabling irqs */
-		softirq_pending(cpu) = 0;
+		do {
+			/* Reset the pending bitmask before enabling irqs */
+			softirq_pending(cpu) = 0;
 
-		local_irq_enable();
+			local_irq_enable();
 
-		h = softirq_vec;
+			h = softirq_vec;
 
-		do {
-			if (pending & 1)
-				h->action(h);
-			h++;
-			pending >>= 1;
-		} while (pending);
-
-		local_irq_disable();
-
-		pending = softirq_pending(cpu);
-		if (pending & mask) {
-			mask &= ~pending;
-			goto restart;
-		}
+			do {
+				if (pending & 1)
+					h->action(h);
+				h++;
+				pending >>= 1;
+			} while (pending);
+
+			local_irq_disable();
+
+			pending = softirq_pending(cpu);
+
+			/* Don't spin here forever... */
+		} while (pending && start == jiffies);
 		__local_bh_enable();
 
+		/* If a timer tick went off, assume we're overloaded,
+                   and kick in ksoftirqd */
 		if (pending)
 			wakeup_softirqd(cpu);
 	}



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 19:48               ` Ingo Molnar
@ 2001-09-29 16:35                 ` kuznet
  2001-09-30  9:37                   ` Kai Henningsen
  0 siblings, 1 reply; 44+ messages in thread
From: kuznet @ 2001-09-29 16:35 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, alan, bcrl, andrea

Hello!

[ Well, Ingo, you may go directly to the end of the mail, snipping
  historical comments. :-) ]

> (sure it works. but Linux boxes still effectively locks up if bombarded
> with UDP packets - ie. the problem has not been solved.
...
> different topic - i think i understand that you meant 'has been solved',
> as 'solved theoretically'.

For me it is solved in practice. I cannot cause hardirq lockup
either with tulip or with acenic. If you mean eepro100, you may fix it,
implementing f.e. hw flowcontrol, which should be easy with any hardware.

BTW I cannot cause softirq lockup with current kernels as well.


> to inaccurate in specifying what i meant to say. Issue closed?)

Yes. I do not know why you started to talk about this. "I do this nasty,
because other nasty exists" is never good argument. Though I also use
it sometimes. :-)


> i found no functional concept of priority there. But i had to react to the
> following, largely bogus claim by Andrea:

Andrea speaks not about priority in fact. He speaks rather about fairness.
Essentially, talking in your terms we have a small amount of events,
encoded in bits of mask and instead of queueing them in strict order, we
used to process all of them each round. In fact, it is equivalent to fifo
with small reordering (as soon as amount of events is small).

:-) Two letter "HI" was my invention, and I bet I did not mean
that it is high priority. Actually, it is just random name.
It is diffucult to say why I selected this name: combined from
"higher than BH", sparc assembly, which I had to hack that time,
and... yes, _that_ time it was important that it is processed
before NET_*, because timers and BH_IMMEDIATE produce net softirq.
It is not important more.

> good :)

Listen, let us try to make this right yet.

Essentially, if we invent some real condition when softirq loop must be
stopped, we win. Now it is "stop after all events pending at start are
processed". OK, it is wrong. You propose: "10 rounds". It is even worse
because it is against plain logic. (pardon :-)).

If we invent right condition, we may improve things really significantly
almost without changes to the rest. (F.e. killing of internal restart
in net_rx_action() would be great win, it is in one row with your
event queue, by the way. Ideal scheme must not fail even if
each net_rx_action() processes single packet, leaving the rest queued.
Each skb is an event. :-))

I am sure, taming is very easy, if to strain brains a bit.
Essentially, your 10 must be replaced with some simple time characteristics
Seems, it is possible to do this correctly, f.e. using small deadline
scheduler with two participants: softirq and all the rest.
Policy sort of: softirq may execute for 1/HZ, if they insist on this,
but if they force to starve some processes, they lose low latency guarantee,
but get latency _not_ worse than 10msec and do not eat more than
some share (half) of cpu in this case. At least for UP I know
how to do this exactly forming trivial deadline scheduler
with two participants: run_queue and softirq pending.
With SMP this can be the same... only run_queue is shared and dumb extension
has bad effects, sort of throttling softirq on cpu0,
when process is scheduled really on cpu0. No doubts, it should be easy.

Alexey

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-28 19:23             ` kuznet
  2001-09-28 19:48               ` Ingo Molnar
@ 2001-09-30  9:01               ` Ingo Molnar
  1 sibling, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2001-09-30  9:01 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Linus Torvalds, linux-kernel, Alan Cox, bcrl, Andrea Arcangeli


On Fri, 28 Sep 2001 kuznet@ms2.inr.ac.ru wrote:

> Argh... I see, what you mean. It is not priority, it is bug introduced
> in 2.4.7. (By you? :-)) [...]

not by me. the 2.4.6 softirq changes/cleanups (softirq_pending cleanups,
restart softirqs in various places, etc.) are the ones from me (the loop
until done stuff). 2.4.7 contains the ksoftirqd related changes/cleanups.

	Ingo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch] softirq performance fixes, cleanups, 2.4.10.
  2001-09-29 16:35                 ` kuznet
@ 2001-09-30  9:37                   ` Kai Henningsen
  0 siblings, 0 replies; 44+ messages in thread
From: Kai Henningsen @ 2001-09-30  9:37 UTC (permalink / raw)
  To: kuznet; +Cc: linux-kernel, mingo

kuznet@ms2.inr.ac.ru  wrote on 29.09.01 in <200109291635.UAA17377@ms2.inr.ac.ru>:

> Essentially, if we invent some real condition when softirq loop must be
> stopped, we win. Now it is "stop after all events pending at start are
> processed". OK, it is wrong. You propose: "10 rounds". It is even worse
> because it is against plain logic. (pardon :-)).

The way I understand from this thread, the condition is *not* "10 rounds",  
it is "no more new softirqs are pending when we want to leave" - except it  
does a sort of emergency abort if 10 rounds weren't enough to get to that  
condition.

MfG Kai

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2001-09-30 13:21 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-09-26 16:44 [patch] softirq performance fixes, cleanups, 2.4.10 Ingo Molnar
2001-09-26 17:48 ` Mike Kravetz
2001-09-26 18:48   ` Ingo Molnar
2001-09-26 18:55 ` Russell King
2001-09-26 19:14   ` Ingo Molnar
2001-09-27 23:31 ` Andrea Arcangeli
2001-09-28  3:20   ` Andrea Arcangeli
2001-09-28  7:34     ` Ingo Molnar
2001-09-28 15:17       ` Andrea Arcangeli
2001-09-28  7:18   ` [patch] softirq-2.4.10-B2 Ingo Molnar
2001-09-28 15:58     ` Andrea Arcangeli
2001-09-28 18:36     ` Simon Kirby
2001-09-28 18:47       ` Ingo Molnar
2001-09-28 19:31         ` kuznet
2001-09-28 16:18 ` [patch] softirq performance fixes, cleanups, 2.4.10 kuznet
2001-09-28 16:31   ` Ingo Molnar
2001-09-28 17:04     ` kuznet
2001-09-28 17:21       ` Rik van Riel
2001-09-28 17:31         ` Andrea Arcangeli
2001-09-28 17:41         ` kuznet
2001-09-28 17:46           ` Ingo Molnar
2001-09-28 18:39         ` Josh MacDonald
2001-09-28 17:31       ` Ingo Molnar
2001-09-28 17:56         ` kuznet
2001-09-28 18:28           ` Ingo Molnar
2001-09-28 19:23             ` kuznet
2001-09-28 19:48               ` Ingo Molnar
2001-09-29 16:35                 ` kuznet
2001-09-30  9:37                   ` Kai Henningsen
2001-09-30  9:01               ` Ingo Molnar
2001-09-28 19:39             ` kuznet
2001-09-28 20:03               ` Ingo Molnar
2001-09-28 18:51           ` Benjamin LaHaise
2001-09-28 16:32   ` Andrea Arcangeli
2001-09-28 16:46     ` Ingo Molnar
2001-09-28 16:58       ` Andrea Arcangeli
2001-09-28 16:35   ` [patch] softirq-2.4.10-B3 Ingo Molnar
2001-09-29  0:40     ` J . A . Magallon
2001-09-29 11:03   ` [patch] softirq performance fixes, cleanups, 2.4.10 Rusty Russell
  -- strict thread matches above, loose matches on Subject: below --
2001-09-27 23:29 Oleg Nesterov
2001-09-28  0:03 ` Andrea Arcangeli
2001-09-28  6:57 ` Ingo Molnar
2001-09-28  2:50 Oleg Nesterov
2001-09-28  7:56 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox