public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] BH removal text
@ 2002-07-01  4:05 Matthew Wilcox
  2002-07-01 13:41 ` Arnd Bergmann
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Matthew Wilcox @ 2002-07-01  4:05 UTC (permalink / raw)
  To: Janitors; +Cc: linux-kernel


I'm soliciting comments before people start implementing these things.
Please, do NOT start changing anything based on the instructions given
below.  I do intend to update the floppy.c patch to fix the problems
I mentioned below, but I'm going to sleep first.

PRERELEASE VERSION 2002-06-30-01

A janitor's guide to removing bottom halves
===========================================

First, ignore the serial devices.  They're being taken care of
independently.

Apart from these, we use 3 bottom halves currently.  IMMEDIATE_BH,
TIMER_BH and TQUEUE_BH.  There is a spinlock (global_bh_lock) which
is held when running any of these three bottom halves, so none of them
can run at the same time.  IMMEDIATE_BH runs the immediate task queue
(tq_immediate).  TQUEUE_BH runs the timer task queue (tq_timer).
TIMER_BH first calls update_times(), then runs the timer list.

What does all that mean?
------------------------

Right now, the kernel guarantees it will only enter your driver (or
indeed any user, but we're mostly concerned with drivers) through one of
these entry points at a time.  If we get rid of bottom halves, we will be
able to enter a driver simultaneously through any active timer routine,
any active immediate task routine and any active timer task routine.

So how do we modify drivers?
----------------------------

I am of the opinion that we should remove tq_immediate entirely.
Every current user of it should be converted to use a private tasklet.
Example code for floppy.c to show how to do this can be found at:
http://ftp.linux.org.uk/pub/linux/willy/patches/floppy.diff
Note that this patch is BROKEN.  There's no locking to prevent any of our
timers from being called at the same time as our tasklet.  See below ...

Some of the users of tq_timer should probably be converted to
schedule_task so they run in a user context rather than interrupt context.
But there will always be a need for a task queue to be run in interrupt
context after a fixed period of time has elapsed in order to allow
for interrupt mitigation.  I think a better interface should be used
for tq_timer anyway -- I will be proposing a queue_timer_task() macro.
We can use conversion to this interface as a flag to indicate that a
driver has been checked for SMP locking.

The same thing goes for add_timer users, except that there's no better
interface that I want to convert drivers to.  So a comment beside the
add_timer usage indicating that you've checked the locking and it's OK
is helpful.

So how should we do the locking?
--------------------------------

Notice that right now we use a spinlock whenever we call any of these
entry points.  So it should be safe to declare a new spinlock within
this driver and acquire/release it at entry & exit to each of these
types of functions.  It's easier than converting drivers which use the
BKL because they might sleep or acquire it twice.  Be wary of reusing an
existing spinlock because it might be acquired from interrupt context,
so you'd have to use spin_lock_irq to acquire it in other contexts.

Of course, that's the lazy way of doing it.  What I'm hoping is that each
Janitor will take a driver and spend a week checking over its locking.
There's only 80 files in the kernel which use tq_immediate; with 10
Janitors involved, that's 8 drivers each -- that's only 2 months and we
have 4.

That doesn't mean that we shouldn't worry about the 38 files which use
tq_timer, but they are almost all tty related and are therefore Hard ;-)

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-01  4:05 [RFC] BH removal text Matthew Wilcox
@ 2002-07-01 13:41 ` Arnd Bergmann
  2002-07-03  7:21 ` george anzinger
  2002-07-14  1:05 ` William Lee Irwin III
  2 siblings, 0 replies; 13+ messages in thread
From: Arnd Bergmann @ 2002-07-01 13:41 UTC (permalink / raw)
  To: Matthew Wilcox, Janitors; +Cc: linux-kernel

On Monday 01 July 2002 06:05, Matthew Wilcox wrote:

> Of course, that's the lazy way of doing it.  What I'm hoping is that each
> Janitor will take a driver and spend a week checking over its locking.
> There's only 80 files in the kernel which use tq_immediate; with 10
> Janitors involved, that's 8 drivers each -- that's only 2 months and we
> have 4.

I suppose mine are the 8 drivers in drivers/s390 then :-). 
I'm already working on them to support the new LDM and I know the
maintainers.

	Arnd <><

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-01  4:05 [RFC] BH removal text Matthew Wilcox
  2002-07-01 13:41 ` Arnd Bergmann
@ 2002-07-03  7:21 ` george anzinger
  2002-07-03 11:15   ` Matthew Wilcox
  2002-07-14  1:05 ` William Lee Irwin III
  2 siblings, 1 reply; 13+ messages in thread
From: george anzinger @ 2002-07-03  7:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Janitors, linux-kernel

Matthew Wilcox wrote:
> 
> I'm soliciting comments before people start implementing these things.
> Please, do NOT start changing anything based on the instructions given
> below.  I do intend to update the floppy.c patch to fix the problems
> I mentioned below, but I'm going to sleep first.
> 
> PRERELEASE VERSION 2002-06-30-01
> 
> A janitor's guide to removing bottom halves
> ===========================================
> 
> First, ignore the serial devices.  They're being taken care of
> independently.
> 
> Apart from these, we use 3 bottom halves currently.  IMMEDIATE_BH,
> TIMER_BH and TQUEUE_BH.  There is a spinlock (global_bh_lock) which
> is held when running any of these three bottom halves, so none of them
> can run at the same time.  IMMEDIATE_BH runs the immediate task queue
> (tq_immediate).  TQUEUE_BH runs the timer task queue (tq_timer).
> TIMER_BH first calls update_times(), then runs the timer list.

It should also be noted that none of these is entered if a
cli is in effect.  This is the global cli and inhibits BHs
on all cpus.

-g

> 
> What does all that mean?
> ------------------------
> 
> Right now, the kernel guarantees it will only enter your driver (or
> indeed any user, but we're mostly concerned with drivers) through one of
> these entry points at a time.  If we get rid of bottom halves, we will be
> able to enter a driver simultaneously through any active timer routine,
> any active immediate task routine and any active timer task routine.
> 
> So how do we modify drivers?
> ----------------------------
> 
> I am of the opinion that we should remove tq_immediate entirely.
> Every current user of it should be converted to use a private tasklet.
> Example code for floppy.c to show how to do this can be found at:
> http://ftp.linux.org.uk/pub/linux/willy/patches/floppy.diff
> Note that this patch is BROKEN.  There's no locking to prevent any of our
> timers from being called at the same time as our tasklet.  See below ...
> 
> Some of the users of tq_timer should probably be converted to
> schedule_task so they run in a user context rather than interrupt context.
> But there will always be a need for a task queue to be run in interrupt
> context after a fixed period of time has elapsed in order to allow
> for interrupt mitigation.  I think a better interface should be used
> for tq_timer anyway -- I will be proposing a queue_timer_task() macro.
> We can use conversion to this interface as a flag to indicate that a
> driver has been checked for SMP locking.
> 
> The same thing goes for add_timer users, except that there's no better
> interface that I want to convert drivers to.  So a comment beside the
> add_timer usage indicating that you've checked the locking and it's OK
> is helpful.
> 
> So how should we do the locking?
> --------------------------------
> 
> Notice that right now we use a spinlock whenever we call any of these
> entry points.  So it should be safe to declare a new spinlock within
> this driver and acquire/release it at entry & exit to each of these
> types of functions.  It's easier than converting drivers which use the
> BKL because they might sleep or acquire it twice.  Be wary of reusing an
> existing spinlock because it might be acquired from interrupt context,
> so you'd have to use spin_lock_irq to acquire it in other contexts.
> 
> Of course, that's the lazy way of doing it.  What I'm hoping is that each
> Janitor will take a driver and spend a week checking over its locking.
> There's only 80 files in the kernel which use tq_immediate; with 10
> Janitors involved, that's 8 drivers each -- that's only 2 months and we
> have 4.
> 
> That doesn't mean that we shouldn't worry about the 38 files which use
> tq_timer, but they are almost all tty related and are therefore Hard ;-)
> 
> --
> Revolutions do not require corporate support.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Real time sched:  http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-03  7:21 ` george anzinger
@ 2002-07-03 11:15   ` Matthew Wilcox
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2002-07-03 11:15 UTC (permalink / raw)
  To: george anzinger; +Cc: Matthew Wilcox, Janitors, linux-kernel

On Wed, Jul 03, 2002 at 12:21:26AM -0700, george anzinger wrote:
> It should also be noted that none of these is entered if a
> cli is in effect.  This is the global cli and inhibits BHs
> on all cpus.

global cli() is so heaviy deprecated, it isn't even funny.  i know
mingo has a patch to remove it entirely.  if you want to ensure that
no softirq/tasklet/timer/... is run, use spin_lock_bh().  I'll add this
information, thanks!

-- 
Revolutions do not require corporate support.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-01  4:05 [RFC] BH removal text Matthew Wilcox
  2002-07-01 13:41 ` Arnd Bergmann
  2002-07-03  7:21 ` george anzinger
@ 2002-07-14  1:05 ` William Lee Irwin III
  2002-07-14  4:52   ` Dipankar Sarma
  2 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2002-07-14  1:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Janitors, linux-kernel

On Mon, Jul 01, 2002 at 05:05:55AM +0100, Matthew Wilcox wrote:
> That doesn't mean that we shouldn't worry about the 38 files which use
> tq_timer, but they are almost all tty related and are therefore Hard ;-)

__global_cli(), timer_bh(), and bh_action() are crippling my machines.

Where do I start?


Cheers,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-14  1:05 ` William Lee Irwin III
@ 2002-07-14  4:52   ` Dipankar Sarma
  2002-07-14 10:17     ` William Lee Irwin III
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Dipankar Sarma @ 2002-07-14  4:52 UTC (permalink / raw)
  To: William Lee Irwin III, Matthew Wilcox, Janitors, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1118 bytes --]

On Sat, Jul 13, 2002 at 06:05:06PM -0700, William Lee Irwin III wrote:
> On Mon, Jul 01, 2002 at 05:05:55AM +0100, Matthew Wilcox wrote:
> > That doesn't mean that we shouldn't worry about the 38 files which use
> > tq_timer, but they are almost all tty related and are therefore Hard ;-)
> 
> __global_cli(), timer_bh(), and bh_action() are crippling my machines.
> 
> Where do I start?

Even if you replace timemr_bh() with a tasklet, you still need
to take the global_bh_lock to ensure that timers don't race with
single-threaded BH processing in drivers. I wrote this patch [included]
to get rid of timer_bh in Ingo's smptimers, but it acquires
global_bh_lock as well as net_bh_lock, the latter to ensure
that some older protocol code that expected serialization of
NET_BH and timers work correctly (see deliver_to_old_ones()).
They need to be cleaned up too.

My patch of course was experimental to see what is needed to
get rid of timer_bh. It needs some cleanup itself ;-)

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

[-- Attachment #2: smptimers_X1-2.5.24-1.patch --]
[-- Type: text/plain, Size: 38932 bytes --]

diff -urN linux-2.5.24-base/arch/alpha/kernel/smp.c linux-2.5.24-smptimers_X1/arch/alpha/kernel/smp.c
--- linux-2.5.24-base/arch/alpha/kernel/smp.c	Fri Jun 21 04:23:48 2002
+++ linux-2.5.24-smptimers_X1/arch/alpha/kernel/smp.c	Fri Jul  5 14:40:29 2002
@@ -24,6 +24,7 @@
 #include <linux/spinlock.h>
 #include <linux/irq.h>
 #include <linux/cache.h>
+#include <linux/timer.h>
 
 #include <asm/hwrpb.h>
 #include <asm/ptrace.h>
@@ -654,6 +655,7 @@
 		if (softirq_pending(cpu))
 			do_softirq();
 	}
+	run_local_timers();
 }
 
 int __init
diff -urN linux-2.5.24-base/arch/i386/kernel/apic.c linux-2.5.24-smptimers_X1/arch/i386/kernel/apic.c
--- linux-2.5.24-base/arch/i386/kernel/apic.c	Fri Jun 21 04:23:52 2002
+++ linux-2.5.24-smptimers_X1/arch/i386/kernel/apic.c	Fri Jul  5 14:40:29 2002
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/mc146818rtc.h>
 #include <linux/kernel_stat.h>
+#include <linux/timer.h>
 
 #include <asm/atomic.h>
 #include <asm/smp.h>
@@ -1087,7 +1088,7 @@
 	irq_enter(cpu, 0);
 	smp_local_timer_interrupt(&regs);
 	irq_exit(cpu, 0);
-
+	run_local_timers();
 	if (softirq_pending(cpu))
 		do_softirq();
 }
diff -urN linux-2.5.24-base/arch/i386/mm/fault.c linux-2.5.24-smptimers_X1/arch/i386/mm/fault.c
--- linux-2.5.24-base/arch/i386/mm/fault.c	Fri Jun 21 04:23:41 2002
+++ linux-2.5.24-smptimers_X1/arch/i386/mm/fault.c	Fri Jul  5 14:40:29 2002
@@ -94,16 +94,12 @@
 	goto bad_area;
 }
 
-extern spinlock_t timerlist_lock;
-
 /*
  * Unlock any spinlocks which will prevent us from getting the
- * message out (timerlist_lock is acquired through the
- * console unblank code)
+ * message out
  */
 void bust_spinlocks(int yes)
 {
-	spin_lock_init(&timerlist_lock);
 	if (yes) {
 		oops_in_progress = 1;
 #ifdef CONFIG_SMP
diff -urN linux-2.5.24-base/arch/ia64/kernel/smp.c linux-2.5.24-smptimers_X1/arch/ia64/kernel/smp.c
--- linux-2.5.24-base/arch/ia64/kernel/smp.c	Fri Jun 21 04:23:57 2002
+++ linux-2.5.24-smptimers_X1/arch/ia64/kernel/smp.c	Fri Jul  5 14:40:29 2002
@@ -32,6 +32,7 @@
 #include <linux/cache.h>
 #include <linux/delay.h>
 #include <linux/cache.h>
+#include <linux/timer.h>
 
 #include <asm/atomic.h>
 #include <asm/bitops.h>
@@ -331,6 +332,7 @@
 		local_cpu_data->prof_counter = local_cpu_data->prof_multiplier;
 		update_process_times(user);
 	}
+	run_local_timers();
 }
 
 /*
diff -urN linux-2.5.24-base/arch/ia64/kernel/traps.c linux-2.5.24-smptimers_X1/arch/ia64/kernel/traps.c
--- linux-2.5.24-base/arch/ia64/kernel/traps.c	Fri Jun 21 04:23:57 2002
+++ linux-2.5.24-smptimers_X1/arch/ia64/kernel/traps.c	Fri Jul  5 14:40:29 2002
@@ -42,7 +42,6 @@
 
 #include <asm/fpswa.h>
 
-extern spinlock_t timerlist_lock;
 
 static fpswa_interface_t *fpswa_interface;
 
@@ -56,13 +55,12 @@
 }
 
 /*
- * Unlock any spinlocks which will prevent us from getting the message out (timerlist_lock
+ * Unlock any spinlocks which will prevent us from getting the message out 
  * is acquired through the console unblank code)
  */
 void
 bust_spinlocks (int yes)
 {
-	spin_lock_init(&timerlist_lock);
 	if (yes) {
 		oops_in_progress = 1;
 #ifdef CONFIG_SMP
diff -urN linux-2.5.24-base/arch/mips64/mm/fault.c linux-2.5.24-smptimers_X1/arch/mips64/mm/fault.c
--- linux-2.5.24-base/arch/mips64/mm/fault.c	Fri Jun 21 04:23:54 2002
+++ linux-2.5.24-smptimers_X1/arch/mips64/mm/fault.c	Fri Jul  5 14:40:29 2002
@@ -58,16 +58,13 @@
 	printk("Got exception 0x%lx at 0x%lx\n", retaddr, regs.cp0_epc);
 }
 
-extern spinlock_t timerlist_lock;
 
 /*
  * Unlock any spinlocks which will prevent us from getting the
- * message out (timerlist_lock is acquired through the
- * console unblank code)
+ * message out 
  */
 void bust_spinlocks(int yes)
 {
-	spin_lock_init(&timerlist_lock);
 	if (yes) {
 		oops_in_progress = 1;
 	} else {
diff -urN linux-2.5.24-base/arch/mips64/sgi-ip27/ip27-timer.c linux-2.5.24-smptimers_X1/arch/mips64/sgi-ip27/ip27-timer.c
--- linux-2.5.24-base/arch/mips64/sgi-ip27/ip27-timer.c	Fri Jun 21 04:23:50 2002
+++ linux-2.5.24-smptimers_X1/arch/mips64/sgi-ip27/ip27-timer.c	Fri Jul  5 14:40:29 2002
@@ -11,6 +11,7 @@
 #include <linux/param.h>
 #include <linux/timex.h>
 #include <linux/mm.h>		
+#include <linux/timer.h>
 
 #include <asm/pgtable.h>
 #include <asm/sgialib.h>
@@ -123,6 +124,7 @@
 		irq_exit(cpu, 0);
 	}
 #endif /* CONFIG_SMP */
+	run_local_timers();
 	
 	/*
 	 * If we have an externally synchronized Linux clock, then update
diff -urN linux-2.5.24-base/arch/ppc/kernel/time.c linux-2.5.24-smptimers_X1/arch/ppc/kernel/time.c
--- linux-2.5.24-base/arch/ppc/kernel/time.c	Fri Jun 21 04:23:47 2002
+++ linux-2.5.24-smptimers_X1/arch/ppc/kernel/time.c	Fri Jul  5 14:40:29 2002
@@ -59,6 +59,7 @@
 #include <linux/mc146818rtc.h>
 #include <linux/time.h>
 #include <linux/init.h>
+#include <linux/timer.h>
 
 #include <asm/segment.h>
 #include <asm/io.h>
@@ -216,6 +217,7 @@
 
 	hardirq_exit(cpu);
 
+	run_local_timers();
 	if (softirq_pending(cpu))
 		do_softirq();
 
diff -urN linux-2.5.24-base/arch/ppc64/kernel/time.c linux-2.5.24-smptimers_X1/arch/ppc64/kernel/time.c
--- linux-2.5.24-base/arch/ppc64/kernel/time.c	Fri Jun 21 04:23:56 2002
+++ linux-2.5.24-smptimers_X1/arch/ppc64/kernel/time.c	Fri Jul  5 14:40:29 2002
@@ -46,6 +46,7 @@
 #include <linux/mc146818rtc.h>
 #include <linux/time.h>
 #include <linux/init.h>
+#include <linux/timer.h>
 
 #include <asm/segment.h>
 #include <asm/io.h>
@@ -294,6 +295,7 @@
 
 	irq_exit(cpu);
 
+	run_local_timers();
 	if (softirq_pending(cpu))
 		do_softirq();
 	
diff -urN linux-2.5.24-base/arch/s390/kernel/time.c linux-2.5.24-smptimers_X1/arch/s390/kernel/time.c
--- linux-2.5.24-base/arch/s390/kernel/time.c	Fri Jun 21 04:23:42 2002
+++ linux-2.5.24-smptimers_X1/arch/s390/kernel/time.c	Fri Jul  5 14:40:29 2002
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <linux/types.h>
+#include <linux/timer.h>
 
 #include <asm/uaccess.h>
 #include <asm/delay.h>
@@ -175,6 +176,7 @@
 #endif
 
 	irq_exit(cpu, 0);
+	run_local_timers();
 }
 
 /*
diff -urN linux-2.5.24-base/arch/s390/mm/fault.c linux-2.5.24-smptimers_X1/arch/s390/mm/fault.c
--- linux-2.5.24-base/arch/s390/mm/fault.c	Fri Jun 21 04:23:56 2002
+++ linux-2.5.24-smptimers_X1/arch/s390/mm/fault.c	Fri Jul  5 14:40:29 2002
@@ -37,16 +37,13 @@
 
 extern void die(const char *,struct pt_regs *,long);
 
-extern spinlock_t timerlist_lock;
 
 /*
  * Unlock any spinlocks which will prevent us from getting the
- * message out (timerlist_lock is acquired through the
- * console unblank code)
+ * message out 
  */
 void bust_spinlocks(int yes)
 {
-	spin_lock_init(&timerlist_lock);
 	if (yes) {
 		oops_in_progress = 1;
 	} else {
diff -urN linux-2.5.24-base/arch/s390x/kernel/time.c linux-2.5.24-smptimers_X1/arch/s390x/kernel/time.c
--- linux-2.5.24-base/arch/s390x/kernel/time.c	Fri Jun 21 04:23:48 2002
+++ linux-2.5.24-smptimers_X1/arch/s390x/kernel/time.c	Fri Jul  5 14:40:29 2002
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <linux/types.h>
+#include <linux/timer.h>
 
 #include <asm/uaccess.h>
 #include <asm/delay.h>
@@ -148,6 +149,7 @@
 #endif
 
 	irq_exit(cpu, 0);
+	run_local_timers();
 }
 
 /*
diff -urN linux-2.5.24-base/arch/s390x/mm/fault.c linux-2.5.24-smptimers_X1/arch/s390x/mm/fault.c
--- linux-2.5.24-base/arch/s390x/mm/fault.c	Fri Jun 21 04:23:49 2002
+++ linux-2.5.24-smptimers_X1/arch/s390x/mm/fault.c	Fri Jul  5 14:40:29 2002
@@ -36,16 +36,13 @@
 
 extern void die(const char *,struct pt_regs *,long);
 
-extern spinlock_t timerlist_lock;
 
 /*
  * Unlock any spinlocks which will prevent us from getting the
- * message out (timerlist_lock is acquired through the
- * console unblank code)
+ * message out 
  */
 void bust_spinlocks(int yes)
 {
-	spin_lock_init(&timerlist_lock);
 	if (yes) {
 		oops_in_progress = 1;
 	} else {
diff -urN linux-2.5.24-base/arch/sparc/kernel/irq.c linux-2.5.24-smptimers_X1/arch/sparc/kernel/irq.c
--- linux-2.5.24-base/arch/sparc/kernel/irq.c	Fri Jun 21 04:23:43 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc/kernel/irq.c	Fri Jul  5 14:40:29 2002
@@ -73,7 +73,7 @@
     prom_halt();
 }
 
-void (*init_timers)(void (*)(int, void *,struct pt_regs *)) =
+void (*sparc_init_timers)(void (*)(int, void *,struct pt_regs *)) =
     (void (*)(void (*)(int, void *,struct pt_regs *))) irq_panic;
 
 /*
diff -urN linux-2.5.24-base/arch/sparc/kernel/sun4c_irq.c linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4c_irq.c
--- linux-2.5.24-base/arch/sparc/kernel/sun4c_irq.c	Fri Jun 21 04:23:44 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4c_irq.c	Fri Jul  5 14:40:29 2002
@@ -143,7 +143,7 @@
 	/* Errm.. not sure how to do this.. */
 }
 
-static void __init sun4c_init_timers(void (*counter_fn)(int, void *, struct pt_regs *))
+static void __init sun4c_sparc_init_timers(void (*counter_fn)(int, void *, struct pt_regs *))
 {
 	int irq;
 
@@ -221,7 +221,7 @@
 	BTFIXUPSET_CALL(clear_profile_irq, sun4c_clear_profile_irq, BTFIXUPCALL_NOP);
 	BTFIXUPSET_CALL(load_profile_irq, sun4c_load_profile_irq, BTFIXUPCALL_NOP);
 	BTFIXUPSET_CALL(__irq_itoa, sun4m_irq_itoa, BTFIXUPCALL_NORM);
-	init_timers = sun4c_init_timers;
+	sparc_init_timers = sun4c_sparc_init_timers;
 #ifdef CONFIG_SMP
 	BTFIXUPSET_CALL(set_cpu_int, sun4c_nop, BTFIXUPCALL_NOP);
 	BTFIXUPSET_CALL(clear_cpu_int, sun4c_nop, BTFIXUPCALL_NOP);
diff -urN linux-2.5.24-base/arch/sparc/kernel/sun4d_irq.c linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4d_irq.c
--- linux-2.5.24-base/arch/sparc/kernel/sun4d_irq.c	Fri Jun 21 04:23:40 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4d_irq.c	Fri Jul  5 14:40:29 2002
@@ -436,7 +436,7 @@
 	bw_set_prof_limit(cpu, limit);
 }
 
-static void __init sun4d_init_timers(void (*counter_fn)(int, void *, struct pt_regs *))
+static void __init sun4d_sparc_init_timers(void (*counter_fn)(int, void *, struct pt_regs *))
 {
 	int irq;
 	extern struct prom_cpuinfo linux_cpus[NR_CPUS];
@@ -547,7 +547,7 @@
 	BTFIXUPSET_CALL(clear_profile_irq, sun4d_clear_profile_irq, BTFIXUPCALL_NORM);
 	BTFIXUPSET_CALL(load_profile_irq, sun4d_load_profile_irq, BTFIXUPCALL_NORM);
 	BTFIXUPSET_CALL(__irq_itoa, sun4d_irq_itoa, BTFIXUPCALL_NORM);
-	init_timers = sun4d_init_timers;
+	sparc_init_timers = sun4d_sparc_init_timers;
 #ifdef CONFIG_SMP
 	BTFIXUPSET_CALL(set_cpu_int, sun4d_set_cpu_int, BTFIXUPCALL_NORM);
 	BTFIXUPSET_CALL(clear_cpu_int, sun4d_clear_ipi, BTFIXUPCALL_NOP);
diff -urN linux-2.5.24-base/arch/sparc/kernel/sun4d_smp.c linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4d_smp.c
--- linux-2.5.24-base/arch/sparc/kernel/sun4d_smp.c	Fri Jun 21 04:23:45 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4d_smp.c	Fri Jul  5 14:40:29 2002
@@ -18,6 +18,7 @@
 #include <linux/init.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
+#include <linux/timer.h>
 
 #include <asm/ptrace.h>
 #include <asm/atomic.h>
@@ -465,6 +466,7 @@
 
 		prof_counter[cpu] = prof_multiplier[cpu];
 	}
+	run_local_timers();
 }
 
 extern unsigned int lvl14_resolution;
diff -urN linux-2.5.24-base/arch/sparc/kernel/sun4m_irq.c linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4m_irq.c
--- linux-2.5.24-base/arch/sparc/kernel/sun4m_irq.c	Fri Jun 21 04:23:53 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4m_irq.c	Fri Jul  5 14:40:29 2002
@@ -223,7 +223,7 @@
 	return buff;
 }
 
-static void __init sun4m_init_timers(void (*counter_fn)(int, void *, struct pt_regs *))
+static void __init sun4m_sparc_init_timers(void (*counter_fn)(int, void *, struct pt_regs *))
 {
 	int reg_count, irq, cpu;
 	struct linux_prom_registers cnt_regs[PROMREG_MAX];
@@ -374,7 +374,7 @@
 	BTFIXUPSET_CALL(clear_profile_irq, sun4m_clear_profile_irq, BTFIXUPCALL_NORM);
 	BTFIXUPSET_CALL(load_profile_irq, sun4m_load_profile_irq, BTFIXUPCALL_NORM);
 	BTFIXUPSET_CALL(__irq_itoa, sun4m_irq_itoa, BTFIXUPCALL_NORM);
-	init_timers = sun4m_init_timers;
+	sparc_init_timers = sun4m_sparc_init_timers;
 #ifdef CONFIG_SMP
 	BTFIXUPSET_CALL(set_cpu_int, sun4m_send_ipi, BTFIXUPCALL_NORM);
 	BTFIXUPSET_CALL(clear_cpu_int, sun4m_clear_ipi, BTFIXUPCALL_NORM);
diff -urN linux-2.5.24-base/arch/sparc/kernel/sun4m_smp.c linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4m_smp.c
--- linux-2.5.24-base/arch/sparc/kernel/sun4m_smp.c	Fri Jun 21 04:23:49 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc/kernel/sun4m_smp.c	Fri Jul  5 14:40:29 2002
@@ -15,6 +15,7 @@
 #include <linux/init.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
+#include <linux/timer.h>
 
 #include <asm/ptrace.h>
 #include <asm/atomic.h>
@@ -452,6 +453,7 @@
 
 		prof_counter[cpu] = prof_multiplier[cpu];
 	}
+	run_local_timers();
 }
 
 extern unsigned int lvl14_resolution;
diff -urN linux-2.5.24-base/arch/sparc/kernel/time.c linux-2.5.24-smptimers_X1/arch/sparc/kernel/time.c
--- linux-2.5.24-base/arch/sparc/kernel/time.c	Fri Jun 21 04:23:54 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc/kernel/time.c	Fri Jul  5 14:40:29 2002
@@ -381,7 +381,7 @@
 	else
 		clock_probe();
 
-	init_timers(timer_interrupt);
+	sparc_init_timers(timer_interrupt);
 	
 #ifdef CONFIG_SUN4
 	if(idprom->id_machtype == (SM_SUN4 | SM_4_330)) {
diff -urN linux-2.5.24-base/arch/sparc64/kernel/irq.c linux-2.5.24-smptimers_X1/arch/sparc64/kernel/irq.c
--- linux-2.5.24-base/arch/sparc64/kernel/irq.c	Fri Jun 21 04:23:50 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc64/kernel/irq.c	Fri Jul  5 14:40:29 2002
@@ -1032,7 +1032,7 @@
 }
 
 /* This is gets the master TICK_INT timer going. */
-void init_timers(void (*cfunc)(int, void *, struct pt_regs *),
+void sparc_init_timers(void (*cfunc)(int, void *, struct pt_regs *),
 		 unsigned long *clock)
 {
 	unsigned long pstate;
diff -urN linux-2.5.24-base/arch/sparc64/kernel/smp.c linux-2.5.24-smptimers_X1/arch/sparc64/kernel/smp.c
--- linux-2.5.24-base/arch/sparc64/kernel/smp.c	Fri Jun 21 04:23:41 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc64/kernel/smp.c	Fri Jul  5 14:40:29 2002
@@ -1061,6 +1061,7 @@
 
 			prof_counter(cpu) = prof_multiplier(cpu);
 		}
+		run_local_timers();
 
 		/* Guarentee that the following sequences execute
 		 * uninterrupted.
diff -urN linux-2.5.24-base/arch/sparc64/kernel/time.c linux-2.5.24-smptimers_X1/arch/sparc64/kernel/time.c
--- linux-2.5.24-base/arch/sparc64/kernel/time.c	Fri Jun 21 04:23:53 2002
+++ linux-2.5.24-smptimers_X1/arch/sparc64/kernel/time.c	Fri Jul  5 14:40:29 2002
@@ -613,7 +613,7 @@
 	__restore_flags(flags);
 }
 
-extern void init_timers(void (*func)(int, void *, struct pt_regs *),
+extern void sparc_init_timers(void (*func)(int, void *, struct pt_regs *),
 			unsigned long *);
 
 void __init time_init(void)
@@ -624,7 +624,7 @@
 	 */
 	unsigned long clock;
 
-	init_timers(timer_interrupt, &clock);
+	sparc_init_timers(timer_interrupt, &clock);
 	timer_ticks_per_usec_quotient = ((1UL<<32) / (clock / 1000020));
 }
 
diff -urN linux-2.5.24-base/arch/x86_64/kernel/apic.c linux-2.5.24-smptimers_X1/arch/x86_64/kernel/apic.c
--- linux-2.5.24-base/arch/x86_64/kernel/apic.c	Fri Jun 21 04:23:45 2002
+++ linux-2.5.24-smptimers_X1/arch/x86_64/kernel/apic.c	Fri Jul  5 14:40:29 2002
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/mc146818rtc.h>
 #include <linux/kernel_stat.h>
+#include <linux/timer.h>
 
 #include <asm/atomic.h>
 #include <asm/smp.h>
@@ -1069,6 +1070,7 @@
 	smp_local_timer_interrupt(regs);
 	irq_exit(cpu, 0);
 
+	run_local_timers();
 	if (softirq_pending(cpu))
 		do_softirq();
 }
diff -urN linux-2.5.24-base/arch/x86_64/mm/fault.c linux-2.5.24-smptimers_X1/arch/x86_64/mm/fault.c
--- linux-2.5.24-base/arch/x86_64/mm/fault.c	Fri Jun 21 04:23:42 2002
+++ linux-2.5.24-smptimers_X1/arch/x86_64/mm/fault.c	Fri Jul  5 14:40:29 2002
@@ -32,11 +32,10 @@
 
 extern void die(const char *,struct pt_regs *,long);
 
-extern spinlock_t console_lock, timerlist_lock;
+extern spinlock_t console_lock;
 
 void bust_spinlocks(int yes)
 {
- 	spin_lock_init(&timerlist_lock);
 	if (yes) {
 		oops_in_progress = 1;
 #ifdef CONFIG_SMP
diff -urN linux-2.5.24-base/drivers/net/eepro100.c linux-2.5.24-smptimers_X1/drivers/net/eepro100.c
--- linux-2.5.24-base/drivers/net/eepro100.c	Fri Jun 21 04:23:54 2002
+++ linux-2.5.24-smptimers_X1/drivers/net/eepro100.c	Fri Jul  5 14:40:29 2002
@@ -1169,9 +1169,6 @@
 	/* We must continue to monitor the media. */
 	sp->timer.expires = RUN_AT(2*HZ); 			/* 2.0 sec. */
 	add_timer(&sp->timer);
-#if defined(timer_exit)
-	timer_exit(&sp->timer);
-#endif
 }
 
 static void speedo_show_state(struct net_device *dev)
diff -urN linux-2.5.24-base/include/asm-sparc/irq.h linux-2.5.24-smptimers_X1/include/asm-sparc/irq.h
--- linux-2.5.24-base/include/asm-sparc/irq.h	Fri Jun 21 04:23:42 2002
+++ linux-2.5.24-smptimers_X1/include/asm-sparc/irq.h	Fri Jul  5 14:40:29 2002
@@ -45,7 +45,7 @@
 #define clear_profile_irq(cpu) BTFIXUP_CALL(clear_profile_irq)(cpu)
 #define load_profile_irq(cpu,limit) BTFIXUP_CALL(load_profile_irq)(cpu,limit)
 
-extern void (*init_timers)(void (*lvl10_irq)(int, void *, struct pt_regs *));
+extern void (*sparc_init_timers)(void (*lvl10_irq)(int, void *, struct pt_regs *));
 extern void claim_ticker14(void (*irq_handler)(int, void *, struct pt_regs *),
 			   int irq,
 			   unsigned int timeout);
diff -urN linux-2.5.24-base/include/asm-sparc64/irq.h linux-2.5.24-smptimers_X1/include/asm-sparc64/irq.h
--- linux-2.5.24-base/include/asm-sparc64/irq.h	Fri Jun 21 04:23:53 2002
+++ linux-2.5.24-smptimers_X1/include/asm-sparc64/irq.h	Fri Jul  5 14:40:29 2002
@@ -114,7 +114,7 @@
 extern void disable_irq(unsigned int);
 #define disable_irq_nosync disable_irq
 extern void enable_irq(unsigned int);
-extern void init_timers(void (*lvl10_irq)(int, void *, struct pt_regs *),
+extern void sparc_init_timers(void (*lvl10_irq)(int, void *, struct pt_regs *),
 			unsigned long *);
 extern unsigned int build_irq(int pil, int inofixup, unsigned long iclr, unsigned long imap);
 extern unsigned int sbus_build_irq(void *sbus, unsigned int ino);
diff -urN linux-2.5.24-base/include/linux/interrupt.h linux-2.5.24-smptimers_X1/include/linux/interrupt.h
--- linux-2.5.24-base/include/linux/interrupt.h	Fri Jun 21 04:23:41 2002
+++ linux-2.5.24-smptimers_X1/include/linux/interrupt.h	Fri Jul  5 14:40:29 2002
@@ -214,6 +214,7 @@
 
 /* It is exported _ONLY_ for wait_on_irq(). */
 extern spinlock_t global_bh_lock;
+extern spinlock_t net_bh_lock;
 
 static inline void mark_bh(int nr)
 {
diff -urN linux-2.5.24-base/include/linux/timer.h linux-2.5.24-smptimers_X1/include/linux/timer.h
--- linux-2.5.24-base/include/linux/timer.h	Fri Jun 21 04:23:46 2002
+++ linux-2.5.24-smptimers_X1/include/linux/timer.h	Fri Jul  5 14:46:52 2002
@@ -2,7 +2,47 @@
 #define _LINUX_TIMER_H
 
 #include <linux/config.h>
+#include <linux/smp.h>
 #include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/cache.h>
+
+/*
+ * Event timer code
+ */
+#define TVN_BITS 6
+#define TVR_BITS 8
+#define TVN_SIZE (1 << TVN_BITS)
+#define TVR_SIZE (1 << TVR_BITS)
+#define TVN_MASK (TVN_SIZE - 1)
+#define TVR_MASK (TVR_SIZE - 1)
+
+typedef struct tvec_s {
+	int index;
+	struct list_head vec[TVN_SIZE];
+} tvec_t;
+
+typedef struct tvec_root_s {
+	int index;
+	struct list_head vec[TVR_SIZE];
+} tvec_root_t;
+
+#define NOOF_TVECS 5
+
+typedef struct timer_list timer_t;
+
+struct tvec_t_base_s {
+	spinlock_t lock;
+	unsigned long timer_jiffies;
+	volatile timer_t * volatile running_timer;
+	tvec_root_t tv1;
+	tvec_t tv2;
+	tvec_t tv3;
+	tvec_t tv4;
+	tvec_t tv5;
+} ____cacheline_aligned_in_smp;
+
+typedef struct tvec_t_base_s tvec_base_t;
 
 /*
  * In Linux 2.4, static timers have been removed from the kernel.
@@ -18,17 +58,26 @@
 	unsigned long expires;
 	unsigned long data;
 	void (*function)(unsigned long);
+	tvec_base_t *base;
 };
 
-extern void add_timer(struct timer_list * timer);
-extern int del_timer(struct timer_list * timer);
-
+extern void add_timer(timer_t * timer);
+extern int del_timer(timer_t * timer);
+  
 #ifdef CONFIG_SMP
-extern int del_timer_sync(struct timer_list * timer);
+extern int del_timer_sync(timer_t * timer);
+extern void sync_timers(void);
+#define timer_enter(base, t) do { base->running_timer = t; mb(); } while (0)
+#define timer_exit(base) do { base->running_timer = NULL; } while (0)
+#define timer_is_running(base,t) (base->running_timer == t)
+#define timer_synchronize(base,t) while (timer_is_running(base,t)) barrier()
 #else
 #define del_timer_sync(t)	del_timer(t)
+#define sync_timers()		do { } while (0)
+#define timer_enter(base,t)          do { } while (0)
+#define timer_exit(base)            do { } while (0)
 #endif
-
+  
 /*
  * mod_timer is a more efficient way to update the expire field of an
  * active timer (if the timer is inactive it will be activated)
@@ -36,17 +85,38 @@
  * If the timer is known to be not pending (ie, in the handler), mod_timer
  * is less efficient than a->expires = b; add_timer(a).
  */
-int mod_timer(struct timer_list *timer, unsigned long expires);
+int mod_timer(timer_t *timer, unsigned long expires);
 
 extern void it_real_fn(unsigned long);
 
-static inline void init_timer(struct timer_list * timer)
+extern void init_timers(void);
+
+#ifdef CONFIG_SMP
+extern void run_local_timers(void);
+#else
+#define run_local_timers() do {} while(0)
+#endif
+
+extern tvec_base_t tvec_bases[NR_CPUS];
+
+static inline void init_timer(timer_t * timer)
 {
 	timer->list.next = timer->list.prev = NULL;
+	timer->base = tvec_bases + 0;
 }
 
-static inline int timer_pending (const struct timer_list * timer)
+#define TIMER_DEBUG 0
+#if TIMER_DEBUG
+# define CHECK_BASE(base) \
+	if (base && ((base < tvec_bases) || (base >= tvec_bases + NR_CPUS))) \
+		BUG()
+#else
+# define CHECK_BASE(base)
+#endif
+
+static inline int timer_pending(const timer_t * timer)
 {
+	CHECK_BASE(timer->base);
 	return timer->list.next != NULL;
 }
 
diff -urN linux-2.5.24-base/kernel/ksyms.c linux-2.5.24-smptimers_X1/kernel/ksyms.c
--- linux-2.5.24-base/kernel/ksyms.c	Fri Jun 21 04:23:42 2002
+++ linux-2.5.24-smptimers_X1/kernel/ksyms.c	Fri Jul  5 14:40:29 2002
@@ -409,6 +409,7 @@
 EXPORT_SYMBOL(del_timer_sync);
 #endif
 EXPORT_SYMBOL(mod_timer);
+EXPORT_SYMBOL(tvec_bases);
 EXPORT_SYMBOL(tq_timer);
 EXPORT_SYMBOL(tq_immediate);
 
diff -urN linux-2.5.24-base/kernel/sched.c linux-2.5.24-smptimers_X1/kernel/sched.c
--- linux-2.5.24-base/kernel/sched.c	Fri Jun 21 04:23:47 2002
+++ linux-2.5.24-smptimers_X1/kernel/sched.c	Fri Jul  5 14:40:29 2002
@@ -1651,7 +1651,7 @@
 	idle->thread_info->preempt_count = (idle->lock_depth >= 0);
 }
 
-extern void init_timervecs(void);
+extern void init_timers(void);
 extern void timer_bh(void);
 extern void tqueue_bh(void);
 extern void immediate_bh(void);
@@ -1689,8 +1689,7 @@
 	rq->idle = current;
 	wake_up_process(current);
 
-	init_timervecs();
-	init_bh(TIMER_BH, timer_bh);
+	init_timers();
 	init_bh(TQUEUE_BH, tqueue_bh);
 	init_bh(IMMEDIATE_BH, immediate_bh);
 
diff -urN linux-2.5.24-base/kernel/softirq.c linux-2.5.24-smptimers_X1/kernel/softirq.c
--- linux-2.5.24-base/kernel/softirq.c	Fri Jun 21 04:23:47 2002
+++ linux-2.5.24-smptimers_X1/kernel/softirq.c	Fri Jul  5 14:40:29 2002
@@ -44,6 +44,12 @@
 irq_cpustat_t irq_stat[NR_CPUS];
 
 static struct softirq_action softirq_vec[32] __cacheline_aligned_in_smp;
+/*
+ * This is required only to support single threaded network protocol
+ * code. (see net/core/dev.c) This should GO AWAY when those protocols
+ * are fixed.
+ */
+spinlock_t net_bh_lock = SPIN_LOCK_UNLOCKED;
 
 /*
  * we cannot loop indefinitely here to avoid userspace starvation,
diff -urN linux-2.5.24-base/kernel/timer.c linux-2.5.24-smptimers_X1/kernel/timer.c
--- linux-2.5.24-base/kernel/timer.c	Fri Jun 21 04:23:48 2002
+++ linux-2.5.24-smptimers_X1/kernel/timer.c	Fri Jul  5 15:47:15 2002
@@ -14,9 +14,13 @@
  *                              Copyright (C) 1998  Andrea Arcangeli
  *  1999-03-10  Improved NTP compatibility by Ulrich Windl
  *  2002-05-31	Move sys_sysinfo here and make its locking sane, Robert Love
+ *  2000-10-05  Implemented scalable SMP per-CPU timer handling.
+ *                              Copyright (C) 2000  Ingo Molnar
+ *              Designed by David S. Miller, Alexey Kuznetsov and Ingo Molnar
  */
 
 #include <linux/config.h>
+#include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/timex.h>
 #include <linux/delay.h>
@@ -24,6 +28,7 @@
 #include <linux/interrupt.h>
 #include <linux/tqueue.h>
 #include <linux/kernel_stat.h>
+#include <linux/percpu.h>
 
 #include <asm/uaccess.h>
 
@@ -80,83 +85,52 @@
 unsigned long prof_len;
 unsigned long prof_shift;
 
-/*
- * Event timer code
- */
-#define TVN_BITS 6
-#define TVR_BITS 8
-#define TVN_SIZE (1 << TVN_BITS)
-#define TVR_SIZE (1 << TVR_BITS)
-#define TVN_MASK (TVN_SIZE - 1)
-#define TVR_MASK (TVR_SIZE - 1)
-
-struct timer_vec {
-	int index;
-	struct list_head vec[TVN_SIZE];
-};
-
-struct timer_vec_root {
-	int index;
-	struct list_head vec[TVR_SIZE];
-};
-
-static struct timer_vec tv5;
-static struct timer_vec tv4;
-static struct timer_vec tv3;
-static struct timer_vec tv2;
-static struct timer_vec_root tv1;
-
-static struct timer_vec * const tvecs[] = {
-	(struct timer_vec *)&tv1, &tv2, &tv3, &tv4, &tv5
-};
+tvec_base_t tvec_bases[NR_CPUS] __cacheline_aligned;
 
-#define NOOF_TVECS (sizeof(tvecs) / sizeof(tvecs[0]))
+static struct tasklet_struct timer_tasklet __per_cpu_data;
 
-void init_timervecs (void)
-{
-	int i;
-
-	for (i = 0; i < TVN_SIZE; i++) {
-		INIT_LIST_HEAD(tv5.vec + i);
-		INIT_LIST_HEAD(tv4.vec + i);
-		INIT_LIST_HEAD(tv3.vec + i);
-		INIT_LIST_HEAD(tv2.vec + i);
-	}
-	for (i = 0; i < TVR_SIZE; i++)
-		INIT_LIST_HEAD(tv1.vec + i);
-}
+/* jiffies at the most recent update of wall time */
+unsigned long wall_jiffies;
 
-static unsigned long timer_jiffies;
+/*
+ * This spinlock protect us from races in SMP while playing with xtime. -arca
+ */
+rwlock_t xtime_lock = RW_LOCK_UNLOCKED;
+unsigned long last_time_offset;
 
-static inline void internal_add_timer(struct timer_list *timer)
+/*
+ * This is the 'global' timer BH. This gets called only if one of
+ * the local timer interrupts couldnt run timers.
+ */
+static inline void internal_add_timer(tvec_base_t *base, timer_t *timer)
 {
 	/*
 	 * must be cli-ed when calling this
 	 */
 	unsigned long expires = timer->expires;
-	unsigned long idx = expires - timer_jiffies;
+	unsigned long idx = expires - base->timer_jiffies;
 	struct list_head * vec;
 
 	if (idx < TVR_SIZE) {
 		int i = expires & TVR_MASK;
-		vec = tv1.vec + i;
+		vec = base->tv1.vec + i;
 	} else if (idx < 1 << (TVR_BITS + TVN_BITS)) {
 		int i = (expires >> TVR_BITS) & TVN_MASK;
-		vec = tv2.vec + i;
+		vec = base->tv2.vec + i;
 	} else if (idx < 1 << (TVR_BITS + 2 * TVN_BITS)) {
 		int i = (expires >> (TVR_BITS + TVN_BITS)) & TVN_MASK;
-		vec =  tv3.vec + i;
+		vec = base->tv3.vec + i;
 	} else if (idx < 1 << (TVR_BITS + 3 * TVN_BITS)) {
 		int i = (expires >> (TVR_BITS + 2 * TVN_BITS)) & TVN_MASK;
-		vec = tv4.vec + i;
+		vec = base->tv4.vec + i;
 	} else if ((signed long) idx < 0) {
 		/* can happen if you add a timer with expires == jiffies,
 		 * or you set a timer to go off in the past
 		 */
-		vec = tv1.vec + tv1.index;
+		vec = base->tv1.vec + base->tv1.index;
 	} else if (idx <= 0xffffffffUL) {
 		int i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
-		vec = tv5.vec + i;
+		vec = base->tv5.vec + i;
 	} else {
 		/* Can only get here on architectures with 64-bit jiffies */
 		INIT_LIST_HEAD(&timer->list);
@@ -168,37 +142,27 @@
 	list_add(&timer->list, vec->prev);
 }
 
-/* Initialize both explicitly - let's try to have them in the same cache line */
-spinlock_t timerlist_lock = SPIN_LOCK_UNLOCKED;
-
-#ifdef CONFIG_SMP
-volatile struct timer_list * volatile running_timer;
-#define timer_enter(t) do { running_timer = t; mb(); } while (0)
-#define timer_exit() do { running_timer = NULL; } while (0)
-#define timer_is_running(t) (running_timer == t)
-#define timer_synchronize(t) while (timer_is_running(t)) barrier()
-#else
-#define timer_enter(t)		do { } while (0)
-#define timer_exit()		do { } while (0)
-#endif
-
-void add_timer(struct timer_list *timer)
+void add_timer(timer_t *timer)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&timerlist_lock, flags);
-	if (unlikely(timer_pending(timer)))
-		goto bug;
-	internal_add_timer(timer);
-	spin_unlock_irqrestore(&timerlist_lock, flags);
-	return;
+	tvec_base_t * base = tvec_bases + smp_processor_id();
+  	unsigned long flags;
+  
+	CHECK_BASE(base);
+	CHECK_BASE(timer->base);
+	spin_lock_irqsave(&base->lock, flags);
+  	if (unlikely(timer_pending(timer)))
+  		goto bug;
+	internal_add_timer(base, timer);
+	timer->base = base;
+	spin_unlock_irqrestore(&base->lock, flags);
+  	return;
 bug:
-	spin_unlock_irqrestore(&timerlist_lock, flags);
-	printk("bug: kernel timer added twice at %p.\n",
-			__builtin_return_address(0));
+	spin_unlock_irqrestore(&base->lock, flags);
+  	printk("bug: kernel timer added twice at %p.\n",
+  			__builtin_return_address(0));
 }
-
-static inline int detach_timer (struct timer_list *timer)
+  
+static inline int detach_timer(timer_t *timer)
 {
 	if (!timer_pending(timer))
 		return 0;
@@ -206,28 +170,81 @@
 	return 1;
 }
 
-int mod_timer(struct timer_list *timer, unsigned long expires)
+/*
+ * mod_timer() has subtle locking semantics because parallel
+ * calls to it must happen serialized.
+ */
+int mod_timer(timer_t *timer, unsigned long expires)
 {
-	int ret;
+	tvec_base_t *old_base, *new_base;
 	unsigned long flags;
+	int ret;
+
+	new_base = tvec_bases + smp_processor_id();
+	CHECK_BASE(new_base);
+
+	__save_flags(flags);
+	__cli();
+repeat:
+	old_base = timer->base;
+	CHECK_BASE(old_base);
+
+	/*
+	 * Prevent deadlocks via ordering by old_base < new_base.
+	 */
+	if (old_base && (new_base != old_base)) {
+		if (old_base < new_base) {
+			spin_lock(&new_base->lock);
+			spin_lock(&old_base->lock);
+		} else {
+			spin_lock(&old_base->lock);
+			spin_lock(&new_base->lock);
+		}
+		/*
+		 * Subtle, we rely on timer->base being always
+		 * valid and being updated atomically.
+		 */
+		if (timer->base != old_base) {
+			spin_unlock(&new_base->lock);
+			spin_unlock(&old_base->lock);
+			goto repeat;
+		}
+	} else
+		spin_lock(&new_base->lock);
 
-	spin_lock_irqsave(&timerlist_lock, flags);
 	timer->expires = expires;
 	ret = detach_timer(timer);
-	internal_add_timer(timer);
-	spin_unlock_irqrestore(&timerlist_lock, flags);
+	internal_add_timer(new_base, timer);
+	timer->base = new_base;
+
+
+	if (old_base && (new_base != old_base))
+		spin_unlock(&old_base->lock);
+	spin_unlock_irqrestore(&new_base->lock, flags);
+
 	return ret;
 }
 
-int del_timer(struct timer_list * timer)
+int del_timer(timer_t * timer)
 {
-	int ret;
 	unsigned long flags;
+	tvec_base_t * base;
+	int ret;
 
-	spin_lock_irqsave(&timerlist_lock, flags);
+	CHECK_BASE(timer->base);
+	if (!timer->base)
+		return 0;
+repeat:
+ 	base = timer->base;
+	spin_lock_irqsave(&base->lock, flags);
+	if (base != timer->base) {
+		spin_unlock_irqrestore(&base->lock, flags);
+		goto repeat;
+	}
 	ret = detach_timer(timer);
 	timer->list.next = timer->list.prev = NULL;
-	spin_unlock_irqrestore(&timerlist_lock, flags);
+	spin_unlock_irqrestore(&base->lock, flags);
+
 	return ret;
 }
 
@@ -240,24 +257,34 @@
  * (for reference counting).
  */
 
-int del_timer_sync(struct timer_list * timer)
+int del_timer_sync(timer_t * timer)
 {
+	tvec_base_t * base;
 	int ret = 0;
 
+	CHECK_BASE(timer->base);
+	if (!timer->base)
+		return 0;
 	for (;;) {
 		unsigned long flags;
 		int running;
 
-		spin_lock_irqsave(&timerlist_lock, flags);
+repeat:
+	 	base = timer->base;
+		spin_lock_irqsave(&base->lock, flags);
+		if (base != timer->base) {
+			spin_unlock_irqrestore(&base->lock, flags);
+			goto repeat;
+		}
 		ret += detach_timer(timer);
 		timer->list.next = timer->list.prev = 0;
-		running = timer_is_running(timer);
-		spin_unlock_irqrestore(&timerlist_lock, flags);
+		running = timer_is_running(base, timer);
+		spin_unlock_irqrestore(&base->lock, flags);
 
 		if (!running)
 			break;
 
-		timer_synchronize(timer);
+		timer_synchronize(base, timer);
 	}
 
 	return ret;
@@ -265,7 +292,7 @@
 #endif
 
 
-static inline void cascade_timers(struct timer_vec *tv)
+static void cascade(tvec_base_t *base, tvec_t *tv)
 {
 	/* cascade all the timers from tv up one level */
 	struct list_head *head, *curr, *next;
@@ -277,54 +304,68 @@
 	 * detach them individually, just clear the list afterwards.
 	 */
 	while (curr != head) {
-		struct timer_list *tmp;
+		timer_t *tmp;
 
-		tmp = list_entry(curr, struct timer_list, list);
+		tmp = list_entry(curr, timer_t, list);
+		CHECK_BASE(tmp->base);
+		if (tmp->base != base)
+			BUG();
 		next = curr->next;
 		list_del(curr); // not needed
-		internal_add_timer(tmp);
+		internal_add_timer(base, tmp);
 		curr = next;
 	}
 	INIT_LIST_HEAD(head);
 	tv->index = (tv->index + 1) & TVN_MASK;
 }
 
-static inline void run_timer_list(void)
+static void __run_timers(tvec_base_t *base)
 {
-	spin_lock_irq(&timerlist_lock);
-	while ((long)(jiffies - timer_jiffies) >= 0) {
+	unsigned long flags;
+
+	spin_lock_irqsave(&base->lock, flags);
+	while ((long)(jiffies - base->timer_jiffies) >= 0) {
 		struct list_head *head, *curr;
-		if (!tv1.index) {
-			int n = 1;
-			do {
-				cascade_timers(tvecs[n]);
-			} while (tvecs[n]->index == 1 && ++n < NOOF_TVECS);
+
+		/*
+		 * Cascade timers:
+		 */
+		if (!base->tv1.index) {
+			cascade(base, &base->tv2);
+			if (base->tv2.index == 1) {
+				cascade(base, &base->tv3);
+				if (base->tv3.index == 1) {
+					cascade(base, &base->tv4);
+					if (base->tv4.index == 1)
+						cascade(base, &base->tv5);
+				}
+			}
 		}
 repeat:
-		head = tv1.vec + tv1.index;
+		head = base->tv1.vec + base->tv1.index;
 		curr = head->next;
 		if (curr != head) {
-			struct timer_list *timer;
 			void (*fn)(unsigned long);
 			unsigned long data;
+			timer_t *timer;
 
-			timer = list_entry(curr, struct timer_list, list);
+			timer = list_entry(curr, timer_t, list);
  			fn = timer->function;
- 			data= timer->data;
+ 			data = timer->data;
 
 			detach_timer(timer);
 			timer->list.next = timer->list.prev = NULL;
-			timer_enter(timer);
-			spin_unlock_irq(&timerlist_lock);
+			timer_enter(base, timer);
+			spin_unlock_irq(&base->lock);
 			fn(data);
-			spin_lock_irq(&timerlist_lock);
-			timer_exit();
+			spin_lock_irq(&base->lock);
+			timer_exit(base);
 			goto repeat;
 		}
-		++timer_jiffies; 
-		tv1.index = (tv1.index + 1) & TVR_MASK;
+		++base->timer_jiffies; 
+		base->tv1.index = (base->tv1.index + 1) & TVR_MASK;
 	}
-	spin_unlock_irq(&timerlist_lock);
+	spin_unlock_irqrestore(&base->lock, flags);
 }
 
 spinlock_t tqueue_lock = SPIN_LOCK_UNLOCKED;
@@ -626,27 +667,66 @@
 	}
 }
 
-/* jiffies at the most recent update of wall time */
-unsigned long wall_jiffies;
+#ifdef CONFIG_SMP
+
+static void run_timer_tasklet(unsigned long data)
+{
+        int cpu = smp_processor_id();
+	tvec_base_t *base = tvec_bases + cpu;
+
+        if (!spin_trylock(&global_bh_lock))
+                goto resched;
 
+        if (!spin_trylock(&net_bh_lock))
+                goto resched_unlock;
+
+	if (!hardirq_trylock(cpu))
+		goto resched_unlock_net;
+
+	if ((long)(jiffies - base->timer_jiffies) >= 0)
+		__run_timers(base);
+
+	hardirq_endlock(cpu);
+        spin_unlock(&net_bh_lock);
+        spin_unlock(&global_bh_lock);
+        return;
+resched_unlock_net:
+        spin_unlock(&net_bh_lock);
+resched_unlock:
+        spin_unlock(&global_bh_lock);
+resched:
+	tasklet_hi_schedule(&per_cpu(timer_tasklet, cpu));
+}
+  
 /*
- * This read-write spinlock protects us from races in SMP while
- * playing with xtime and avenrun.
+ * Called by the local, per-CPU timer interrupt on SMP.
+ *
+ * This function has to do all sorts of locking to make legacy
+ * cli()-users and BH-disablers work. If locking doesnt succeed
+ * now then we fall back to TIMER_BH.
  */
-rwlock_t xtime_lock = RW_LOCK_UNLOCKED;
-unsigned long last_time_offset;
-
-static inline void update_times(void)
+void run_local_timers(void)
+{
+        int cpu = smp_processor_id();
+	tasklet_hi_schedule(&per_cpu(timer_tasklet, cpu));
+}
+#else
+static void run_timer_tasklet(unsigned long data)
+{
+	tvec_base_t *base = tvec_bases + smp_processor_id();
+	if ((long)(jiffies - base->timer_jiffies) >= 0)
+		__run_timers(base);
+}
+#endif
+  
+/*
+ * Called by the timer interrupt. xtime_lock must already be taken
+ * by the timer IRQ!
+ */
+static void update_times(void)
 {
 	unsigned long ticks;
-
-	/*
-	 * update_times() is run from the raw timer_bh handler so we
-	 * just know that the irqs are locally enabled and so we don't
-	 * need to save/restore the flags of the local CPU here. -arca
-	 */
-	write_lock_irq(&xtime_lock);
-
+ 
 	ticks = jiffies - wall_jiffies;
 	if (ticks) {
 		wall_jiffies += ticks;
@@ -654,15 +734,8 @@
 	}
 	last_time_offset = 0;
 	calc_load(ticks);
-	write_unlock_irq(&xtime_lock);
-}
-
-void timer_bh(void)
-{
-	update_times();
-	run_timer_list();
 }
-
+  
 void do_timer(struct pt_regs *regs)
 {
 	jiffies_64++;
@@ -670,8 +743,9 @@
 	/* SMP process accounting uses the local APIC timer */
 
 	update_process_times(user_mode(regs));
+	tasklet_hi_schedule(&this_cpu(timer_tasklet));
 #endif
-	mark_bh(TIMER_BH);
+	update_times();
 	if (TQ_ACTIVE(tq_timer))
 		mark_bh(TQUEUE_BH);
 }
@@ -986,3 +1060,23 @@
 
 	return 0;
 }
+
+void __init init_timers(void)
+{
+	int i, j;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		tvec_base_t *base = tvec_bases + i;
+
+		spin_lock_init(&base->lock);
+		for (j = 0; j < TVN_SIZE; j++) {
+			INIT_LIST_HEAD(base->tv5.vec + j);
+			INIT_LIST_HEAD(base->tv4.vec + j);
+			INIT_LIST_HEAD(base->tv3.vec + j);
+			INIT_LIST_HEAD(base->tv2.vec + j);
+		}
+		for (j = 0; j < TVR_SIZE; j++)
+			INIT_LIST_HEAD(base->tv1.vec + j);
+		tasklet_init(&per_cpu(timer_tasklet, i), run_timer_tasklet, 0);
+	}
+}
diff -urN linux-2.5.24-base/lib/bust_spinlocks.c linux-2.5.24-smptimers_X1/lib/bust_spinlocks.c
--- linux-2.5.24-base/lib/bust_spinlocks.c	Fri Jun 21 04:23:47 2002
+++ linux-2.5.24-smptimers_X1/lib/bust_spinlocks.c	Fri Jul  5 14:40:29 2002
@@ -14,11 +14,9 @@
 #include <linux/wait.h>
 #include <linux/vt_kern.h>
 
-extern spinlock_t timerlist_lock;
 
 void bust_spinlocks(int yes)
 {
-	spin_lock_init(&timerlist_lock);
 	if (yes) {
 		oops_in_progress = 1;
 	} else {
diff -urN linux-2.5.24-base/net/core/dev.c linux-2.5.24-smptimers_X1/net/core/dev.c
--- linux-2.5.24-base/net/core/dev.c	Fri Jun 21 04:23:53 2002
+++ linux-2.5.24-smptimers_X1/net/core/dev.c	Fri Jul  5 14:40:30 2002
@@ -1293,7 +1293,6 @@
 static int deliver_to_old_ones(struct packet_type *pt,
 			       struct sk_buff *skb, int last)
 {
-	static spinlock_t net_bh_lock = SPIN_LOCK_UNLOCKED;
 	int ret = NET_RX_DROP;
 
 	if (!last) {
@@ -1310,13 +1309,7 @@
 
 	/* Emulate NET_BH with special spinlock */
 	spin_lock(&net_bh_lock);
-
-	/* Disable timers and wait for all timers completion */
-	tasklet_disable(bh_task_vec+TIMER_BH);
-
 	ret = pt->func(skb, skb->dev, pt);
-
-	tasklet_hi_enable(bh_task_vec+TIMER_BH);
 	spin_unlock(&net_bh_lock);
 out:
 	return ret;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-14  4:52   ` Dipankar Sarma
@ 2002-07-14 10:17     ` William Lee Irwin III
  2002-07-15  9:25       ` Dipankar Sarma
  2002-07-17 23:57     ` William Lee Irwin III
  2002-07-18  8:22     ` William Lee Irwin III
  2 siblings, 1 reply; 13+ messages in thread
From: William Lee Irwin III @ 2002-07-14 10:17 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Matthew Wilcox, Janitors, linux-kernel

On Mon, Jul 01, 2002 at 05:05:55AM +0100, Matthew Wilcox wrote:
>>> That doesn't mean that we shouldn't worry about the 38 files which use
>>> tq_timer, but they are almost all tty related and are therefore Hard ;-)

On Sat, Jul 13, 2002 at 06:05:06PM -0700, William Lee Irwin III wrote:
>> __global_cli(), timer_bh(), and bh_action() are crippling my machines.
>> Where do I start?

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> Even if you replace timemr_bh() with a tasklet, you still need
> to take the global_bh_lock to ensure that timers don't race with
> single-threaded BH processing in drivers. I wrote this patch [included]
> to get rid of timer_bh in Ingo's smptimers, but it acquires
> global_bh_lock as well as net_bh_lock, the latter to ensure
> that some older protocol code that expected serialization of
> NET_BH and timers work correctly (see deliver_to_old_ones()).
> They need to be cleaned up too.

This is great stuff. I'll definitely try it out in an hour or two. I'd
be interested in helping with the cleanup of the things assuming the BH
things still exist but might need a wee bit of hand-holding to get
through it. I'll go around flagging people down who might be able to
help me with it as I go.

I actually suspect tty-related things are a likely culprit as
significant use of the serial console occurs.


On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> My patch of course was experimental to see what is needed to
> get rid of timer_bh. It needs some cleanup itself ;-)


I can at least try it out. The BH stuff is legacy, so killing it
entirely at some point makes sense. I did volunteer to help with
this at OLS, so I'll be delivering code at some point.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-14 10:17     ` William Lee Irwin III
@ 2002-07-15  9:25       ` Dipankar Sarma
  2002-07-15 10:17         ` William Lee Irwin III
  0 siblings, 1 reply; 13+ messages in thread
From: Dipankar Sarma @ 2002-07-15  9:25 UTC (permalink / raw)
  To: William Lee Irwin III, Matthew Wilcox, Janitors, linux-kernel

On Sun, Jul 14, 2002 at 03:17:30AM -0700, William Lee Irwin III wrote:
> On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> > Even if you replace timemr_bh() with a tasklet, you still need
> > to take the global_bh_lock to ensure that timers don't race with
> > single-threaded BH processing in drivers. I wrote this patch [included]
> > to get rid of timer_bh in Ingo's smptimers, but it acquires
> > global_bh_lock as well as net_bh_lock, the latter to ensure
> > that some older protocol code that expected serialization of
> > NET_BH and timers work correctly (see deliver_to_old_ones()).
> > They need to be cleaned up too.
> 
> This is great stuff. I'll definitely try it out in an hour or two. I'd
> be interested in helping with the cleanup of the things assuming the BH
> things still exist but might need a wee bit of hand-holding to get
> through it. I'll go around flagging people down who might be able to
> help me with it as I go.

I did a quick and dirty search on packet_type.data == NULL protocols.
Here is a list -

802/psnap.c
appletalk/ddp.c
ax25/af_ax25.c
core/ext8022.c
econet/af_econet.c
irda/irsyms.c
x25/af_x25.c

These need to be made safe for a non-BH based timer. I guess
the current code assumes serialization between timer and
BH context code due to the use of now-defunct NET_BH.

> 
> I actually suspect tty-related things are a likely culprit as
> significant use of the serial console occurs.

It should also be possible to make minimal non-smptimers 
bhless_timer patch - just in case smptimers isn't going in
any time soon. It will run a timer tasklet off of do_timer().
The tasklet handler still has to grab global_bh_lock and
the likes to keep the tty and other drivers that expect
serialization BH and timers or use __global_cli, happy.
Will such a patch be useful ?

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-15  9:25       ` Dipankar Sarma
@ 2002-07-15 10:17         ` William Lee Irwin III
  0 siblings, 0 replies; 13+ messages in thread
From: William Lee Irwin III @ 2002-07-15 10:17 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Matthew Wilcox, Janitors, linux-kernel

On Sun, Jul 14, 2002 at 03:17:30AM -0700, William Lee Irwin III wrote:
>> I actually suspect tty-related things are a likely culprit as
>> significant use of the serial console occurs.

On Mon, Jul 15, 2002 at 02:55:21PM +0530, Dipankar Sarma wrote:
> It should also be possible to make minimal non-smptimers 
> bhless_timer patch - just in case smptimers isn't going in
> any time soon. It will run a timer tasklet off of do_timer().
> The tasklet handler still has to grab global_bh_lock and
> the likes to keep the tty and other drivers that expect
> serialization BH and timers or use __global_cli, happy.
> Will such a patch be useful ?

The temporary "hangs" are so bad any way to mitigate this horrible
problem will be useful. The machine is stuck so long in this stuff
literal network timeouts occur. It's insanely bad, this is really
beyond the scope of a performance problem and into the realm of an
out-and-out bug. A machine stuck for that long in this code is
effectively dead. I'm just slightly more patient than average users.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-14  4:52   ` Dipankar Sarma
  2002-07-14 10:17     ` William Lee Irwin III
@ 2002-07-17 23:57     ` William Lee Irwin III
  2002-07-18  8:22     ` William Lee Irwin III
  2 siblings, 0 replies; 13+ messages in thread
From: William Lee Irwin III @ 2002-07-17 23:57 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Matthew Wilcox, Janitors, linux-kernel

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> Even if you replace timemr_bh() with a tasklet, you still need
> to take the global_bh_lock to ensure that timers don't race with
> single-threaded BH processing in drivers. I wrote this patch [included]
> to get rid of timer_bh in Ingo's smptimers, but it acquires
> global_bh_lock as well as net_bh_lock, the latter to ensure
> that some older protocol code that expected serialization of
> NET_BH and timers work correctly (see deliver_to_old_ones()).
> They need to be cleaned up too.
> My patch of course was experimental to see what is needed to
> get rid of timer_bh. It needs some cleanup itself ;-)

It runs here. New profile (hopefully I'll get some fixed-up stuff like
oprofile, kernprof, & lockmeter to play with at some point):


14465232 total                                  114.2269
10694436 mod_timer                            33420.1125
1089589 __global_cli                           4005.8419
961598 timer_bh                                1059.0286
453404 do_gettimeofday                         3333.8529
440086 __wake_up                               2340.8830
298729 schedule                                 268.6412
294945 default_idle                            5672.0192
155762 do_softirq                               708.0091
 43256 tasklet_hi_action                        220.6939
 12724 system_call                              289.1818
 
mod_timer is 75%, __global_cli() appears to be 7.5%, and timer_bh()
is 6.6%... I wonder what happened to the plot for lockless gettimeofday(),
esp as that accounts for 3.1% here...

It's still spinning with interrupts off for several minutes at a time.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-14  4:52   ` Dipankar Sarma
  2002-07-14 10:17     ` William Lee Irwin III
  2002-07-17 23:57     ` William Lee Irwin III
@ 2002-07-18  8:22     ` William Lee Irwin III
  2002-07-18 10:29       ` William Lee Irwin III
  2002-07-18 10:43       ` William Lee Irwin III
  2 siblings, 2 replies; 13+ messages in thread
From: William Lee Irwin III @ 2002-07-18  8:22 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Matthew Wilcox, Janitors, linux-kernel

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> Even if you replace timemr_bh() with a tasklet, you still need
> to take the global_bh_lock to ensure that timers don't race with
> single-threaded BH processing in drivers. I wrote this patch [included]
> to get rid of timer_bh in Ingo's smptimers, but it acquires
> global_bh_lock as well as net_bh_lock, the latter to ensure
> that some older protocol code that expected serialization of
> NET_BH and timers work correctly (see deliver_to_old_ones()).
> They need to be cleaned up too.
> My patch of course was experimental to see what is needed to
> get rid of timer_bh. It needs some cleanup itself ;-)

It turns out those profiling results are total garbage. oprofile
hit counts during the tbench 1024 run with smptimers-X1 on the 16-way
16GB NUMA-Q follow:

c020249d 43051806 73.9493     .text.lock.dev
c0196750 2138900  3.67395     csum_partial_copy_generic
c020090c 1454023  2.49755     netif_rx
c0200e78 1237550  2.12572     process_backlog
c0200480 1083695  1.86144     dev_queue_xmit
c0120bf8 1013839  1.74145     run_timer_tasklet
c0228c8c 946933   1.62653     tcp_v4_rcv
c0196920 773495   1.32862     __generic_copy_to_user
c012009c 605591   1.04021     mod_timer
c01fbe98 477906   0.820891    sock_wfree
c0218e14 392831   0.674759    tcp_recvmsg
c0112648 362804   0.623182    try_to_wake_up
c01132fc 278976   0.479192    schedule
c0136087 251550   0.432083    .text.lock.page_alloc
c0211d14 215139   0.36954     ip_queue_xmit
c021759c 205078   0.352259    tcp_sendmsg
c0220ad4 203218   0.349064    tcp_rcv_established
c0112f64 189216   0.325013    scheduler_tick
c02221a4 187313   0.321744    tcp_transmit_skb
c021f5fc 184471   0.316863    tcp_data_queue
c02189a8 163828   0.281404    tcp_data_wait
c01dd820 139310   0.23929     loopback_xmit
c01fcfcc 137241   0.235736    skb_release_data


I'll follow up with the "before"  profile next.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-18  8:22     ` William Lee Irwin III
@ 2002-07-18 10:29       ` William Lee Irwin III
  2002-07-18 10:43       ` William Lee Irwin III
  1 sibling, 0 replies; 13+ messages in thread
From: William Lee Irwin III @ 2002-07-18 10:29 UTC (permalink / raw)
  To: Dipankar Sarma, Matthew Wilcox, Janitors, linux-kernel

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
>> Even if you replace timemr_bh() with a tasklet, you still need
>> to take the global_bh_lock to ensure that timers don't race with
>> single-threaded BH processing in drivers. I wrote this patch [included]
>> to get rid of timer_bh in Ingo's smptimers, but it acquires
>> global_bh_lock as well as net_bh_lock, the latter to ensure
>> that some older protocol code that expected serialization of
>> NET_BH and timers work correctly (see deliver_to_old_ones()).
>> They need to be cleaned up too.
>> My patch of course was experimental to see what is needed to
>> get rid of timer_bh. It needs some cleanup itself ;-)

On Thu, Jul 18, 2002 at 01:22:38AM -0700, William Lee Irwin III wrote:
> I'll follow up with the "before"  profile next.

By the way, since it applies with just offsets to 2.5.26 I did my testing
on it. Here they are:

c01210c3 15914360 73.4974     .text.lock.timer
c0120114 1740662  8.03891     mod_timer
c0196480 533190   2.46243     csum_partial_copy_generic
c0196650 409733   1.89227     __generic_copy_to_user
c0112658 271923   1.25582     try_to_wake_up
c022893c 227856   1.05231     tcp_v4_rcv
c021ba91 219423   1.01336     .text.lock.tcp
c0107bb4 216722   1.00089     apic_timer_interrupt
c02118a4 160277   0.740208    ip_output
c011330c 123467   0.570208    schedule
c021727c 121239   0.559919    tcp_sendmsg
c0200170 121187   0.559679    dev_queue_xmit
c02021ad 83061    0.383601    .text.lock.dev
c02119f4 80365    0.37115     ip_queue_xmit
c0112f74 77876    0.359655    scheduler_tick
c01fcd98 75855    0.350322    __kfree_skb
c010fb30 73084    0.337524    smp_apic_timer_interrupt
c0218688 68854    0.317989    tcp_data_wait
c0218af4 66201    0.305736    tcp_recvmsg
c01207f8 64625    0.298458    timer_bh
c021e448 55844    0.257905    tcp_ack
c010cd60 52670    0.243246    do_gettimeofday
c02207b4 51631    0.238448    tcp_rcv_established


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] BH removal text
  2002-07-18  8:22     ` William Lee Irwin III
  2002-07-18 10:29       ` William Lee Irwin III
@ 2002-07-18 10:43       ` William Lee Irwin III
  1 sibling, 0 replies; 13+ messages in thread
From: William Lee Irwin III @ 2002-07-18 10:43 UTC (permalink / raw)
  To: Dipankar Sarma, Matthew Wilcox, Janitors, linux-kernel

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
>> Even if you replace timemr_bh() with a tasklet, you still need
>> to take the global_bh_lock to ensure that timers don't race with
>> single-threaded BH processing in drivers. I wrote this patch [included]
>> to get rid of timer_bh in Ingo's smptimers, but it acquires
>> global_bh_lock as well as net_bh_lock, the latter to ensure
>> that some older protocol code that expected serialization of
>> NET_BH and timers work correctly (see deliver_to_old_ones()).
>> They need to be cleaned up too.
>> My patch of course was experimental to see what is needed to
>> get rid of timer_bh. It needs some cleanup itself ;-)

On Thu, Jul 18, 2002 at 01:22:38AM -0700, William Lee Irwin III wrote:
> It turns out those profiling results are total garbage. oprofile
> hit counts during the tbench 1024 run with smptimers-X1 on the 16-way
> 16GB NUMA-Q follow:

Oh yes, bandwidth was increased from 23MB/s to 37MB/s.

And the rundown on .text.lock.dev:

c020249d 43051806 73.9493     .text.lock.dev
 c02024f9 10357    0.0240571
 c02024fc 121387   0.281956
 c02024fe 10282    0.0238829
 c0202515 5777619  13.4202
 c0202518 31534891 73.2487
 c020251a 5596759  13.0001
 c020251c 10       2.32278e-05
 c0202521 11       2.55506e-05
 c0202522 158      0.000367
 c0202523 34       7.89746e-05
 c0202524 34       7.89746e-05
 c0202529 61       0.00014169
 c020252a 125      0.000290348
 c020252b 36       8.36202e-05
 c020252c 42       9.75569e-05


c0202518:       f3 90                   repz nop 
c020251a:       7e f9                   jle    c0202515 <.text.lock.dev+0x78>
c020251c:       e9 83 e1 ff ff          jmp    c02006a4 <dev_queue_xmit+0x224>

[...]

c0200694:       e8 eb 78 f1 ff          call   c0117f84 <printk>
c0200699:       0f 0b                   ud2a   
c020069b:       7b 00                   jnp    c020069d <dev_queue_xmit+0x21d>
c020069d:       40                      inc    %eax
c020069e:       5c                      pop    %esp
c020069f:       29 c0                   sub    %eax,%eax
c02006a1:       83 c4 08                add    $0x8,%esp
c02006a4:       f0 fe 0f                lock decb (%edi)
c02006a7:       0f 88 68 1e 00 00       js     c0202515 <.text.lock.dev+0x78>
c02006ad:       8b 44 24 10             mov    0x10(%esp,1),%eax
c02006b1:       89 86 e8 00 00 00       mov    %eax,0xe8(%esi)
c02006b7:       8b 46 24                mov    0x24(%esi),%eax
c02006ba:       a8 01                   test   $0x1,%al
c02006bc:       0f 85 a1 00 00 00       jne    c0200763 <dev_queue_xmit+0x2e3>
c02006c2:       83 3d 60 5f 3b c0 00    cmpl   $0x0,0xc03b5f60
c02006c9:       74 0a                   je     c02006d5 <dev_queue_xmit+0x255>
c02006cb:       56                      push   %esi
c02006cc:       55                      push   %ebp
c02006cd:       e8 42 fc ff ff          call   c0200314 <dev_queue_xmit_nit>


This leads me to believe it's the dev->xmit_lock as that's protects the
critical section in which dev_queue_xmit_nit() is called.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2002-07-18 10:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-01  4:05 [RFC] BH removal text Matthew Wilcox
2002-07-01 13:41 ` Arnd Bergmann
2002-07-03  7:21 ` george anzinger
2002-07-03 11:15   ` Matthew Wilcox
2002-07-14  1:05 ` William Lee Irwin III
2002-07-14  4:52   ` Dipankar Sarma
2002-07-14 10:17     ` William Lee Irwin III
2002-07-15  9:25       ` Dipankar Sarma
2002-07-15 10:17         ` William Lee Irwin III
2002-07-17 23:57     ` William Lee Irwin III
2002-07-18  8:22     ` William Lee Irwin III
2002-07-18 10:29       ` William Lee Irwin III
2002-07-18 10:43       ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox