public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [patch] del_timer_sync scalability patch
@ 2005-03-11 18:54 Oleg Nesterov
  2005-03-11 20:57 ` Christoph Lameter
  2005-03-13 13:13 ` [patch] del_timer_sync scalability patch Oleg Nesterov
  0 siblings, 2 replies; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-11 18:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: Shai Fultheim, Christoph Lameter, Andrew Morton, Ingo Molnar

Hello.

I am not sure, but I think this patch incorrect.

> @@ -466,6 +482,7 @@ repeat:
>  			set_running_timer(base, timer);
>  			smp_wmb();
>  			timer->base = NULL;
------>	WINDOW <------
> +			set_last_running(timer, base);
>  			spin_unlock_irq(&base->lock);

Suppose it is the first invocation of timer, so
timer->last_running == NULL.

What if del_timer_sync() happens in that window?

del_timer_sync:
	del_timer();	// timer->base == NULL, returns

	base = timer->last_running;
	if (base)	// no, it is still NULL
		...

	if (timer->base != NULL || timer->last_running != base)
		goto del_again;	// not taken

	return;

I think it is not enough to exchange these 2 lines in
__run_timers, we also need barriers.

Oleg.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] del_timer_sync scalability patch
  2005-03-11 18:54 [patch] del_timer_sync scalability patch Oleg Nesterov
@ 2005-03-11 20:57 ` Christoph Lameter
  2005-03-15 17:19   ` [PATCH 0/2] del_timer_sync: proof of concept Oleg Nesterov
                     ` (2 more replies)
  2005-03-13 13:13 ` [patch] del_timer_sync scalability patch Oleg Nesterov
  1 sibling, 3 replies; 18+ messages in thread
From: Christoph Lameter @ 2005-03-11 20:57 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, Shai Fultheim, Andrew Morton, Ingo Molnar

On Fri, 11 Mar 2005, Oleg Nesterov wrote:

> I think it is not enough to exchange these 2 lines in
> __run_timers, we also need barriers.

Maybe its best to drop last_running_timer as Ingo suggested.

Replace the magic with a flag that can be set to stop scheduling a timer
again.

Then del_timer_sync may

1. Set the flag not to reschedule
2. delete the timer
3. wait till the timer function is complete
4. (eventually verify that the timer is really gone)
5. reset the no reschedule flag


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] del_timer_sync scalability patch
  2005-03-11 18:54 [patch] del_timer_sync scalability patch Oleg Nesterov
  2005-03-11 20:57 ` Christoph Lameter
@ 2005-03-13 13:13 ` Oleg Nesterov
  2005-03-14 19:40   ` Christoph Lameter
  2005-03-15  8:06   ` Christoph Lameter
  1 sibling, 2 replies; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-13 13:13 UTC (permalink / raw)
  To: linux-kernel, Shai Fultheim, Christoph Lameter, Andrew Morton,
	Ingo Molnar

I suspect that del_timer_sync() in its current form is racy.

CPU 0						CPU 1

__run_timers() sets timer->base = NULL

						del_timer_sync() starts, calls del_timer(), it returns
						because timer->base == NULL.

						waits until the run is complete:
							while (base->running_timer == timer)
								preempt_check_resched();
						
						calls schedule(), or long interrupt happens.

timer reschedules itself, __run_timers()
exits.

						base->running_timer == NULL, end of loop.

next timer interrupt, __run_timers() picks
this timer again, sets timer->base = NULL

						if (timer_pending(timer))	// no, timer->base == NULL
							goto del_again;		// not taken

						del_timer_sync() returns

timer runs.

No?

Oleg.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] del_timer_sync scalability patch
  2005-03-13 13:13 ` [patch] del_timer_sync scalability patch Oleg Nesterov
@ 2005-03-14 19:40   ` Christoph Lameter
  2005-03-15  9:12     ` Oleg Nesterov
  2005-03-15  8:06   ` Christoph Lameter
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2005-03-14 19:40 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, Shai Fultheim, Andrew Morton, Ingo Molnar


On Sun, 13 Mar 2005, Oleg Nesterov wrote:

> I suspect that del_timer_sync() in its current form is racy.
>
> CPU 0						CPU 1
>
> __run_timers() sets timer->base = NULL
>
> 						del_timer_sync() starts, calls del_timer(), it returns
> 						because timer->base == NULL.
>
> 						waits until the run is complete:
> 							while (base->running_timer == timer)
> 								preempt_check_resched();
>
> 						calls schedule(), or long interrupt happens.
>
> timer reschedules itself, __run_timers()
> exits.
>
> 						base->running_timer == NULL, end of loop.
>
> next timer interrupt, __run_timers() picks
> this timer again, sets timer->base = NULL
>
> 						if (timer_pending(timer))	// no, timer->base == NULL

timer->base is != NULL because the timer has rescheduled itself.
__mod_timer sets timer->bBase

> 							goto del_again;		// not taken

Yes it is taken!

>
> 						del_timer_sync() returns
>
> timer runs.

Nope.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] del_timer_sync scalability patch
  2005-03-13 13:13 ` [patch] del_timer_sync scalability patch Oleg Nesterov
  2005-03-14 19:40   ` Christoph Lameter
@ 2005-03-15  8:06   ` Christoph Lameter
  2005-03-15  9:28     ` Ingo Molnar
  2005-03-15 10:28     ` Oleg Nesterov
  1 sibling, 2 replies; 18+ messages in thread
From: Christoph Lameter @ 2005-03-15  8:06 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov; +Cc: linux-kernel, Shai Fultheim, Ingo Molnar

How about this take on the problem?

When a potential periodic timer is deleted through timer_del_sync, all cpus
are scanned to determine if the timer is running on that cpu. In a NUMA
configuration doing so will cause NUMA interlink traffic which limits the
scalability of timers.

The following patch removes the magic in the timer_list structure (Andrew
suggested that we may not need it anymore) and replaces it with two u8
variables that give us some additional state of the timer

running		-> Set if the timer function is currently executing
shutdown	-> Set if del_timer_sync is running and the timer
		   should not be rescheduled.

Signed-off-by: Christoph Lameter <christoph@lameter.com>
Signed-off-by: Shai Fultheim <Shai@Scalex86.org>

Index: linux-2.6.11/include/linux/timer.h
===================================================================
--- linux-2.6.11.orig/include/linux/timer.h	2005-03-14 14:17:51.671533904 -0800
+++ linux-2.6.11/include/linux/timer.h	2005-03-14 14:41:49.640929328 -0800
@@ -13,22 +13,22 @@ struct timer_list {
 	unsigned long expires;

 	spinlock_t lock;
-	unsigned long magic;

 	void (*function)(unsigned long);
 	unsigned long data;

-	struct tvec_t_base_s *base;
+	struct tvec_t_base_s *base;	/* ==NULL if timer not scheduled */
+	u8 running;		/* function is currently executing */
+	u8 shutdown;		/* do not schedule this timer */
 };

-#define TIMER_MAGIC	0x4b87ad6e
-
 #define TIMER_INITIALIZER(_function, _expires, _data) {		\
 		.function = (_function),			\
 		.expires = (_expires),				\
 		.data = (_data),				\
 		.base = NULL,					\
-		.magic = TIMER_MAGIC,				\
+		.running = 0,					\
+		.shutdown = 0,					\
 		.lock = SPIN_LOCK_UNLOCKED,			\
 	}

@@ -42,7 +42,8 @@ struct timer_list {
 static inline void init_timer(struct timer_list * timer)
 {
 	timer->base = NULL;
-	timer->magic = TIMER_MAGIC;
+	timer->running = 0;
+	timer->shutdown = 0;
 	spin_lock_init(&timer->lock);
 }

Index: linux-2.6.11/kernel/timer.c
===================================================================
--- linux-2.6.11.orig/kernel/timer.c	2005-03-14 14:17:51.671533904 -0800
+++ linux-2.6.11/kernel/timer.c	2005-03-14 14:57:24.613791848 -0800
@@ -89,31 +89,6 @@ static inline void set_running_timer(tve
 /* Fake initialization */
 static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };

-static void check_timer_failed(struct timer_list *timer)
-{
-	static int whine_count;
-	if (whine_count < 16) {
-		whine_count++;
-		printk("Uninitialised timer!\n");
-		printk("This is just a warning.  Your computer is OK\n");
-		printk("function=0x%p, data=0x%lx\n",
-			timer->function, timer->data);
-		dump_stack();
-	}
-	/*
-	 * Now fix it up
-	 */
-	spin_lock_init(&timer->lock);
-	timer->magic = TIMER_MAGIC;
-}
-
-static inline void check_timer(struct timer_list *timer)
-{
-	if (timer->magic != TIMER_MAGIC)
-		check_timer_failed(timer);
-}
-
-
 static void internal_add_timer(tvec_base_t *base, struct timer_list *timer)
 {
 	unsigned long expires = timer->expires;
@@ -164,8 +139,6 @@ int __mod_timer(struct timer_list *timer

 	BUG_ON(!timer->function);

-	check_timer(timer);
-
 	spin_lock_irqsave(&timer->lock, flags);
 	new_base = &__get_cpu_var(tvec_bases);
 repeat:
@@ -208,8 +181,10 @@ repeat:
 		ret = 1;
 	}
 	timer->expires = expires;
-	internal_add_timer(new_base, timer);
-	timer->base = new_base;
+	if (!timer->shutdown) {
+		internal_add_timer(new_base, timer);
+		timer->base = new_base;
+	}

 	if (old_base && (new_base != old_base))
 		spin_unlock(&old_base->lock);
@@ -235,7 +210,8 @@ void add_timer_on(struct timer_list *tim

   	BUG_ON(timer_pending(timer) || !timer->function);

-	check_timer(timer);
+	if (timer->shutdown)
+		return;

 	spin_lock_irqsave(&base->lock, flags);
 	internal_add_timer(base, timer);
@@ -267,8 +243,6 @@ int mod_timer(struct timer_list *timer,
 {
 	BUG_ON(!timer->function);

-	check_timer(timer);
-
 	/*
 	 * This is a common optimization triggered by the
 	 * networking code - if the timer is re-modified
@@ -298,8 +272,6 @@ int del_timer(struct timer_list *timer)
 	unsigned long flags;
 	tvec_base_t *base;

-	check_timer(timer);
-
 repeat:
  	base = timer->base;
 	if (!base)
@@ -337,35 +309,40 @@ EXPORT_SYMBOL(del_timer);
  *
  * The function returns whether it has deactivated a pending timer or not.
  *
- * del_timer_sync() is slow and complicated because it copes with timer
- * handlers which re-arm the timer (periodic timers).  If the timer handler
- * is known to not do this (a single shot timer) then use
- * del_singleshot_timer_sync() instead.
+ * del_timer_sync() copes with time handlers which re-arm the timer (periodic
+ * timers).  If the timer handler is known to not do this (a single shot
+ * timer) then use del_singleshot_timer_sync() instead.
  */
 int del_timer_sync(struct timer_list *timer)
 {
-	tvec_base_t *base;
-	int i, ret = 0;
+	int ret = 0;

-	check_timer(timer);
+	/* Forbid additional scheduling of this timer */
+	timer->shutdown = 1;

 del_again:
 	ret += del_timer(timer);

-	for_each_online_cpu(i) {
-		base = &per_cpu(tvec_bases, i);
-		if (base->running_timer == timer) {
-			while (base->running_timer == timer) {
-				cpu_relax();
-				preempt_check_resched();
-			}
-			break;
-		}
+	/*
+	 * If the timer is still executing then wait until the
+	 * run is complete.
+	 */
+	while (timer->running) {
+			cpu_relax();
+			preempt_check_resched();
 	}
 	smp_rmb();
+	/*
+	 * If the timer is no longer pending then the timer
+	 * is truly off.
+	 * The timer may have been started on some other
+	 * cpu in the meantime (due to a race condition)
+	 */
 	if (timer_pending(timer))
 		goto del_again;

+	/* Reenable timer shutdown */
+	timer->shutdown = 0;
 	return ret;
 }
 EXPORT_SYMBOL(del_timer_sync);
@@ -380,7 +357,7 @@ EXPORT_SYMBOL(del_timer_sync);
  *
  * Synchronization rules: callers must prevent restarting of the timer,
  * otherwise this function is meaningless. It must not be called from
- * interrupt contexts. The caller must not hold locks which wold prevent
+ * interrupt contexts. The caller must not hold locks which would prevent
  * completion of the timer's handler.  Upon exit the timer is not queued and
  * the handler is not running on any CPU.
  *
@@ -464,6 +441,7 @@ repeat:

 			list_del(&timer->entry);
 			set_running_timer(base, timer);
+			timer->running = 1;
 			smp_wmb();
 			timer->base = NULL;
 			spin_unlock_irq(&base->lock);
@@ -476,6 +454,7 @@ repeat:
 				}
 			}
 			spin_lock_irq(&base->lock);
+			timer->running = 0;
 			goto repeat;
 		}
 	}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] del_timer_sync scalability patch
  2005-03-14 19:40   ` Christoph Lameter
@ 2005-03-15  9:12     ` Oleg Nesterov
  0 siblings, 0 replies; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-15  9:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, Shai Fultheim, Andrew Morton, Ingo Molnar

Christoph Lameter wrote:
>
> On Sun, 13 Mar 2005, Oleg Nesterov wrote:
>
> > I suspect that del_timer_sync() in its current form is racy.
> >
...snip...
> > next timer interrupt, __run_timers() picks
> > this timer again, sets timer->base = NULL
                      ^^^^^^^^^^^^^^^^^^^^^^^
> >
> > 						if (timer_pending(timer))	// no, timer->base == NULL
>
> timer->base is != NULL because the timer has rescheduled itself.
> __mod_timer sets timer->bBase

Christoph, please look again. Yes, __mod_timer sets timer->base,
but it is cleared in the _next_ timer interrupt on CPU 0.

Andrew, Ingo, what do you think?

Oleg.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] del_timer_sync scalability patch
  2005-03-15  8:06   ` Christoph Lameter
@ 2005-03-15  9:28     ` Ingo Molnar
  2005-03-15 10:28     ` Oleg Nesterov
  1 sibling, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2005-03-15  9:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Oleg Nesterov, linux-kernel, Shai Fultheim


* Christoph Lameter <christoph@lameter.com> wrote:

> The following patch removes the magic in the timer_list structure
> (Andrew suggested that we may not need it anymore) and replaces it
> with two u8 variables that give us some additional state of the timer

The 'remove the magic' observation is not a 'backdoor' to introduce new
fields 'for free'. So please dont mix the two things into one patch.

> +       u8 running;             /* function is currently executing */
> +       u8 shutdown;            /* do not schedule this timer */

it may as well be cleaner to do the timer->base_running thing after all,
and to do this in __run_timers():

	timer->base = NULL;
	timer->base_running = base;

note that we cannot clear ->base_running in __run_timers() [we dont own
any access to the timer anymore, it may be kfree()d, etc.], so
del_timer_sync() has to make sure that the timer->base_running info is
not stale - it is a hint only. But it looks doable, and it should solve
the NUMA scalability problem.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] del_timer_sync scalability patch
  2005-03-15  8:06   ` Christoph Lameter
  2005-03-15  9:28     ` Ingo Molnar
@ 2005-03-15 10:28     ` Oleg Nesterov
  1 sibling, 0 replies; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-15 10:28 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, Shai Fultheim, Ingo Molnar

Christoph Lameter wrote:
>
> @@ -476,6 +454,7 @@ repeat:
>  				}
>  			}
>  			spin_lock_irq(&base->lock);
> +			timer->running = 0;
			^^^^^^^^^^^^^^^^^^
>  			goto repeat;
>  		}
>  	}

This is imho wrong. The timer probably don't exist when
timer_list->function returns.

I mean, timer_list->function could deallocate timer's memory.

Oleg.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 0/2] del_timer_sync: proof of concept
  2005-03-11 20:57 ` Christoph Lameter
@ 2005-03-15 17:19   ` Oleg Nesterov
  2005-03-15 18:15     ` Christoph Lameter
  2005-03-16 16:55     ` Oleg Nesterov
  2005-03-15 17:19   ` [PATCH 1/2] " Oleg Nesterov
  2005-03-15 17:20   ` [PATCH 2/2] " Oleg Nesterov
  2 siblings, 2 replies; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-15 17:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, Shai Fultheim, Andrew Morton, Ingo Molnar

Andrew Morton wrote:
>
> If we're prepared to rule that a timer handler is not allowed to do
> add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't
> it?
>
> That should make things simpler?

In that case I think that both problems (race and scalability)
can be solved without increasing sizeof(timer_list).

What if timer_list had ->pending field? Then we can do:

timer_pending:
	return timer->pending;

__mod_timer:
	internal_add_timer(new_base, timer);
	timer->base = new_base;
	timer->pending = 1;

__run_timers:
	list_del(&timer->entry);
	set_running_timer(base, timer);

	/* do not change timer->base */
	timer->pending = 0;

	spin_unlock(base->lock);
	timer->function();

del_timer:
	if (!timer->pending)
		return 0;
	base = timer->base;
	...

del_timer_sync:
	base = timer->base;
	if (!base)
		return 0;
	spin_lock(base->lock);

	if (base != timer->base)
		goto del_again;
	if (base->running_timer == timer)
		goto del_again;

	if (timer->pending)
		list_del(&timer->entry);

	timer->pending = 0;
	timer->base = NULL;

The ->pending flag could live in the least significant bit of
timer->base, this way we:
	1. do not waste the space
	2. can read/write base+pending atomically

These patches are incomplete/suboptimal, just a proof of concept.

Oleg.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] del_timer_sync: proof of concept
  2005-03-11 20:57 ` Christoph Lameter
  2005-03-15 17:19   ` [PATCH 0/2] del_timer_sync: proof of concept Oleg Nesterov
@ 2005-03-15 17:19   ` Oleg Nesterov
  2005-03-15 17:20   ` [PATCH 2/2] " Oleg Nesterov
  2 siblings, 0 replies; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-15 17:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, Shai Fultheim, Andrew Morton, Ingo Molnar

This patch renames ->base to _base. Now this field
is used only in __get_base/__set_base helpers.

It is ugly, just to reduce the size of patch.

No changes in kernel/timer.o, so it is correct.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

--- 2.6.11/include/linux/timer.h~1_BASE	2005-03-14 00:22:43.000000000 +0300
+++ 2.6.11/include/linux/timer.h	2005-03-15 17:49:19.000000000 +0300
@@ -18,16 +18,21 @@ struct timer_list {
 	void (*function)(unsigned long);
 	unsigned long data;
 
-	struct tvec_t_base_s *base;
+	struct tvec_t_base_s *_base;
 };
 
+static inline int __timer_pending(struct tvec_t_base_s *base)
+{
+	return base != NULL;
+}
+
 #define TIMER_MAGIC	0x4b87ad6e
 
 #define TIMER_INITIALIZER(_function, _expires, _data) {		\
 		.function = (_function),			\
 		.expires = (_expires),				\
 		.data = (_data),				\
-		.base = NULL,					\
+		._base = NULL,					\
 		.magic = TIMER_MAGIC,				\
 		.lock = SPIN_LOCK_UNLOCKED,			\
 	}
@@ -41,7 +46,7 @@ struct timer_list {
  */
 static inline void init_timer(struct timer_list * timer)
 {
-	timer->base = NULL;
+	timer->_base = NULL;
 	timer->magic = TIMER_MAGIC;
 	spin_lock_init(&timer->lock);
 }
@@ -58,7 +63,7 @@ static inline void init_timer(struct tim
  */
 static inline int timer_pending(const struct timer_list * timer)
 {
-	return timer->base != NULL;
+	return __timer_pending(timer->_base);
 }
 
 extern void add_timer_on(struct timer_list *timer, int cpu);
--- 2.6.11/kernel/timer.c~1_BASE	2005-03-14 00:21:13.000000000 +0300
+++ 2.6.11/kernel/timer.c	2005-03-15 17:55:00.000000000 +0300
@@ -84,6 +84,20 @@ static inline void set_running_timer(tve
 #endif
 }
 
+static inline tvec_base_t *__get_base(struct timer_list *timer)
+{
+	return timer->_base;
+}
+
+static inline void __set_base(struct timer_list *timer,
+				tvec_base_t *base, int pending)
+{
+	if (pending)
+		timer->_base = base;
+	else
+		timer->_base = NULL;
+}
+
 /* Fake initialization */
 static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };
 
@@ -167,7 +181,7 @@ int __mod_timer(struct timer_list *timer
 	spin_lock_irqsave(&timer->lock, flags);
 	new_base = &__get_cpu_var(tvec_bases);
 repeat:
-	old_base = timer->base;
+	old_base = __get_base(timer);
 
 	/*
 	 * Prevent deadlocks via ordering by old_base < new_base.
@@ -184,14 +198,14 @@ repeat:
 		 * The timer base might have been cancelled while we were
 		 * trying to take the lock(s):
 		 */
-		if (timer->base != old_base) {
+		if (__get_base(timer) != old_base) {
 			spin_unlock(&new_base->lock);
 			spin_unlock(&old_base->lock);
 			goto repeat;
 		}
 	} else {
 		spin_lock(&new_base->lock);
-		if (timer->base != old_base) {
+		if (__get_base(timer) != old_base) {
 			spin_unlock(&new_base->lock);
 			goto repeat;
 		}
@@ -207,7 +221,7 @@ repeat:
 	}
 	timer->expires = expires;
 	internal_add_timer(new_base, timer);
-	timer->base = new_base;
+	__set_base(timer, new_base, 1);
 
 	if (old_base && (new_base != old_base))
 		spin_unlock(&old_base->lock);
@@ -237,7 +251,7 @@ void add_timer_on(struct timer_list *tim
 
 	spin_lock_irqsave(&base->lock, flags);
 	internal_add_timer(base, timer);
-	timer->base = base;
+	__set_base(timer, base, 1);
 	spin_unlock_irqrestore(&base->lock, flags);
 }
 
@@ -299,18 +313,18 @@ int del_timer(struct timer_list *timer)
 	check_timer(timer);
 
 repeat:
- 	base = timer->base;
+ 	base = __get_base(timer);
 	if (!base)
 		return 0;
 	spin_lock_irqsave(&base->lock, flags);
-	if (base != timer->base) {
+	if (base != __get_base(timer)) {
 		spin_unlock_irqrestore(&base->lock, flags);
 		goto repeat;
 	}
 	list_del(&timer->entry);
 	/* Need to make sure that anybody who sees a NULL base also sees the list ops */
 	smp_wmb();
-	timer->base = NULL;
+	__set_base(timer, base, 0);
 	spin_unlock_irqrestore(&base->lock, flags);
 
 	return 1;
@@ -413,7 +427,7 @@ static int cascade(tvec_base_t *base, tv
 		struct timer_list *tmp;
 
 		tmp = list_entry(curr, struct timer_list, entry);
-		BUG_ON(tmp->base != base);
+		BUG_ON(__get_base(tmp) != base);
 		curr = curr->next;
 		internal_add_timer(base, tmp);
 	}
@@ -463,7 +477,7 @@ repeat:
 			list_del(&timer->entry);
 			set_running_timer(base, timer);
 			smp_wmb();
-			timer->base = NULL;
+			__set_base(timer, base, 0);
 			spin_unlock_irq(&base->lock);
 			{
 				u32 preempt_count = preempt_count();
@@ -1309,7 +1323,7 @@ static int migrate_timer_list(tvec_base_
 			return 0;
 		list_del(&timer->entry);
 		internal_add_timer(new_base, timer);
-		timer->base = new_base;
+		__set_base(timer, new_base, 1);
 		spin_unlock(&timer->lock);
 	}
 	return 1;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/2] del_timer_sync: proof of concept
  2005-03-11 20:57 ` Christoph Lameter
  2005-03-15 17:19   ` [PATCH 0/2] del_timer_sync: proof of concept Oleg Nesterov
  2005-03-15 17:19   ` [PATCH 1/2] " Oleg Nesterov
@ 2005-03-15 17:20   ` Oleg Nesterov
  2005-03-16  9:00     ` Ingo Molnar
  2 siblings, 1 reply; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-15 17:20 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, Shai Fultheim, Andrew Morton, Ingo Molnar

New rules:
	->_base &  1	: is timer pending
	->_base & ~1	: timer's base

If ->_base == NULL, this timer is not running on any cpu.
Cleared only in del_timer_sync().

Note that now we don't need del_singleshot_timer_sync().

Here is the code of del_timer_sync:

int del_timer_sync(struct timer_list *timer)
{
	int ret = 0;

	for (;;) {
		unsigned long flags;
		tvec_base_t *_base, *base;

		_base = timer->_base;
		if (!_base)
			return ret;

		base = __timer_base(_base);
		spin_lock_irqsave(&base->lock, flags);

		if (timer->_base != _base)
			goto unlock;

		if (base->running_timer == timer)
			goto unlock;

		if (__timer_pending(_base)) {
			list_del(&timer->entry);
			ret = 1;
		}
		smp_wmb();
		timer->_base = NULL;
unlock:
		spin_unlock_irqrestore(&base->lock, flags);
	}
}

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

--- 2.6.11/include/linux/timer.h~2_PEND	2005-03-15 17:49:19.000000000 +0300
+++ 2.6.11/include/linux/timer.h	2005-03-15 18:09:06.000000000 +0300
@@ -21,9 +21,11 @@ struct timer_list {
 	struct tvec_t_base_s *_base;
 };
 
+#define	__TIMER_PENDING	1
+
 static inline int __timer_pending(struct tvec_t_base_s *base)
 {
-	return base != NULL;
+	return ((unsigned long)base & __TIMER_PENDING) != 0;
 }
 
 #define TIMER_MAGIC	0x4b87ad6e
--- 2.6.11/kernel/timer.c~2_PEND	2005-03-15 17:55:00.000000000 +0300
+++ 2.6.11/kernel/timer.c	2005-03-15 19:52:48.000000000 +0300
@@ -86,16 +86,26 @@ static inline void set_running_timer(tve
 
 static inline tvec_base_t *__get_base(struct timer_list *timer)
 {
-	return timer->_base;
+	tvec_base_t *base = timer->_base;
+
+	if (__timer_pending(base))
+		return (void*)base - __TIMER_PENDING;
+	else
+		return NULL;
 }
 
 static inline void __set_base(struct timer_list *timer,
 				tvec_base_t *base, int pending)
 {
 	if (pending)
-		timer->_base = base;
+		timer->_base = (void*)base + __TIMER_PENDING;
 	else
-		timer->_base = NULL;
+		timer->_base = base;
+}
+
+static inline tvec_base_t *__timer_base(tvec_base_t *base)
+{
+	return (tvec_base_t*)((unsigned long)base & ~__TIMER_PENDING);
 }
 
 /* Fake initialization */
@@ -356,29 +366,39 @@ EXPORT_SYMBOL(del_timer);
  */
 int del_timer_sync(struct timer_list *timer)
 {
-	tvec_base_t *base;
-	int i, ret = 0;
+	int ret;
 
 	check_timer(timer);
 
-del_again:
-	ret += del_timer(timer);
+	ret = 0;
+	for (;;) {
+		unsigned long flags;
+		tvec_base_t *_base, *base;
+
+		_base = timer->_base;
+		if (!_base)
+			return ret;
 
-	for_each_online_cpu(i) {
-		base = &per_cpu(tvec_bases, i);
-		if (base->running_timer == timer) {
-			while (base->running_timer == timer) {
-				cpu_relax();
-				preempt_check_resched();
-			}
-			break;
+		base = __timer_base(_base);
+		spin_lock_irqsave(&base->lock, flags);
+
+		if (timer->_base != _base)
+			goto unlock;
+
+		if (base->running_timer == timer)
+			goto unlock;
+
+		if (__timer_pending(_base)) {
+			list_del(&timer->entry);
+			ret = 1;
 		}
+		/* Need to make sure that anybody who sees a NULL base
+		 * also sees the list ops */
+		smp_wmb();
+		timer->_base = NULL;
+unlock:
+		spin_unlock_irqrestore(&base->lock, flags);
 	}
-	smp_rmb();
-	if (timer_pending(timer))
-		goto del_again;
-
-	return ret;
 }
 EXPORT_SYMBOL(del_timer_sync);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] del_timer_sync: proof of concept
  2005-03-15 17:19   ` [PATCH 0/2] del_timer_sync: proof of concept Oleg Nesterov
@ 2005-03-15 18:15     ` Christoph Lameter
  2005-03-15 19:41       ` Oleg Nesterov
  2005-03-16 16:55     ` Oleg Nesterov
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Lameter @ 2005-03-15 18:15 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, Shai Fultheim, Andrew Morton, Ingo Molnar

I like the idea and it would solve the concerns that we had. The encoding
of a bit in a pointer is weird but we have done that before in the
page_struct.

However, this also means that __run_timers will not free up the timer and
it has to be explicitly freed with del_timer_??. There may be code
that relies on not having to delete a single shot timer after it has been
fired.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] del_timer_sync: proof of concept
  2005-03-15 19:41       ` Oleg Nesterov
@ 2005-03-15 19:02         ` Christoph Lameter
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Lameter @ 2005-03-15 19:02 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, Shai Fultheim, Andrew Morton, Ingo Molnar

On Tue, 15 Mar 2005, Oleg Nesterov wrote:

> Christoph Lameter wrote:
> >
> > However, this also means that __run_timers will not free up the timer and
> > it has to be explicitly freed with del_timer_??.
>
> I am not sure I understand you but no, del_timer{,_sync} is not needed.
>
> __run_timer deletes timer from base->tv? list and clears 'pending flag'.
>
> __del_timer_sync sets ->_base = NULL, but it is merely optimization.
> It could set ->_base = base, but in that case next del_timer_sync()
> call will need spin_lock(base->lock) again.

For some reason I thought that ->base == NULL would have special
significance outside of the function you discussed. Looks fine to me now.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] del_timer_sync: proof of concept
  2005-03-15 18:15     ` Christoph Lameter
@ 2005-03-15 19:41       ` Oleg Nesterov
  2005-03-15 19:02         ` Christoph Lameter
  0 siblings, 1 reply; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-15 19:41 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, Shai Fultheim, Andrew Morton, Ingo Molnar

Christoph Lameter wrote:
>
> However, this also means that __run_timers will not free up the timer and
> it has to be explicitly freed with del_timer_??.

I am not sure I understand you but no, del_timer{,_sync} is not needed.

__run_timer deletes timer from base->tv? list and clears 'pending flag'.

__del_timer_sync sets ->_base = NULL, but it is merely optimization.
It could set ->_base = base, but in that case next del_timer_sync()
call will need spin_lock(base->lock) again.

Oleg.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] del_timer_sync: proof of concept
  2005-03-15 17:20   ` [PATCH 2/2] " Oleg Nesterov
@ 2005-03-16  9:00     ` Ingo Molnar
  2005-03-16 12:09       ` Oleg Nesterov
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2005-03-16  9:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, linux-kernel, Shai Fultheim, Andrew Morton


* Oleg Nesterov <oleg@tv-sign.ru> wrote:

> New rules:
> 	->_base &  1	: is timer pending
> 	->_base & ~1	: timer's base

how would it look like if we had a separate timer->pending field after
all? Would it be faster/cleaner?

(we dont need to keep them small _that_ bad - if there's a good reason
we should rather add a clean new field than to encode two fields into
one field and complicate the code.)

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] del_timer_sync: proof of concept
  2005-03-16  9:00     ` Ingo Molnar
@ 2005-03-16 12:09       ` Oleg Nesterov
  2005-03-16 13:52         ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-16 12:09 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Christoph Lameter, linux-kernel, Shai Fultheim, Andrew Morton

Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@tv-sign.ru> wrote:
>
> > New rules:
> > 	->_base &  1	: is timer pending
> > 	->_base & ~1	: timer's base
>
> how would it look like if we had a separate timer->pending field after
> all? Would it be faster/cleaner?

The only change visible outside kernel/timer.c is:

 static inline int timer_pending(const struct timer_list * timer)
 {
-	return timer->base != NULL;
+	return timer->base & 1;
 }

Currently __get_base() usage in the kernel/time.c suboptimal and
should be cleanuped, I see no other problems with performance.

> (we dont need to keep them small _that_ bad - if there's a good reason
> we should rather add a clean new field than to encode two fields into
> one field and complicate the code.)

I think that separate timer->pending field will require more changes,
because we can't read/write base+pending atomically.

int del_timer()
{
again:
	if (!timer->pending)	// not strictly necessary, but it is
		return 0;	// nice to avoid locking
	base = timer->base;
	if (!base)
		return 0;

	spin_lock(base->lock);

	if (!timer->pending) {
		spin_unlock();
		goto again;
	}
	if (timer->base != base) {
		spin_unlock();
		goto again;
	}
	....	
}

Note also, that we have to audit every timer->base usage anyway,
because currently it mix base and pending.

But may be you are right, the encoding of a bit in a pointer is
indeed weird.

Oleg.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] del_timer_sync: proof of concept
  2005-03-16 12:09       ` Oleg Nesterov
@ 2005-03-16 13:52         ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2005-03-16 13:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christoph Lameter, linux-kernel, Shai Fultheim, Andrew Morton


* Oleg Nesterov <oleg@tv-sign.ru> wrote:

> I think that separate timer->pending field will require more changes,
> because we can't read/write base+pending atomically.

i think that's the killer argument in favor of the bit-trick. Being able
to read/write base+pending atomically is a good excuse. Using less RAM
alone is not necessarily a good excuse.

Andrew, could we give Oleg's 2 patches a whirl in -mm?

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] del_timer_sync: proof of concept
  2005-03-15 17:19   ` [PATCH 0/2] del_timer_sync: proof of concept Oleg Nesterov
  2005-03-15 18:15     ` Christoph Lameter
@ 2005-03-16 16:55     ` Oleg Nesterov
  1 sibling, 0 replies; 18+ messages in thread
From: Oleg Nesterov @ 2005-03-16 16:55 UTC (permalink / raw)
  To: Christoph Lameter, linux-kernel, Shai Fultheim, Andrew Morton,
	Ingo Molnar

Andrew Morton wrote:
>
> If we're prepared to rule that a timer handler is not allowed to do
> add_timer_on() then a recurring timer is permanently pinned to a CPU, isn't
> it?
>
> That should make things simpler?

I think that current inplementation of del_timer_sync() don't like
add_timer_on() too.

Consider the timer running on CPU_0. It sets timer->expires = jiffies,
and calls add_timer_on(1). Now it is possible that local timer interrupt
on CPU_1 happens and starts that timer before timer->function returns on
CPU_0.

del_timer_sync() detects that timer is running on CPU_0, waits while
->running_timer == timer, and returns. The timer still runs on CPU_1.

Oleg.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-03-16 15:49 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-11 18:54 [patch] del_timer_sync scalability patch Oleg Nesterov
2005-03-11 20:57 ` Christoph Lameter
2005-03-15 17:19   ` [PATCH 0/2] del_timer_sync: proof of concept Oleg Nesterov
2005-03-15 18:15     ` Christoph Lameter
2005-03-15 19:41       ` Oleg Nesterov
2005-03-15 19:02         ` Christoph Lameter
2005-03-16 16:55     ` Oleg Nesterov
2005-03-15 17:19   ` [PATCH 1/2] " Oleg Nesterov
2005-03-15 17:20   ` [PATCH 2/2] " Oleg Nesterov
2005-03-16  9:00     ` Ingo Molnar
2005-03-16 12:09       ` Oleg Nesterov
2005-03-16 13:52         ` Ingo Molnar
2005-03-13 13:13 ` [patch] del_timer_sync scalability patch Oleg Nesterov
2005-03-14 19:40   ` Christoph Lameter
2005-03-15  9:12     ` Oleg Nesterov
2005-03-15  8:06   ` Christoph Lameter
2005-03-15  9:28     ` Ingo Molnar
2005-03-15 10:28     ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox