linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v3 PATCH 0/4] timers: Framework for migration of timers
@ 2009-03-16 11:10 Arun R Bharadwaj
  2009-03-16 11:12 ` [v3 PATCH 1/4] timers: Framework for identifying pinned timers Arun R Bharadwaj
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Arun R Bharadwaj @ 2009-03-16 11:10 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, svaidy, arun

Hi,


In an SMP system, tasks are scheduled on different CPUs by the
scheduler, interrupts are managed by irqbalancer daemon, but timers
are still stuck to the CPUs that they have been initialised.  Timers
queued by tasks gets re-queued on the CPU where the task gets to run
next, but timers from IRQ context like the ones in device drivers are
still stuck on the CPU they were initialised.  This framework will
help move all 'movable timers' using a sysctl interface.

Iteration v2 of this patch can be found at
http://lkml.org/lkml/2009/3/4/130


Changelog: v2->v3

-removed the sysfs interface for enabling timer migration
and instead created a /proc/sys sysctl, which can turn off
timer_migration when CONFIG_SCHED_DEBUG =y.(as suggested by Ingo)

-modified the interface to identify pinned regular timers,
made it cleaner and simpler to understand.(as suggested by Oleg)


The following patches are included:
PATCH 1/4 - framework to identify pinned timers.
PATCH 2/4 - identifying the existing pinned hrtimers.
PATCH 3/4 - /proc/sys sysctl hook to enable timer migration.
PATCH 4/4 - logic to enable timer migration.

The patchset is based on the latest tip/master.

Timer migration is enabled by default as suggested by Ingo.
It can be turned off when CONFIG_SCHED_DEBUG=y by

echo 0 > /proc/sys/kernel/timer_migration


Benchmark tested: Kernbench

Kernbench results on an 2-socket, quad-core machine
Averaged over 3 iterations.
sched_mc_power_savings = 2

------------------------------------------------------------------------
|No. of	    Time taken(s)      Time Taken(s)	      Time Taken(s)    |
|Threads  Patch Not Applied (Migration Disabled)   (Migration Enabled) |
|								       |
|  32	     64.02		64.05			63.73	       |
|  16	     63.31		63.71			62.91	       |
|   8	     67.20		67.38			67.32	       |
|   4	    115.67	       114.72		       114.54	       |
|   2	    222.21	       220.63		       218.13	       |
------------------------------------------------------------------------


TODO:

-Yet to implement Oleg's comment on my previous post PATCH 4/4.
Current code seems to be working fine. Working on it to implement
what has been suggested.


Please let me know your comments.

--arun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [v3 PATCH 1/4] timers: Framework for identifying pinned timers
  2009-03-16 11:10 [v3 PATCH 0/4] timers: Framework for migration of timers Arun R Bharadwaj
@ 2009-03-16 11:12 ` Arun R Bharadwaj
  2009-03-16 11:13 ` [v3 PATCH 2/4] timers: Identifying the existing " Arun R Bharadwaj
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Arun R Bharadwaj @ 2009-03-16 11:12 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, svaidy, Arun Bharadwaj

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-16 16:40:45]:

This patch creates a new framework for identifying cpu-pinned timers
and hrtimers.


This framework is needed because pinned timers are expected to fire on
the same CPU on which they are queued. So it is essential to identify
these and not migrate them, in case there are any.

For regular timers, the currently existing add_timer_on() can be used
queue pinned timers and subsequently mod_timer_pinned() can be used
to modify the 'expires' field.

For hrtimers, a new interface hrtimer_start_pinned() is created,
which can be used to queue cpu-pinned hrtimer.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/hrtimer.h |   24 ++++++++++++++++++++----
 include/linux/timer.h   |    3 +++
 kernel/hrtimer.c        |   34 ++++++++++++++++++++++++++++------
 kernel/timer.c          |   31 +++++++++++++++++++++++++++----
 4 files changed, 78 insertions(+), 14 deletions(-)

Index: linux.trees.git/include/linux/hrtimer.h
===================================================================
--- linux.trees.git.orig/include/linux/hrtimer.h
+++ linux.trees.git/include/linux/hrtimer.h
@@ -331,23 +331,39 @@ static inline void hrtimer_init_on_stack
 static inline void destroy_hrtimer_on_stack(struct hrtimer *timer) { }
 #endif
 
+#define HRTIMER_NOT_PINNED	0
+#define HRTIMER_PINNED		1
 /* Basic timer operations: */
 extern int hrtimer_start(struct hrtimer *timer, ktime_t tim,
 			 const enum hrtimer_mode mode);
+extern int hrtimer_start_pinned(struct hrtimer *timer, ktime_t tim,
+			const enum hrtimer_mode mode);
 extern int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
-			unsigned long range_ns, const enum hrtimer_mode mode);
+	unsigned long range_ns, const enum hrtimer_mode mode, int pinned);
 extern int hrtimer_cancel(struct hrtimer *timer);
 extern int hrtimer_try_to_cancel(struct hrtimer *timer);
 
-static inline int hrtimer_start_expires(struct hrtimer *timer,
-						enum hrtimer_mode mode)
+static inline int __hrtimer_start_expires(struct hrtimer *timer,
+					enum hrtimer_mode mode, int pinned)
 {
 	unsigned long delta;
 	ktime_t soft, hard;
 	soft = hrtimer_get_softexpires(timer);
 	hard = hrtimer_get_expires(timer);
 	delta = ktime_to_ns(ktime_sub(hard, soft));
-	return hrtimer_start_range_ns(timer, soft, delta, mode);
+	return hrtimer_start_range_ns(timer, soft, delta, mode, pinned);
+}
+
+static inline int hrtimer_start_expires(struct hrtimer *timer,
+						enum hrtimer_mode mode)
+{
+	return __hrtimer_start_expires(timer, mode, HRTIMER_NOT_PINNED);
+}
+
+static inline int hrtimer_start_expires_pinned(struct hrtimer *timer,
+						enum hrtimer_mode mode)
+{
+	return __hrtimer_start_expires(timer, mode, HRTIMER_PINNED);
 }
 
 static inline int hrtimer_restart(struct hrtimer *timer)
Index: linux.trees.git/kernel/hrtimer.c
===================================================================
--- linux.trees.git.orig/kernel/hrtimer.c
+++ linux.trees.git/kernel/hrtimer.c
@@ -193,7 +193,8 @@ struct hrtimer_clock_base *lock_hrtimer_
  * Switch the timer base to the current CPU when possible.
  */
 static inline struct hrtimer_clock_base *
-switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base)
+switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
+	int pinned)
 {
 	struct hrtimer_clock_base *new_base;
 	struct hrtimer_cpu_base *new_cpu_base;
@@ -897,9 +898,8 @@ remove_hrtimer(struct hrtimer *timer, st
  *  0 on success
  *  1 when the timer was active
  */
-int
-hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, unsigned long delta_ns,
-			const enum hrtimer_mode mode)
+int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
+	unsigned long delta_ns, const enum hrtimer_mode mode, int pinned)
 {
 	struct hrtimer_clock_base *base, *new_base;
 	unsigned long flags;
@@ -911,7 +911,7 @@ hrtimer_start_range_ns(struct hrtimer *t
 	ret = remove_hrtimer(timer, base);
 
 	/* Switch the timer base, if necessary: */
-	new_base = switch_hrtimer_base(timer, base);
+	new_base = switch_hrtimer_base(timer, base, pinned);
 
 	if (mode == HRTIMER_MODE_REL) {
 		tim = ktime_add_safe(tim, new_base->get_time());
@@ -948,6 +948,12 @@ hrtimer_start_range_ns(struct hrtimer *t
 }
 EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
 
+int __hrtimer_start(struct hrtimer *timer, ktime_t tim,
+	const enum hrtimer_mode mode, int pinned)
+{
+	return hrtimer_start_range_ns(timer, tim, 0, mode, pinned);
+}
+
 /**
  * hrtimer_start - (re)start an hrtimer on the current CPU
  * @timer:	the timer to be added
@@ -961,10 +967,26 @@ EXPORT_SYMBOL_GPL(hrtimer_start_range_ns
 int
 hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode)
 {
-	return hrtimer_start_range_ns(timer, tim, 0, mode);
+	return __hrtimer_start(timer, tim, mode, HRTIMER_NOT_PINNED);
 }
 EXPORT_SYMBOL_GPL(hrtimer_start);
 
+/**
+ * hrtimer_start_pinned - start a CPU-pinned hrtimer
+ * @timer:      the timer to be added
+ * @tim:        expiry time
+ * @mode:       expiry mode: absolute (HRTIMER_ABS) or relative (HRTIMER_REL)
+ *
+ * Returns:
+ *  0 on success
+ *  1 when the timer was active
+ */
+int hrtimer_start_pinned(struct hrtimer *timer,
+	ktime_t tim, const enum hrtimer_mode mode)
+{
+	return __hrtimer_start(timer, tim, mode, HRTIMER_PINNED);
+}
+EXPORT_SYMBOL_GPL(hrtimer_start_pinned);
 
 /**
  * hrtimer_try_to_cancel - try to deactivate a timer
Index: linux.trees.git/include/linux/timer.h
===================================================================
--- linux.trees.git.orig/include/linux/timer.h
+++ linux.trees.git/include/linux/timer.h
@@ -163,7 +163,10 @@ extern void add_timer_on(struct timer_li
 extern int del_timer(struct timer_list * timer);
 extern int mod_timer(struct timer_list *timer, unsigned long expires);
 extern int mod_timer_pending(struct timer_list *timer, unsigned long expires);
+extern int mod_timer_pinned(struct timer_list *timer, unsigned long expires);
 
+#define TIMER_NOT_PINNED	0
+#define TIMER_PINNED		1
 /*
  * The jiffies value which is added to now, when there is no timer
  * in the timer wheel:
Index: linux.trees.git/kernel/timer.c
===================================================================
--- linux.trees.git.orig/kernel/timer.c
+++ linux.trees.git/kernel/timer.c
@@ -601,7 +601,8 @@ static struct tvec_base *lock_timer_base
 }
 
 static inline int
-__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
+__mod_timer(struct timer_list *timer, unsigned long expires,
+						bool pending_only, int pinned)
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
@@ -665,7 +666,7 @@ out_unlock:
  */
 int mod_timer_pending(struct timer_list *timer, unsigned long expires)
 {
-	return __mod_timer(timer, expires, true);
+	return __mod_timer(timer, expires, true, TIMER_NOT_PINNED);
 }
 EXPORT_SYMBOL(mod_timer_pending);
 
@@ -699,11 +700,33 @@ int mod_timer(struct timer_list *timer, 
 	if (timer->expires == expires && timer_pending(timer))
 		return 1;
 
-	return __mod_timer(timer, expires, false);
+	return __mod_timer(timer, expires, false, TIMER_NOT_PINNED);
 }
 EXPORT_SYMBOL(mod_timer);
 
 /**
+ * mod_timer_pinned - modify a timer's timeout
+ * @timer: the timer to be modified
+ * @expires: new timeout in jiffies
+ *
+ * mod_timer_pinned() is a way to update the expire field of an
+ * active timer (if the timer is inactive it will be activated)
+ * and not allow the timer to be migrated to a different CPU.
+ *
+ * mod_timer_pinned(timer, expires) is equivalent to:
+ *
+ *     del_timer(timer); timer->expires = expires; add_timer(timer);
+ */
+int mod_timer_pinned(struct timer_list *timer, unsigned long expires)
+{
+	if (timer->expires == expires && timer_pending(timer))
+		return 1;
+
+	return __mod_timer(timer, expires, false, TIMER_PINNED);
+}
+EXPORT_SYMBOL(mod_timer_pinned);
+
+/**
  * add_timer - start a timer
  * @timer: the timer to be added
  *
@@ -1350,7 +1373,7 @@ signed long __sched schedule_timeout(sig
 	expire = timeout + jiffies;
 
 	setup_timer_on_stack(&timer, process_timeout, (unsigned long)current);
-	__mod_timer(&timer, expire, false);
+	__mod_timer(&timer, expire, false, TIMER_NOT_PINNED);
 	schedule();
 	del_singleshot_timer_sync(&timer);
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [v3 PATCH 2/4] timers: Identifying the existing pinned timers
  2009-03-16 11:10 [v3 PATCH 0/4] timers: Framework for migration of timers Arun R Bharadwaj
  2009-03-16 11:12 ` [v3 PATCH 1/4] timers: Framework for identifying pinned timers Arun R Bharadwaj
@ 2009-03-16 11:13 ` Arun R Bharadwaj
  2009-03-17 13:07   ` Thomas Gleixner
  2009-03-16 11:14 ` [v3 PATCH 3/4] timers: /proc/sys sysctl hook to enable timer migration Arun R Bharadwaj
  2009-03-16 11:15 ` [v3 PATCH 4/4] timers: logic to move non pinned timers Arun R Bharadwaj
  3 siblings, 1 reply; 13+ messages in thread
From: Arun R Bharadwaj @ 2009-03-16 11:13 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, svaidy, Arun Bharadwaj

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-16 16:40:45]:

The following pinned hrtimers have been identified and marked:
1)sched_rt_period_timer
2)tick_sched_timer
3)stack_trace_timer_fn

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 arch/x86/kernel/apic/x2apic_uv_x.c |    2 +-
 kernel/sched.c                     |    5 +++--
 kernel/time/tick-sched.c           |    7 ++++---
 kernel/trace/trace_sysprof.c       |    3 ++-
 4 files changed, 10 insertions(+), 7 deletions(-)

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -236,7 +236,7 @@ static void start_rt_bandwidth(struct rt
 
 		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
 		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-		hrtimer_start_expires(&rt_b->rt_period_timer,
+		hrtimer_start_expires_pinned(&rt_b->rt_period_timer,
 				HRTIMER_MODE_ABS);
 	}
 	spin_unlock(&rt_b->rt_runtime_lock);
@@ -1156,7 +1156,8 @@ static __init void init_hrtick(void)
  */
 static void hrtick_start(struct rq *rq, u64 delay)
 {
-	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay), HRTIMER_MODE_REL);
+	hrtimer_start_pinned(&rq->hrtick_timer, ns_to_ktime(delay),
+				HRTIMER_MODE_REL);
 }
 
 static inline void init_hrtick(void)
Index: linux.trees.git/kernel/time/tick-sched.c
===================================================================
--- linux.trees.git.orig/kernel/time/tick-sched.c
+++ linux.trees.git/kernel/time/tick-sched.c
@@ -348,7 +348,7 @@ void tick_nohz_stop_sched_tick(int inidl
 		ts->idle_expires = expires;
 
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
-			hrtimer_start(&ts->sched_timer, expires,
+			hrtimer_start_pinned(&ts->sched_timer, expires,
 				      HRTIMER_MODE_ABS);
 			/* Check, if the timer was already in the past */
 			if (hrtimer_active(&ts->sched_timer))
@@ -394,7 +394,7 @@ static void tick_nohz_restart(struct tic
 		hrtimer_forward(&ts->sched_timer, now, tick_period);
 
 		if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
-			hrtimer_start_expires(&ts->sched_timer,
+			hrtimer_start_expires_pinned(&ts->sched_timer,
 				      HRTIMER_MODE_ABS);
 			/* Check, if the timer was already in the past */
 			if (hrtimer_active(&ts->sched_timer))
@@ -698,7 +698,8 @@ void tick_setup_sched_timer(void)
 
 	for (;;) {
 		hrtimer_forward(&ts->sched_timer, now, tick_period);
-		hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS);
+		hrtimer_start_expires_pinned(&ts->sched_timer,
+						HRTIMER_MODE_ABS);
 		/* Check, if the timer was already in the past */
 		if (hrtimer_active(&ts->sched_timer))
 			break;
Index: linux.trees.git/kernel/trace/trace_sysprof.c
===================================================================
--- linux.trees.git.orig/kernel/trace/trace_sysprof.c
+++ linux.trees.git/kernel/trace/trace_sysprof.c
@@ -203,7 +203,8 @@ static void start_stack_timer(void *unus
 	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	hrtimer->function = stack_trace_timer_fn;
 
-	hrtimer_start(hrtimer, ns_to_ktime(sample_period), HRTIMER_MODE_REL);
+	hrtimer_start_pinned(hrtimer, ns_to_ktime(sample_period),
+				HRTIMER_MODE_REL);
 }
 
 static void start_stack_timers(void)
Index: linux.trees.git/arch/x86/kernel/apic/x2apic_uv_x.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/apic/x2apic_uv_x.c
+++ linux.trees.git/arch/x86/kernel/apic/x2apic_uv_x.c
@@ -455,7 +455,7 @@ static void uv_heartbeat(unsigned long i
 	uv_set_scir_bits(bits);
 
 	/* enable next timer period */
-	mod_timer(timer, jiffies + SCIR_CPU_HB_INTERVAL);
+	mod_timer_pinned(timer, jiffies + SCIR_CPU_HB_INTERVAL);
 }
 
 static void __cpuinit uv_heartbeat_enable(int cpu)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [v3 PATCH 3/4] timers: /proc/sys sysctl hook to enable timer migration
  2009-03-16 11:10 [v3 PATCH 0/4] timers: Framework for migration of timers Arun R Bharadwaj
  2009-03-16 11:12 ` [v3 PATCH 1/4] timers: Framework for identifying pinned timers Arun R Bharadwaj
  2009-03-16 11:13 ` [v3 PATCH 2/4] timers: Identifying the existing " Arun R Bharadwaj
@ 2009-03-16 11:14 ` Arun R Bharadwaj
  2009-03-16 15:29   ` Alexey Dobriyan
  2009-03-16 11:15 ` [v3 PATCH 4/4] timers: logic to move non pinned timers Arun R Bharadwaj
  3 siblings, 1 reply; 13+ messages in thread
From: Arun R Bharadwaj @ 2009-03-16 11:14 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, svaidy, Arun Bharadwaj

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-16 16:40:45]:

This patch creates the /proc/sys sysctl interface at
/proc/sys/kernel/timer_migration

Timer migration is enabled by default.

To disable timer migration, when CONFIG_SCHED_DEBUG = y,

echo 0 > /proc/sys/kernel/timer_migration

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |    1 +
 kernel/sysctl.c       |    8 ++++++++
 3 files changed, 10 insertions(+)

Index: linux.trees.git/include/linux/sched.h
===================================================================
--- linux.trees.git.orig/include/linux/sched.h
+++ linux.trees.git/include/linux/sched.h
@@ -1763,6 +1763,7 @@ extern unsigned int sysctl_sched_child_r
 extern unsigned int sysctl_sched_features;
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
+extern unsigned int sysctl_timer_migration;
 
 int sched_nr_latency_handler(struct ctl_table *table, int write,
 		struct file *file, void __user *buffer, size_t *length,
Index: linux.trees.git/kernel/sysctl.c
===================================================================
--- linux.trees.git.orig/kernel/sysctl.c
+++ linux.trees.git/kernel/sysctl.c
@@ -328,6 +328,14 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "timer_migration",
+		.data		= &sysctl_timer_migration,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #endif
 	{
 		.ctl_name	= CTL_UNNUMBERED,
Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -7445,6 +7445,7 @@ static void sched_domain_node_span(int n
 #endif /* CONFIG_NUMA */
 
 int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
+const_debug unsigned int sysctl_timer_migration = 1;
 
 /*
  * The cpus mask in sched_group and sched_domain hangs off the end.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [v3 PATCH 4/4] timers: logic to move non pinned timers
  2009-03-16 11:10 [v3 PATCH 0/4] timers: Framework for migration of timers Arun R Bharadwaj
                   ` (2 preceding siblings ...)
  2009-03-16 11:14 ` [v3 PATCH 3/4] timers: /proc/sys sysctl hook to enable timer migration Arun R Bharadwaj
@ 2009-03-16 11:15 ` Arun R Bharadwaj
  2009-03-17 10:22   ` Thomas Gleixner
  3 siblings, 1 reply; 13+ messages in thread
From: Arun R Bharadwaj @ 2009-03-16 11:15 UTC (permalink / raw)
  To: linux-kernel, linux-pm
  Cc: a.p.zijlstra, ego, tglx, mingo, andi, venkatesh.pallipadi, vatsa,
	arjan, svaidy, Arun Bharadwaj

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-16 16:40:45]:

This patch migrates all non pinned timers and hrtimers to the current
idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
are not migrated.

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 include/linux/sched.h |    9 +++++++++
 kernel/hrtimer.c      |   12 +++++++++++-
 kernel/sched.c        |    5 +++++
 kernel/timer.c        |   13 ++++++++++++-
 4 files changed, 37 insertions(+), 2 deletions(-)

Index: linux.trees.git/kernel/timer.c
===================================================================
--- linux.trees.git.orig/kernel/timer.c
+++ linux.trees.git/kernel/timer.c
@@ -37,6 +37,7 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/sched.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -606,7 +607,7 @@ __mod_timer(struct timer_list *timer, un
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
-	int ret;
+	int ret, current_cpu, preferred_cpu;
 
 	ret = 0;
 
@@ -627,6 +628,16 @@ __mod_timer(struct timer_list *timer, un
 
 	new_base = __get_cpu_var(tvec_bases);
 
+	current_cpu = smp_processor_id();
+	preferred_cpu = get_nohz_load_balancer();
+	if (get_sysctl_timer_migration() && idle_cpu(current_cpu) &&
+			!pinned && preferred_cpu != -1) {
+		new_base = per_cpu(tvec_bases, preferred_cpu);
+		timer_set_base(timer, new_base);
+		timer->expires = expires;
+		internal_add_timer(new_base, timer);
+		goto out_unlock;
+	}
 	if (base != new_base) {
 		/*
 		 * We are trying to schedule the timer on the local CPU.
Index: linux.trees.git/kernel/hrtimer.c
===================================================================
--- linux.trees.git.orig/kernel/hrtimer.c
+++ linux.trees.git/kernel/hrtimer.c
@@ -43,6 +43,8 @@
 #include <linux/seq_file.h>
 #include <linux/err.h>
 #include <linux/debugobjects.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
 
 #include <asm/uaccess.h>
 
@@ -198,8 +200,16 @@ switch_hrtimer_base(struct hrtimer *time
 {
 	struct hrtimer_clock_base *new_base;
 	struct hrtimer_cpu_base *new_cpu_base;
+	int current_cpu, preferred_cpu;
+
+	current_cpu = smp_processor_id();
+	preferred_cpu = get_nohz_load_balancer();
+	if (get_sysctl_timer_migration() && !pinned && preferred_cpu != -1
+			&& idle_cpu(current_cpu))
+		new_cpu_base = &per_cpu(hrtimer_bases, preferred_cpu);
+	else
+		new_cpu_base = &__get_cpu_var(hrtimer_bases);
 
-	new_cpu_base = &__get_cpu_var(hrtimer_bases);
 	new_base = &new_cpu_base->clock_base[base->index];
 
 	if (base != new_base) {
Index: linux.trees.git/include/linux/sched.h
===================================================================
--- linux.trees.git.orig/include/linux/sched.h
+++ linux.trees.git/include/linux/sched.h
@@ -265,6 +265,7 @@ static inline int select_nohz_load_balan
 }
 #endif
 
+extern int get_nohz_load_balancer(void);
 /*
  * Only dump TASK_* tasks. (0 for all tasks)
  */
@@ -1769,6 +1770,14 @@ int sched_nr_latency_handler(struct ctl_
 		struct file *file, void __user *buffer, size_t *length,
 		loff_t *ppos);
 #endif
+static inline int get_sysctl_timer_migration(void)
+{
+#ifdef CONFIG_SCHED_DEBUG
+	return sysctl_timer_migration;
+#else
+	return 1;
+#endif
+}
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c
+++ linux.trees.git/kernel/sched.c
@@ -4009,6 +4009,11 @@ static struct {
 	.load_balancer = ATOMIC_INIT(-1),
 };
 
+int get_nohz_load_balancer(void)
+{
+        return atomic_read(&nohz.load_balancer);
+}
+
 /*
  * This routine will try to nominate the ilb (idle load balancing)
  * owner among the cpus whose ticks are stopped. ilb owner will do the idle

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 3/4] timers: /proc/sys sysctl hook to enable timer migration
  2009-03-16 11:14 ` [v3 PATCH 3/4] timers: /proc/sys sysctl hook to enable timer migration Arun R Bharadwaj
@ 2009-03-16 15:29   ` Alexey Dobriyan
  2009-03-17 10:13     ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: Alexey Dobriyan @ 2009-03-16 15:29 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, tglx, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

On Mon, Mar 16, 2009 at 04:44:44PM +0530, Arun R Bharadwaj wrote:
> echo 0 > /proc/sys/kernel/timer_migration

There should be no sysctl for this.

Timer migration should be good enough to be always enabled.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 3/4] timers: /proc/sys sysctl hook to enable timer migration
  2009-03-16 15:29   ` Alexey Dobriyan
@ 2009-03-17 10:13     ` Ingo Molnar
  0 siblings, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2009-03-17 10:13 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Arun R Bharadwaj, linux-kernel, linux-pm, a.p.zijlstra, ego, tglx,
	andi, venkatesh.pallipadi, vatsa, arjan, svaidy


* Alexey Dobriyan <adobriyan@gmail.com> wrote:

> On Mon, Mar 16, 2009 at 04:44:44PM +0530, Arun R Bharadwaj wrote:
> > echo 0 > /proc/sys/kernel/timer_migration
> 
> There should be no sysctl for this.
> 
> Timer migration should be good enough to be always enabled.

It's in the CONFIG_SCHED_DEBUG section - i.e. only available 
under scheduler debugging. (and it's a fair flag for that case)

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 4/4] timers: logic to move non pinned timers
  2009-03-16 11:15 ` [v3 PATCH 4/4] timers: logic to move non pinned timers Arun R Bharadwaj
@ 2009-03-17 10:22   ` Thomas Gleixner
  2009-03-17 11:45     ` Arun R Bharadwaj
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2009-03-17 10:22 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

On Mon, 16 Mar 2009, Arun R Bharadwaj wrote:
> @@ -627,6 +628,16 @@ __mod_timer(struct timer_list *timer, un
>  
>  	new_base = __get_cpu_var(tvec_bases);
>  
> +	current_cpu = smp_processor_id();
> +	preferred_cpu = get_nohz_load_balancer();
> +	if (get_sysctl_timer_migration() && idle_cpu(current_cpu) &&
> +			!pinned && preferred_cpu != -1) {
> +		new_base = per_cpu(tvec_bases, preferred_cpu);
> +		timer_set_base(timer, new_base);
> +		timer->expires = expires;
> +		internal_add_timer(new_base, timer);
> +		goto out_unlock;
> +	}

  Err. This change breaks the timer->base logic. Why can't it just
  select the base and use the existing code ?

> @@ -198,8 +200,16 @@ switch_hrtimer_base(struct hrtimer *time
>  {
>  	struct hrtimer_clock_base *new_base;
>  	struct hrtimer_cpu_base *new_cpu_base;
> +	int current_cpu, preferred_cpu;
> +
> +	current_cpu = smp_processor_id();
> +	preferred_cpu = get_nohz_load_balancer();
> +	if (get_sysctl_timer_migration() && !pinned && preferred_cpu != -1
> +			&& idle_cpu(current_cpu))
> +		new_cpu_base = &per_cpu(hrtimer_bases, preferred_cpu);
> +	else
> +		new_cpu_base = &__get_cpu_var(hrtimer_bases);
>  
> -	new_cpu_base = &__get_cpu_var(hrtimer_bases);
>  	new_base = &new_cpu_base->clock_base[base->index];

  Hmm. This can lead to high latencies when you enqueue the timer on
  the other CPU simply because we can not reprogram the timer hardware
  on the other CPU in the CONFIG_HIGH_RES=y case.

  Let's assume we are on CPU0 and try to enqueue the timer on CPU1,
  where the next timer expiry is 5ms away. The timer which we enqueue
  is due in 500us. So you introduce 4.5ms latency.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 4/4] timers: logic to move non pinned timers
  2009-03-17 10:22   ` Thomas Gleixner
@ 2009-03-17 11:45     ` Arun R Bharadwaj
  2009-03-17 13:01       ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Arun R Bharadwaj @ 2009-03-17 11:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

* Thomas Gleixner <tglx@linutronix.de> [2009-03-17 11:22:24]:

> On Mon, 16 Mar 2009, Arun R Bharadwaj wrote:
> > @@ -627,6 +628,16 @@ __mod_timer(struct timer_list *timer, un
> >  
> >  	new_base = __get_cpu_var(tvec_bases);
> >  
> > +	current_cpu = smp_processor_id();
> > +	preferred_cpu = get_nohz_load_balancer();
> > +	if (get_sysctl_timer_migration() && idle_cpu(current_cpu) &&
> > +			!pinned && preferred_cpu != -1) {
> > +		new_base = per_cpu(tvec_bases, preferred_cpu);
> > +		timer_set_base(timer, new_base);
> > +		timer->expires = expires;
> > +		internal_add_timer(new_base, timer);
> > +		goto out_unlock;
> > +	}
> 
>   Err. This change breaks the timer->base logic. Why can't it just
>   select the base and use the existing code ?
>

Sure, I'll take care of this.

> > @@ -198,8 +200,16 @@ switch_hrtimer_base(struct hrtimer *time
> >  {
> >  	struct hrtimer_clock_base *new_base;
> >  	struct hrtimer_cpu_base *new_cpu_base;
> > +	int current_cpu, preferred_cpu;
> > +
> > +	current_cpu = smp_processor_id();
> > +	preferred_cpu = get_nohz_load_balancer();
> > +	if (get_sysctl_timer_migration() && !pinned && preferred_cpu != -1
> > +			&& idle_cpu(current_cpu))
> > +		new_cpu_base = &per_cpu(hrtimer_bases, preferred_cpu);
> > +	else
> > +		new_cpu_base = &__get_cpu_var(hrtimer_bases);
> >  
> > -	new_cpu_base = &__get_cpu_var(hrtimer_bases);
> >  	new_base = &new_cpu_base->clock_base[base->index];
> 
>   Hmm. This can lead to high latencies when you enqueue the timer on
>   the other CPU simply because we can not reprogram the timer hardware
>   on the other CPU in the CONFIG_HIGH_RES=y case.
> 
>   Let's assume we are on CPU0 and try to enqueue the timer on CPU1,
>   where the next timer expiry is 5ms away. The timer which we enqueue
>   is due in 500us. So you introduce 4.5ms latency.
>

We are moving timers to the ilb which wakes up every jiffy.
So we can move the timer to the ilb only if it's expiry
time is greater than 1 jiffy. Else we can fire on the same CPU.
This would prevent any latency from being added, right?

--arun

> Thanks,
> 
> 	tglx
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 4/4] timers: logic to move non pinned timers
  2009-03-17 11:45     ` Arun R Bharadwaj
@ 2009-03-17 13:01       ` Thomas Gleixner
  2009-03-30  5:00         ` Arun R Bharadwaj
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2009-03-17 13:01 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

On Tue, 17 Mar 2009, Arun R Bharadwaj wrote:
> * Thomas Gleixner <tglx@linutronix.de> [2009-03-17 11:22:24]:
> >   Let's assume we are on CPU0 and try to enqueue the timer on CPU1,
> >   where the next timer expiry is 5ms away. The timer which we enqueue
> >   is due in 500us. So you introduce 4.5ms latency.
> >
> We are moving timers to the ilb which wakes up every jiffy.
> So we can move the timer to the ilb only if it's expiry
> time is greater than 1 jiffy. Else we can fire on the same CPU.

Please do not start to add some obscure jiffies magic. The correct
check is whether the new timers expiry time is before the first timers
expiry time on the target CPU.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 2/4] timers: Identifying the existing pinned timers
  2009-03-16 11:13 ` [v3 PATCH 2/4] timers: Identifying the existing " Arun R Bharadwaj
@ 2009-03-17 13:07   ` Thomas Gleixner
  2009-03-18  4:36     ` Arun R Bharadwaj
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2009-03-17 13:07 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

On Mon, 16 Mar 2009, Arun R Bharadwaj wrote:
> * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-16 16:40:45]:
> 
> The following pinned hrtimers have been identified and marked:
> 1)sched_rt_period_timer
> 2)tick_sched_timer
> 3)stack_trace_timer_fn

How did you verify that these are the only timers which need to be
pinned ?

Can we be sure that there is no code which relies on the current
behaviour to keep functionality tied together ? Networking and block
layer comes to mind.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 2/4] timers: Identifying the existing pinned timers
  2009-03-17 13:07   ` Thomas Gleixner
@ 2009-03-18  4:36     ` Arun R Bharadwaj
  0 siblings, 0 replies; 13+ messages in thread
From: Arun R Bharadwaj @ 2009-03-18  4:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

* Thomas Gleixner <tglx@linutronix.de> [2009-03-17 14:07:44]:

> On Mon, 16 Mar 2009, Arun R Bharadwaj wrote:
> > * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-03-16 16:40:45]:
> > 
> > The following pinned hrtimers have been identified and marked:
> > 1)sched_rt_period_timer
> > 2)tick_sched_timer
> > 3)stack_trace_timer_fn
> 
> How did you verify that these are the only timers which need to be
> pinned ?
>

Until 2.6.28, there was a callback mode called HRTIMER_CB_IRQSAFE_PERCPU
which did not allow timers to be migrated, if set.
These were the only 3 hrtimers which were using that callback mode.

--arun

> Can we be sure that there is no code which relies on the current
> behaviour to keep functionality tied together ? Networking and block
> layer comes to mind.
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 4/4] timers: logic to move non pinned timers
  2009-03-17 13:01       ` Thomas Gleixner
@ 2009-03-30  5:00         ` Arun R Bharadwaj
  0 siblings, 0 replies; 13+ messages in thread
From: Arun R Bharadwaj @ 2009-03-30  5:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-pm, a.p.zijlstra, ego, mingo, andi,
	venkatesh.pallipadi, vatsa, arjan, svaidy

* Thomas Gleixner <tglx@linutronix.de> [2009-03-17 14:01:00]:

> On Tue, 17 Mar 2009, Arun R Bharadwaj wrote:
> > * Thomas Gleixner <tglx@linutronix.de> [2009-03-17 11:22:24]:
> > >   Let's assume we are on CPU0 and try to enqueue the timer on CPU1,
> > >   where the next timer expiry is 5ms away. The timer which we enqueue
> > >   is due in 500us. So you introduce 4.5ms latency.
> > >
> > We are moving timers to the ilb which wakes up every jiffy.
> > So we can move the timer to the ilb only if it's expiry
> > time is greater than 1 jiffy. Else we can fire on the same CPU.
> 
> Please do not start to add some obscure jiffies magic. The correct
> check is whether the new timers expiry time is before the first timers
> expiry time on the target CPU.
>

Hi Thomas,

Currently, there is a function get_next_timer_interrupt() which
returns the next timer's expiry on the CPU. This is used by
tick_nohz_stop_sched_tick() in order to reprogram the timer device
before stopping the tick.

get_next_timer_interrupt() returns the earliest between the
next timer expiry and the next hrtimer expiry.

In case of the hrtimer expiry, the ktime is rounded off to the next
jiffie and returned.

So using the current infrastructure, there is no benefit in finding
out when the next timer is going to fire on the target cpu over just
checking if the current hrtimer which we are migrating is firing
before the next jiffie or not. If it is firing after 1 jiffie, it can
be migrated and the target cpu (idle load balancer) will reprogram
its timer device accordingly.

--arun
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-03-30  5:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-16 11:10 [v3 PATCH 0/4] timers: Framework for migration of timers Arun R Bharadwaj
2009-03-16 11:12 ` [v3 PATCH 1/4] timers: Framework for identifying pinned timers Arun R Bharadwaj
2009-03-16 11:13 ` [v3 PATCH 2/4] timers: Identifying the existing " Arun R Bharadwaj
2009-03-17 13:07   ` Thomas Gleixner
2009-03-18  4:36     ` Arun R Bharadwaj
2009-03-16 11:14 ` [v3 PATCH 3/4] timers: /proc/sys sysctl hook to enable timer migration Arun R Bharadwaj
2009-03-16 15:29   ` Alexey Dobriyan
2009-03-17 10:13     ` Ingo Molnar
2009-03-16 11:15 ` [v3 PATCH 4/4] timers: logic to move non pinned timers Arun R Bharadwaj
2009-03-17 10:22   ` Thomas Gleixner
2009-03-17 11:45     ` Arun R Bharadwaj
2009-03-17 13:01       ` Thomas Gleixner
2009-03-30  5:00         ` Arun R Bharadwaj

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).