linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] sched: Cleanup and improve polling idle loops
@ 2014-06-04  0:29 Andy Lutomirski
  2014-06-04  0:29 ` [PATCH 1/6] cpuidle: Set polling in poll_idle Andy Lutomirski
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04  0:29 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

This series reduces the number of IPIs on my workload by something like
99%.  It's down from many hundreds per second to very few.

The basic idea behind this series is to make TIF_POLLING_NRFLAG be a
reliable indication that the idle task is polling.  Once that's done,
the rest is reasonably straightforward.

Patches 1 and 2 are related improvements: patch 1 teaches the cpuidle
polling loop how to poll, and patch 2 adds tracepoints so that avoided
IPIs are visible.  Patch 3 is a pure cleanup, patch 4 is the main
semantic change, patch 5 is cleanup, and patch 6 is peterz's code,
rebased on top of my stuff, and fixed up a bit.

Andy Lutomirski (5):
  cpuidle: Set polling in poll_idle
  sched,trace: Add a tracepoint for remote wakeups via polling
  sched,idle: Clarify where TIF_NRFLAG_POLLING is set
  sched,idle: Clear polling before descheduling the idle thread
  sched,idle: Simplify wake_up_idle_cpu

Peter Zijlstra (1):
  sched: Optimize ttwu IPI

 drivers/cpuidle/driver.c     |  7 ++--
 include/trace/events/sched.h | 20 +++++++++++
 kernel/sched/core.c          | 79 +++++++++++++++++++++++++++++---------------
 kernel/sched/idle.c          | 22 +++++++++++-
 kernel/sched/sched.h         |  6 ++++
 5 files changed, 105 insertions(+), 29 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] cpuidle: Set polling in poll_idle
  2014-06-04  0:29 [PATCH 0/6] sched: Cleanup and improve polling idle loops Andy Lutomirski
@ 2014-06-04  0:29 ` Andy Lutomirski
  2014-06-04  0:29 ` [PATCH 2/6] sched,trace: Add a tracepoint for remote wakeups via polling Andy Lutomirski
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04  0:29 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

poll_idle is the archetypal polling idle loop; tell the core idle
code about it.

This avoids pointless IPIs when all of the other cpuidle states are
disabled.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 drivers/cpuidle/driver.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
index 136d6a2..9634f20 100644
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -187,8 +187,11 @@ static int poll_idle(struct cpuidle_device *dev,
 
 	t1 = ktime_get();
 	local_irq_enable();
-	while (!need_resched())
-		cpu_relax();
+	if (!current_set_polling_and_test()) {
+		while (!need_resched())
+			cpu_relax();
+	}
+	current_clr_polling();
 
 	t2 = ktime_get();
 	diff = ktime_to_us(ktime_sub(t2, t1));
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6] sched,trace: Add a tracepoint for remote wakeups via polling
  2014-06-04  0:29 [PATCH 0/6] sched: Cleanup and improve polling idle loops Andy Lutomirski
  2014-06-04  0:29 ` [PATCH 1/6] cpuidle: Set polling in poll_idle Andy Lutomirski
@ 2014-06-04  0:29 ` Andy Lutomirski
  2014-06-04 13:00   ` Daniel Lezcano
  2014-06-04  0:29 ` [PATCH 3/6] sched,idle: Clarify where TIF_NRFLAG_POLLING is set Andy Lutomirski
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04  0:29 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

Remote wakeups of polling CPUs are a valuable performance
improvement; add a tracepoint to make it much easier to verify that
they're working.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/trace/events/sched.h | 20 ++++++++++++++++++++
 kernel/sched/core.c          |  4 ++++
 2 files changed, 24 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..08f8632 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -530,6 +530,26 @@ TRACE_EVENT(sched_swap_numa,
 			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
 			__entry->dst_cpu, __entry->dst_nid)
 );
+
+/*
+ * Tracepoint for waking a polling cpu without an IPI.
+ */
+TRACE_EVENT(sched_wake_polling_cpu,
+
+	TP_PROTO(int cpu),
+
+	TP_ARGS(cpu),
+
+	TP_STRUCT__entry(
+		__field(	int,	cpu	)
+	),
+
+	TP_fast_assign(
+		__entry->cpu	= cpu;
+	),
+
+	TP_printk("cpu=%d", __entry->cpu)
+);
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 321d800..18bebfc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -564,6 +564,8 @@ void resched_task(struct task_struct *p)
 
 	if (set_nr_and_not_polling(p))
 		smp_send_reschedule(cpu);
+	else
+		trace_sched_wake_polling_cpu(cpu);
 }
 
 void resched_cpu(int cpu)
@@ -647,6 +649,8 @@ static void wake_up_idle_cpu(int cpu)
 	smp_mb();
 	if (!tsk_is_polling(rq->idle))
 		smp_send_reschedule(cpu);
+	else
+		trace_sched_wake_polling_cpu(cpu);
 }
 
 static bool wake_up_full_nohz_cpu(int cpu)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6] sched,idle: Clarify where TIF_NRFLAG_POLLING is set
  2014-06-04  0:29 [PATCH 0/6] sched: Cleanup and improve polling idle loops Andy Lutomirski
  2014-06-04  0:29 ` [PATCH 1/6] cpuidle: Set polling in poll_idle Andy Lutomirski
  2014-06-04  0:29 ` [PATCH 2/6] sched,trace: Add a tracepoint for remote wakeups via polling Andy Lutomirski
@ 2014-06-04  0:29 ` Andy Lutomirski
  2014-06-04 13:03   ` Daniel Lezcano
  2014-06-04  0:29 ` [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04  0:29 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

This doesn't change functionality at all, but I've misread this code
so many times that I want to make it a bit more obvious.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 kernel/sched/idle.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 25b9423..2ec9f47 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -67,6 +67,10 @@ void __weak arch_cpu_idle(void)
  * cpuidle_idle_call - the main idle function
  *
  * NOTE: no locks or semaphores should be used here
+ *
+ * On archs that support TIF_POLLING_NRFLAG, is called with polling
+ * set, and it returns with polling set.  If it ever stops polling, it
+ * must clear the polling bit.
  */
 static void cpuidle_idle_call(void)
 {
@@ -178,7 +182,14 @@ exit_idle:
  */
 static void cpu_idle_loop(void)
 {
+	__current_set_polling();
+
 	while (1) {
+		/*
+		 * Invariant: polling is set here (assuming that the arch
+		 * has a polling bit.
+		 */
+
 		tick_nohz_idle_enter();
 
 		while (!need_resched()) {
@@ -239,7 +250,6 @@ void cpu_startup_entry(enum cpuhp_state state)
 	 */
 	boot_init_stack_canary();
 #endif
-	__current_set_polling();
 	arch_cpu_idle_prepare();
 	cpu_idle_loop();
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread
  2014-06-04  0:29 [PATCH 0/6] sched: Cleanup and improve polling idle loops Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-06-04  0:29 ` [PATCH 3/6] sched,idle: Clarify where TIF_NRFLAG_POLLING is set Andy Lutomirski
@ 2014-06-04  0:29 ` Andy Lutomirski
  2014-06-04  7:44   ` Peter Zijlstra
  2014-06-04  7:53   ` Peter Zijlstra
  2014-06-04  0:29 ` [PATCH 5/6] sched,idle: Simplify wake_up_idle_cpu Andy Lutomirski
  2014-06-04  0:29 ` [PATCH 6/6] sched: Optimize ttwu IPI Andy Lutomirski
  5 siblings, 2 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04  0:29 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

Currently, the only real guarantee provided by the polling bit is
that, if you hold rq->lock and the polling bit is set, then you can
set need_resched to force a reschedule.

The only reason the lock is needed is that the idle thread might not
be running at all when setting its need_resched bit, and rq->lock
keeps it pinned.

This is easy to fix: just clear the polling bit before scheduling.
Now the polling bit is only ever set when rq->curr == rq->idle.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 kernel/sched/idle.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 2ec9f47..4de3a24 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -179,17 +179,22 @@ exit_idle:
 
 /*
  * Generic idle loop implementation
+ *
+ * Called with polling cleared.
  */
 static void cpu_idle_loop(void)
 {
-	__current_set_polling();
-
 	while (1) {
 		/*
-		 * Invariant: polling is set here (assuming that the arch
-		 * has a polling bit.
+		 * If the arch has a polling bit, we maintain an invariant:
+		 *
+		 * The polling bit is clear if we're not scheduled (i.e. if
+		 * rq->curr != rq->idle).  This means that, if rq->idle has
+		 * the polling bit set, then setting need_resched is
+		 * guaranteed to cause the cpu to reschedule.
 		 */
 
+		__current_set_polling();
 		tick_nohz_idle_enter();
 
 		while (!need_resched()) {
@@ -229,6 +234,8 @@ static void cpu_idle_loop(void)
 		 */
 		preempt_set_need_resched();
 		tick_nohz_idle_exit();
+		__current_clr_polling();
+		smp_mb__after_clear_bit();
 		schedule_preempt_disabled();
 	}
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6] sched,idle: Simplify wake_up_idle_cpu
  2014-06-04  0:29 [PATCH 0/6] sched: Cleanup and improve polling idle loops Andy Lutomirski
                   ` (3 preceding siblings ...)
  2014-06-04  0:29 ` [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
@ 2014-06-04  0:29 ` Andy Lutomirski
  2014-06-04  0:29 ` [PATCH 6/6] sched: Optimize ttwu IPI Andy Lutomirski
  5 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04  0:29 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

Now that rq->idle's polling bit is a reliable indication that the cpu is
polling, use it.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 kernel/sched/core.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 18bebfc..dfc3e23 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -628,26 +628,7 @@ static void wake_up_idle_cpu(int cpu)
 	if (cpu == smp_processor_id())
 		return;
 
-	/*
-	 * This is safe, as this function is called with the timer
-	 * wheel base lock of (cpu) held. When the CPU is on the way
-	 * to idle and has not yet set rq->curr to idle then it will
-	 * be serialized on the timer wheel base lock and take the new
-	 * timer into account automatically.
-	 */
-	if (rq->curr != rq->idle)
-		return;
-
-	/*
-	 * We can set TIF_RESCHED on the idle task of the other CPU
-	 * lockless. The worst case is that the other CPU runs the
-	 * idle task through an additional NOOP schedule()
-	 */
-	set_tsk_need_resched(rq->idle);
-
-	/* NEED_RESCHED must be visible before we test polling */
-	smp_mb();
-	if (!tsk_is_polling(rq->idle))
+	if (set_nr_and_not_polling(rq->idle))
 		smp_send_reschedule(cpu);
 	else
 		trace_sched_wake_polling_cpu(cpu);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6] sched: Optimize ttwu IPI
  2014-06-04  0:29 [PATCH 0/6] sched: Cleanup and improve polling idle loops Andy Lutomirski
                   ` (4 preceding siblings ...)
  2014-06-04  0:29 ` [PATCH 5/6] sched,idle: Simplify wake_up_idle_cpu Andy Lutomirski
@ 2014-06-04  0:29 ` Andy Lutomirski
  5 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04  0:29 UTC (permalink / raw)
  To: Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, daniel.lezcano, linux-kernel,
	Andy Lutomirski

From: Peter Zijlstra <peterz@infradead.org>

When enqueueing tasks on remote LLC domains, we send an IPI to do the
work 'locally' and avoid bouncing all the cachelines over.

However, when the remote CPU is idle (and polling, say x86 mwait), we
don't need to send an IPI, we can simply kick the TIF word to wake it
up and have the 'idle' loop do the work.

So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet)
set, set _TIF_NEED_RESCHED and avoid sending the IPI.

[Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.]

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Much-requested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 kernel/sched/core.c  | 54 ++++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/idle.c  |  3 +++
 kernel/sched/sched.h |  6 ++++++
 3 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dfc3e23..af52702 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -519,7 +519,7 @@ static inline void init_hrtick(void)
  	__old;								\
 })
 
-#ifdef TIF_POLLING_NRFLAG
+#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
 /*
  * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
  * this avoids any races wrt polling state changes and thereby avoids
@@ -530,12 +530,44 @@ static bool set_nr_and_not_polling(struct task_struct *p)
 	struct thread_info *ti = task_thread_info(p);
 	return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
 }
+
+/*
+ * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.
+ *
+ * If this returns true, then the idle task promises to call
+ * sched_ttwu_pending() and reschedule soon.
+ */
+static bool set_nr_if_polling(struct task_struct *p)
+{
+	struct thread_info *ti = task_thread_info(p);
+	typeof(ti->flags) old, val = ACCESS_ONCE(ti->flags);
+
+	for (;;) {
+		if (!(val & _TIF_POLLING_NRFLAG))
+			return false;
+		if (val & _TIF_NEED_RESCHED)
+			return true;
+		old = cmpxchg(&ti->flags, val, val | _TIF_NEED_RESCHED);
+		if (old == val)
+			break;
+		val = old;
+	}
+	return true;
+}
+
 #else
 static bool set_nr_and_not_polling(struct task_struct *p)
 {
 	set_tsk_need_resched(p);
 	return true;
 }
+
+#ifdef CONFIG_SMP
+static bool set_nr_if_polling(struct task_struct *p)
+{
+	return false;
+}
+#endif
 #endif
 
 /*
@@ -1490,13 +1522,17 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 }
 
 #ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+void sched_ttwu_pending(void)
 {
 	struct rq *rq = this_rq();
 	struct llist_node *llist = llist_del_all(&rq->wake_list);
 	struct task_struct *p;
+	unsigned long flags;
 
-	raw_spin_lock(&rq->lock);
+	if (!llist)
+		return;
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	while (llist) {
 		p = llist_entry(llist, struct task_struct, wake_entry);
@@ -1504,7 +1540,7 @@ static void sched_ttwu_pending(void)
 		ttwu_do_activate(rq, p, 0);
 	}
 
-	raw_spin_unlock(&rq->lock);
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
 void scheduler_ipi(void)
@@ -1550,8 +1586,14 @@ void scheduler_ipi(void)
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu)
 {
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
-		smp_send_reschedule(cpu);
+	struct rq *rq = cpu_rq(cpu);
+
+	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
+		if (!set_nr_if_polling(rq->idle))
+			smp_send_reschedule(cpu);
+		else
+			trace_sched_wake_polling_cpu(cpu);
+	}
 }
 
 bool cpus_share_cache(int this_cpu, int that_cpu)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 4de3a24..9d90909 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -12,6 +12,8 @@
 
 #include <trace/events/power.h>
 
+#include "sched.h"
+
 static int __read_mostly cpu_idle_force_poll;
 
 void cpu_idle_poll_ctrl(bool enable)
@@ -236,6 +238,7 @@ static void cpu_idle_loop(void)
 		tick_nohz_idle_exit();
 		__current_clr_polling();
 		smp_mb__after_clear_bit();
+		sched_ttwu_pending();
 		schedule_preempt_disabled();
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 600e229..99d9e81 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -670,6 +670,8 @@ extern int migrate_swap(struct task_struct *, struct task_struct *);
 
 #ifdef CONFIG_SMP
 
+extern void sched_ttwu_pending(void);
+
 #define rcu_dereference_check_sched_domain(p) \
 	rcu_dereference_check((p), \
 			      lockdep_is_held(&sched_domains_mutex))
@@ -787,6 +789,10 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
 
 extern int group_balance_cpu(struct sched_group *sg);
 
+#else
+
+static inline void sched_ttwu_pending(void) { }
+
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread
  2014-06-04  0:29 ` [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
@ 2014-06-04  7:44   ` Peter Zijlstra
  2014-06-04 14:57     ` Andy Lutomirski
  2014-06-04  7:53   ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-06-04  7:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: umgwanakikbuti, mingo, tglx, nicolas.pitre, daniel.lezcano,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 978 bytes --]

On Tue, Jun 03, 2014 at 05:29:50PM -0700, Andy Lutomirski wrote:
> Currently, the only real guarantee provided by the polling bit is
> that, if you hold rq->lock and the polling bit is set, then you can
> set need_resched to force a reschedule.
> 
> The only reason the lock is needed is that the idle thread might not
> be running at all when setting its need_resched bit, and rq->lock
> keeps it pinned.
> 
> This is easy to fix: just clear the polling bit before scheduling.
> Now the polling bit is only ever set when rq->curr == rq->idle.

Yah, except of course:

  lkml.kernel.org/r/20131120162736.508462614@infradead.org

which I really need to rebase and post again :/

In any case, this is useful even with that, although then we really must
do something like:

  rcu_read_lock();
  if (!set_nr_if_polling(rq->curr))
  	smp_send_reschedule(rq->cpu);
  rcu_read_unlock();

Because there's other tasks than rq->idle which might be 'idle', joy!

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread
  2014-06-04  0:29 ` [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
  2014-06-04  7:44   ` Peter Zijlstra
@ 2014-06-04  7:53   ` Peter Zijlstra
  2014-06-04 14:46     ` Andy Lutomirski
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-06-04  7:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: umgwanakikbuti, mingo, tglx, nicolas.pitre, daniel.lezcano,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 575 bytes --]

On Tue, Jun 03, 2014 at 05:29:50PM -0700, Andy Lutomirski wrote:
> @@ -229,6 +234,8 @@ static void cpu_idle_loop(void)
>  		 */
>  		preempt_set_need_resched();
>  		tick_nohz_idle_exit();
> +		__current_clr_polling();
> +		smp_mb__after_clear_bit();

barriers always need a comment, and I'm not entirely sure why you put
this one here.

Merging the NR set and POLLING tests into a single atomic made the
entire barrier situation somewhat more complicated and I really need to
rethink that.

>  		schedule_preempt_disabled();
>  	}
>  }
> -- 
> 1.9.3
> 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] sched,trace: Add a tracepoint for remote wakeups via polling
  2014-06-04  0:29 ` [PATCH 2/6] sched,trace: Add a tracepoint for remote wakeups via polling Andy Lutomirski
@ 2014-06-04 13:00   ` Daniel Lezcano
  2014-06-04 13:54     ` Peter Zijlstra
  2014-06-04 14:43     ` Andy Lutomirski
  0 siblings, 2 replies; 17+ messages in thread
From: Daniel Lezcano @ 2014-06-04 13:00 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, linux-kernel

On 06/04/2014 02:29 AM, Andy Lutomirski wrote:
> Remote wakeups of polling CPUs are a valuable performance
> improvement; add a tracepoint to make it much easier to verify that
> they're working.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>

I don't think this trace makes sense. The polling state is x86 only and 
this trace is in the generic code.

Furthermore, you may be not in polling state but in the idle mainloop 
before or after the idle state, so the trace will be wrong for the 
purpose you are aiming.

IMO, the right place would be in 'poll_idle' but why add such trace ?

If, on x86, we exit poll_idle, we have the idle state exit trace. The 
missing information would be the origin of the 'wakeup' (irq or ipi or 
nothing). The missing informations are the IPI traces [1].

And as a sidenote, the polling state could be rare on a system with a 
cpuidle driver, it should be much more easy to restrict the idle states 
to 'poll' only and check there are no IPI_WAKEUP, no ?

   -- Daniel

[1] 
http://lists.linaro.org/pipermail/linaro-kernel/2013-November/008581.html

> ---
>   include/trace/events/sched.h | 20 ++++++++++++++++++++
>   kernel/sched/core.c          |  4 ++++
>   2 files changed, 24 insertions(+)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 67e1bbf..08f8632 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -530,6 +530,26 @@ TRACE_EVENT(sched_swap_numa,
>   			__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
>   			__entry->dst_cpu, __entry->dst_nid)
>   );
> +
> +/*
> + * Tracepoint for waking a polling cpu without an IPI.
> + */
> +TRACE_EVENT(sched_wake_polling_cpu,
> +
> +	TP_PROTO(int cpu),
> +
> +	TP_ARGS(cpu),
> +
> +	TP_STRUCT__entry(
> +		__field(	int,	cpu	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->cpu	= cpu;
> +	),
> +
> +	TP_printk("cpu=%d", __entry->cpu)
> +);
>   #endif /* _TRACE_SCHED_H */
>
>   /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 321d800..18bebfc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -564,6 +564,8 @@ void resched_task(struct task_struct *p)
>
>   	if (set_nr_and_not_polling(p))
>   		smp_send_reschedule(cpu);
> +	else
> +		trace_sched_wake_polling_cpu(cpu);
>   }
>
>   void resched_cpu(int cpu)
> @@ -647,6 +649,8 @@ static void wake_up_idle_cpu(int cpu)
>   	smp_mb();
>   	if (!tsk_is_polling(rq->idle))
>   		smp_send_reschedule(cpu);
> +	else
> +		trace_sched_wake_polling_cpu(cpu);
>   }
>
>   static bool wake_up_full_nohz_cpu(int cpu)
>


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/6] sched,idle: Clarify where TIF_NRFLAG_POLLING is set
  2014-06-04  0:29 ` [PATCH 3/6] sched,idle: Clarify where TIF_NRFLAG_POLLING is set Andy Lutomirski
@ 2014-06-04 13:03   ` Daniel Lezcano
  2014-06-04 14:58     ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Lezcano @ 2014-06-04 13:03 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra, umgwanakikbuti
  Cc: mingo, tglx, nicolas.pitre, linux-kernel

On 06/04/2014 02:29 AM, Andy Lutomirski wrote:
> This doesn't change functionality at all, but I've misread this code
> so many times that I want to make it a bit more obvious.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>   kernel/sched/idle.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 25b9423..2ec9f47 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -67,6 +67,10 @@ void __weak arch_cpu_idle(void)
>    * cpuidle_idle_call - the main idle function
>    *
>    * NOTE: no locks or semaphores should be used here
> + *
> + * On archs that support TIF_POLLING_NRFLAG, is called with polling
> + * set, and it returns with polling set.  If it ever stops polling, it
> + * must clear the polling bit.
>    */
>   static void cpuidle_idle_call(void)
>   {
> @@ -178,7 +182,14 @@ exit_idle:
>    */
>   static void cpu_idle_loop(void)
>   {
> +	__current_set_polling();
> +
>   	while (1) {
> +		/*
> +		 * Invariant: polling is set here (assuming that the arch
> +		 * has a polling bit.
> +		 */
> +

nit : extra line

>   		tick_nohz_idle_enter();
>
>   		while (!need_resched()) {
> @@ -239,7 +250,6 @@ void cpu_startup_entry(enum cpuhp_state state)
>   	 */
>   	boot_init_stack_canary();
>   #endif
> -	__current_set_polling();

I don't get the connection with the patch description.

>   	arch_cpu_idle_prepare();
>   	cpu_idle_loop();
>   }
>


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] sched,trace: Add a tracepoint for remote wakeups via polling
  2014-06-04 13:00   ` Daniel Lezcano
@ 2014-06-04 13:54     ` Peter Zijlstra
  2014-06-04 14:43     ` Andy Lutomirski
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-06-04 13:54 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Andy Lutomirski, umgwanakikbuti, mingo, tglx, nicolas.pitre,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 780 bytes --]

On Wed, Jun 04, 2014 at 03:00:26PM +0200, Daniel Lezcano wrote:
> On 06/04/2014 02:29 AM, Andy Lutomirski wrote:
> >Remote wakeups of polling CPUs are a valuable performance
> >improvement; add a tracepoint to make it much easier to verify that
> >they're working.
> >
> >Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> 
> I don't think this trace makes sense. The polling state is x86 only and this
> trace is in the generic code.

Not so, there's PPC using it someplace (ISTR its doing a polling
amortization of the idle hypercall or somesuch) and there's archs that
have no other idle means that polling, as well as some for which its
optional.

There are indeed archs, like arm, that only do the WFI thing, and for
those this is indeed all a NOP.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] sched,trace: Add a tracepoint for remote wakeups via polling
  2014-06-04 13:00   ` Daniel Lezcano
  2014-06-04 13:54     ` Peter Zijlstra
@ 2014-06-04 14:43     ` Andy Lutomirski
  1 sibling, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04 14:43 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Thomas Gleixner,
	nicolas.pitre, linux-kernel@vger.kernel.org

On Wed, Jun 4, 2014 at 6:00 AM, Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
> On 06/04/2014 02:29 AM, Andy Lutomirski wrote:
>>
>> Remote wakeups of polling CPUs are a valuable performance
>> improvement; add a tracepoint to make it much easier to verify that
>> they're working.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>
>
> I don't think this trace makes sense. The polling state is x86 only and this
> trace is in the generic code.
>
> Furthermore, you may be not in polling state but in the idle mainloop before
> or after the idle state, so the trace will be wrong for the purpose you are
> aiming.
>
> IMO, the right place would be in 'poll_idle' but why add such trace ?
>
> If, on x86, we exit poll_idle, we have the idle state exit trace. The
> missing information would be the origin of the 'wakeup' (irq or ipi or
> nothing). The missing informations are the IPI traces [1].
>
> And as a sidenote, the polling state could be rare on a system with a
> cpuidle driver, it should be much more easy to restrict the idle states to
> 'poll' only and check there are no IPI_WAKEUP, no ?

While developing and testing this, I found it quite useful to measure
the ratio of IPIs to non-IPIs.  I think most of your objection stems
from the fact that I named this tracepoint rather poorly.  How about
"sched_wake_idle_without_ipi"?  I don't think there's anywhere good to
put it in arch / cpuidle driver code (neither the true polling loop
nor (AFAIK) x86 mwait the sleeping cpu any straightforward indication
of why it woke up.

On archs without TIF_POLLING_NRFLAG, the compiler should optimize this
away completely.  On archs with TIF_POLLING_NRFLAG but not actual
polling, it could be interesting to see if this tracepoint ever hits
and, if not, consider removing TIF_POLLING_NRFLAG.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread
  2014-06-04  7:53   ` Peter Zijlstra
@ 2014-06-04 14:46     ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, Thomas Gleixner, nicolas.pitre,
	Daniel Lezcano, linux-kernel@vger.kernel.org

On Wed, Jun 4, 2014 at 12:53 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 03, 2014 at 05:29:50PM -0700, Andy Lutomirski wrote:
>> @@ -229,6 +234,8 @@ static void cpu_idle_loop(void)
>>                */
>>               preempt_set_need_resched();
>>               tick_nohz_idle_exit();
>> +             __current_clr_polling();
>> +             smp_mb__after_clear_bit();
>
> barriers always need a comment, and I'm not entirely sure why you put
> this one here.

Will do.

I actually suspect that the barrier isn't needed in this patch, but it
will be needed one patch later when there's a sched_ttwu_pending after
the barrier.  I'll move this to the next patch and add a comment.

The reason that the barrier is needed is that polling needs to be
visibly cleared before checking for queued ttwu tasks, because
otherwise another CPU could queue a ttwu task, do set_nr_if_polling,
and get a true return value *after* sched_ttwu_pending because the CPU
decided to defer the clear_bit.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread
  2014-06-04  7:44   ` Peter Zijlstra
@ 2014-06-04 14:57     ` Andy Lutomirski
  2014-06-04 16:03       ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04 14:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, Thomas Gleixner, nicolas.pitre,
	Daniel Lezcano, linux-kernel@vger.kernel.org

On Wed, Jun 4, 2014 at 12:44 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 03, 2014 at 05:29:50PM -0700, Andy Lutomirski wrote:
>> Currently, the only real guarantee provided by the polling bit is
>> that, if you hold rq->lock and the polling bit is set, then you can
>> set need_resched to force a reschedule.
>>
>> The only reason the lock is needed is that the idle thread might not
>> be running at all when setting its need_resched bit, and rq->lock
>> keeps it pinned.
>>
>> This is easy to fix: just clear the polling bit before scheduling.
>> Now the polling bit is only ever set when rq->curr == rq->idle.
>
> Yah, except of course:
>
>   lkml.kernel.org/r/20131120162736.508462614@infradead.org

Wow, that code was ugly.

Can you be persuaded to hold off on that patch until after this series
is done?  I think the cpu_startup_entry change will just muddy the
waters for now.

>
> which I really need to rebase and post again :/
>
> In any case, this is useful even with that, although then we really must
> do something like:
>
>   rcu_read_lock();
>   if (!set_nr_if_polling(rq->curr))
>         smp_send_reschedule(rq->cpu);
>   rcu_read_unlock();
>
> Because there's other tasks than rq->idle which might be 'idle', joy!

Is this really a problem?

It's certainly the case that a non-idle (in the sense of != rq->idle)
task can have the polling bit set, but so what?  I'm perfectly happy
to wake these ersatz idle things via IPI instead of via
TIF_NEED_RESCHED.  I think that all of the code that plays with the
polling bit after all these patches are applied either holds rq->lock
(which prevents the task from going away) or acts directly on
rq->idle, which really has no business being deallocated.

I think that the RCU solution is actually racy, too.  Consider:

If rq->curr is one some thermal sort-of-idle thing that has polling
set, and no one has made those tasks respect the new polling semantics
from this patch, then this could happen:

CPU A:
rcu_read_lock();
load rq->curr

                           CPU B:
                           deschedule rq->curr and schedule somthing else

CPU A:
set_nr_if_polling(rq->curr) returns true
no IPI sent
rcu_read_unlock()

RCU prevents us from a use-after-free, but we can still fail to send
the required IPI.

Using rq->idle in combination with this patch should prevent that
race.  It's possible for there to be weird timing things that cause
unnecessary IPIs to be sent, but my workload (which, I suspect, is
unusually heavy on remote wakeups from idle) sees something like 99%
of them avoiding the IPI.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/6] sched,idle: Clarify where TIF_NRFLAG_POLLING is set
  2014-06-04 13:03   ` Daniel Lezcano
@ 2014-06-04 14:58     ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-06-04 14:58 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Thomas Gleixner,
	nicolas.pitre, linux-kernel@vger.kernel.org

On Wed, Jun 4, 2014 at 6:03 AM, Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
> On 06/04/2014 02:29 AM, Andy Lutomirski wrote:
>>
>> This doesn't change functionality at all, but I've misread this code
>> so many times that I want to make it a bit more obvious.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>   kernel/sched/idle.c | 12 +++++++++++-
>>   1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 25b9423..2ec9f47 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -67,6 +67,10 @@ void __weak arch_cpu_idle(void)
>>    * cpuidle_idle_call - the main idle function
>>    *
>>    * NOTE: no locks or semaphores should be used here
>> + *
>> + * On archs that support TIF_POLLING_NRFLAG, is called with polling
>> + * set, and it returns with polling set.  If it ever stops polling, it
>> + * must clear the polling bit.
>>    */
>>   static void cpuidle_idle_call(void)
>>   {
>> @@ -178,7 +182,14 @@ exit_idle:
>>    */
>>   static void cpu_idle_loop(void)
>>   {
>> +       __current_set_polling();
>> +
>>         while (1) {
>> +               /*
>> +                * Invariant: polling is set here (assuming that the arch
>> +                * has a polling bit.
>> +                */
>> +
>
>
> nit : extra line
>

It was intentional: the comment is about this point in the loop, not
about the tick_nohz_idle_enter call.  I'm not attached to it, though.

>
>>                 tick_nohz_idle_enter();
>>
>>                 while (!need_resched()) {
>> @@ -239,7 +250,6 @@ void cpu_startup_entry(enum cpuhp_state state)
>>          */
>>         boot_init_stack_canary();
>>   #endif
>> -       __current_set_polling();
>
>
> I don't get the connection with the patch description.
>

In retrospect, I think I can just fold this patch with the one after
it.  I wrote this because I could never remember who is responsible
for setting polling in the first place.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread
  2014-06-04 14:57     ` Andy Lutomirski
@ 2014-06-04 16:03       ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-06-04 16:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Galbraith, Ingo Molnar, Thomas Gleixner, nicolas.pitre,
	Daniel Lezcano, linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2284 bytes --]

On Wed, Jun 04, 2014 at 07:57:07AM -0700, Andy Lutomirski wrote:
> On Wed, Jun 4, 2014 at 12:44 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Jun 03, 2014 at 05:29:50PM -0700, Andy Lutomirski wrote:
> >> Currently, the only real guarantee provided by the polling bit is
> >> that, if you hold rq->lock and the polling bit is set, then you can
> >> set need_resched to force a reschedule.
> >>
> >> The only reason the lock is needed is that the idle thread might not
> >> be running at all when setting its need_resched bit, and rq->lock
> >> keeps it pinned.
> >>
> >> This is easy to fix: just clear the polling bit before scheduling.
> >> Now the polling bit is only ever set when rq->curr == rq->idle.
> >
> > Yah, except of course:
> >
> >   lkml.kernel.org/r/20131120162736.508462614@infradead.org
> 
> Wow, that code was ugly.
> 
> Can you be persuaded to hold off on that patch until after this series
> is done?  I think the cpu_startup_entry change will just muddy the
> waters for now.

Sure, no problem with that.

> > which I really need to rebase and post again :/
> >
> > In any case, this is useful even with that, although then we really must
> > do something like:
> >
> >   rcu_read_lock();
> >   if (!set_nr_if_polling(rq->curr))
> >         smp_send_reschedule(rq->cpu);
> >   rcu_read_unlock();
> >
> > Because there's other tasks than rq->idle which might be 'idle', joy!
> 
> Is this really a problem?
> 
> It's certainly the case that a non-idle (in the sense of != rq->idle)
> task can have the polling bit set, but so what?  I'm perfectly happy
> to wake these ersatz idle things via IPI instead of via
> TIF_NEED_RESCHED.  I think that all of the code that plays with the
> polling bit after all these patches are applied either holds rq->lock
> (which prevents the task from going away) or acts directly on
> rq->idle, which really has no business being deallocated.
> 
> I think that the RCU solution is actually racy, too.  Consider:

Indeed, ok so since I really hate the idle injection crap anyhow, I
don't particularly care if we send them extra IPIs if that means we get
extra pain for the sane case.

So I'll drop that part of the rebased patch and we'll focus on rq->idle.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-06-04 16:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-04  0:29 [PATCH 0/6] sched: Cleanup and improve polling idle loops Andy Lutomirski
2014-06-04  0:29 ` [PATCH 1/6] cpuidle: Set polling in poll_idle Andy Lutomirski
2014-06-04  0:29 ` [PATCH 2/6] sched,trace: Add a tracepoint for remote wakeups via polling Andy Lutomirski
2014-06-04 13:00   ` Daniel Lezcano
2014-06-04 13:54     ` Peter Zijlstra
2014-06-04 14:43     ` Andy Lutomirski
2014-06-04  0:29 ` [PATCH 3/6] sched,idle: Clarify where TIF_NRFLAG_POLLING is set Andy Lutomirski
2014-06-04 13:03   ` Daniel Lezcano
2014-06-04 14:58     ` Andy Lutomirski
2014-06-04  0:29 ` [PATCH 4/6] sched,idle: Clear polling before descheduling the idle thread Andy Lutomirski
2014-06-04  7:44   ` Peter Zijlstra
2014-06-04 14:57     ` Andy Lutomirski
2014-06-04 16:03       ` Peter Zijlstra
2014-06-04  7:53   ` Peter Zijlstra
2014-06-04 14:46     ` Andy Lutomirski
2014-06-04  0:29 ` [PATCH 5/6] sched,idle: Simplify wake_up_idle_cpu Andy Lutomirski
2014-06-04  0:29 ` [PATCH 6/6] sched: Optimize ttwu IPI Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).