[PATCH 0/8] Cure faux idle wreckage

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] Cure faux idle wreckage
@ 2013-11-26 15:57 Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 1/8] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-26 15:57 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

Respin of the earlier series that tries to cure the 2 idle injection drivers
and cleans up some of the preempt_enable_no_resched() mess.

The intel_powerclamp driver is tested by Jacob Pan and needs one more patch to
cpuidle to work as before. I'll let him provide this patch; since he actually
has it and tested it.

Jacob also said he'll try and work with the QoS people to sort out the conflict
of interest between the idle injectors and the QoS framework.

Can someone please test acpi_pad? Rafael, since the original author seems MIA
and you're the over-all ACPI maintainer, can you appoint a person who knows
what he's doing? Alternatively, Jacob would you be willing to have a look at
that thing? Better still rm drivers/acpi/acpi_pad.c ?

Thomas, can you pick this series up and merge it into -tip provided acpi_pad
works?

---
Changes since the earlier version:

 - fixed a few build issues; thanks Jacob for spotting them
 - Added PF_IDLE so that is_idle_task() can work for the faux idle
   tasks, which in turn is require for RCU-idle support.
 - added an rcu_sleep_check() to play_idle() to ensure we don't try
   and play idle while holding rcu_read_lock(), which would counter
   the previous point.
 - changed the net busy_poll over to local_clock().

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/8] x86, acpi, idle: Restructure the mwait idle routines
  2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
@ 2013-11-26 15:57 ` Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 2/8] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-26 15:57 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki

[-- Attachment #1: peter_zijlstra-x86_acpi_idle-restructure_the_mwait_idle_routines.patch --]
[-- Type: text/plain, Size: 6940 bytes --]

People seem to delight in writing wrong and broken mwait idle routines;
collapse the lot.

This leaves mwait_play_dead() the sole remaining user of __mwait() and
new __mwait() users are probably doing it wrong.

Also remove __sti_mwait() as its unused.

Cc: arjan@linux.intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: lenb@kernel.org
Cc: rui.zhang@intel.com
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/mwait.h       |   40 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   |   23 ---------------------
 arch/x86/kernel/acpi/cstate.c      |   23 ---------------------
 drivers/acpi/acpi_pad.c            |    5 ----
 drivers/acpi/processor_idle.c      |   15 -------------
 drivers/idle/intel_idle.c          |    8 -------
 drivers/thermal/intel_powerclamp.c |    4 ---
 7 files changed, 43 insertions(+), 75 deletions(-)

--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_X86_MWAIT_H
 #define _ASM_X86_MWAIT_H
 
+#include <linux/sched.h>
+
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
 #define MWAIT_SUBSTATE_SIZE		4
@@ -13,4 +15,42 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
+{
+	if (!current_set_polling_and_test()) {
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
+		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		if (!need_resched())
+			__mwait(eax, ecx);
+	}
+	__current_clr_polling();
+}
+
 #endif /* _ASM_X86_MWAIT_H */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -700,29 +700,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -150,29 +150,6 @@ int acpi_processor_ffh_cstate_probe(unsi
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
-	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(ax, cx);
-	}
-}
-
 void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 {
 	unsigned int cpu = smp_processor_id();
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,10 +193,7 @@ static int power_saving_thread(void *dat
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
-			if (!need_resched())
-				__mwait(power_saving_mwait_eax, 1);
+			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
 			start_critical_timings();
 			if (lapic_marked_unstable)
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -727,11 +727,6 @@ static int acpi_idle_enter_c1(struct cpu
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	lapic_timer_state_broadcast(pr, cx, 1);
 	acpi_idle_do_entry(cx);
 
@@ -785,11 +780,6 @@ static int acpi_idle_enter_simple(struct
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	/*
 	 * Must be done before busmaster disable as we might need to
 	 * access HPET !
@@ -841,11 +831,6 @@ static int acpi_idle_enter_bm(struct cpu
 		}
 	}
 
-	if (cx->entry_method == ACPI_CSTATE_FFH) {
-		if (current_set_polling_and_test())
-			return -EINVAL;
-	}
-
 	acpi_unlazy_tlb(smp_processor_id());
 
 	/* Tell the scheduler that we are going deep-idle: */
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -359,13 +359,7 @@ static int intel_idle(struct cpuidle_dev
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	if (!current_set_polling_and_test()) {
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(eax, ecx);
-	}
+	mwait_idle_with_hints(eax, ecx);
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,9 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			cpu_relax(); /* allow HT sibling to run */
-			__mwait(eax, ecx);
+			mwait_idle_with_hints(eax, ecx);
 			start_critical_timings();
 			atomic_inc(&idle_wakeup_counter);
 		}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/8] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding
  2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 1/8] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
@ 2013-11-26 15:57 ` Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 3/8] idle, thermal, acpi: Remove home grown idle implementations Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-26 15:57 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

[-- Attachment #1: peterz-preempt_fold_need_resched.patch --]
[-- Type: text/plain, Size: 3572 bytes --]

With various drivers wanting to inject idle time; we get people
calling idle routines outside of the idle loop proper.

Therefore we need to be extra careful about not missing
TIF_NEED_RESCHED -> PREEMPT_NEED_RESCHED propagations.

While looking at this, I also realized there's a small window in the
existing idle loop where we can miss TIF_NEED_RESCHED; when it hits
right after the tif_need_resched() test at the end of the loop but
right before the need_resched() test at the start of the loop.

So move preempt_fold_need_resched() out of the loop where we're
guaranteed to have TIF_NEED_RESCHED set.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/mwait.h |    2 +-
 include/linux/preempt.h      |   15 +++++++++++++++
 include/linux/sched.h        |   15 +++++++++++++++
 kernel/cpu/idle.c            |   17 ++++++++++-------
 kernel/sched/core.c          |    3 +--
 5 files changed, 42 insertions(+), 10 deletions(-)

--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -50,7 +50,7 @@ static inline void mwait_idle_with_hints
 		if (!need_resched())
 			__mwait(eax, ecx);
 	}
-	__current_clr_polling();
+	current_clr_polling();
 }
 
 #endif /* _ASM_X86_MWAIT_H */
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -116,6 +116,21 @@ do { \
 
 #endif /* CONFIG_PREEMPT_COUNT */
 
+#ifdef CONFIG_PREEMPT
+#define preempt_set_need_resched() \
+do { \
+	set_preempt_need_resched(); \
+} while (0)
+#define preempt_fold_need_resched() \
+do { \
+	if (tif_need_resched()) \
+		set_preempt_need_resched(); \
+} while (0)
+#else
+#define preempt_set_need_resched() do { } while (0)
+#define preempt_fold_need_resched() do { } while (0)
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
 struct preempt_notifier;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2628,6 +2628,21 @@ static inline bool __must_check current_
 }
 #endif
 
+static inline void current_clr_polling(void)
+{
+	__current_clr_polling();
+
+	/*
+	 * Ensure we check TIF_NEED_RESCHED after we clear the polling bit.
+	 * Once the bit is cleared, we'll get IPIs with every new
+	 * TIF_NEED_RESCHED and the IPI handler, scheduler_ipi(), will also
+	 * fold.
+	 */
+	smp_mb(); /* paired with resched_task() */
+
+	preempt_fold_need_resched();
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -105,14 +105,17 @@ static void cpu_idle_loop(void)
 				__current_set_polling();
 			}
 			arch_cpu_idle_exit();
-			/*
-			 * We need to test and propagate the TIF_NEED_RESCHED
-			 * bit here because we might not have send the
-			 * reschedule IPI to idle tasks.
-			 */
-			if (tif_need_resched())
-				set_preempt_need_resched();
 		}
+
+		/*
+		 * Since we fell out of the loop above, we know
+		 * TIF_NEED_RESCHED must be set, propagate it into
+		 * PREEMPT_NEED_RESCHED.
+		 *
+		 * This is required because for polling idle loops we will
+		 * not have had an IPI to fold the state for us.
+		 */
+		preempt_set_need_resched();
 		tick_nohz_idle_exit();
 		schedule_preempt_disabled();
 	}
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1499,8 +1499,7 @@ void scheduler_ipi(void)
 	 * TIF_NEED_RESCHED remotely (for the first time) will also send
 	 * this IPI.
 	 */
-	if (tif_need_resched())
-		set_preempt_need_resched();
+	preempt_fold_need_resched();
 
 	if (llist_empty(&this_rq()->wake_list)
 			&& !tick_nohz_full_cpu(smp_processor_id())

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/8] idle, thermal, acpi: Remove home grown idle implementations
  2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 1/8] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 2/8] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
@ 2013-11-26 15:57 ` Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 4/8] preempt, locking: Rework local_bh_{dis,en}able() Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-26 15:57 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	paulmck, Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Rafael J. Wysocki

[-- Attachment #1: peterz-fixup-intel_clamp-mess.patch --]
[-- Type: text/plain, Size: 11074 bytes --]

People are starting to grow their own idle implementations in various
disgusting ways. Collapse the lot and use the generic idle code to
provide a proper idle cycle implementation.

This does not fully preseve existing behaviour in that the generic
idle cycle function calls into the normal cpuidle governed idle
routines and should thus respect things like QoS parameters and the
like.

If people want to over-ride the idle state they should talk to the
cpuidle folks about extending the interface and attempt to preserve
QoS guarantees, instead of jumping straight to the deepest possible C
state -- Jacub Pan said he was going to do this.

This is reported to work for intel_powerclamp by Jacub Pan, the
acpi_pad driver is untested.

Cc: lenb@kernel.org
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: arjan@linux.intel.com
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 drivers/acpi/acpi_pad.c            |   41 -----------
 drivers/thermal/intel_powerclamp.c |   38 ----------
 include/linux/cpu.h                |    2 
 include/linux/sched.h              |    3 
 kernel/cpu/idle.c                  |  131 ++++++++++++++++++++++---------------
 kernel/sched/core.c                |    1 
 kernel/time/tick-sched.c           |    2 
 7 files changed, 91 insertions(+), 127 deletions(-)

--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -41,9 +41,7 @@ static DEFINE_MUTEX(round_robin_lock);
 static unsigned long power_saving_mwait_eax;
 
 static unsigned char tsc_detected_unstable;
-static unsigned char tsc_marked_unstable;
 static unsigned char lapic_detected_unstable;
-static unsigned char lapic_marked_unstable;
 
 static void power_saving_mwait_init(void)
 {
@@ -153,10 +151,9 @@ static int power_saving_thread(void *dat
 	unsigned int tsk_index = (unsigned long)data;
 	u64 last_jiffies = 0;
 
-	sched_setscheduler(current, SCHED_RR, &param);
+	sched_setscheduler(current, SCHED_FIFO, &param);
 
 	while (!kthread_should_stop()) {
-		int cpu;
 		u64 expire_time;
 
 		try_to_freeze();
@@ -171,41 +168,7 @@ static int power_saving_thread(void *dat
 
 		expire_time = jiffies + HZ * (100 - idle_pct) / 100;
 
-		while (!need_resched()) {
-			if (tsc_detected_unstable && !tsc_marked_unstable) {
-				/* TSC could halt in idle, so notify users */
-				mark_tsc_unstable("TSC halts in idle");
-				tsc_marked_unstable = 1;
-			}
-			if (lapic_detected_unstable && !lapic_marked_unstable) {
-				int i;
-				/* LAPIC could halt in idle, so notify users */
-				for_each_online_cpu(i)
-					clockevents_notify(
-						CLOCK_EVT_NOTIFY_BROADCAST_ON,
-						&i);
-				lapic_marked_unstable = 1;
-			}
-			local_irq_disable();
-			cpu = smp_processor_id();
-			if (lapic_marked_unstable)
-				clockevents_notify(
-					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
-			stop_critical_timings();
-
-			mwait_idle_with_hints(power_saving_mwait_eax, 1);
-
-			start_critical_timings();
-			if (lapic_marked_unstable)
-				clockevents_notify(
-					CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
-			local_irq_enable();
-
-			if (jiffies > expire_time) {
-				do_sleep = 1;
-				break;
-			}
-		}
+		play_idle(expire_time);
 
 		/*
 		 * current sched_rt has threshold for rt task running time.
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -247,11 +247,6 @@ static u64 pkg_state_counter(void)
 	return count;
 }
 
-static void noop_timer(unsigned long foo)
-{
-	/* empty... just the fact that we get the interrupt wakes us up */
-}
-
 static unsigned int get_compensation(int ratio)
 {
 	unsigned int comp = 0;
@@ -356,7 +351,6 @@ static bool powerclamp_adjust_controls(u
 static int clamp_thread(void *arg)
 {
 	int cpunr = (unsigned long)arg;
-	DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
 	static const struct sched_param param = {
 		.sched_priority = MAX_USER_RT_PRIO/2,
 	};
@@ -365,11 +359,9 @@ static int clamp_thread(void *arg)
 
 	set_bit(cpunr, cpu_clamping_mask);
 	set_freezable();
-	init_timer_on_stack(&wakeup_timer);
 	sched_setscheduler(current, SCHED_FIFO, &param);
 
-	while (true == clamping && !kthread_should_stop() &&
-		cpu_online(cpunr)) {
+	while (clamping && !kthread_should_stop() && cpu_online(cpunr)) {
 		int sleeptime;
 		unsigned long target_jiffies;
 		unsigned int guard;
@@ -417,35 +409,11 @@ static int clamp_thread(void *arg)
 		if (should_skip)
 			continue;
 
-		target_jiffies = jiffies + duration_jiffies;
-		mod_timer(&wakeup_timer, target_jiffies);
 		if (unlikely(local_softirq_pending()))
 			continue;
-		/*
-		 * stop tick sched during idle time, interrupts are still
-		 * allowed. thus jiffies are updated properly.
-		 */
-		preempt_disable();
-		tick_nohz_idle_enter();
-		/* mwait until target jiffies is reached */
-		while (time_before(jiffies, target_jiffies)) {
-			unsigned long ecx = 1;
-			unsigned long eax = target_mwait;
-
-			/*
-			 * REVISIT: may call enter_idle() to notify drivers who
-			 * can save power during cpu idle. same for exit_idle()
-			 */
-			local_touch_nmi();
-			stop_critical_timings();
-			mwait_idle_with_hints(eax, ecx);
-			start_critical_timings();
-			atomic_inc(&idle_wakeup_counter);
-		}
-		tick_nohz_idle_exit();
-		preempt_enable_no_resched();
+
+		play_idle(duration_jiffies);
 	}
-	del_timer_sync(&wakeup_timer);
 	clear_bit(cpunr, cpu_clamping_mask);
 
 	return 0;
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -215,6 +215,8 @@ enum cpuhp_state {
 	CPUHP_ONLINE,
 };
 
+void play_idle(unsigned long jiffies);
+
 void cpu_startup_entry(enum cpuhp_state state);
 void cpu_idle(void);
 
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1669,6 +1669,7 @@ extern void thread_group_cputime_adjuste
 /*
  * Per process flags
  */
+#define PF_IDLE		0x00000002	/* I am an IDLE thread */
 #define PF_EXITING	0x00000004	/* getting shut down */
 #define PF_EXITPIDONE	0x00000008	/* pi exit done on shut down */
 #define PF_VCPU		0x00000010	/* I'm a virtual CPU */
@@ -1969,7 +1970,7 @@ extern struct task_struct *idle_task(int
  */
 static inline bool is_idle_task(const struct task_struct *p)
 {
-	return p->pid == 0;
+	return !!(p->flags & PF_IDLE);
 }
 extern struct task_struct *curr_task(int cpu);
 extern void set_curr_task(int cpu, struct task_struct *p);
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -63,65 +63,95 @@ void __weak arch_cpu_idle(void)
 }
 
 /*
- * Generic idle loop implementation
+ * Generic idle cycle.
  */
-static void cpu_idle_loop(void)
+static void do_idle(void)
 {
-	while (1) {
-		tick_nohz_idle_enter();
+	tick_nohz_idle_enter();
 
-		while (!need_resched()) {
-			check_pgt_cache();
-			rmb();
-
-			if (cpu_is_offline(smp_processor_id()))
-				arch_cpu_idle_dead();
-
-			local_irq_disable();
-			arch_cpu_idle_enter();
-
-			/*
-			 * In poll mode we reenable interrupts and spin.
-			 *
-			 * Also if we detected in the wakeup from idle
-			 * path that the tick broadcast device expired
-			 * for us, we don't want to go deep idle as we
-			 * know that the IPI is going to arrive right
-			 * away
-			 */
-			if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
-				cpu_idle_poll();
-			} else {
-				if (!current_clr_polling_and_test()) {
-					stop_critical_timings();
-					rcu_idle_enter();
-					arch_cpu_idle();
-					WARN_ON_ONCE(irqs_disabled());
-					rcu_idle_exit();
-					start_critical_timings();
-				} else {
-					local_irq_enable();
-				}
-				__current_set_polling();
-			}
-			arch_cpu_idle_exit();
-		}
+	while (!need_resched()) {
+		check_pgt_cache();
+		rmb();
+
+		if (cpu_is_offline(smp_processor_id()))
+			arch_cpu_idle_dead();
+
+		local_irq_disable();
+		arch_cpu_idle_enter();
 
 		/*
-		 * Since we fell out of the loop above, we know
-		 * TIF_NEED_RESCHED must be set, propagate it into
-		 * PREEMPT_NEED_RESCHED.
+		 * In poll mode we reenable interrupts and spin.
 		 *
-		 * This is required because for polling idle loops we will
-		 * not have had an IPI to fold the state for us.
+		 * Also if we detected in the wakeup from idle
+		 * path that the tick broadcast device expired
+		 * for us, we don't want to go deep idle as we
+		 * know that the IPI is going to arrive right
+		 * away
 		 */
-		preempt_set_need_resched();
-		tick_nohz_idle_exit();
-		schedule_preempt_disabled();
+		if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+			cpu_idle_poll();
+		} else {
+			if (!current_clr_polling_and_test()) {
+				stop_critical_timings();
+				rcu_idle_enter();
+				arch_cpu_idle();
+				WARN_ON_ONCE(irqs_disabled());
+				rcu_idle_exit();
+				start_critical_timings();
+			} else {
+				local_irq_enable();
+			}
+			__current_set_polling();
+		}
+		arch_cpu_idle_exit();
 	}
+
+	/*
+	 * Since we fell out of the loop above, we know
+	 * TIF_NEED_RESCHED must be set, propagate it into
+	 * PREEMPT_NEED_RESCHED.
+	 *
+	 * This is required because for polling idle loops we will
+	 * not have had an IPI to fold the state for us.
+	 */
+	preempt_set_need_resched();
+	tick_nohz_idle_exit();
+	schedule_preempt_disabled();
+}
+
+static void play_idle_timer(unsigned long foo)
+{
+	set_tsk_need_resched(current);
+}
+
+void play_idle(unsigned long duration)
+{
+	DEFINE_TIMER(wakeup_timer, play_idle_timer, 0, 0);
+
+	/*
+	 * Only FIFO tasks can disable the tick since they don't need the forced
+	 * preemption.
+	 */
+	WARN_ON_ONCE(current->policy != SCHED_FIFO);
+	WARN_ON_ONCE(current->nr_cpus_allowed != 1);
+	WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY));
+	WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
+	rcu_sleep_check();
+
+	init_timer_on_stack(&wakeup_timer);
+	mod_timer_pinned(&wakeup_timer, jiffies + duration);
+
+	preempt_disable();
+	current->flags |= PF_IDLE;
+	do_idle();
+	current->flags &= ~PF_IDLE;
+	del_timer_sync(&wakeup_timer);
+	preempt_fold_need_resched();
+	preempt_enable();
 }
+EXPORT_SYMBOL_GPL(play_idle);
 
-void cpu_startup_entry(enum cpuhp_state state)
+__noreturn void cpu_startup_entry(enum cpuhp_state state)
 {
 	/*
 	 * This #ifdef needs to die, but it's too late in the cycle to
@@ -140,5 +170,6 @@ void cpu_startup_entry(enum cpuhp_state
 #endif
 	__current_set_polling();
 	arch_cpu_idle_prepare();
-	cpu_idle_loop();
+	while (1)
+		do_idle();
 }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3918,6 +3918,7 @@ void init_idle(struct task_struct *idle,
 	__sched_fork(0, idle);
 	idle->state = TASK_RUNNING;
 	idle->se.exec_start = sched_clock();
+	idle->flags |= PF_IDLE;
 
 	do_set_cpus_allowed(idle, cpumask_of(cpu));
 	/*
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -804,7 +804,6 @@ void tick_nohz_idle_enter(void)
 
 	local_irq_enable();
 }
-EXPORT_SYMBOL_GPL(tick_nohz_idle_enter);
 
 /**
  * tick_nohz_irq_exit - update next tick event from interrupt exit
@@ -932,7 +931,6 @@ void tick_nohz_idle_exit(void)
 
 	local_irq_enable();
 }
-EXPORT_SYMBOL_GPL(tick_nohz_idle_exit);
 
 static int tick_nohz_reprogram(struct tick_sched *ts, ktime_t now)
 {



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/8] preempt, locking: Rework local_bh_{dis,en}able()
  2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
                   ` (2 preceding siblings ...)
  2013-11-26 15:57 ` [PATCH 3/8] idle, thermal, acpi: Remove home grown idle implementations Peter Zijlstra
@ 2013-11-26 15:57 ` Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 5/8] locking: Optimize lock_bh functions Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-26 15:57 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

[-- Attachment #1: peter_zijlstra-inline-local_bh_disable.patch --]
[-- Type: text/plain, Size: 4887 bytes --]

Currently local_bh_disable() is out-of-line for no apparent reason.
So inline it to save a few cycles on call/return nonsense, the
function body is a single add on x86 (a few loads and store extra on
load/store archs).

Also expose two new local_bh functions:

  __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);

Which implement the actual local_bh_{dis,en}able() behaviour.

The next patch uses the exposed @cnt argument to optimize bh lock
functions.

With build fixes from Jacob Pan.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/bottom_half.h  |   32 +++++++++++++++++++++++++++++---
 include/linux/hardirq.h      |    1 +
 include/linux/preempt_mask.h |    1 -
 kernel/softirq.c             |   35 ++++++-----------------------------
 4 files changed, 36 insertions(+), 33 deletions(-)

--- a/include/linux/bottom_half.h
+++ b/include/linux/bottom_half.h
@@ -1,9 +1,35 @@
 #ifndef _LINUX_BH_H
 #define _LINUX_BH_H
 
-extern void local_bh_disable(void);
+#include <linux/preempt.h>
+#include <linux/preempt_mask.h>
+
+#ifdef CONFIG_TRACE_IRQFLAGS
+extern void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
+#else
+static __always_inline void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
+{
+	preempt_count_add(cnt);
+	barrier();
+}
+#endif
+
+static inline void local_bh_disable(void)
+{
+	__local_bh_disable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
+
 extern void _local_bh_enable(void);
-extern void local_bh_enable(void);
-extern void local_bh_enable_ip(unsigned long ip);
+extern void __local_bh_enable_ip(unsigned long ip, unsigned int cnt);
+
+static inline void local_bh_enable_ip(unsigned long ip)
+{
+	__local_bh_enable_ip(ip, SOFTIRQ_DISABLE_OFFSET);
+}
+
+static inline void local_bh_enable(void)
+{
+	__local_bh_enable_ip(_THIS_IP_, SOFTIRQ_DISABLE_OFFSET);
+}
 
 #endif /* _LINUX_BH_H */
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -5,6 +5,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <asm/hardirq.h>
 
 
 extern void synchronize_irq(unsigned int irq);
--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -2,7 +2,6 @@
 #define LINUX_PREEMPT_MASK_H
 
 #include <linux/preempt.h>
-#include <asm/hardirq.h>
 
 /*
  * We put the hardirq and softirq counter into the preemption
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -89,7 +89,7 @@ static void wakeup_softirqd(void)
  * where hardirqs are disabled legitimately:
  */
 #ifdef CONFIG_TRACE_IRQFLAGS
-static void __local_bh_disable(unsigned long ip, unsigned int cnt)
+void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
 {
 	unsigned long flags;
 
@@ -114,21 +114,9 @@ static void __local_bh_disable(unsigned
 	if (preempt_count() == cnt)
 		trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
 }
-#else /* !CONFIG_TRACE_IRQFLAGS */
-static inline void __local_bh_disable(unsigned long ip, unsigned int cnt)
-{
-	preempt_count_add(cnt);
-	barrier();
-}
+EXPORT_SYMBOL(__local_bh_disable_ip);
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
-void local_bh_disable(void)
-{
-	__local_bh_disable(_RET_IP_, SOFTIRQ_DISABLE_OFFSET);
-}
-
-EXPORT_SYMBOL(local_bh_disable);
-
 static void __local_bh_enable(unsigned int cnt)
 {
 	WARN_ON_ONCE(!irqs_disabled());
@@ -151,7 +139,7 @@ void _local_bh_enable(void)
 
 EXPORT_SYMBOL(_local_bh_enable);
 
-static inline void _local_bh_enable_ip(unsigned long ip)
+void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 {
 	WARN_ON_ONCE(in_irq() || irqs_disabled());
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -166,7 +154,7 @@ static inline void _local_bh_enable_ip(u
 	 * Keep preemption disabled until we are done with
 	 * softirq processing:
  	 */
-	preempt_count_sub(SOFTIRQ_DISABLE_OFFSET - 1);
+	preempt_count_sub(cnt - 1);
 
 	if (unlikely(!in_interrupt() && local_softirq_pending())) {
 		/*
@@ -182,18 +170,7 @@ static inline void _local_bh_enable_ip(u
 #endif
 	preempt_check_resched();
 }
-
-void local_bh_enable(void)
-{
-	_local_bh_enable_ip(_RET_IP_);
-}
-EXPORT_SYMBOL(local_bh_enable);
-
-void local_bh_enable_ip(unsigned long ip)
-{
-	_local_bh_enable_ip(ip);
-}
-EXPORT_SYMBOL(local_bh_enable_ip);
+EXPORT_SYMBOL(__local_bh_enable_ip);
 
 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
@@ -268,7 +245,7 @@ asmlinkage void __do_softirq(void)
 	pending = local_softirq_pending();
 	account_irq_enter_time(current);
 
-	__local_bh_disable(_RET_IP_, SOFTIRQ_OFFSET);
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	lockdep_softirq_start();
 
 	cpu = smp_processor_id();

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 5/8] locking: Optimize lock_bh functions
  2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
                   ` (3 preceding siblings ...)
  2013-11-26 15:57 ` [PATCH 4/8] preempt, locking: Rework local_bh_{dis,en}able() Peter Zijlstra
@ 2013-11-26 15:57 ` Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 6/8] sched, net: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-26 15:57 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

[-- Attachment #1: peter_zijlstra-remove_preempt_enable_no_sched-from-locks.patch --]
[-- Type: text/plain, Size: 5839 bytes --]

Currently all _bh_ lock functions do two preempt_count operations:

  local_bh_disable();
  preempt_disable();

and for the unlock:

  preempt_enable_no_resched();
  local_bh_enable();

Since its a waste of perfectly good cycles to modify the same variable
twice when you can do it in one go; use the new
__local_bh_{dis,en}able_ip() functions that allow us to provide a
preempt_count value to add/sub.

So define SOFTIRQ_LOCK_OFFSET as the offset a _bh_ lock needs to
add/sub to be done in one go.

As a bonus it gets rid of the preempt_enable_no_resched() usage.

Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/preempt_mask.h     |   15 +++++++++++++++
 include/linux/rwlock_api_smp.h   |   12 ++++--------
 include/linux/spinlock_api_smp.h |   12 ++++--------
 include/linux/spinlock_api_up.h  |   16 +++++++++++-----
 4 files changed, 34 insertions(+), 21 deletions(-)

--- a/include/linux/preempt_mask.h
+++ b/include/linux/preempt_mask.h
@@ -78,6 +78,21 @@
 #endif
 
 /*
+ * The preempt_count offset needed for things like:
+ *
+ *  spin_lock_bh()
+ *
+ * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and
+ * softirqs, such that unlock sequences of:
+ *
+ *  spin_unlock();
+ *  local_bh_enable();
+ *
+ * Work as expected.
+ */
+#define SOFTIRQ_LOCK_OFFSET (SOFTIRQ_DISABLE_OFFSET + PREEMPT_CHECK_OFFSET)
+
+/*
  * Are we running in atomic context?  WARNING: this macro cannot
  * always detect atomic context; in particular, it cannot know about
  * held spinlocks in non-preemptible kernels.  Thus it should not be
--- a/include/linux/rwlock_api_smp.h
+++ b/include/linux/rwlock_api_smp.h
@@ -172,8 +172,7 @@ static inline void __raw_read_lock_irq(r
 
 static inline void __raw_read_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_read_trylock, do_raw_read_lock);
 }
@@ -200,8 +199,7 @@ static inline void __raw_write_lock_irq(
 
 static inline void __raw_write_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);
 }
@@ -250,8 +248,7 @@ static inline void __raw_read_unlock_bh(
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_read_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline void __raw_write_unlock_irqrestore(rwlock_t *lock,
@@ -275,8 +272,7 @@ static inline void __raw_write_unlock_bh
 {
 	rwlock_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_write_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 #endif /* __LINUX_RWLOCK_API_SMP_H */
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -131,8 +131,7 @@ static inline void __raw_spin_lock_irq(r
 
 static inline void __raw_spin_lock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
 }
@@ -174,20 +173,17 @@ static inline void __raw_spin_unlock_bh(
 {
 	spin_release(&lock->dep_map, 1, _RET_IP_);
 	do_raw_spin_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 }
 
 static inline int __raw_spin_trylock_bh(raw_spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	if (do_raw_spin_trylock(lock)) {
 		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 		return 1;
 	}
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
 	return 0;
 }
 
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -24,11 +24,14 @@
  * flags straight, to suppress compiler warnings of unused lock
  * variables, and to add the proper checker annotations:
  */
+#define ___LOCK(lock) \
+  do { __acquire(lock); (void)(lock); } while (0)
+
 #define __LOCK(lock) \
-  do { preempt_disable(); __acquire(lock); (void)(lock); } while (0)
+  do { preempt_disable(); ___LOCK(lock); } while (0);
 
 #define __LOCK_BH(lock) \
-  do { local_bh_disable(); __LOCK(lock); } while (0)
+  do { __local_bh_disable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); ___LOCK(lock); } while (0)
 
 #define __LOCK_IRQ(lock) \
   do { local_irq_disable(); __LOCK(lock); } while (0)
@@ -36,12 +39,15 @@
 #define __LOCK_IRQSAVE(lock, flags) \
   do { local_irq_save(flags); __LOCK(lock); } while (0)
 
+#define ___UNLOCK(lock) \
+  do { __release(lock); (void)(lock); } while (0)
+
 #define __UNLOCK(lock) \
-  do { preempt_enable(); __release(lock); (void)(lock); } while (0)
+  do { preempt_enable(); ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_BH(lock) \
-  do { preempt_enable_no_resched(); local_bh_enable(); \
-	  __release(lock); (void)(lock); } while (0)
+  do { __local_bh_enable_ip(_THIS_IP_, SOFTIRQ_LOCK_OFFSET); \
+       ___UNLOCK(lock); } while (0)
 
 #define __UNLOCK_IRQ(lock) \
   do { local_irq_enable(); __UNLOCK(lock); } while (0)



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 6/8] sched, net: Clean up preempt_enable_no_resched() abuse
  2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
                   ` (4 preceding siblings ...)
  2013-11-26 15:57 ` [PATCH 5/8] locking: Optimize lock_bh functions Peter Zijlstra
@ 2013-11-26 15:57 ` Peter Zijlstra
  2013-11-26 15:57 ` [PATCH 7/8] sched, net: Fixup busy_loop_us_clock() Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-26 15:57 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

[-- Attachment #1: peterz-fixup-weird-preempt_enable_no_resched-usage.patch --]
[-- Type: text/plain, Size: 1171 bytes --]

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 net/ipv4/tcp.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1623,11 +1623,11 @@ int tcp_recvmsg(struct kiocb *iocb, stru
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
 		    !sysctl_tcp_low_latency &&
 		    net_dma_find_channel()) {
-			preempt_enable_no_resched();
+			preempt_enable();
 			tp->ucopy.pinned_list =
 					dma_pin_iovec_pages(msg->msg_iov, len);
 		} else {
-			preempt_enable_no_resched();
+			preempt_enable();
 		}
 	}
 #endif



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 7/8] sched, net: Fixup busy_loop_us_clock()
  2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
                   ` (5 preceding siblings ...)
  2013-11-26 15:57 ` [PATCH 6/8] sched, net: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
@ 2013-11-26 15:57 ` Peter Zijlstra
  2013-11-28 16:49   ` Eliezer Tamir
  2013-11-26 15:57 ` [PATCH 8/8] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
  2013-11-26 23:23 ` [PATCH 0/8] Cure faux idle wreckage Jacob Pan
  8 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-26 15:57 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm

[-- Attachment #1: peterz-fixup-busy_poll.patch --]
[-- Type: text/plain, Size: 2278 bytes --]

The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.

This busy_poll stuff looks to be completely and utterly broken,
sched_clock() can return utter garbage with interrupts enabled (rare
but still) and it can drift unbounded between CPUs.

This means that if you get preempted/migrated and your new CPU is
years behind on the previous CPU we get to busy spin for a _very_ long
time.

There is a _REASON_ sched_clock() warns about preemptability -
papering over it with a preempt_disable()/preempt_enable_no_resched()
is just terminal brain damage on so many levels.

Replace sched_clock() usage with local_clock() which has a bounded
drift between CPUs (<2 jiffies).

There is a further problem with the entire busy wait poll thing in
that the spin time is additive to the syscall timeout, not inclusive.

Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/net/busy_poll.h |   19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -42,27 +42,10 @@ static inline bool net_busy_loop_on(void
 	return sysctl_net_busy_poll;
 }
 
-/* a wrapper to make debug_smp_processor_id() happy
- * we can use sched_clock() because we don't care much about precision
- * we only care that the average is bounded
- */
-#ifdef CONFIG_DEBUG_PREEMPT
 static inline u64 busy_loop_us_clock(void)
 {
-	u64 rc;
-
-	preempt_disable_notrace();
-	rc = sched_clock();
-	preempt_enable_no_resched_notrace();
-
-	return rc >> 10;
-}
-#else /* CONFIG_DEBUG_PREEMPT */
-static inline u64 busy_loop_us_clock(void)
-{
-	return sched_clock() >> 10;
+	return local_clock() >> 10;
 }
-#endif /* CONFIG_DEBUG_PREEMPT */
 
 static inline unsigned long sk_busy_loop_end_time(struct sock *sk)
 {

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 8/8] preempt: Take away preempt_enable_no_resched() from modules
  2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
                   ` (6 preceding siblings ...)
  2013-11-26 15:57 ` [PATCH 7/8] sched, net: Fixup busy_loop_us_clock() Peter Zijlstra
@ 2013-11-26 15:57 ` Peter Zijlstra
  2013-11-26 23:23 ` [PATCH 0/8] Cure faux idle wreckage Jacob Pan
  8 siblings, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-26 15:57 UTC (permalink / raw)
  To: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Rusty Russell

[-- Attachment #1: peterz-hide-preempt_enable_no_resched-modules.patch --]
[-- Type: text/plain, Size: 2247 bytes --]

Discourage drivers/modules to be creative with preemption.

Sadly all is implemented in macros and inline so if they want to do
evil they still can, but at least try and discourage some.

Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: hpa@zytor.com
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/preempt.h |   22 ++++++++++++++++++++--
 include/linux/uaccess.h |    5 ++++-
 2 files changed, 24 insertions(+), 3 deletions(-)

--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -64,7 +64,11 @@ do { \
 } while (0)
 
 #else
-#define preempt_enable() preempt_enable_no_resched()
+#define preempt_enable() \
+do { \
+	barrier(); \
+	preempt_count_dec(); \
+} while (0)
 #define preempt_check_resched() do { } while (0)
 #endif
 
@@ -93,7 +97,11 @@ do { \
 		__preempt_schedule_context(); \
 } while (0)
 #else
-#define preempt_enable_notrace() preempt_enable_no_resched_notrace()
+#define preempt_enable_notrace() \
+do { \
+	barrier(); \
+	__preempt_count_dec(); \
+} while (0)
 #endif
 
 #else /* !CONFIG_PREEMPT_COUNT */
@@ -126,6 +134,16 @@ do { \
 #define preempt_fold_need_resched() do { } while (0)
 #endif
 
+#ifdef MODULE
+/*
+ * Modules have no business playing preemption tricks.
+ */
+#undef sched_preempt_enable_no_resched
+#undef preempt_enable_no_resched
+#undef preempt_enable_no_resched_notrace
+#undef preempt_check_resched
+#endif
+
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 
 struct preempt_notifier;
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -25,13 +25,16 @@ static inline void pagefault_disable(voi
 
 static inline void pagefault_enable(void)
 {
+#ifndef CONFIG_PREEMPT
 	/*
 	 * make sure to issue those last loads/stores before enabling
 	 * the pagefault handler again.
 	 */
 	barrier();
 	preempt_count_dec();
-	preempt_check_resched();
+#else
+	preempt_enable();
+#endif
 }
 
 #ifndef ARCH_HAS_NOCACHE_UACCESS

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/8] Cure faux idle wreckage
  2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
                   ` (7 preceding siblings ...)
  2013-11-26 15:57 ` [PATCH 8/8] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
@ 2013-11-26 23:23 ` Jacob Pan
  8 siblings, 0 replies; 14+ messages in thread
From: Jacob Pan @ 2013-11-26 23:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, Eliezer Tamir, David Miller,
	rui.zhang, Mike Galbraith, Ingo Molnar, hpa, paulmck,
	Thomas Gleixner, linux-kernel, linux-pm

On Tue, 26 Nov 2013 16:57:43 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> Respin of the earlier series that tries to cure the 2 idle injection
> drivers and cleans up some of the preempt_enable_no_resched() mess.
> 
> The intel_powerclamp driver is tested by Jacob Pan and needs one more
> patch to cpuidle to work as before. I'll let him provide this patch;
> since he actually has it and tested it.
> 
> Jacob also said he'll try and work with the QoS people to sort out
> the conflict of interest between the idle injectors and the QoS
> framework.
[Jacob Pan] I have sent out a patch to hook up powerclamp with qos.
Please review. Ideally, this should be in one patchset to avoid
performance regression in powerclamp. If the qos hook is acceptable, it
can be easily used by ACPI PAD, I think.

Jacob

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 7/8] sched, net: Fixup busy_loop_us_clock()
  2013-11-26 15:57 ` [PATCH 7/8] sched, net: Fixup busy_loop_us_clock() Peter Zijlstra
@ 2013-11-28 16:49   ` Eliezer Tamir
  2013-11-28 17:40     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Eliezer Tamir @ 2013-11-28 16:49 UTC (permalink / raw)
  To: Peter Zijlstra, Arjan van de Ven, lenb, rjw, David Miller,
	rui.zhang, jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa,
	Thomas Gleixner
  Cc: linux-kernel, linux-pm

On 26/11/2013 17:57, Peter Zijlstra wrote:
> 
> Replace sched_clock() usage with local_clock() which has a bounded
> drift between CPUs (<2 jiffies).
> 

Peter,

I have tested this patch and I see a performance regression of about
1.5%.

Maybe it would be better, rather then testing in the fast path, to
simply disallow busy polling altogether when sched_clock_stable is
not true?

Thanks,
Eliezer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 7/8] sched, net: Fixup busy_loop_us_clock()
  2013-11-28 16:49   ` Eliezer Tamir
@ 2013-11-28 17:40     ` Peter Zijlstra
  2013-11-28 18:50       ` Peter Zijlstra
  2013-11-29 13:52       ` Eliezer Tamir
  0 siblings, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-28 17:40 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Arjan van de Ven, lenb, rjw, David Miller, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, Thomas Gleixner,
	linux-kernel, linux-pm

On Thu, Nov 28, 2013 at 06:49:00PM +0200, Eliezer Tamir wrote:
> I have tested this patch and I see a performance regression of about
> 1.5%.

Cute, can you qualify your metric? Since this is a poll loop the only
metric that would be interesting is the response latency. Is that what's
increased by 1.5%? Also, what's the standard deviation of your result?

Also, can you provide relevant perf results for this? Is it really the
sti;cli pair that's degrading your latency?

Better yet, can you provide us with a simple test-case that we can run
locally (preferably single machine setup, using localnet or somesuch).

> Maybe it would be better, rather then testing in the fast path, to
> simply disallow busy polling altogether when sched_clock_stable is
> not true?

Sadly that doesn't work; sched_clock_stable can become false at any time
after boot (and does, even on recent machines).

That said; let me see if I can come up with a few patches to optimize
the entire thing; that'd be something we all benefit from.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 7/8] sched, net: Fixup busy_loop_us_clock()
  2013-11-28 17:40     ` Peter Zijlstra
@ 2013-11-28 18:50       ` Peter Zijlstra
  2013-11-29 13:52       ` Eliezer Tamir
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Zijlstra @ 2013-11-28 18:50 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: Arjan van de Ven, lenb, rjw, David Miller, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, Thomas Gleixner,
	linux-kernel, linux-pm

On Thu, Nov 28, 2013 at 06:40:01PM +0100, Peter Zijlstra wrote:
> That said; let me see if I can come up with a few patches to optimize
> the entire thing; that'd be something we all benefit from.

OK, so the below compiles, I currently haven't got time to see if it
runs or not.

I've got it as series of 6 patches, but for convenience I'll just put the
entire folded diff below.

Obviously I still need to fix the #if 0 bits and do ia64 which would add
another few patches.

---
 arch/x86/Kconfig                 |   1 +
 arch/x86/include/asm/timer.h     |  64 +---------------
 arch/x86/kernel/cpu/amd.c        |   2 +-
 arch/x86/kernel/cpu/intel.c      |   2 +-
 arch/x86/kernel/cpu/perf_event.c |   4 +-
 arch/x86/kernel/tsc.c            | 153 +++++++++++++++++++++++++--------------
 include/linux/math64.h           |  30 ++++++++
 include/linux/sched.h            |   4 +-
 init/Kconfig                     |   6 ++
 kernel/sched/clock.c             |  63 ++++++++--------
 kernel/sched/debug.c             |   2 +-
 kernel/time/tick-sched.c         |   2 +-
 kernel/trace/ring_buffer.c       |   2 +-
 13 files changed, 181 insertions(+), 154 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c84cf90ca693..bd1f30159689 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,6 +26,7 @@ config X86
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select ARCH_SUPPORTS_NUMA_BALANCING
+	select ARCH_SUPPORTS_INT128 if X86_64
 	select ARCH_WANTS_PROT_NUMA_PROT_NONE
 	select HAVE_IDE
 	select HAVE_OPROFILE
diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index 34baa0eb5d0c..125cdd1371da 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -4,6 +4,7 @@
 #include <linux/pm.h>
 #include <linux/percpu.h>
 #include <linux/interrupt.h>
+#include <linux/math64.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -12,68 +13,5 @@ extern int recalibrate_cpu_khz(void);
 
 extern int no_timer_check;
 
-/* Accelerators for sched_clock()
- * convert from cycles(64bits) => nanoseconds (64bits)
- *  basic equation:
- *		ns = cycles / (freq / ns_per_sec)
- *		ns = cycles * (ns_per_sec / freq)
- *		ns = cycles * (10^9 / (cpu_khz * 10^3))
- *		ns = cycles * (10^6 / cpu_khz)
- *
- *	Then we use scaling math (suggested by george@mvista.com) to get:
- *		ns = cycles * (10^6 * SC / cpu_khz) / SC
- *		ns = cycles * cyc2ns_scale / SC
- *
- *	And since SC is a constant power of two, we can convert the div
- *  into a shift.
- *
- *  We can use khz divisor instead of mhz to keep a better precision, since
- *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
- *  (mathieu.desnoyers@polymtl.ca)
- *
- *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
- *
- * In:
- *
- * ns = cycles * cyc2ns_scale / SC
- *
- * Although we may still have enough bits to store the value of ns,
- * in some cases, we may not have enough bits to store cycles * cyc2ns_scale,
- * leading to an incorrect result.
- *
- * To avoid this, we can decompose 'cycles' into quotient and remainder
- * of division by SC.  Then,
- *
- * ns = (quot * SC + rem) * cyc2ns_scale / SC
- *    = quot * cyc2ns_scale + (rem * cyc2ns_scale) / SC
- *
- *			- sqazi@google.com
- */
-
-DECLARE_PER_CPU(unsigned long, cyc2ns);
-DECLARE_PER_CPU(unsigned long long, cyc2ns_offset);
-
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
-
-static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
-{
-	int cpu = smp_processor_id();
-	unsigned long long ns = per_cpu(cyc2ns_offset, cpu);
-	ns += mult_frac(cyc, per_cpu(cyc2ns, cpu),
-			(1UL << CYC2NS_SCALE_FACTOR));
-	return ns;
-}
-
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	unsigned long long ns;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	ns = __cycles_2_ns(cyc);
-	local_irq_restore(flags);
-
-	return ns;
-}
 
 #endif /* _ASM_X86_TIMER_H */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index bca023bdd6b2..8bc79cddd9a2 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -487,7 +487,7 @@ static void early_init_amd(struct cpuinfo_x86 *c)
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
 		if (!check_tsc_unstable())
-			sched_clock_stable = 1;
+			set_sched_clock_stable();
 	}
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0dff939..d6a93c1f64db 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -93,7 +93,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
 		if (!check_tsc_unstable())
-			sched_clock_stable = 1;
+			set_sched_clock_stable();
 	}
 
 	/* Penwell and Cloverview have the TSC which doesn't sleep on S3 */
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 98f845bdee5a..0b214d398c81 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1895,7 +1895,8 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
 	userpg->cap_user_rdpmc = x86_pmu.attr_rdpmc;
 	userpg->pmc_width = x86_pmu.cntval_bits;
 
-	if (!sched_clock_stable)
+#if 0
+	if (!sched_clock_stable())
 		return;
 
 	userpg->cap_user_time = 1;
@@ -1905,6 +1906,7 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
 
 	userpg->cap_user_time_zero = 1;
 	userpg->time_zero = this_cpu_read(cyc2ns_offset);
+#endif
 }
 
 /*
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 930e5d48f560..68c84d7b7658 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -11,6 +11,7 @@
 #include <linux/clocksource.h>
 #include <linux/percpu.h>
 #include <linux/timex.h>
+#include <linux/static_key.h>
 
 #include <asm/hpet.h>
 #include <asm/timer.h>
@@ -37,7 +38,95 @@ static int __read_mostly tsc_unstable;
    erroneous rdtsc usage on !cpu_has_tsc processors */
 static int __read_mostly tsc_disabled = -1;
 
+static struct static_key __use_tsc = STATIC_KEY_INIT;
+
 int tsc_clocksource_reliable;
+
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *              ns = cycles / (freq / ns_per_sec)
+ *              ns = cycles * (ns_per_sec / freq)
+ *              ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *              ns = cycles * (10^6 / cpu_khz)
+ *
+ *      Then we use scaling math (suggested by george@mvista.com) to get:
+ *              ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *              ns = cycles * cyc2ns_scale / SC
+ *
+ *      And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better precision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+
+struct cyc2ns_data {
+	unsigned long cyc2ns_mul;
+	unsigned long long cyc2ns_offset;
+};
+
+struct cyc2ns_latch {
+	unsigned int head, tail;
+	struct cyc2ns_data data[2];
+};
+
+static DEFINE_PER_CPU(struct cyc2ns_latch, cyc2ns);
+
+#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	unsigned long long ns;
+	unsigned int tail, idx;
+
+again:
+	tail = this_cpu_read(cyc2ns.tail);
+	smp_rmb();
+	idx = tail & 1;
+	ns = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset);
+	ns += mul_u64_u32_shr(cyc, this_cpu_read(cyc2ns.data[idx].cyc2ns_mul),
+			CYC2NS_SCALE_FACTOR);
+	smp_rmb();
+	if (unlikely(this_cpu_read(cyc2ns.head) - tail >= 2))
+		goto again;
+
+	return ns;
+}
+
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
+{
+	unsigned long long tsc_now, ns_now;
+	struct cyc2ns_latch *latch = &per_cpu(cyc2ns, cpu);
+	struct cyc2ns_data *data;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	if (!cpu_khz)
+		goto done;
+
+	latch->head++;
+	smp_wmb();
+	data = latch->data + (latch->head & 1);
+
+	rdtscll(tsc_now);
+	ns_now = cycles_2_ns(tsc_now);
+
+	data->cyc2ns_mul = ((NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR) + cpu_khz / 2) / cpu_khz;
+	data->cyc2ns_offset = ns_now - mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
+
+	smp_wmb();
+	latch->tail++;
+
+done:
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
+}
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
@@ -53,7 +142,7 @@ u64 native_sched_clock(void)
 	 *   very important for it to be as fast as the platform
 	 *   can achieve it. )
 	 */
-	if (unlikely(tsc_disabled)) {
+	if (static_key_false(&__use_tsc)) {
 		/* No locking but a rare wrong value is not a big deal: */
 		return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
 	}
@@ -62,7 +151,7 @@ u64 native_sched_clock(void)
 	rdtscll(this_offset);
 
 	/* return the value in ns */
-	return __cycles_2_ns(this_offset);
+	return cycles_2_ns(this_offset);
 }
 
 /* We need to define a real function for sched_clock, to override the
@@ -589,61 +678,11 @@ int recalibrate_cpu_khz(void)
 EXPORT_SYMBOL(recalibrate_cpu_khz);
 
 
-/* Accelerators for sched_clock()
- * convert from cycles(64bits) => nanoseconds (64bits)
- *  basic equation:
- *              ns = cycles / (freq / ns_per_sec)
- *              ns = cycles * (ns_per_sec / freq)
- *              ns = cycles * (10^9 / (cpu_khz * 10^3))
- *              ns = cycles * (10^6 / cpu_khz)
- *
- *      Then we use scaling math (suggested by george@mvista.com) to get:
- *              ns = cycles * (10^6 * SC / cpu_khz) / SC
- *              ns = cycles * cyc2ns_scale / SC
- *
- *      And since SC is a constant power of two, we can convert the div
- *  into a shift.
- *
- *  We can use khz divisor instead of mhz to keep a better precision, since
- *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
- *  (mathieu.desnoyers@polymtl.ca)
- *
- *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
- */
-
-DEFINE_PER_CPU(unsigned long, cyc2ns);
-DEFINE_PER_CPU(unsigned long long, cyc2ns_offset);
-
-static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
-{
-	unsigned long long tsc_now, ns_now, *offset;
-	unsigned long flags, *scale;
-
-	local_irq_save(flags);
-	sched_clock_idle_sleep_event();
-
-	scale = &per_cpu(cyc2ns, cpu);
-	offset = &per_cpu(cyc2ns_offset, cpu);
-
-	rdtscll(tsc_now);
-	ns_now = __cycles_2_ns(tsc_now);
-
-	if (cpu_khz) {
-		*scale = ((NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR) +
-				cpu_khz / 2) / cpu_khz;
-		*offset = ns_now - mult_frac(tsc_now, *scale,
-					     (1UL << CYC2NS_SCALE_FACTOR));
-	}
-
-	sched_clock_idle_wakeup_event(0);
-	local_irq_restore(flags);
-}
-
 static unsigned long long cyc2ns_suspend;
 
 void tsc_save_sched_clock_state(void)
 {
-	if (!sched_clock_stable)
+	if (!sched_clock_stable())
 		return;
 
 	cyc2ns_suspend = sched_clock();
@@ -659,11 +698,12 @@ void tsc_save_sched_clock_state(void)
  */
 void tsc_restore_sched_clock_state(void)
 {
+#if 0
 	unsigned long long offset;
 	unsigned long flags;
 	int cpu;
 
-	if (!sched_clock_stable)
+	if (!sched_clock_stable())
 		return;
 
 	local_irq_save(flags);
@@ -675,6 +715,7 @@ void tsc_restore_sched_clock_state(void)
 		per_cpu(cyc2ns_offset, cpu) = offset;
 
 	local_irq_restore(flags);
+#endif
 }
 
 #ifdef CONFIG_CPU_FREQ
@@ -795,7 +836,7 @@ void mark_tsc_unstable(char *reason)
 {
 	if (!tsc_unstable) {
 		tsc_unstable = 1;
-		sched_clock_stable = 0;
+		clear_sched_clock_stable();
 		disable_sched_clock_irqtime();
 		pr_info("Marking TSC unstable due to %s\n", reason);
 		/* Change only the rating, when not registered */
@@ -1002,7 +1043,9 @@ void __init tsc_init(void)
 		return;
 
 	/* now allow native_sched_clock() to use rdtsc */
+
 	tsc_disabled = 0;
+	static_key_slow_inc(&__use_tsc);
 
 	if (!no_sched_irq_time)
 		enable_sched_clock_irqtime();
diff --git a/include/linux/math64.h b/include/linux/math64.h
index 69ed5f5e9f6e..c45c089bfdac 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -133,4 +133,34 @@ __iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)
 	return ret;
 }
 
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+
+#ifndef mul_u64_u32_shr
+static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
+{
+	return (u64)(((unsigned __int128)a * mul) >> shift);
+}
+#endif /* mul_u64_u32_shr */
+
+#else
+
+#ifndef mul_u64_u32_shr
+static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
+{
+	u32 ah, al;
+	u64 ret;
+
+	al = a;
+	ah = a >> 32;
+
+	ret = ((u64)al * mul) >> shift;
+	if (ah)
+		ret += ((u64)ah * mul) << (32 - shift);
+
+	return ret;
+}
+#endif /* mul_u64_u32_shr */
+
+#endif
+
 #endif /* _LINUX_MATH64_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bf14c215af1e..44fbcbff8dde 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1880,7 +1880,9 @@ static inline void sched_clock_idle_wakeup_event(u64 delta_ns)
  * but then during bootup it turns out that sched_clock()
  * is reliable after all:
  */
-extern int sched_clock_stable;
+extern int sched_clock_stable(void);
+extern void set_sched_clock_stable(void);
+extern void clear_sched_clock_stable(void);
 
 extern void sched_clock_tick(void);
 extern void sched_clock_idle_sleep_event(void);
diff --git a/init/Kconfig b/init/Kconfig
index 79383d3aa5dc..4e5d96ab2034 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -809,6 +809,12 @@ config GENERIC_SCHED_CLOCK
 config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
+#
+# For architectures that know their GCC __int128 support is sound
+#
+config ARCH_SUPPORTS_INT128
+	bool
+
 # For architectures that (ab)use NUMA to represent different memory regions
 # all cpu-local but of different latencies, such as SuperH.
 #
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c3ae1446461c..35a14f76d633 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -26,9 +26,10 @@
  * at 0 on boot (but people really shouldn't rely on that).
  *
  * cpu_clock(i)       -- can be used from any context, including NMI.
- * sched_clock_cpu(i) -- must be used with local IRQs disabled (implied by NMI)
  * local_clock()      -- is cpu_clock() on the current cpu.
  *
+ * sched_clock_cpu(i)
+ *
  * How:
  *
  * The implementation either uses sched_clock() when
@@ -50,15 +51,6 @@
  * Furthermore, explicit sleep and wakeup hooks allow us to account for time
  * that is otherwise invisible (TSC gets stopped).
  *
- *
- * Notes:
- *
- * The !IRQ-safetly of sched_clock() and sched_clock_cpu() comes from things
- * like cpufreq interrupts that can change the base clock (TSC) multiplier
- * and cause funny jumps in time -- although the filtering provided by
- * sched_clock_cpu() should mitigate serious artifacts we cannot rely on it
- * in general since for !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK we fully rely on
- * sched_clock().
  */
 #include <linux/spinlock.h>
 #include <linux/hardirq.h>
@@ -66,6 +58,7 @@
 #include <linux/percpu.h>
 #include <linux/ktime.h>
 #include <linux/sched.h>
+#include <linux/static_key.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -82,7 +75,27 @@ EXPORT_SYMBOL_GPL(sched_clock);
 __read_mostly int sched_clock_running;
 
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-__read_mostly int sched_clock_stable;
+static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+
+int sched_clock_stable(void)
+{
+	if (static_key_false(&__sched_clock_stable))
+		return false;
+	return true;
+}
+
+void set_sched_clock_stable(void)
+{
+	if (!sched_clock_stable())
+		static_key_slow_inc(&__sched_clock_stable);
+}
+
+void clear_sched_clock_stable(void)
+{
+	/* XXX worry about clock continuity */
+	if (sched_clock_stable())
+		static_key_slow_dec(&__sched_clock_stable);
+}
 
 struct sched_clock_data {
 	u64			tick_raw;
@@ -244,7 +257,7 @@ u64 sched_clock_cpu(int cpu)
 
 	WARN_ON_ONCE(!irqs_disabled());
 
-	if (sched_clock_stable)
+	if (sched_clock_stable())
 		return sched_clock();
 
 	if (unlikely(!sched_clock_running))
@@ -265,7 +278,7 @@ void sched_clock_tick(void)
 	struct sched_clock_data *scd;
 	u64 now, now_gtod;
 
-	if (sched_clock_stable)
+	if (sched_clock_stable())
 		return;
 
 	if (unlikely(!sched_clock_running))
@@ -316,14 +329,10 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
  */
 u64 cpu_clock(int cpu)
 {
-	u64 clock;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	clock = sched_clock_cpu(cpu);
-	local_irq_restore(flags);
+	if (static_key_false(&__sched_clock_stable))
+		return sched_clock_cpu(cpu);
 
-	return clock;
+	return sched_clock();
 }
 
 /*
@@ -335,14 +344,10 @@ u64 cpu_clock(int cpu)
  */
 u64 local_clock(void)
 {
-	u64 clock;
-	unsigned long flags;
+	if (static_key_false(&__sched_clock_stable))
+		return sched_clock_cpu(smp_processor_id());
 
-	local_irq_save(flags);
-	clock = sched_clock_cpu(smp_processor_id());
-	local_irq_restore(flags);
-
-	return clock;
+	return sched_clock();
 }
 
 #else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
@@ -362,12 +367,12 @@ u64 sched_clock_cpu(int cpu)
 
 u64 cpu_clock(int cpu)
 {
-	return sched_clock_cpu(cpu);
+	return sched_clock();
 }
 
 u64 local_clock(void)
 {
-	return sched_clock_cpu(0);
+	return sched_clock();
 }
 
 #endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 5c34d1817e8f..71934842baaf 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -371,7 +371,7 @@ static void sched_debug_header(struct seq_file *m)
 	PN(cpu_clk);
 	P(jiffies);
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-	P(sched_clock_stable);
+	P(sched_clock_stable());
 #endif
 #undef PN
 #undef P
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a12df5abde0b..8be2dca1e1d7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -177,7 +177,7 @@ static bool can_stop_full_tick(void)
 	 * TODO: kick full dynticks CPUs when
 	 * sched_clock_stable is set.
 	 */
-	if (!sched_clock_stable) {
+	if (!sched_clock_stable()) {
 		trace_tick_stop(0, "unstable sched clock\n");
 		/*
 		 * Don't allow the user to think they can get
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc2f66f68dc5..294b8a271a04 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2558,7 +2558,7 @@ rb_reserve_next_event(struct ring_buffer *buffer,
 		if (unlikely(test_time_stamp(delta))) {
 			int local_clock_stable = 1;
 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-			local_clock_stable = sched_clock_stable;
+			local_clock_stable = sched_clock_stable();
 #endif
 			WARN_ONCE(delta > (1ULL << 59),
 				  KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 7/8] sched, net: Fixup busy_loop_us_clock()
  2013-11-28 17:40     ` Peter Zijlstra
  2013-11-28 18:50       ` Peter Zijlstra
@ 2013-11-29 13:52       ` Eliezer Tamir
  1 sibling, 0 replies; 14+ messages in thread
From: Eliezer Tamir @ 2013-11-29 13:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, lenb, rjw, David Miller, rui.zhang,
	jacob.jun.pan, Mike Galbraith, Ingo Molnar, hpa, Thomas Gleixner,
	linux-kernel, linux-pm

On 28/11/2013 19:40, Peter Zijlstra wrote:
> On Thu, Nov 28, 2013 at 06:49:00PM +0200, Eliezer Tamir wrote:
>> I have tested this patch and I see a performance regression of about
>> 1.5%.
> 
> Cute, can you qualify your metric? Since this is a poll loop the only
> metric that would be interesting is the response latency. Is that what's
> increased by 1.5%? Also, what's the standard deviation of your result?

Sorry, I should have been more specific.

I use netperf tcp_rr, with all settings except the time (30s) on their
defaults. The setup is exactly the same as in the commit message of the
original patch set.

I get 91.5 KRR/s vs. 90.0 KRR/s.

Unfortunately you need two machines, both of which need NICs that have
driver support for busy poll. currently AFAIK bnx2x, ixgbe, mlx4 and
myri10ge are the only ones, but it's not that hard to add to most NAPI
based drivers.

I will try to test your latest patches and hopefully also get some perf
numbers on Sunday.

Thanks,
Eliezer

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-11-29 13:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-26 15:57 [PATCH 0/8] Cure faux idle wreckage Peter Zijlstra
2013-11-26 15:57 ` [PATCH 1/8] x86, acpi, idle: Restructure the mwait idle routines Peter Zijlstra
2013-11-26 15:57 ` [PATCH 2/8] sched, preempt: Fixup missed PREEMPT_NEED_RESCHED folding Peter Zijlstra
2013-11-26 15:57 ` [PATCH 3/8] idle, thermal, acpi: Remove home grown idle implementations Peter Zijlstra
2013-11-26 15:57 ` [PATCH 4/8] preempt, locking: Rework local_bh_{dis,en}able() Peter Zijlstra
2013-11-26 15:57 ` [PATCH 5/8] locking: Optimize lock_bh functions Peter Zijlstra
2013-11-26 15:57 ` [PATCH 6/8] sched, net: Clean up preempt_enable_no_resched() abuse Peter Zijlstra
2013-11-26 15:57 ` [PATCH 7/8] sched, net: Fixup busy_loop_us_clock() Peter Zijlstra
2013-11-28 16:49   ` Eliezer Tamir
2013-11-28 17:40     ` Peter Zijlstra
2013-11-28 18:50       ` Peter Zijlstra
2013-11-29 13:52       ` Eliezer Tamir
2013-11-26 15:57 ` [PATCH 8/8] preempt: Take away preempt_enable_no_resched() from modules Peter Zijlstra
2013-11-26 23:23 ` [PATCH 0/8] Cure faux idle wreckage Jacob Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).