[PATCH 0/3] nohz: Some fixes and updates

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] nohz: Some fixes and updates
@ 2013-05-10 21:12 Steven Rostedt
  2013-05-10 21:12 ` [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Steven Rostedt @ 2013-05-10 21:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Frederic Weisbecker, Paul E. McKenney, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin

Frederic,

Can you take these patches. The first patch will need to get an
ack by one of the x86 maintainers. That one adds a new schedule_preempt_user()
that is called before going back to user space in order to safely
switch from user_context before calling schedule.

The second patch prevents LOCKUP_DETECTION from being selected if
NO_HZ_FULL is. That's because if LOCKUP_DETECTION is selected, it will
prevent NO_HZ_FULL from functioning. I first thought about having
NO_HZ_FULL depend on LOCKUP_DETECTION, but it would hide the feature.
As NO_HZ_FULL is a feature and LOCKUP_DETECTION is debugging, I
have precedence to the feature.

The last patch produces a warning if nohz_full is used or NO_HZ_FULL_ALL
is defined, and the system fails to go into nohz_full mode because
the machine's clock is unstable. We don't want users thinking they
have nohz_full but their machine can't handle it.

-- Steve

---
 arch/x86/include/asm/context_tracking.h |   10 ---------
 b/arch/x86/kernel/entry_64.S            |   26 +++++--------------------
 b/kernel/sched/core.c                   |   33 +++++++++++++++++---------------
 b/kernel/time/tick-sched.c              |    5 ++++
 b/lib/Kconfig.debug                     |    2 +
 5 files changed, 31 insertions(+), 45 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
  2013-05-10 21:12 [PATCH 0/3] nohz: Some fixes and updates Steven Rostedt
@ 2013-05-10 21:12 ` Steven Rostedt
  2013-05-11  1:36   ` Frederic Weisbecker
                     ` (2 more replies)
  2013-05-10 21:12 ` [PATCH 2/3] nohz: Disable LOCKUP_DETECTOR when NO_HZ_FULL is enabled Steven Rostedt
  2013-05-10 21:12 ` [PATCH 3/3] nohz: Warn if the machine can not perform nohz_full Steven Rostedt
  2 siblings, 3 replies; 10+ messages in thread
From: Steven Rostedt @ 2013-05-10 21:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Frederic Weisbecker, Paul E. McKenney, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin

[-- Attachment #1: fix-user-exit-preempt.patch --]
[-- Type: text/plain, Size: 8226 bytes --]

I started testing the new NOHZ_FULL in the kernel and had some issues,
so I started function tracing and this bug report came out:


[23446.458073] ------------[ cut here ]------------
[23446.461028] WARNING:
at /home/rostedt/work/git/linux-trace.git/kernel/rcutree.c:388
rcu_eqs_enter+0x4b/0x89()
[23446.466096] Modules linked in: ebtables ipt_MASQUERADE sunrpc bridge
stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
ip6_tables ipv6 uinput snd_hda_codec_id
t snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm
kvm_intel snd_timer snd kvm soundcore shpchp snd_page_alloc microcode
i2c_i801 pata_acpi firewire_ohci firewi
re_core crc_itu_t ata_generic i915 drm_kms_helper drm i2c_algo_bit
i2c_core video
[23446.466096] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.9.0-test+ #11
[23446.466096] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[23446.466096]  ffffffff814879f0 ffffffff81004fa6 ffffffff810337af
ffff88007d400000
[23446.466096]  0000000000000000 0000000000000000 ffff88007d40d900
ffffffff81a01f00
[23446.466096]  ffffffff81acba00 ffff88007d6495c0 ffffffff810a1f36
0000000000000a28
[23446.466096] Call Trace:
[23446.466096]  [<ffffffff814879f0>] ? dump_stack+0xd/0x17
[23446.466096]  [<ffffffff81004fa6>] ? show_stack+0x5/0x3e
[23446.466096]  [<ffffffff810337af>] ? warn_slowpath_common+0x64/0x7c
[23446.466096]  [<ffffffff810a1f36>] ? rcu_eqs_enter+0x4b/0x89
[23446.466096]  [<ffffffff810a1f8e>] ? rcu_idle_enter+0x1a/0x30
[23446.466096]  [<ffffffff81491bc5>] ? ftrace_graph_caller+0x85/0x85
[23446.466096]  [<ffffffff810707f3>] cpu_startup_entry+0xb3/0x11c
[23446.466096]  [<ffffffff81ae9d2e>] ? start_kernel+0x41a/0x425
[23446.466096]  [<ffffffff81ae972d>] ? repair_env_string+0x54/0x54
[23446.466096] ---[ end trace ddbb69ae2a0f6687 ]---
[23446.466096] ------------[ cut here ]------------

This is caused by an imbalance of rcu_eqs_enter() and rcu_eqs_exit().
Adding more tracing, I also discovered that there seemed to be an
imbalance in user_exit() and user_enter(). Specifically, I found that
user_enter() was called, and then a migration happened and user_exit()
was never called before the schedule. Then I had noticed this in my
trace:

run-isol-11546   1.N.. 37363.637641: function:             schedule_user <-- retint_careful
run-isol-11546   1.N.. 37363.637641: function:             __schedule <-- preempt_schedule

The funny part here is that schedule_user does not call
preempt_schedule. But then I also noticed the flags of the call to
schedule_user(): "1.N..". This shows that it was running on cpu 1 with
NEED_RESCHED set (N), but both interrupts and preemption are enabled. If
an interrupt came in here we can schedule out again, but that would call
preempt_schedule_irq, which has code to check if its in "user context
(according to dynamic_ticks)" and would do the user_exit() too. But
something didn't seem right here.

Then I realized that it was actually the function tracer itself, as for
every function it traces, it calls preempt_disable() and
preempt_enable() to record the data on a per_cpu buffer. That
preempt_enable() noticed the NEED_RESCHED set and since preemption is
enabled, it kindly called preempt_schedule() for us!

All this before schedule_user() was able to call user_exit() and take us
out of dynamic tick user context.

I even verified this by adding a:

  preempt_disable();
  preempt_enable();

just before the user_exit() and ran a test without doing any tracing and
was able to easily hit this bug.

I then looked at the code in entry_64.S, and noticed that the calls to
SCHEDULE_USER were all encompassed with ENABLE_INTERRUPTS and
DISABLE_INTERRUPTS before calling schedule. And the SCHEDULE_USER macro
is different if we have CONTEXT_TRACKING enabled or not.

I created a new schedule_preempt_user() which is similar to
schedule_preempt_disable()  and works like preempt_schedule_irq(), where
it includes the exception_enter() and exit() routines that take care of
the accounting of dynamic ticks when interrupting user mode or not. It
also takes care of it before enabling interrupts, so we don't need to
worry about any preemption happening at the wrong time.

Now the code always calls schedule_preempt_user() whether or not
CONTEXT_TRACKING is configured, which cleans up entry_64.S and removes
the need of the context_tracking.h header file in asm.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-trace.git/arch/x86/include/asm/context_tracking.h
===================================================================
--- linux-trace.git.orig/arch/x86/include/asm/context_tracking.h
+++ /dev/null
@@ -1,10 +0,0 @@
-#ifndef _ASM_X86_CONTEXT_TRACKING_H
-#define _ASM_X86_CONTEXT_TRACKING_H
-
-#ifdef CONFIG_CONTEXT_TRACKING
-# define SCHEDULE_USER call schedule_user
-#else
-# define SCHEDULE_USER call schedule
-#endif
-
-#endif
Index: linux-trace.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-trace.git.orig/arch/x86/kernel/entry_64.S
+++ linux-trace.git/arch/x86/kernel/entry_64.S
@@ -56,7 +56,6 @@
 #include <asm/ftrace.h>
 #include <asm/percpu.h>
 #include <asm/asm.h>
-#include <asm/context_tracking.h>
 #include <asm/smap.h>
 #include <linux/err.h>
 
@@ -654,6 +653,7 @@ sysret_check:
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
+sysret_check_irqsoff:
 	movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
 	andl %edi,%edx
 	jnz  sysret_careful
@@ -675,12 +675,10 @@ sysret_check:
 sysret_careful:
 	bt $TIF_NEED_RESCHED,%edx
 	jnc sysret_signal
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi %rdi
-	SCHEDULE_USER
+	call schedule_preempt_user
 	popq_cfi %rdi
-	jmp sysret_check
+	jmp sysret_check_irqsoff
 
 	/* Handle a signal */
 sysret_signal:
@@ -788,13 +786,9 @@ GLOBAL(int_with_check)
 int_careful:
 	bt $TIF_NEED_RESCHED,%edx
 	jnc  int_very_careful
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi %rdi
-	SCHEDULE_USER
+	call schedule_preempt_user
 	popq_cfi %rdi
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
 	jmp int_with_check
 
 	/* handle signals and tracing -- both require a full stack frame */
@@ -1088,14 +1082,10 @@ retint_careful:
 	CFI_RESTORE_STATE
 	bt    $TIF_NEED_RESCHED,%edx
 	jnc   retint_signal
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
 	pushq_cfi %rdi
-	SCHEDULE_USER
+	call schedule_preempt_user
 	popq_cfi %rdi
 	GET_THREAD_INFO(%rcx)
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
 	jmp retint_check
 
 retint_signal:
@@ -1534,11 +1524,7 @@ paranoid_userspace:
 	TRACE_IRQS_OFF
 	jmp paranoid_userspace
 paranoid_schedule:
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_ANY)
-	SCHEDULE_USER
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
+	call schedule_preempt_user
 	jmp paranoid_userspace
 	CFI_ENDPROC
 END(paranoid_exit)
Index: linux-trace.git/kernel/sched/core.c
===================================================================
--- linux-trace.git.orig/kernel/sched/core.c
+++ linux-trace.git/kernel/sched/core.c
@@ -3034,21 +3034,6 @@ asmlinkage void __sched schedule(void)
 }
 EXPORT_SYMBOL(schedule);
 
-#ifdef CONFIG_CONTEXT_TRACKING
-asmlinkage void __sched schedule_user(void)
-{
-	/*
-	 * If we come here after a random call to set_need_resched(),
-	 * or we have been woken up remotely but the IPI has not yet arrived,
-	 * we haven't yet exited the RCU idle mode. Do it here manually until
-	 * we find a better solution.
-	 */
-	user_exit();
-	schedule();
-	user_enter();
-}
-#endif
-
 /**
  * schedule_preempt_disabled - called with preemption disabled
  *
@@ -3127,6 +3112,24 @@ asmlinkage void __sched preempt_schedule
 
 #endif /* CONFIG_PREEMPT */
 
+/*
+ * This is a entry point to the scheduler() just before going
+ * back to user space. This is called with irqs disabled
+ * which prevents races with the CONTEXT_TRACKING updates.
+ */
+asmlinkage void __sched schedule_preempt_user(void)
+{
+	enum ctx_state prev_state;
+
+	prev_state = exception_enter();
+
+	local_irq_enable();
+	__schedule();
+	local_irq_disable();
+
+	exception_exit(prev_state);
+}
+
 int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
 			  void *key)
 {


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/3] nohz: Disable LOCKUP_DETECTOR when NO_HZ_FULL is enabled
  2013-05-10 21:12 [PATCH 0/3] nohz: Some fixes and updates Steven Rostedt
  2013-05-10 21:12 ` [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S Steven Rostedt
@ 2013-05-10 21:12 ` Steven Rostedt
  2013-05-10 21:12 ` [PATCH 3/3] nohz: Warn if the machine can not perform nohz_full Steven Rostedt
  2 siblings, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2013-05-10 21:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Frederic Weisbecker, Paul E. McKenney, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin

[-- Attachment #1: no-hz-full-prevents-lockup-detector.patch --]
[-- Type: text/plain, Size: 1156 bytes --]

Trying to test the nohz_full code, I was not able to get it to work.
Finally I enabled the tick_stop tracepoint and it showed:

 tick_stop:            success=no msg=perf events running

I talked to Frederic Weisbecker about this and he informed me that
perf is used by the lockup detector. I checked the code, and sure
enough it is.

As perf is always running when LOCKUP_DETECTOR is enabled, which
will always disable nohz_full from working, instead of confusing
users, disable LOCKUP_DETECTOR when NO_HZ_FULL is enabled.

When perf is changed such that it does not prevent nohz_full from
working, then we can and should remove this constraint.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 566cf2b..1364d09 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -174,6 +174,8 @@ config DEBUG_SHIRQ
 config LOCKUP_DETECTOR
 	bool "Detect Hard and Soft Lockups"
 	depends on DEBUG_KERNEL && !S390
+	# Lockup detector currently prevents NO_HZ_FULL from working
+	depends on !NO_HZ_FULL
 	help
 	  Say Y here to enable the kernel to act as a watchdog to detect
 	  hard and soft lockups.


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] nohz: Warn if the machine can not perform nohz_full
  2013-05-10 21:12 [PATCH 0/3] nohz: Some fixes and updates Steven Rostedt
  2013-05-10 21:12 ` [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S Steven Rostedt
  2013-05-10 21:12 ` [PATCH 2/3] nohz: Disable LOCKUP_DETECTOR when NO_HZ_FULL is enabled Steven Rostedt
@ 2013-05-10 21:12 ` Steven Rostedt
  2 siblings, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2013-05-10 21:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Frederic Weisbecker, Paul E. McKenney, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin

[-- Attachment #1: warn-no-hz.patch --]
[-- Type: text/plain, Size: 907 bytes --]

If the user configures NO_HZ_FULL and defines nohz_full=XXX on the
kerenl command line, or enables NO_HZ_FULL_ALL, but nohz fails
due to the machine having a unstable clock, warn about it.

We do not want users thinking that they are getting the benefit
of nohz when their machine can not support it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-trace.git/kernel/time/tick-sched.c
===================================================================
--- linux-trace.git.orig/kernel/time/tick-sched.c
+++ linux-trace.git/kernel/time/tick-sched.c
@@ -178,6 +178,11 @@ static bool can_stop_full_tick(void)
 	 */
 	if (!sched_clock_stable) {
 		trace_tick_stop(0, "unstable sched clock\n");
+		/*
+		 * Don't allow the user to think they can get
+		 * full NO_HZ with this machine.
+		 */
+		WARN_ONCE(1, "NO_HZ FULL will not work with unstable sched clock");
 		return false;
 	}
 #endif


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
  2013-05-10 21:12 ` [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S Steven Rostedt
@ 2013-05-11  1:36   ` Frederic Weisbecker
  2013-05-13  9:56   ` Li Zhong
  2013-05-14 14:13   ` Frederic Weisbecker
  2 siblings, 0 replies; 10+ messages in thread
From: Frederic Weisbecker @ 2013-05-11  1:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin

On Fri, May 10, 2013 at 05:12:26PM -0400, Steven Rostedt wrote:
> +/*
> + * This is a entry point to the scheduler() just before going
> + * back to user space. This is called with irqs disabled
> + * which prevents races with the CONTEXT_TRACKING updates.
> + */
> +asmlinkage void __sched schedule_preempt_user(void)
> +{
> +	enum ctx_state prev_state;
> +
> +	prev_state = exception_enter();
> +
> +	local_irq_enable();
> +	__schedule();
> +	local_irq_disable();
> +
> +	exception_exit(prev_state);

So since it's only ever called right before resuming to userspace and after
the user_exit() call from the end of the syscall/exception/irq code,
you can use user_enter()/user_exit() directly.

I'm also wondering if this assumption that irqs are disabled by the time
we do user preemption is x86-centric or not. May be we can wait for complains
from those who'll port it...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
  2013-05-10 21:12 ` [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S Steven Rostedt
  2013-05-11  1:36   ` Frederic Weisbecker
@ 2013-05-13  9:56   ` Li Zhong
  2013-05-13 15:03     ` Steven Rostedt
  2013-05-14 14:13   ` Frederic Weisbecker
  2 siblings, 1 reply; 10+ messages in thread
From: Li Zhong @ 2013-05-13  9:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Frederic Weisbecker, Paul E. McKenney, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Fri, 2013-05-10 at 17:12 -0400, Steven Rostedt wrote:
> plain text document attachment (fix-user-exit-preempt.patch)
> I started testing the new NOHZ_FULL in the kernel and had some issues,
> so I started function tracing and this bug report came out:
> 
> 
> [23446.458073] ------------[ cut here ]------------
> [23446.461028] WARNING:
> at /home/rostedt/work/git/linux-trace.git/kernel/rcutree.c:388
> rcu_eqs_enter+0x4b/0x89()
> [23446.466096] Modules linked in: ebtables ipt_MASQUERADE sunrpc bridge
> stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> ip6_tables ipv6 uinput snd_hda_codec_id
> t snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm
> kvm_intel snd_timer snd kvm soundcore shpchp snd_page_alloc microcode
> i2c_i801 pata_acpi firewire_ohci firewi
> re_core crc_itu_t ata_generic i915 drm_kms_helper drm i2c_algo_bit
> i2c_core video
> [23446.466096] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.9.0-test+ #11
> [23446.466096] Hardware name: To Be Filled By O.E.M. To Be Filled By
> O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
> [23446.466096]  ffffffff814879f0 ffffffff81004fa6 ffffffff810337af
> ffff88007d400000
> [23446.466096]  0000000000000000 0000000000000000 ffff88007d40d900
> ffffffff81a01f00
> [23446.466096]  ffffffff81acba00 ffff88007d6495c0 ffffffff810a1f36
> 0000000000000a28
> [23446.466096] Call Trace:
> [23446.466096]  [<ffffffff814879f0>] ? dump_stack+0xd/0x17
> [23446.466096]  [<ffffffff81004fa6>] ? show_stack+0x5/0x3e
> [23446.466096]  [<ffffffff810337af>] ? warn_slowpath_common+0x64/0x7c
> [23446.466096]  [<ffffffff810a1f36>] ? rcu_eqs_enter+0x4b/0x89
> [23446.466096]  [<ffffffff810a1f8e>] ? rcu_idle_enter+0x1a/0x30
> [23446.466096]  [<ffffffff81491bc5>] ? ftrace_graph_caller+0x85/0x85
> [23446.466096]  [<ffffffff810707f3>] cpu_startup_entry+0xb3/0x11c
> [23446.466096]  [<ffffffff81ae9d2e>] ? start_kernel+0x41a/0x425
> [23446.466096]  [<ffffffff81ae972d>] ? repair_env_string+0x54/0x54
> [23446.466096] ---[ end trace ddbb69ae2a0f6687 ]---
> [23446.466096] ------------[ cut here ]------------
> 
> This is caused by an imbalance of rcu_eqs_enter() and rcu_eqs_exit().
> Adding more tracing, I also discovered that there seemed to be an
> imbalance in user_exit() and user_enter(). Specifically, I found that
> user_enter() was called, and then a migration happened and user_exit()
> was never called before the schedule. Then I had noticed this in my
> trace:
> 
> run-isol-11546   1.N.. 37363.637641: function:             schedule_user <-- retint_careful
> run-isol-11546   1.N.. 37363.637641: function:             __schedule <-- preempt_schedule
> 
> The funny part here is that schedule_user does not call
> preempt_schedule. But then I also noticed the flags of the call to
> schedule_user(): "1.N..". This shows that it was running on cpu 1 with
> NEED_RESCHED set (N), but both interrupts and preemption are enabled. If
> an interrupt came in here we can schedule out again, but that would call
> preempt_schedule_irq, which has code to check if its in "user context
> (according to dynamic_ticks)" and would do the user_exit() too. But
> something didn't seem right here.
> 
> Then I realized that it was actually the function tracer itself, as for
> every function it traces, it calls preempt_disable() and
> preempt_enable() to record the data on a per_cpu buffer. That
> preempt_enable() noticed the NEED_RESCHED set and since preemption is
> enabled, it kindly called preempt_schedule() for us!
> 
> All this before schedule_user() was able to call user_exit() and take us
> out of dynamic tick user context.

Maybe we could just disable function trace for schedule_user()? 

It seems that function trace might use RCU, at least in __schedule()
when calling preempt_enable(), and if preemption is really enabled. 

user_exit() is used to allow RCU usage after that (with rcu_user_exit).
RCU usage is not allowed before that because cpu is in user eqs. And if
function trace needs use RCU, then it seems user_exit() itself or its
callers couldn't be function traced.

If __schedule() in preempt_enable() is the only place function trace
uses RCU, then after converting to schedule_preempt_user(), it is safe
as irqs are disabled for __schedule() to happen. But user_exit() itself
and some other callers might still need function trace be disabled.

Thanks, Zhong 

> 
> I even verified this by adding a:
> 
>   preempt_disable();
>   preempt_enable();
> 
> just before the user_exit() and ran a test without doing any tracing and
> was able to easily hit this bug.
> 
> I then looked at the code in entry_64.S, and noticed that the calls to
> SCHEDULE_USER were all encompassed with ENABLE_INTERRUPTS and
> DISABLE_INTERRUPTS before calling schedule. And the SCHEDULE_USER macro
> is different if we have CONTEXT_TRACKING enabled or not.
> 
> I created a new schedule_preempt_user() which is similar to
> schedule_preempt_disable()  and works like preempt_schedule_irq(), where
> it includes the exception_enter() and exit() routines that take care of
> the accounting of dynamic ticks when interrupting user mode or not. It
> also takes care of it before enabling interrupts, so we don't need to
> worry about any preemption happening at the wrong time.
> 
> Now the code always calls schedule_preempt_user() whether or not
> CONTEXT_TRACKING is configured, which cleans up entry_64.S and removes
> the need of the context_tracking.h header file in asm.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> Index: linux-trace.git/arch/x86/include/asm/context_tracking.h
> ===================================================================
> --- linux-trace.git.orig/arch/x86/include/asm/context_tracking.h
> +++ /dev/null
> @@ -1,10 +0,0 @@
> -#ifndef _ASM_X86_CONTEXT_TRACKING_H
> -#define _ASM_X86_CONTEXT_TRACKING_H
> -
> -#ifdef CONFIG_CONTEXT_TRACKING
> -# define SCHEDULE_USER call schedule_user
> -#else
> -# define SCHEDULE_USER call schedule
> -#endif
> -
> -#endif
> Index: linux-trace.git/arch/x86/kernel/entry_64.S
> ===================================================================
> --- linux-trace.git.orig/arch/x86/kernel/entry_64.S
> +++ linux-trace.git/arch/x86/kernel/entry_64.S
> @@ -56,7 +56,6 @@
>  #include <asm/ftrace.h>
>  #include <asm/percpu.h>
>  #include <asm/asm.h>
> -#include <asm/context_tracking.h>
>  #include <asm/smap.h>
>  #include <linux/err.h>
> 
> @@ -654,6 +653,7 @@ sysret_check:
>  	LOCKDEP_SYS_EXIT
>  	DISABLE_INTERRUPTS(CLBR_NONE)
>  	TRACE_IRQS_OFF
> +sysret_check_irqsoff:
>  	movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
>  	andl %edi,%edx
>  	jnz  sysret_careful
> @@ -675,12 +675,10 @@ sysret_check:
>  sysret_careful:
>  	bt $TIF_NEED_RESCHED,%edx
>  	jnc sysret_signal
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
>  	pushq_cfi %rdi
> -	SCHEDULE_USER
> +	call schedule_preempt_user
>  	popq_cfi %rdi
> -	jmp sysret_check
> +	jmp sysret_check_irqsoff
> 
>  	/* Handle a signal */
>  sysret_signal:
> @@ -788,13 +786,9 @@ GLOBAL(int_with_check)
>  int_careful:
>  	bt $TIF_NEED_RESCHED,%edx
>  	jnc  int_very_careful
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
>  	pushq_cfi %rdi
> -	SCHEDULE_USER
> +	call schedule_preempt_user
>  	popq_cfi %rdi
> -	DISABLE_INTERRUPTS(CLBR_NONE)
> -	TRACE_IRQS_OFF
>  	jmp int_with_check
> 
>  	/* handle signals and tracing -- both require a full stack frame */
> @@ -1088,14 +1082,10 @@ retint_careful:
>  	CFI_RESTORE_STATE
>  	bt    $TIF_NEED_RESCHED,%edx
>  	jnc   retint_signal
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
>  	pushq_cfi %rdi
> -	SCHEDULE_USER
> +	call schedule_preempt_user
>  	popq_cfi %rdi
>  	GET_THREAD_INFO(%rcx)
> -	DISABLE_INTERRUPTS(CLBR_NONE)
> -	TRACE_IRQS_OFF
>  	jmp retint_check
> 
>  retint_signal:
> @@ -1534,11 +1524,7 @@ paranoid_userspace:
>  	TRACE_IRQS_OFF
>  	jmp paranoid_userspace
>  paranoid_schedule:
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_ANY)
> -	SCHEDULE_USER
> -	DISABLE_INTERRUPTS(CLBR_ANY)
> -	TRACE_IRQS_OFF
> +	call schedule_preempt_user
>  	jmp paranoid_userspace
>  	CFI_ENDPROC
>  END(paranoid_exit)
> Index: linux-trace.git/kernel/sched/core.c
> ===================================================================
> --- linux-trace.git.orig/kernel/sched/core.c
> +++ linux-trace.git/kernel/sched/core.c
> @@ -3034,21 +3034,6 @@ asmlinkage void __sched schedule(void)
>  }
>  EXPORT_SYMBOL(schedule);
> 
> -#ifdef CONFIG_CONTEXT_TRACKING
> -asmlinkage void __sched schedule_user(void)
> -{
> -	/*
> -	 * If we come here after a random call to set_need_resched(),
> -	 * or we have been woken up remotely but the IPI has not yet arrived,
> -	 * we haven't yet exited the RCU idle mode. Do it here manually until
> -	 * we find a better solution.
> -	 */
> -	user_exit();
> -	schedule();
> -	user_enter();
> -}
> -#endif
> -
>  /**
>   * schedule_preempt_disabled - called with preemption disabled
>   *
> @@ -3127,6 +3112,24 @@ asmlinkage void __sched preempt_schedule
> 
>  #endif /* CONFIG_PREEMPT */
> 
> +/*
> + * This is a entry point to the scheduler() just before going
> + * back to user space. This is called with irqs disabled
> + * which prevents races with the CONTEXT_TRACKING updates.
> + */
> +asmlinkage void __sched schedule_preempt_user(void)
> +{
> +	enum ctx_state prev_state;
> +
> +	prev_state = exception_enter();
> +
> +	local_irq_enable();
> +	__schedule();
> +	local_irq_disable();
> +
> +	exception_exit(prev_state);
> +}
> +
>  int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
>  			  void *key)
>  {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
  2013-05-13  9:56   ` Li Zhong
@ 2013-05-13 15:03     ` Steven Rostedt
  2013-05-14  9:44       ` Li Zhong
  0 siblings, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2013-05-13 15:03 UTC (permalink / raw)
  To: Li Zhong
  Cc: linux-kernel, Frederic Weisbecker, Paul E. McKenney, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Mon, 2013-05-13 at 17:56 +0800, Li Zhong wrote:

> > All this before schedule_user() was able to call user_exit() and take us
> > out of dynamic tick user context.
> 
> Maybe we could just disable function trace for schedule_user()? 
> 
> It seems that function trace might use RCU, at least in __schedule()
> when calling preempt_enable(), and if preemption is really enabled. 

There is a small use of rcu scheduling (preempt disable/enable) in
function tracing. One is for perf and dynamic call traces and some hash
code (multiple users of the function tracer). But those are the unique
cases, and I really do not want to remove this function just for the non
common case. Function tracing has proven extremely useful in debugging
RCU and context tracking. By adding notrace, it's a lost cause.

Not to mention, adding trace_printk() there will break it too, and that
does not use any RCU but still disables preemption because the writing
to the ring buffer requires staying on the same CPU.

> 
> user_exit() is used to allow RCU usage after that (with rcu_user_exit).
> RCU usage is not allowed before that because cpu is in user eqs. And if
> function trace needs use RCU, then it seems user_exit() itself or its
> callers couldn't be function traced.

And it can't be debugged either.

I can probably black list those functions manually, such that only the
main function tracer can trace it, all others will not. In otherwords, I
can have it such that function tracer will not trace those functions for
perf, dynamic function tracers, or anything that requires changing the
function list.

I could probably also add a heavier weight synchronize_sched() that even
synchronizes cpus in userspace.

> 
> If __schedule() in preempt_enable() is the only place function trace
> uses RCU, then after converting to schedule_preempt_user(), it is safe
> as irqs are disabled for __schedule() to happen. But user_exit() itself
> and some other callers might still need function trace be disabled.

The above makes no sense to me. function tracing didn't break, but the
user_exit() did because of a preemption in the wrong place, as there was
no protection there to prevent an unnecessary preemption.

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
  2013-05-13 15:03     ` Steven Rostedt
@ 2013-05-14  9:44       ` Li Zhong
  0 siblings, 0 replies; 10+ messages in thread
From: Li Zhong @ 2013-05-14  9:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Frederic Weisbecker, Paul E. McKenney, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Mon, 2013-05-13 at 11:03 -0400, Steven Rostedt wrote:
> On Mon, 2013-05-13 at 17:56 +0800, Li Zhong wrote:
> 
> > > All this before schedule_user() was able to call user_exit() and take us
> > > out of dynamic tick user context.
> > 
> > Maybe we could just disable function trace for schedule_user()? 
> > 
> > It seems that function trace might use RCU, at least in __schedule()
> > when calling preempt_enable(), and if preemption is really enabled. 
> 
> There is a small use of rcu scheduling (preempt disable/enable) in
> function tracing. One is for perf and dynamic call traces and some hash
> code (multiple users of the function tracer). But those are the unique
> cases, and I really do not want to remove this function just for the non
> common case. Function tracing has proven extremely useful in debugging
> RCU and context tracking. By adding notrace, it's a lost cause.

OK, so it seems a bad idea to disable trace here. 

However, as schedule is traced, so we might get something like 
.schedule <-.schedule_user to know that schedule_user is called?

> 
> Not to mention, adding trace_printk() there will break it too, and that
> does not use any RCU but still disables preemption because the writing
> to the ring buffer requires staying on the same CPU.
> 
I guess you mean the preemption enable after buffer written might cause
a task migration, and break the pair of user_enter/user_exit?

Then how about adding trace_printk() after user_exit() there?

> > 
> > user_exit() is used to allow RCU usage after that (with rcu_user_exit).
> > RCU usage is not allowed before that because cpu is in user eqs. And if
> > function trace needs use RCU, then it seems user_exit() itself or its
> > callers couldn't be function traced.
> 
> And it can't be debugged either.
> 
> I can probably black list those functions manually, such that only the
> main function tracer can trace it, all others will not. In otherwords, I
> can have it such that function tracer will not trace those functions for
> perf, dynamic function tracers, or anything that requires changing the
> function list.

I'm totally lost. Need some more time to read the trace code (please
forgive me for the silly questions in this mail...)

By "main function tracer", do you mean the function_trace_call()? And I
don't understand how to black list some functions manually?

If function_trace_call() is the main function tracer you mentioned,
there is preemption disable/enable in it. schedule might happen where
preemption is enabled. Or maybe you mean we should make all those
functions called with preemption disabled(or irqs disabled)?  

Then syscall_trace_enter(), syscall_trace_leave(), do_notify_resume(),
need be converted the similar way as schedule_preempt_user() did? 

Thanks, Zhong

> I could probably also add a heavier weight synchronize_sched() that even
> synchronizes cpus in userspace.
> 
> > 
> > If __schedule() in preempt_enable() is the only place function trace
> > uses RCU, then after converting to schedule_preempt_user(), it is safe
> > as irqs are disabled for __schedule() to happen. But user_exit() itself
> > and some other callers might still need function trace be disabled.
> 
> The above makes no sense to me. function tracing didn't break, but the
> user_exit() did because of a preemption in the wrong place, as there was
> no protection there to prevent an unnecessary preemption.
> 
> 
> -- Steve
> 
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
  2013-05-10 21:12 ` [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S Steven Rostedt
  2013-05-11  1:36   ` Frederic Weisbecker
  2013-05-13  9:56   ` Li Zhong
@ 2013-05-14 14:13   ` Frederic Weisbecker
  2013-05-15  1:38     ` Li Zhong
  2 siblings, 1 reply; 10+ messages in thread
From: Frederic Weisbecker @ 2013-05-14 14:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Paul E. McKenney, Ingo Molnar, Andrew Morton,
	Thomas Gleixner, H. Peter Anvin

On Fri, May 10, 2013 at 05:12:26PM -0400, Steven Rostedt wrote:
> +/*
> + * This is a entry point to the scheduler() just before going
> + * back to user space. This is called with irqs disabled
> + * which prevents races with the CONTEXT_TRACKING updates.
> + */
> +asmlinkage void __sched schedule_preempt_user(void)
> +{
> +	enum ctx_state prev_state;
> +
> +	prev_state = exception_enter();
> +
> +	local_irq_enable();
> +	__schedule();
> +	local_irq_disable();
> +
> +	exception_exit(prev_state);
> +}

Ok I just had a look at how ARM and PPC64 are handling user preemption and it seems
that irqs are disabled around the call to schedule() on these archs too. Although
do_work_pending() in ARM surprisingly doesn't re-enable irqs before calling schedule?

Anyway having irqs disabled around user preemption seem to be a requirement to make
sure the TIF_NEED_RESCHED check is not racy against irqs and return to userspace.
So I guess we can keep the above function as it is.

But perhaps we should queue this for 3.11 given that it's a bit of a sensitive change
in the x86 user return path.

Look, I'm just going to make a seperate pull request with this patch based on 3.10-rc1
and let Ingo choose the target.

(Meanwhile I still think it would be a good idea to keep LOCKDEP_SYS_EXIT in the loop :-)

Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S
  2013-05-14 14:13   ` Frederic Weisbecker
@ 2013-05-15  1:38     ` Li Zhong
  0 siblings, 0 replies; 10+ messages in thread
From: Li Zhong @ 2013-05-15  1:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, linux-kernel, Paul E. McKenney, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Tue, 2013-05-14 at 16:13 +0200, Frederic Weisbecker wrote:
> On Fri, May 10, 2013 at 05:12:26PM -0400, Steven Rostedt wrote:
> > +/*
> > + * This is a entry point to the scheduler() just before going
> > + * back to user space. This is called with irqs disabled
> > + * which prevents races with the CONTEXT_TRACKING updates.
> > + */
> > +asmlinkage void __sched schedule_preempt_user(void)
> > +{
> > +	enum ctx_state prev_state;
> > +
> > +	prev_state = exception_enter();
> > +
> > +	local_irq_enable();
> > +	__schedule();
> > +	local_irq_disable();
> > +
> > +	exception_exit(prev_state);
> > +}
> 
> Ok I just had a look at how ARM and PPC64 are handling user preemption and it seems
> that irqs are disabled around the call to schedule() on these archs too. Although
> do_work_pending() in ARM surprisingly doesn't re-enable irqs before calling schedule?
> 
> Anyway having irqs disabled around user preemption seem to be a requirement to make
> sure the TIF_NEED_RESCHED check is not racy against irqs and return to userspace.
> So I guess we can keep the above function as it is.
> 
> But perhaps we should queue this for 3.11 given that it's a bit of a sensitive change
> in the x86 user return path.
> 
> Look, I'm just going to make a seperate pull request with this patch based on 3.10-rc1
> and let Ingo choose the target.

Could the deletion of schedule_user() be separated to another patch. So
for other archs, such as ppc64, we could have both
schedule_preempt_user() and schedule_user() there for the conversion?

Or are there some better ways to avoid the conflict with arch trees? 

Thanks, Zhong

> 
> (Meanwhile I still think it would be a good idea to keep LOCKDEP_SYS_EXIT in the loop :-)
> 
> Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-05-15  1:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-10 21:12 [PATCH 0/3] nohz: Some fixes and updates Steven Rostedt
2013-05-10 21:12 ` [PATCH 1/3] x86/sched/context_tracking: Call new schedule_preempt_user() from entry_64.S Steven Rostedt
2013-05-11  1:36   ` Frederic Weisbecker
2013-05-13  9:56   ` Li Zhong
2013-05-13 15:03     ` Steven Rostedt
2013-05-14  9:44       ` Li Zhong
2013-05-14 14:13   ` Frederic Weisbecker
2013-05-15  1:38     ` Li Zhong
2013-05-10 21:12 ` [PATCH 2/3] nohz: Disable LOCKUP_DETECTOR when NO_HZ_FULL is enabled Steven Rostedt
2013-05-10 21:12 ` [PATCH 3/3] nohz: Warn if the machine can not perform nohz_full Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox