[PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes
@ 2010-03-13  2:56 Steven Rostedt
  2010-03-13  2:56 ` [PATCH 1/5] ring-buffer: Move disabled check into preempt disable section Steven Rostedt
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Steven Rostedt @ 2010-03-13  2:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Li Zefan,
	Lai Jiangshan

Ingo,

Li Zefan and Lai Jiangshan told me they were finding various bugs
with running a stress test. I asked for this test, and Li sent it
to me. Running the test I was also able to trigger the same bugs that
they were seeing. Unfortunately, these bugs were very hard to figure
out and it took me most of the week to fix them.

These bugs would most likely not happen in normal use, since they
all require multiple users writing to the control files of ftrace.
This stress test does just that, it creates several processes that
each write to a different part of a ftrace control system. One resizes
the ring buffer, the other enables and disables various options,
another changes the ftrace plugins, another enables and disables
various events, another enables and disables the tracer, etc.

Normal use generally has a single user that would only do one of these
actions at a time. But that is no excuse for not fixing the issues
that this test uncovered, so I took this very seriously.

Even the fix "tracing: Disable buffer switching when starting or
stopping trace" may look like it could happen if wakeup tracer is running,
but since opening the trace file would cause the wakeup tracer to stop,
it actually required another task to change the current tracer to
the wakeup tracer while the trace file was being read.

In any event, I've been running Li's ftrace_stress_test for a few
hours now, and it seems pretty stable with these patches. I'll continue
to run it to see if it finds anything else.

All of these patches I consider urgent and have also Cc'd stable on them.

Please pull the latest tip/tracing/urgent tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
tip/tracing/urgent

Lai Jiangshan (1):
      ring-buffer: Move disabled check into preempt disable section

Steven Rostedt (4):
      function-graph: Init curr_ret_stack with ret_stack
      tracing: Use same local variable when resetting the ring buffer
      tracing: Disable buffer switching when starting or stopping trace
      tracing: Do not record user stack trace from NMI context

----
 kernel/trace/ftrace.c      |    2 +-
 kernel/trace/ring_buffer.c |   12 ++++++------
 kernel/trace/trace.c       |   24 ++++++++++++++++++++----
 3 files changed, 27 insertions(+), 11 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/5] ring-buffer: Move disabled check into preempt disable section
  2010-03-13  2:56 [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Steven Rostedt
@ 2010-03-13  2:56 ` Steven Rostedt
  2010-03-13  2:56 ` [PATCH 2/5] function-graph: Init curr_ret_stack with ret_stack Steven Rostedt
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2010-03-13  2:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Li Zefan,
	Lai Jiangshan, stable

[-- Attachment #1: 0001-ring-buffer-Move-disabled-check-into-preempt-disable.patch --]
[-- Type: text/plain, Size: 1735 bytes --]

From: Lai Jiangshan <laijs@cn.fujitsu.com>

The ring buffer resizing and resetting relies on a schedule RCU
action. The buffers are disabled, a synchronize_sched() is called
and then the resize or reset takes place.

But this only works if the disabling of the buffers are within the
preempt disabled section, otherwise a window exists that the buffers
can be written to while a reset or resize takes place.

Cc: stable@kernel.org
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B949E43.2010906@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8c1b2d2..54191d6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2232,12 +2232,12 @@ ring_buffer_lock_reserve(struct ring_buffer *buffer, unsigned long length)
 	if (ring_buffer_flags != RB_BUFFERS_ON)
 		return NULL;
 
-	if (atomic_read(&buffer->record_disabled))
-		return NULL;
-
 	/* If we are tracing schedule, we don't want to recurse */
 	resched = ftrace_preempt_disable();
 
+	if (atomic_read(&buffer->record_disabled))
+		goto out_nocheck;
+
 	if (trace_recursive_lock())
 		goto out_nocheck;
 
@@ -2469,11 +2469,11 @@ int ring_buffer_write(struct ring_buffer *buffer,
 	if (ring_buffer_flags != RB_BUFFERS_ON)
 		return -EBUSY;
 
-	if (atomic_read(&buffer->record_disabled))
-		return -EBUSY;
-
 	resched = ftrace_preempt_disable();
 
+	if (atomic_read(&buffer->record_disabled))
+		goto out;
+
 	cpu = raw_smp_processor_id();
 
 	if (!cpumask_test_cpu(cpu, buffer->cpumask))
-- 
1.7.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/5] function-graph: Init curr_ret_stack with ret_stack
  2010-03-13  2:56 [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Steven Rostedt
  2010-03-13  2:56 ` [PATCH 1/5] ring-buffer: Move disabled check into preempt disable section Steven Rostedt
@ 2010-03-13  2:56 ` Steven Rostedt
  2010-03-14 10:10   ` Frederic Weisbecker
  2010-03-13  2:56 ` [PATCH 3/5] tracing: Use same local variable when resetting the ring buffer Steven Rostedt
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2010-03-13  2:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Li Zefan,
	Lai Jiangshan, stable

[-- Attachment #1: 0002-function-graph-Init-curr_ret_stack-with-ret_stack.patch --]
[-- Type: text/plain, Size: 1362 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

If the graph tracer is active, and a task is forked but the allocating of
the processes graph stack fails, it can cause crash later on.

This is due to the temporary stack being NULL, but the curr_ret_stack
variable is copied from the parent. If it is not -1, then in
ftrace_graph_probe_sched_switch() the following:

	for (index = next->curr_ret_stack; index >= 0; index--)
		next->ret_stack[index].calltime += timestamp;

Will cause a kernel OOPS.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ftrace.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index d4d1238..bb53edb 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3349,6 +3349,7 @@ void ftrace_graph_init_task(struct task_struct *t)
 {
 	/* Make sure we do not use the parent ret_stack */
 	t->ret_stack = NULL;
+	t->curr_ret_stack = -1;
 
 	if (ftrace_graph_active) {
 		struct ftrace_ret_stack *ret_stack;
@@ -3358,7 +3359,6 @@ void ftrace_graph_init_task(struct task_struct *t)
 				GFP_KERNEL);
 		if (!ret_stack)
 			return;
-		t->curr_ret_stack = -1;
 		atomic_set(&t->tracing_graph_pause, 0);
 		atomic_set(&t->trace_overrun, 0);
 		t->ftrace_timestamp = 0;
-- 
1.7.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/5] tracing: Use same local variable when resetting the ring buffer
  2010-03-13  2:56 [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Steven Rostedt
  2010-03-13  2:56 ` [PATCH 1/5] ring-buffer: Move disabled check into preempt disable section Steven Rostedt
  2010-03-13  2:56 ` [PATCH 2/5] function-graph: Init curr_ret_stack with ret_stack Steven Rostedt
@ 2010-03-13  2:56 ` Steven Rostedt
  2010-03-13  2:56 ` [PATCH 4/5] tracing: Disable buffer switching when starting or stopping trace Steven Rostedt
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2010-03-13  2:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Li Zefan,
	Lai Jiangshan, stable

[-- Attachment #1: 0003-tracing-Use-same-local-variable-when-resetting-the-r.patch --]
[-- Type: text/plain, Size: 1857 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

In the ftrace code that resets the ring buffer it references the
buffer with a local variable, but then uses the tr->buffer as the
parameter to reset. If the wakeup tracer is running, which can
switch the tr->buffer with the max saved buffer, this can break
the requirement of disabling the buffer before the reset.

   buffer = tr->buffer;
   ring_buffer_record_disable(buffer);
   synchronize_sched();
   __tracing_reset(tr->buffer, cpu);

If the tr->buffer is swapped, then the reset is not happening to the
buffer that was disabled. This will cause the ring buffer to fail.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6af8d7b..60de37b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -840,10 +840,10 @@ out:
 	mutex_unlock(&trace_types_lock);
 }
 
-static void __tracing_reset(struct trace_array *tr, int cpu)
+static void __tracing_reset(struct ring_buffer *buffer, int cpu)
 {
 	ftrace_disable_cpu();
-	ring_buffer_reset_cpu(tr->buffer, cpu);
+	ring_buffer_reset_cpu(buffer, cpu);
 	ftrace_enable_cpu();
 }
 
@@ -855,7 +855,7 @@ void tracing_reset(struct trace_array *tr, int cpu)
 
 	/* Make sure all commits have finished */
 	synchronize_sched();
-	__tracing_reset(tr, cpu);
+	__tracing_reset(buffer, cpu);
 
 	ring_buffer_record_enable(buffer);
 }
@@ -873,7 +873,7 @@ void tracing_reset_online_cpus(struct trace_array *tr)
 	tr->time_start = ftrace_now(tr->cpu);
 
 	for_each_online_cpu(cpu)
-		__tracing_reset(tr, cpu);
+		__tracing_reset(buffer, cpu);
 
 	ring_buffer_record_enable(buffer);
 }
-- 
1.7.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/5] tracing: Disable buffer switching when starting or stopping trace
  2010-03-13  2:56 [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Steven Rostedt
                   ` (2 preceding siblings ...)
  2010-03-13  2:56 ` [PATCH 3/5] tracing: Use same local variable when resetting the ring buffer Steven Rostedt
@ 2010-03-13  2:56 ` Steven Rostedt
  2010-03-13  2:57 ` [PATCH 5/5] tracing: Do not record user stack trace from NMI context Steven Rostedt
  2010-03-13  7:25 ` [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Ingo Molnar
  5 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2010-03-13  2:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Li Zefan,
	Lai Jiangshan, stable

[-- Attachment #1: 0004-tracing-Disable-buffer-switching-when-starting-or-st.patch --]
[-- Type: text/plain, Size: 2271 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

When the trace iterator is read, tracing_start() and tracing_stop()
is called to stop tracing while the iterator is processing the trace
output.

These functions disable both the standard buffer and the max latency
buffer. But if the wakeup tracer is running, it can switch these
buffers between the two disables:

  buffer = global_trace.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

      <<<--------- swap happens here

  buffer = max_tr.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

What happens is that we disabled the same buffer twice. On tracing_start()
we can enable the same buffer twice. All ring_buffer_record_disable()
must be matched with a ring_buffer_record_enable() or the buffer
can be disable permanently, or enable prematurely, and cause a bug
where a reset happens while a trace is commiting.

This patch protects these two by taking the ftrace_max_lock to prevent
a switch from occurring.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 60de37b..484337d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -950,6 +950,8 @@ void tracing_start(void)
 		goto out;
 	}
 
+	/* Prevent the buffers from switching */
+	arch_spin_lock(&ftrace_max_lock);
 
 	buffer = global_trace.buffer;
 	if (buffer)
@@ -959,6 +961,8 @@ void tracing_start(void)
 	if (buffer)
 		ring_buffer_record_enable(buffer);
 
+	arch_spin_unlock(&ftrace_max_lock);
+
 	ftrace_start();
  out:
 	spin_unlock_irqrestore(&tracing_start_lock, flags);
@@ -980,6 +984,9 @@ void tracing_stop(void)
 	if (trace_stop_count++)
 		goto out;
 
+	/* Prevent the buffers from switching */
+	arch_spin_lock(&ftrace_max_lock);
+
 	buffer = global_trace.buffer;
 	if (buffer)
 		ring_buffer_record_disable(buffer);
@@ -988,6 +995,8 @@ void tracing_stop(void)
 	if (buffer)
 		ring_buffer_record_disable(buffer);
 
+	arch_spin_unlock(&ftrace_max_lock);
+
  out:
 	spin_unlock_irqrestore(&tracing_start_lock, flags);
 }
-- 
1.7.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/5] tracing: Do not record user stack trace from NMI context
  2010-03-13  2:56 [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Steven Rostedt
                   ` (3 preceding siblings ...)
  2010-03-13  2:56 ` [PATCH 4/5] tracing: Disable buffer switching when starting or stopping trace Steven Rostedt
@ 2010-03-13  2:57 ` Steven Rostedt
  2010-03-14 10:27   ` Frederic Weisbecker
  2010-03-14 22:05   ` John Kacur
  2010-03-13  7:25 ` [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Ingo Molnar
  5 siblings, 2 replies; 14+ messages in thread
From: Steven Rostedt @ 2010-03-13  2:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Frederic Weisbecker, Li Zefan,
	Lai Jiangshan, stable

[-- Attachment #1: 0005-tracing-Do-not-record-user-stack-trace-from-NMI-cont.patch --]
[-- Type: text/plain, Size: 3509 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

A bug was found with Li Zefan's ftrace_stress_test that caused applications
to segfault during the test.

Placing a tracing_off() in the segfault code, and examining several
traces, I found that the following was always the case. The lock tracer
was enabled (lockdep being required) and userstack was enabled. Testing
this out, I just enabled the two, but that was not good enough. I needed
to run something else that could trigger it. Running a load like hackbench
did not work, but executing a new program would. The following would
trigger the segfault within seconds:

  # echo 1 > /debug/tracing/options/userstacktrace
  # echo 1 > /debug/tracing/events/lock/enable
  # while :; do ls > /dev/null ; done

Enabling the function graph tracer and looking at what was happening
I finally noticed that all cashes happened just after an NMI.

 1)               |    copy_user_handle_tail() {
 1)               |      bad_area_nosemaphore() {
 1)               |        __bad_area_nosemaphore() {
 1)               |          no_context() {
 1)               |            fixup_exception() {
 1)   0.319 us    |              search_exception_tables();
 1)   0.873 us    |            }
[...]
 1)   0.314 us    |  __rcu_read_unlock();
 1)   0.325 us    |    native_apic_mem_write();
 1)   0.943 us    |  }
 1)   0.304 us    |  rcu_nmi_exit();
[...]
 1)   0.479 us    |  find_vma();
 1)               |  bad_area() {
 1)               |    __bad_area() {

After capturing several traces of failures, all of them happened
after an NMI. Curious about this, I added a trace_printk() to the NMI
handler to read the regs->ip to see where the NMI happened. In which I
found out it was here:

ffffffff8135b660 <page_fault>:
ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>

What was happening is that the NMI would happen at the place that a page
fault occurred. It would call rcu_read_lock() which was traced by
the lock events, and the user_stack_trace would run. This would trigger
a page fault inside the NMI. I do not see where the CR2 register is
saved or restored in NMI handling. This means that it would corrupt
the page fault handling that the NMI interrupted.

The reason the while loop of ls helped trigger the bug, was that
each execution of ls would cause lots of pages to be faulted in, and
increase the chances of the race happening.

The simple solution is to not allow user stack traces in NMI context.
After this patch, I ran the above "ls" test for a couple of hours
without any issues. Without this patch, the bug would trigger in less
than a minute.

Cc: stable@kernel.org
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 484337d..e52683f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1284,6 +1284,13 @@ ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, int pc)
 	if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
 		return;
 
+	/*
+	 * NMIs can not handle page faults, even with fix ups.
+	 * The save user stack can (and often does) fault.
+	 */
+	if (unlikely(in_nmi()))
+		return;
+
 	event = trace_buffer_lock_reserve(buffer, TRACE_USER_STACK,
 					  sizeof(*entry), flags, pc);
 	if (!event)
-- 
1.7.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes
  2010-03-13  2:56 [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Steven Rostedt
                   ` (4 preceding siblings ...)
  2010-03-13  2:57 ` [PATCH 5/5] tracing: Do not record user stack trace from NMI context Steven Rostedt
@ 2010-03-13  7:25 ` Ingo Molnar
  5 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2010-03-13  7:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Andrew Morton, Frederic Weisbecker, Li Zefan,
	Lai Jiangshan


* Steven Rostedt <rostedt@goodmis.org> wrote:

> Ingo,
> 
> Li Zefan and Lai Jiangshan told me they were finding various bugs
> with running a stress test. I asked for this test, and Li sent it
> to me. Running the test I was also able to trigger the same bugs that
> they were seeing. Unfortunately, these bugs were very hard to figure
> out and it took me most of the week to fix them.
> 
> These bugs would most likely not happen in normal use, since they
> all require multiple users writing to the control files of ftrace.
> This stress test does just that, it creates several processes that
> each write to a different part of a ftrace control system. One resizes
> the ring buffer, the other enables and disables various options,
> another changes the ftrace plugins, another enables and disables
> various events, another enables and disables the tracer, etc.
> 
> Normal use generally has a single user that would only do one of these
> actions at a time. But that is no excuse for not fixing the issues
> that this test uncovered, so I took this very seriously.
> 
> Even the fix "tracing: Disable buffer switching when starting or
> stopping trace" may look like it could happen if wakeup tracer is running,
> but since opening the trace file would cause the wakeup tracer to stop,
> it actually required another task to change the current tracer to
> the wakeup tracer while the trace file was being read.
> 
> In any event, I've been running Li's ftrace_stress_test for a few
> hours now, and it seems pretty stable with these patches. I'll continue
> to run it to see if it finds anything else.
> 
> All of these patches I consider urgent and have also Cc'd stable on them.
> 
> Please pull the latest tip/tracing/urgent tree, which can be found at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> tip/tracing/urgent
> 
> 
> Lai Jiangshan (1):
>       ring-buffer: Move disabled check into preempt disable section
> 
> Steven Rostedt (4):
>       function-graph: Init curr_ret_stack with ret_stack
>       tracing: Use same local variable when resetting the ring buffer
>       tracing: Disable buffer switching when starting or stopping trace
>       tracing: Do not record user stack trace from NMI context
> 
> ----
>  kernel/trace/ftrace.c      |    2 +-
>  kernel/trace/ring_buffer.c |   12 ++++++------
>  kernel/trace/trace.c       |   24 ++++++++++++++++++++----
>  3 files changed, 27 insertions(+), 11 deletions(-)

Pulled, thanks a lot Steve!

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/5] function-graph: Init curr_ret_stack with ret_stack
  2010-03-13  2:56 ` [PATCH 2/5] function-graph: Init curr_ret_stack with ret_stack Steven Rostedt
@ 2010-03-14 10:10   ` Frederic Weisbecker
  0 siblings, 0 replies; 14+ messages in thread
From: Frederic Weisbecker @ 2010-03-14 10:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Li Zefan, Lai Jiangshan,
	stable

On Fri, Mar 12, 2010 at 09:56:57PM -0500, Steven Rostedt wrote:
> From: Steven Rostedt <srostedt@redhat.com>
> 
> If the graph tracer is active, and a task is forked but the allocating of
> the processes graph stack fails, it can cause crash later on.
> 
> This is due to the temporary stack being NULL, but the curr_ret_stack
> variable is copied from the parent. If it is not -1, then in
> ftrace_graph_probe_sched_switch() the following:
> 
> 	for (index = next->curr_ret_stack; index >= 0; index--)
> 		next->ret_stack[index].calltime += timestamp;
> 
> Will cause a kernel OOPS.
> 
> Found with Li Zefan's ftrace_stress_test.
> 
> Cc: stable@kernel.org
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>


Ah, well spotted!



> ---
>  kernel/trace/ftrace.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index d4d1238..bb53edb 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -3349,6 +3349,7 @@ void ftrace_graph_init_task(struct task_struct *t)
>  {
>  	/* Make sure we do not use the parent ret_stack */
>  	t->ret_stack = NULL;
> +	t->curr_ret_stack = -1;
>  
>  	if (ftrace_graph_active) {
>  		struct ftrace_ret_stack *ret_stack;
> @@ -3358,7 +3359,6 @@ void ftrace_graph_init_task(struct task_struct *t)
>  				GFP_KERNEL);
>  		if (!ret_stack)
>  			return;
> -		t->curr_ret_stack = -1;
>  		atomic_set(&t->tracing_graph_pause, 0);
>  		atomic_set(&t->trace_overrun, 0);
>  		t->ftrace_timestamp = 0;
> -- 
> 1.7.0
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context
  2010-03-13  2:57 ` [PATCH 5/5] tracing: Do not record user stack trace from NMI context Steven Rostedt
@ 2010-03-14 10:27   ` Frederic Weisbecker
  2010-03-14 15:28     ` Steven Rostedt
  2010-03-14 16:58     ` Steven Rostedt
  2010-03-14 22:05   ` John Kacur
  1 sibling, 2 replies; 14+ messages in thread
From: Frederic Weisbecker @ 2010-03-14 10:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Li Zefan, Lai Jiangshan,
	stable

On Fri, Mar 12, 2010 at 09:57:00PM -0500, Steven Rostedt wrote:
> From: Steven Rostedt <srostedt@redhat.com>
> 
> A bug was found with Li Zefan's ftrace_stress_test that caused applications
> to segfault during the test.
> 
> Placing a tracing_off() in the segfault code, and examining several
> traces, I found that the following was always the case. The lock tracer
> was enabled (lockdep being required) and userstack was enabled. Testing
> this out, I just enabled the two, but that was not good enough. I needed
> to run something else that could trigger it. Running a load like hackbench
> did not work, but executing a new program would. The following would
> trigger the segfault within seconds:
> 
>   # echo 1 > /debug/tracing/options/userstacktrace
>   # echo 1 > /debug/tracing/events/lock/enable
>   # while :; do ls > /dev/null ; done
> 
> Enabling the function graph tracer and looking at what was happening
> I finally noticed that all cashes happened just after an NMI.
> 
>  1)               |    copy_user_handle_tail() {
>  1)               |      bad_area_nosemaphore() {
>  1)               |        __bad_area_nosemaphore() {
>  1)               |          no_context() {
>  1)               |            fixup_exception() {
>  1)   0.319 us    |              search_exception_tables();
>  1)   0.873 us    |            }
> [...]
>  1)   0.314 us    |  __rcu_read_unlock();
>  1)   0.325 us    |    native_apic_mem_write();
>  1)   0.943 us    |  }
>  1)   0.304 us    |  rcu_nmi_exit();
> [...]
>  1)   0.479 us    |  find_vma();
>  1)               |  bad_area() {
>  1)               |    __bad_area() {
> 
> After capturing several traces of failures, all of them happened
> after an NMI. Curious about this, I added a trace_printk() to the NMI
> handler to read the regs->ip to see where the NMI happened. In which I
> found out it was here:
> 
> ffffffff8135b660 <page_fault>:
> ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
> ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>
> 
> What was happening is that the NMI would happen at the place that a page
> fault occurred. It would call rcu_read_lock() which was traced by
> the lock events, and the user_stack_trace would run. This would trigger
> a page fault inside the NMI. I do not see where the CR2 register is
> saved or restored in NMI handling. This means that it would corrupt
> the page fault handling that the NMI interrupted.
> 
> The reason the while loop of ls helped trigger the bug, was that
> each execution of ls would cause lots of pages to be faulted in, and
> increase the chances of the race happening.
> 
> The simple solution is to not allow user stack traces in NMI context.
> After this patch, I ran the above "ls" test for a couple of hours
> without any issues. Without this patch, the bug would trigger in less
> than a minute.
> 
> Cc: stable@kernel.org
> Reported-by: Li Zefan <lizf@cn.fujitsu.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>



Wow, that's a race :)

In perf this is dealt with a special copy_from_user_nmi()
(see in arch/x86/kernel/cpu/perf_event.c)

May be save_stack_trace_user() should use that instead
of a __copy_from_user_inatomic() based thing, just to
cover such NMI corner race case.



> ---
>  kernel/trace/trace.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 484337d..e52683f 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1284,6 +1284,13 @@ ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, int pc)
>  	if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
>  		return;
>  
> +	/*
> +	 * NMIs can not handle page faults, even with fix ups.
> +	 * The save user stack can (and often does) fault.
> +	 */
> +	if (unlikely(in_nmi()))
> +		return;
> +
>  	event = trace_buffer_lock_reserve(buffer, TRACE_USER_STACK,
>  					  sizeof(*entry), flags, pc);
>  	if (!event)
> -- 
> 1.7.0
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context
  2010-03-14 10:27   ` Frederic Weisbecker
@ 2010-03-14 15:28     ` Steven Rostedt
  2010-03-14 16:58     ` Steven Rostedt
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2010-03-14 15:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Li Zefan, Lai Jiangshan,
	stable

On Sun, 2010-03-14 at 11:27 +0100, Frederic Weisbecker wrote:

> 
> Wow, that's a race :)
> 
> In perf this is dealt with a special copy_from_user_nmi()
> (see in arch/x86/kernel/cpu/perf_event.c)
> 
> May be save_stack_trace_user() should use that instead
> of a __copy_from_user_inatomic() based thing, just to
> cover such NMI corner race case.
> 

Yeah, we should move the __copy_from_user_nmi() out of the perf code and
into the normal uaccess code. Then we could do as you suggest, and have
the stack code do:

if (in_nmi())
	__copy_from_user_nmi();
else
	__copy_from_user_inatomic();

Or maybe it would be best to have the __copy_from_user_inatomic() handle
it.

-- Steve


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context
  2010-03-14 10:27   ` Frederic Weisbecker
  2010-03-14 15:28     ` Steven Rostedt
@ 2010-03-14 16:58     ` Steven Rostedt
  2010-03-17  2:08       ` Frederic Weisbecker
  1 sibling, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2010-03-14 16:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Li Zefan, Lai Jiangshan,
	stable

On Sun, 2010-03-14 at 11:27 +0100, Frederic Weisbecker wrote:

> 
> Wow, that's a race :)
> 
> In perf this is dealt with a special copy_from_user_nmi()
> (see in arch/x86/kernel/cpu/perf_event.c)
> 
> May be save_stack_trace_user() should use that instead
> of a __copy_from_user_inatomic() based thing, just to
> cover such NMI corner race case.
> 

Yeah, we should move the __copy_from_user_nmi() out of the perf code and
into the normal uaccess code. Then we could do as you suggest, and have
the stack code do:

if (in_nmi())
	__copy_from_user_nmi();
else
	__copy_from_user_inatomic();

Or maybe it would be best to have the __copy_from_user_inatomic() handle
it.

-- Steve




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI  context
  2010-03-13  2:57 ` [PATCH 5/5] tracing: Do not record user stack trace from NMI context Steven Rostedt
  2010-03-14 10:27   ` Frederic Weisbecker
@ 2010-03-14 22:05   ` John Kacur
  2010-03-14 22:29     ` Steven Rostedt
  1 sibling, 1 reply; 14+ messages in thread
From: John Kacur @ 2010-03-14 22:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Li Zefan, Lai Jiangshan, stable

On Sat, Mar 13, 2010 at 3:57 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> From: Steven Rostedt <srostedt@redhat.com>
>
> A bug was found with Li Zefan's ftrace_stress_test that caused applications
> to segfault during the test.
>
> Placing a tracing_off() in the segfault code, and examining several
> traces, I found that the following was always the case. The lock tracer
> was enabled (lockdep being required) and userstack was enabled. Testing
> this out, I just enabled the two, but that was not good enough. I needed
> to run something else that could trigger it. Running a load like hackbench
> did not work, but executing a new program would. The following would
> trigger the segfault within seconds:
>
>  # echo 1 > /debug/tracing/options/userstacktrace
>  # echo 1 > /debug/tracing/events/lock/enable
>  # while :; do ls > /dev/null ; done
>
> Enabling the function graph tracer and looking at what was happening
> I finally noticed that all cashes happened just after an NMI.
>
>  1)               |    copy_user_handle_tail() {
>  1)               |      bad_area_nosemaphore() {
>  1)               |        __bad_area_nosemaphore() {
>  1)               |          no_context() {
>  1)               |            fixup_exception() {
>  1)   0.319 us    |              search_exception_tables();
>  1)   0.873 us    |            }
> [...]
>  1)   0.314 us    |  __rcu_read_unlock();
>  1)   0.325 us    |    native_apic_mem_write();
>  1)   0.943 us    |  }
>  1)   0.304 us    |  rcu_nmi_exit();
> [...]
>  1)   0.479 us    |  find_vma();
>  1)               |  bad_area() {
>  1)               |    __bad_area() {
>
> After capturing several traces of failures, all of them happened
> after an NMI. Curious about this, I added a trace_printk() to the NMI
> handler to read the regs->ip to see where the NMI happened. In which I
> found out it was here:
>
> ffffffff8135b660 <page_fault>:
> ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
> ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>
>
> What was happening is that the NMI would happen at the place that a page
> fault occurred. It would call rcu_read_lock() which was traced by
> the lock events, and the user_stack_trace would run. This would trigger
> a page fault inside the NMI. I do not see where the CR2 register is
> saved or restored in NMI handling. This means that it would corrupt
> the page fault handling that the NMI interrupted.
>
> The reason the while loop of ls helped trigger the bug, was that
> each execution of ls would cause lots of pages to be faulted in, and
> increase the chances of the race happening.
>
> The simple solution is to not allow user stack traces in NMI context.
> After this patch, I ran the above "ls" test for a couple of hours
> without any issues. Without this patch, the bug would trigger in less
> than a minute.
>
> Cc: stable@kernel.org
> Reported-by: Li Zefan <lizf@cn.fujitsu.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/trace.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 484337d..e52683f 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1284,6 +1284,13 @@ ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, int pc)
>        if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
>                return;
>
> +       /*
> +        * NMIs can not handle page faults, even with fix ups.
> +        * The save user stack can (and often does) fault.
> +        */
> +       if (unlikely(in_nmi()))
> +               return;
> +
>        event = trace_buffer_lock_reserve(buffer, TRACE_USER_STACK,
>                                          sizeof(*entry), flags, pc);
>        if (!event)
> --
> 1.7.0
>

That is some awfully cool detective work. Just one question, if it is
so easy to trigger, why did you wrap it in "unlikely"?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI  context
  2010-03-14 22:05   ` John Kacur
@ 2010-03-14 22:29     ` Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2010-03-14 22:29 UTC (permalink / raw)
  To: John Kacur
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Li Zefan, Lai Jiangshan, stable

On Sun, 2010-03-14 at 23:05 +0100, John Kacur wrote:
> On Sat, Mar 13, 2010 at 3:57 AM, Steven Rostedt <rostedt@goodmis.org> wrote:

> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 484337d..e52683f 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -1284,6 +1284,13 @@ ftrace_trace_userstack(struct ring_buffer *buffer, unsigned long flags, int pc)
> >        if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
> >                return;
> >
> > +       /*
> > +        * NMIs can not handle page faults, even with fix ups.
> > +        * The save user stack can (and often does) fault.
> > +        */
> > +       if (unlikely(in_nmi()))
> > +               return;
> > +
> >        event = trace_buffer_lock_reserve(buffer, TRACE_USER_STACK,
> >                                          sizeof(*entry), flags, pc);
> >        if (!event)
> > --
> > 1.7.0
> >
> 
> That is some awfully cool detective work. Just one question, if it is
> so easy to trigger, why did you wrap it in "unlikely"?

It was not easy to trigger until I had a case to trigger it. Currently,
the only time that this should trigger is for the lockdep tracing, which
only happens when lockdep is enabled. Most of the time, this should
never trigger.

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/5] tracing: Do not record user stack trace from NMI context
  2010-03-14 16:58     ` Steven Rostedt
@ 2010-03-17  2:08       ` Frederic Weisbecker
  0 siblings, 0 replies; 14+ messages in thread
From: Frederic Weisbecker @ 2010-03-17  2:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Li Zefan, Lai Jiangshan,
	stable

On Sun, Mar 14, 2010 at 12:58:40PM -0400, Steven Rostedt wrote:
> On Sun, 2010-03-14 at 11:27 +0100, Frederic Weisbecker wrote:
> 
> > 
> > Wow, that's a race :)
> > 
> > In perf this is dealt with a special copy_from_user_nmi()
> > (see in arch/x86/kernel/cpu/perf_event.c)
> > 
> > May be save_stack_trace_user() should use that instead
> > of a __copy_from_user_inatomic() based thing, just to
> > cover such NMI corner race case.
> > 
> 
> Yeah, we should move the __copy_from_user_nmi() out of the perf code and
> into the normal uaccess code. Then we could do as you suggest, and have
> the stack code do:
> 
> if (in_nmi())
> 	__copy_from_user_nmi();
> else
> 	__copy_from_user_inatomic();
> 
> Or maybe it would be best to have the __copy_from_user_inatomic() handle
> it.


Yeah, agreed.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-03-17  2:08 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-13  2:56 [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Steven Rostedt
2010-03-13  2:56 ` [PATCH 1/5] ring-buffer: Move disabled check into preempt disable section Steven Rostedt
2010-03-13  2:56 ` [PATCH 2/5] function-graph: Init curr_ret_stack with ret_stack Steven Rostedt
2010-03-14 10:10   ` Frederic Weisbecker
2010-03-13  2:56 ` [PATCH 3/5] tracing: Use same local variable when resetting the ring buffer Steven Rostedt
2010-03-13  2:56 ` [PATCH 4/5] tracing: Disable buffer switching when starting or stopping trace Steven Rostedt
2010-03-13  2:57 ` [PATCH 5/5] tracing: Do not record user stack trace from NMI context Steven Rostedt
2010-03-14 10:27   ` Frederic Weisbecker
2010-03-14 15:28     ` Steven Rostedt
2010-03-14 16:58     ` Steven Rostedt
2010-03-17  2:08       ` Frederic Weisbecker
2010-03-14 22:05   ` John Kacur
2010-03-14 22:29     ` Steven Rostedt
2010-03-13  7:25 ` [PATCH 0/5][GIT PULL][2.6.34] tracing: urgent fixes Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox