From: james.morse@arm.com (James Morse)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH] arm64: fix unwind_frame() for filtered out fn via set_graph_notrace
Date: Tue, 29 Aug 2017 18:03:30 +0100 [thread overview]
Message-ID: <59A59E62.7060009@arm.com> (raw)
In-Reply-To: <4a92b0bf5ee602de6463057e5d98d20b4d4f2bcd.1503924827.git.panand@redhat.com>
Hi Pratyush,
(Nit: get_maintainer.pl will give you the list of people to CC, you're reliant
on the maintainers being eagle-eyed to spot this!)
On 28/08/17 13:53, Pratyush Anand wrote:
> Testcase:
> cd /sys/kernel/debug/tracing/
> echo schedule > set_graph_notrace
> echo 1 > options/display-graph
> echo wakeup > current_tracer
> ps -ef | grep -i agent
>
> Above commands result in
> PANIC: "Unable to handle kernel paging request at virtual address
> ffff801bcbde7000"
This didn't panic for me, it just killed the running task. (I guess you have
panic-on-oops set)
> vmcore analysis:
> 1)
> crash> bt
> PID: 1561 TASK: ffff8003cb7e4000 CPU: 0 COMMAND: "ps"
> #0 [ffff8003c4ff77b0] crash_kexec at ffff00000816b9b8
> #1 [ffff8003c4ff77e0] die at ffff000008088b34
> #2 [ffff8003c4ff7820] __do_kernel_fault at ffff00000809b830
> #3 [ffff8003c4ff7850] do_bad_area at ffff000008098b90
> #4 [ffff8003c4ff7880] do_translation_fault at ffff0000087c6cdc
> #5 [ffff8003c4ff78b0] do_mem_abort at ffff000008081334
> #6 [ffff8003c4ff7ab0] el1_ia at ffff000008082cc0
> PC: ffff00000808811c [unwind_frame+300]
> LR: ffff0000080858a8 [get_wchan+212]
> SP: ffff8003c4ff7ab0 PSTATE: 60000145
> X29: ffff8003c4ff7ab0 X28: 0000000000000001 X27: 0000000000000000
> X26: 0000000000000000 X25: 0000000000000000 X24: 0000000000000000
> X23: ffff8003c1c20000 X22: ffff000008c73000 X21: ffff8003c1c1c000
> X20: 000000000000000f X19: ffff8003c1bc7000 X18: 0000000000000010
> X17: 0000000000000000 X16: 0000000000000001 X15: ffffffffffffffed
> X14: 0000000000000010 X13: ffffffffffffffff X12: 0000000000000004
> X11: 0000000000000000 X10: 0000000002dd14c0 X9: 1999999999999999
> X8: 000000000000003f X7: 00008003f71b0000 X6: 0000000000000018
> X5: 0000000000000000 X4: ffff8003c1c1fd20 X3: ffff8003c1c1fd10
> X2: 00000017ffe80000 X1: ffff8003c4ff7af8 X0: ffff8003cbf67000
> #7 [ffff8003c4ff7b20] do_task_stat at ffff000008304f0c
> #8 [ffff8003c4ff7c60] proc_tgid_stat at ffff000008305b48
> #9 [ffff8003c4ff7ca0] proc_single_show at ffff0000082fdd10
> #10 [ffff8003c4ff7ce0] seq_read at ffff0000082b27bc
> #11 [ffff8003c4ff7d70] __vfs_read at ffff000008289e54
> #12 [ffff8003c4ff7e30] vfs_read at ffff00000828b14c
> #13 [ffff8003c4ff7e70] sys_read at ffff00000828c2d0
> #14 [ffff8003c4ff7ed0] __sys_trace at ffff00000808349c
> PC: 00000006 LR: 00000000 SP: ffffffffffffffed PSTATE: 0000003f
> X12: 00000010 X11: ffffffffffffffff X10: 00000004 X9: ffff7febe8d0
> X8: 00000000 X7: 1999999999999999 X6: 0000003f X5: 0000000c
> X4: ffff7fce9c78 X3: 0000000c X2: 00000000 X1: 00000000
> X0: 00000400
>
> (2) Instruction at ffff00000808811c caused IA/DA.
>
> crash> dis -l ffff000008088108 6
> /usr/src/debug/xxxxxxxxxxxx/xxxxxxxxxx/arch/arm64/kernel/stacktrace.c:
> 84
> 0xffff000008088108 <unwind_frame+280>: ldr w2, [x1,#24]
> 0xffff00000808810c <unwind_frame+284>: sub w6, w2, #0x1
> 0xffff000008088110 <unwind_frame+288>: str w6, [x1,#24]
> 0xffff000008088114 <unwind_frame+292>: mov w6, #0x18 // #24
> 0xffff000008088118 <unwind_frame+296>: umull x2, w2, w6
> 0xffff00000808811c <unwind_frame+300>: ldr x0, [x0,x2]
>
> Corresponding c statement is
> frame->pc = tsk->ret_stack[frame->graph--].ret;
>
> (3) So, it caused data abort while executing
> 0xffff00000808811c <unwind_frame+300>: ldr x0, [x0,x2]
>
> x0 + x2 = ffff8003cbf67000 + 00000017ffe80000 = ffff801bcbde7000
> Access of ffff801bcbde7000 resulted in "Unable to handle kernel paging
> request"
>
> from above data: frame->graph = task->curr_ret_stack which should be,
> x2 / 0x18 = FFFF0000, which is -FTRACE_NOTRACE_DEPTH.
(0x18 is the size of struct ftrace_ret_stack for your config?)
> OK, so problem is here:
> do_task_stat() calls get_wchan(). Here p->curr_ret_stack is
> -FTRACE_NOTRACE_DEPTH in the failed case.
> It means, when do_task_stat()
> has been called for task A (ps in this case) on CPUm,task A was in mid
> of execution on CPUn,
get_wchan() on a running processes can't work: the stack may be modified while
we walk it.
>From arch/arm64/kernel/process.c::get_wchan():
> if (!p || p == current || p->state == TASK_RUNNING)
> return 0;
As far as I can see, if task A is running on CPU-N then CPU-Ms attempt to
get_wchan() it will return 0.
> and was in the mid of mcount() execution where
> curr_ret_stack had been decremented in ftrace_push_return_trace() for
> hitting schedule() function, but it was yet to be incremented in
> ftrace_return_to_handler().
So if the function-graph-tracer has hooked the return values on a stack and hit
a filtered function, (schedule() in your example), we can't unwind it as
ftrace_push_return_trace() has biased the index with a 'huge negative' offset to
flag it as 'this should be filtered out'.
The arm64 stack walker isn't aware of this and interprets it as an unsigned
index. Nasty!
> Similar problem we can have with calling of walk_stackframe() from
> save_stack_trace_tsk() or dump_backtrace().
>
> This patch fixes unwind_frame() to not to manipulate frame->pc for
Nit: strictly this is is 'restore frame->pc from ftrace's ret_stack', but I
think we still need to do this restore...
> function graph tracer if the function has been marked in
> set_graph_notrace.
Nit: ftrace describes these as 'filtered out', set_graph_notrace is the debugfs
interface.
> This patch fixes: 20380bb390a4 (arm64: ftrace: fix a stack tracer's
> output under function graph tracer)
> Signed-off-by: Pratyush Anand <panand@redhat.com>
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 09d37d66b630..e79035d673b3 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -74,7 +74,8 @@ int notrace unwind_frame(struct task_struct *tsk, struct
stackframe *frame)
>
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> if (tsk->ret_stack &&
> - (frame->pc == (unsigned long)return_to_handler)) {
> + (frame->pc == (unsigned long)return_to_handler) &&
> + (frame->graph > -1)) {
This should give you a compiler warning: its declared as an unsigned int in
struct stackframe.
... but with this patch /proc/<pid>/wchan now reports:
> cat /proc/1/wchan
> return_to_handler
So it looks like 'filtered out' doesn't mean 'not hooked'. I think a more
complete fix is to test if frame->graph is negative and add FTRACE_NOTRACE_DEPTH
back to it:
>From kernel/trace/ftrace_functions_graph.c::trace_pop_return_trace():
> index = current->curr_ret_stack;
>
> /*
> * A negative index here means that it's just returned from a
> * notrace'd function. Recover index to get an original
> * return address. See ftrace_push_return_trace().
> *
> * TODO: Need to check whether the stack gets corrupted.
> */
> if (index < 0)
> index += FTRACE_NOTRACE_DEPTH;
(looks like this is the only magic offset)
I don't think we need to preserve the vmcore debugging in the kernel log, could
you cut the commit message down to describe the negative curr_ret_stack being
interpreted as a signed value instead of skipped by unwind_frame(), then have
the reproducer and a chunk of the splat.
Thanks for getting to the bottom of this, it looks like this was a mammoth
debugging session!
James
WARNING: multiple messages have this Message-ID (diff)
From: James Morse <james.morse@arm.com>
To: Pratyush Anand <panand@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] arm64: fix unwind_frame() for filtered out fn via set_graph_notrace
Date: Tue, 29 Aug 2017 18:03:30 +0100 [thread overview]
Message-ID: <59A59E62.7060009@arm.com> (raw)
In-Reply-To: <4a92b0bf5ee602de6463057e5d98d20b4d4f2bcd.1503924827.git.panand@redhat.com>
Hi Pratyush,
(Nit: get_maintainer.pl will give you the list of people to CC, you're reliant
on the maintainers being eagle-eyed to spot this!)
On 28/08/17 13:53, Pratyush Anand wrote:
> Testcase:
> cd /sys/kernel/debug/tracing/
> echo schedule > set_graph_notrace
> echo 1 > options/display-graph
> echo wakeup > current_tracer
> ps -ef | grep -i agent
>
> Above commands result in
> PANIC: "Unable to handle kernel paging request at virtual address
> ffff801bcbde7000"
This didn't panic for me, it just killed the running task. (I guess you have
panic-on-oops set)
> vmcore analysis:
> 1)
> crash> bt
> PID: 1561 TASK: ffff8003cb7e4000 CPU: 0 COMMAND: "ps"
> #0 [ffff8003c4ff77b0] crash_kexec at ffff00000816b9b8
> #1 [ffff8003c4ff77e0] die at ffff000008088b34
> #2 [ffff8003c4ff7820] __do_kernel_fault at ffff00000809b830
> #3 [ffff8003c4ff7850] do_bad_area at ffff000008098b90
> #4 [ffff8003c4ff7880] do_translation_fault at ffff0000087c6cdc
> #5 [ffff8003c4ff78b0] do_mem_abort at ffff000008081334
> #6 [ffff8003c4ff7ab0] el1_ia at ffff000008082cc0
> PC: ffff00000808811c [unwind_frame+300]
> LR: ffff0000080858a8 [get_wchan+212]
> SP: ffff8003c4ff7ab0 PSTATE: 60000145
> X29: ffff8003c4ff7ab0 X28: 0000000000000001 X27: 0000000000000000
> X26: 0000000000000000 X25: 0000000000000000 X24: 0000000000000000
> X23: ffff8003c1c20000 X22: ffff000008c73000 X21: ffff8003c1c1c000
> X20: 000000000000000f X19: ffff8003c1bc7000 X18: 0000000000000010
> X17: 0000000000000000 X16: 0000000000000001 X15: ffffffffffffffed
> X14: 0000000000000010 X13: ffffffffffffffff X12: 0000000000000004
> X11: 0000000000000000 X10: 0000000002dd14c0 X9: 1999999999999999
> X8: 000000000000003f X7: 00008003f71b0000 X6: 0000000000000018
> X5: 0000000000000000 X4: ffff8003c1c1fd20 X3: ffff8003c1c1fd10
> X2: 00000017ffe80000 X1: ffff8003c4ff7af8 X0: ffff8003cbf67000
> #7 [ffff8003c4ff7b20] do_task_stat at ffff000008304f0c
> #8 [ffff8003c4ff7c60] proc_tgid_stat at ffff000008305b48
> #9 [ffff8003c4ff7ca0] proc_single_show at ffff0000082fdd10
> #10 [ffff8003c4ff7ce0] seq_read at ffff0000082b27bc
> #11 [ffff8003c4ff7d70] __vfs_read at ffff000008289e54
> #12 [ffff8003c4ff7e30] vfs_read at ffff00000828b14c
> #13 [ffff8003c4ff7e70] sys_read at ffff00000828c2d0
> #14 [ffff8003c4ff7ed0] __sys_trace at ffff00000808349c
> PC: 00000006 LR: 00000000 SP: ffffffffffffffed PSTATE: 0000003f
> X12: 00000010 X11: ffffffffffffffff X10: 00000004 X9: ffff7febe8d0
> X8: 00000000 X7: 1999999999999999 X6: 0000003f X5: 0000000c
> X4: ffff7fce9c78 X3: 0000000c X2: 00000000 X1: 00000000
> X0: 00000400
>
> (2) Instruction at ffff00000808811c caused IA/DA.
>
> crash> dis -l ffff000008088108 6
> /usr/src/debug/xxxxxxxxxxxx/xxxxxxxxxx/arch/arm64/kernel/stacktrace.c:
> 84
> 0xffff000008088108 <unwind_frame+280>: ldr w2, [x1,#24]
> 0xffff00000808810c <unwind_frame+284>: sub w6, w2, #0x1
> 0xffff000008088110 <unwind_frame+288>: str w6, [x1,#24]
> 0xffff000008088114 <unwind_frame+292>: mov w6, #0x18 // #24
> 0xffff000008088118 <unwind_frame+296>: umull x2, w2, w6
> 0xffff00000808811c <unwind_frame+300>: ldr x0, [x0,x2]
>
> Corresponding c statement is
> frame->pc = tsk->ret_stack[frame->graph--].ret;
>
> (3) So, it caused data abort while executing
> 0xffff00000808811c <unwind_frame+300>: ldr x0, [x0,x2]
>
> x0 + x2 = ffff8003cbf67000 + 00000017ffe80000 = ffff801bcbde7000
> Access of ffff801bcbde7000 resulted in "Unable to handle kernel paging
> request"
>
> from above data: frame->graph = task->curr_ret_stack which should be,
> x2 / 0x18 = FFFF0000, which is -FTRACE_NOTRACE_DEPTH.
(0x18 is the size of struct ftrace_ret_stack for your config?)
> OK, so problem is here:
> do_task_stat() calls get_wchan(). Here p->curr_ret_stack is
> -FTRACE_NOTRACE_DEPTH in the failed case.
> It means, when do_task_stat()
> has been called for task A (ps in this case) on CPUm,task A was in mid
> of execution on CPUn,
get_wchan() on a running processes can't work: the stack may be modified while
we walk it.
>From arch/arm64/kernel/process.c::get_wchan():
> if (!p || p == current || p->state == TASK_RUNNING)
> return 0;
As far as I can see, if task A is running on CPU-N then CPU-Ms attempt to
get_wchan() it will return 0.
> and was in the mid of mcount() execution where
> curr_ret_stack had been decremented in ftrace_push_return_trace() for
> hitting schedule() function, but it was yet to be incremented in
> ftrace_return_to_handler().
So if the function-graph-tracer has hooked the return values on a stack and hit
a filtered function, (schedule() in your example), we can't unwind it as
ftrace_push_return_trace() has biased the index with a 'huge negative' offset to
flag it as 'this should be filtered out'.
The arm64 stack walker isn't aware of this and interprets it as an unsigned
index. Nasty!
> Similar problem we can have with calling of walk_stackframe() from
> save_stack_trace_tsk() or dump_backtrace().
>
> This patch fixes unwind_frame() to not to manipulate frame->pc for
Nit: strictly this is is 'restore frame->pc from ftrace's ret_stack', but I
think we still need to do this restore...
> function graph tracer if the function has been marked in
> set_graph_notrace.
Nit: ftrace describes these as 'filtered out', set_graph_notrace is the debugfs
interface.
> This patch fixes: 20380bb390a4 (arm64: ftrace: fix a stack tracer's
> output under function graph tracer)
> Signed-off-by: Pratyush Anand <panand@redhat.com>
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 09d37d66b630..e79035d673b3 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -74,7 +74,8 @@ int notrace unwind_frame(struct task_struct *tsk, struct
stackframe *frame)
>
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> if (tsk->ret_stack &&
> - (frame->pc == (unsigned long)return_to_handler)) {
> + (frame->pc == (unsigned long)return_to_handler) &&
> + (frame->graph > -1)) {
This should give you a compiler warning: its declared as an unsigned int in
struct stackframe.
... but with this patch /proc/<pid>/wchan now reports:
> cat /proc/1/wchan
> return_to_handler
So it looks like 'filtered out' doesn't mean 'not hooked'. I think a more
complete fix is to test if frame->graph is negative and add FTRACE_NOTRACE_DEPTH
back to it:
>From kernel/trace/ftrace_functions_graph.c::trace_pop_return_trace():
> index = current->curr_ret_stack;
>
> /*
> * A negative index here means that it's just returned from a
> * notrace'd function. Recover index to get an original
> * return address. See ftrace_push_return_trace().
> *
> * TODO: Need to check whether the stack gets corrupted.
> */
> if (index < 0)
> index += FTRACE_NOTRACE_DEPTH;
(looks like this is the only magic offset)
I don't think we need to preserve the vmcore debugging in the kernel log, could
you cut the commit message down to describe the negative curr_ret_stack being
interpreted as a signed value instead of skipped by unwind_frame(), then have
the reproducer and a chunk of the splat.
Thanks for getting to the bottom of this, it looks like this was a mammoth
debugging session!
James
next prev parent reply other threads:[~2017-08-29 17:03 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-28 12:53 [PATCH] arm64: fix unwind_frame() for filtered out fn via set_graph_notrace Pratyush Anand
2017-08-28 12:53 ` Pratyush Anand
2017-08-29 17:03 ` James Morse [this message]
2017-08-29 17:03 ` James Morse
2017-08-30 12:59 ` Pratyush Anand
2017-08-30 12:59 ` Pratyush Anand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=59A59E62.7060009@arm.com \
--to=james.morse@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.