From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Arnaldo Carvalho de Melo <acme@kernel.org>,
"dvyukov@google.com" <dvyukov@google.com>
Subject: Re: perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start
Date: Tue, 29 Nov 2016 02:28:18 -0800 [thread overview]
Message-ID: <20161129102817.GM3924@linux.vnet.ibm.com> (raw)
In-Reply-To: <20161129055241.6dy2dt4q4ptazk2s@treble>
On Mon, Nov 28, 2016 at 11:52:41PM -0600, Josh Poimboeuf wrote:
> On Mon, Nov 28, 2016 at 04:40:21PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 28, 2016 at 03:54:11PM -0600, Josh Poimboeuf wrote:
> > > On Thu, Nov 24, 2016 at 12:33:48PM -0500, Vince Weaver wrote:
> > > >
> > > > This is on a skylake machine, linus git as of yesterday after the various
> > > > kasan-related fixes went in. Not sure if there were any that hadn't hit
> > > > upstream yet.
> > > >
> > > > Anyway I can't tell from this one what the actual trigger is. After this
> > > > mess the fuzzer process was locked, udev started complaining, and it
> > > > eventually died completely after a few hours of repeated messages like
> > > > this.
> > > >
> > > > [38898.373183] INFO: rcu_sched self-detected stall on CPU
> > > > [38898.378452] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> > > > [38898.381211] INFO: rcu_sched detected stalls on CPUs/tasks:
> > > > [38898.381214] 0-...: (1 GPs behind) idle=05f/140000000000001/2 softirq=3285458/3285459 fqs=2625
> > > > [38898.381217] 7-...: (5249 ticks this GP) idle=727/140000000000001/0 softirq=3141908/3141908 fqs=2625
> > > > [38898.381218] (detected by 1, t=5252 jiffies, g=3685053, c=3685052, q=32)
> > > > [38898.381244] ==================================================================
> > > > [38898.381247] BUG: KASAN: stack-out-of-bounds in __unwind_start+0x1a2/0x1c0 at addr ffff8801e9727c28
> > > > [38898.381248] Read of size 8 by task swapper/1/0
> > > > [38898.381250] page:ffffea0007a5c9c0 count:0 mapcount:0 mapping: (null) index:0x0
> > > > [38898.381251] flags: 0x2ffff8000000000()
> > > > [38898.381251] page dumped because: kasan: bad access detected
> > > > [38898.381328] Memory state around the buggy address:
> > > > [38898.381330] ffff8801e9727b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > [38898.381331] ffff8801e9727b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > [38898.381332] >ffff8801e9727c00: 00 00 00 00 f1 f1 f1 f1 00 00 00 00 f3 f3 f3 f3
> > > > [38898.381333] ^
> > > > [38898.381334] ffff8801e9727c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1
> > > > [38898.381335] ffff8801e9727d00: f1 f1 f1 00 00 00 f4 f2 f2 f2 f2 00 00 00 00 f3
> > > > [38898.381335] ==================================================================
> > > > [38898.510702] (t=5284 jiffies g=3685053 c=3685052 q=32)
> > > >
> > > > (That's all, the report above repeats but no useful things like a
> > > > backtrace are ever printed)
> > >
> > > After looking at the RCU stall detection code, I think the KASAN error
> > > and missing stack dump aren't very surprising. RCU calls the scheduler
> > > dump_cpu_task() function, which seems inherently problematic: it tries
> > > to dump the stack of a task while it's running on another CPU.
> > >
> > > There are some issues with that:
> > >
> > > 1) There's no way to find the starting frame of a currently running task
> > > from another CPU.
> > >
> > > In fact, I'm wondering how dump_cpu_task() ever worked at all? It
> > > seems like you'd have to get lucky that the sp/bp registers stored by
> > > the last call to schedule() happen to point to a currently valid
> > > stack frame.
> > >
> > > 2) Even if there were a way to find the starting frame, it's racy
> > > because the target task could be overwriting the stack while we're
> > > reading it.
> > >
> > > 3) IRQ/exception stack dumps would be missing anyway because the stack
> > > dump code only looks at the current CPU's interrupt stacks.
> > >
> > > Maybe dump_cpu_task() should instead run the stack dump directly from
> > > the target CPU, e.g. with trigger_single_cpu_backtrace() or
> > > smp_call_function_single()?
> > >
> > > Paul, Peter, Ingo, any thoughts?
> >
> > We used to do that, but the resulting NMIs were problematic on some
> > platforms. Perhaps things have gotten better?
>
> Did a little digging on git blame and found the following commit (which
> seems to be the cause of the KASAN warning and missing stack dump):
>
> bc1dce514e9b ("rcu: Don't use NMIs to dump other CPUs' stacks")
>
> I presume this commit is still needed because of the NMI printk deadlock
> issues which were discussed at Kernel Summit. I guess those issues need
> to be sorted out before the above commit can be reverted.
Agreed -- it would be very good to revert that commit, but not until
it is safe to do so. ;-)
Thanx, Paul
next prev parent reply other threads:[~2016-11-29 10:28 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-24 17:33 perf: fuzzer BUG: KASAN: stack-out-of-bounds in __unwind_start Vince Weaver
2016-11-28 21:54 ` Josh Poimboeuf
2016-11-29 0:40 ` Paul E. McKenney
2016-11-29 5:52 ` Josh Poimboeuf
2016-11-29 9:16 ` Peter Zijlstra
2016-11-29 14:07 ` Paul E. McKenney
2016-11-29 15:09 ` Josh Poimboeuf
2016-11-29 16:12 ` Petr Mladek
2016-11-29 18:01 ` Paul E. McKenney
2016-11-29 16:51 ` Paul E. McKenney
2016-11-29 17:17 ` Josh Poimboeuf
2016-11-29 17:36 ` Paul E. McKenney
2016-11-30 9:29 ` Petr Mladek
2016-11-29 10:28 ` Paul E. McKenney [this message]
2016-11-29 12:43 ` Peter Zijlstra
2016-11-29 15:10 ` Paul E. McKenney
2016-11-29 16:29 ` Petr Mladek
2016-11-29 17:10 ` Peter Zijlstra
2016-11-29 19:39 ` Paul E. McKenney
2016-11-29 19:52 ` Peter Zijlstra
2016-11-29 20:07 ` Paul E. McKenney
2016-11-29 20:32 ` Paul E. McKenney
2016-11-30 19:13 ` Josh Poimboeuf
2016-11-30 19:49 ` Paul E. McKenney
2016-12-01 5:52 ` Peter Zijlstra
2016-12-01 12:33 ` Paul E. McKenney
2016-12-01 16:41 ` Peter Zijlstra
2016-12-01 17:00 ` Paul E. McKenney
2016-11-30 10:01 ` Petr Mladek
2016-11-30 11:06 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161129102817.GM3924@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=acme@kernel.org \
--cc=dvyukov@google.com \
--cc=jpoimboe@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=vincent.weaver@maine.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.