From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Dmitry Vyukov <dvyukov@google.com>
Cc: josh@joshtriplett.org, Steven Rostedt <rostedt@goodmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
jiangshanlai@gmail.com, LKML <linux-kernel@vger.kernel.org>,
syzkaller <syzkaller@googlegroups.com>
Subject: Re: rcu: WARNING in rcu_seq_end
Date: Tue, 7 Mar 2017 07:16:37 -0800 [thread overview]
Message-ID: <20170307151637.GL30506@linux.vnet.ibm.com> (raw)
In-Reply-To: <CACT4Y+bW13N50TuzCr68VbYRQxmzW0xhi5p0cezit1GQHOWAfQ@mail.gmail.com>
On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> On Tue, Mar 7, 2017 at 12:08 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> >> >> Hello,
> >> >> >> >>
> >> >> >> >> Paul, you wanted bugs in rcu.
> >> >> >> >
> >> >> >> > Well, whether I want them or not, I must deal with them. ;-)
> >> >> >> >
> >> >> >> >> I've got this WARNING while running syzkaller fuzzer on
> >> >> >> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
> >> >> >> >>
> >> >> >> >> ------------[ cut here ]------------
> >> >> >> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
> >> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >> >> >> Kernel panic - not syncing: panic_on_warn set ...
> >> >> >> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
> >> >> >> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >> >> >> >> Workqueue: events wait_rcu_exp_gp
> >> >> >> >> Call Trace:
> >> >> >> >> __dump_stack lib/dump_stack.c:15 [inline]
> >> >> >> >> dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
> >> >> >> >> panic+0x1fb/0x412 kernel/panic.c:179
> >> >> >> >> __warn+0x1c4/0x1e0 kernel/panic.c:540
> >> >> >> >> warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
> >> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >> >> >> rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
> >> >> >> >> rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
> >> >> >> >> rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
> >> >> >> >> wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
> >> >> >> >> process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
> >> >> >> >> worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
> >> >> >> >> kthread+0x326/0x3f0 kernel/kthread.c:227
> >> >> >> >> ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> >> >> >> >> Dumping ftrace buffer:
> >> >> >> >> (ftrace buffer empty)
> >> >> >> >> Kernel Offset: disabled
> >> >> >> >> Rebooting in 86400 seconds..
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Not reproducible. But looking at the code, shouldn't it be:
> >> >> >> >>
> >> >> >> >> static void rcu_seq_end(unsigned long *sp)
> >> >> >> >> {
> >> >> >> >> smp_mb(); /* Ensure update-side operation before counter increment. */
> >> >> >> >> + WARN_ON_ONCE(!(*sp & 0x1));
> >> >> >> >> WRITE_ONCE(*sp, *sp + 1);
> >> >> >> >> - WARN_ON_ONCE(*sp & 0x1);
> >> >> >> >> }
> >> >> >> >>
> >> >> >> >> ?
> >> >> >> >>
> >> >> >> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
> >> >> >> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
> >> >> >> >> consequently can allow start of next grace periods. Which in turn can
> >> >> >> >> make the warning fire. Am I missing something?
> >> >> >> >>
> >> >> >> >> I don't see any other bad consequences of this. The rest of
> >> >> >> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
> >> >> >> >> returned and destroyed work on stack and next period has started and
> >> >> >> >> ended, but it seems OK.
> >> >> >> >
> >> >> >> > I believe that this is a heygood change, but I don't see how it will
> >> >> >> > help in this case. BTW, may I have your Signed-off-by?
> >> >> >> >
> >> >> >> > The reason I don't believe that it will help is that the
> >> >> >> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
> >> >> >> > is invoked holding ->exp_mutex, and this mutex is not released until
> >> >> >> > after the handler invokes rcu_seq_end() and then wakes up the task that
> >> >> >> > scheduled the workqueue handler. So the ordering above should not matter
> >> >> >> > (but I agree that your ordering is cleaner.
> >> >> >> >
> >> >> >> > That said, it looks like I am missing some memory barriers, please
> >> >> >> > see the following patch.
> >> >> >> >
> >> >> >> > But what architecture did you see this on?
> >> >> >>
> >> >> >>
> >> >> >> This is just x86.
> >> >> >>
> >> >> >> You seem to assume that wait_event() waits for the wakeup. It does not
> >> >> >> work this way. It can return as soon as the condition becomes true
> >> >> >> without ever waiting:
> >> >> >>
> >> >> >> 305 #define wait_event(wq, condition) \
> >> >> >> 306 do { \
> >> >> >> 307 might_sleep(); \
> >> >> >> 308 if (condition) \
> >> >> >> 309 break; \
> >> >> >> 310 __wait_event(wq, condition); \
> >> >> >> 311 } while (0)
> >> >> >
> >> >> > Agreed, hence my patch in the previous email. I guess I knew that, but
> >> >>
> >> >> Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?
> >> >
> >> > No, there is a mutex release and acquisition that do the synchronization,
> >> > but only if the wakeup has appropriate barriers. The issue is that
> >> > part of the mutex's critical section executes in a workqueue, possibly
> >> > on some other CPU.
> >>
> >> What is that mutex? And what locks/unlocks provide synchronization? I
> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> >
> > Both of them.
> >
> > ->exp_mutex is acquired by the task requesting the grace period, and
> > the counter's first increment is done by that task under that mutex.
> > This task then schedules a workqueue, which drives forward the grace
> > period. Upon grace-period completion, the workqueue handler does the
> > second increment (the one that your patch addressed). The workqueue
> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> > ->exp_mutex (along with all other tasks waiting for this grace period),
> > and that task releases ->exp_mutex, which allows the next grace period to
> > start (and the first increment for that next grace period to be carried
> > out under that lock). The workqueue handler releases ->exp_wake_mutex
> > after finishing its wakeups.
First and foremost, thank you for reviewing this stuff!!!
> Then we need the following for the case when task requesting the grace
> period does not block, right?
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d80c2587bed8..aa7ba83f6a56 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3534,7 +3534,7 @@ static void rcu_seq_start(unsigned long *sp)
> static void rcu_seq_end(unsigned long *sp)
> {
> smp_mb(); /* Ensure update-side operation before counter increment. */
> - WRITE_ONCE(*sp, *sp + 1);
> + smp_store_release(sp, *sp + 1);
> WARN_ON_ONCE(*sp & 0x1);
> }
The above is not needed, as the smp_mb() covers it completely.
> @@ -3554,7 +3554,7 @@ static unsigned long rcu_seq_snap(unsigned long *sp)
> */
> static bool rcu_seq_done(unsigned long *sp, unsigned long s)
> {
> - return ULONG_CMP_GE(READ_ONCE(*sp), s);
> + return ULONG_CMP_GE(smp_load_acquire(sp), s);
And this one is covered by the smp_mb__before_atomic() inside the "if"
statement in sync_exp_work_done(), via rcu_exp_gp_seq_done(). That is,
sync_exp_work_done() invokes rcu_exp_gp_seq_done() which in turn invokes
rcu_seq_done().
Why smp_mb() instead of release-acquire? Because the grace-period
relationship is quite strong, it cannot be implemented with
release-acquire on all platforms.
Thanx, Paul
prev parent reply other threads:[~2017-03-07 19:18 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-04 16:01 rcu: WARNING in rcu_seq_end Dmitry Vyukov
2017-03-04 20:40 ` Paul E. McKenney
2017-03-05 10:50 ` Dmitry Vyukov
2017-03-05 18:47 ` Paul E. McKenney
2017-03-06 9:24 ` Dmitry Vyukov
2017-03-06 10:07 ` Paul E. McKenney
2017-03-06 10:11 ` Dmitry Vyukov
2017-03-06 23:08 ` Paul E. McKenney
2017-03-07 7:05 ` Dmitry Vyukov
2017-03-07 14:27 ` Boqun Feng
2017-03-07 14:43 ` Dmitry Vyukov
2017-03-07 15:27 ` Paul E. McKenney
2017-03-07 18:37 ` Dmitry Vyukov
2017-03-07 19:09 ` Paul E. McKenney
2017-03-07 23:05 ` Boqun Feng
2017-03-07 23:31 ` Paul E. McKenney
2017-03-08 1:39 ` Boqun Feng
2017-03-08 2:26 ` Paul E. McKenney
2017-03-08 2:44 ` Boqun Feng
2017-03-08 3:08 ` Paul E. McKenney
2017-03-07 15:16 ` Paul E. McKenney [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170307151637.GL30506@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=dvyukov@google.com \
--cc=jiangshanlai@gmail.com \
--cc=josh@joshtriplett.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=rostedt@goodmis.org \
--cc=syzkaller@googlegroups.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.