rcu: WARNING in rcu_seq

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* rcu: WARNING in rcu_seq_end
@ 2017-03-04 16:01 Dmitry Vyukov
  2017-03-04 20:40 ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Dmitry Vyukov @ 2017-03-04 16:01 UTC (permalink / raw)
  To: Paul McKenney, josh, Steven Rostedt, Mathieu Desnoyers,
	jiangshanlai, LKML
  Cc: syzkaller

Hello,

Paul, you wanted bugs in rcu.

I've got this WARNING while running syzkaller fuzzer on
86292b33d4b79ee03e2f43ea0381ef85f077c760:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: events wait_rcu_exp_gp
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
 panic+0x1fb/0x412 kernel/panic.c:179
 __warn+0x1c4/0x1e0 kernel/panic.c:540
 warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
 rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
 rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
 rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
 rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
 wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
 process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
 worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
 kthread+0x326/0x3f0 kernel/kthread.c:227
 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


Not reproducible. But looking at the code, shouldn't it be:

 static void rcu_seq_end(unsigned long *sp)
 {
        smp_mb(); /* Ensure update-side operation before counter increment. */
+       WARN_ON_ONCE(!(*sp & 0x1));
        WRITE_ONCE(*sp, *sp + 1);
-       WARN_ON_ONCE(*sp & 0x1);
 }

?

Otherwise wait_event in _synchronize_rcu_expedited can return as soon
as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
consequently can allow start of next grace periods. Which in turn can
make the warning fire. Am I missing something?

I don't see any other bad consequences of this. The rest of
rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
returned and destroyed work on stack and next period has started and
ended, but it seems OK.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-04 16:01 rcu: WARNING in rcu_seq_end Dmitry Vyukov
@ 2017-03-04 20:40 ` Paul E. McKenney
  2017-03-05 10:50   ` Dmitry Vyukov
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-04 20:40 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai, LKML,
	syzkaller

On Sat, Mar 04, 2017 at 05:01:19PM +0100, Dmitry Vyukov wrote:
> Hello,
> 
> Paul, you wanted bugs in rcu.

Well, whether I want them or not, I must deal with them.  ;-)

> I've got this WARNING while running syzkaller fuzzer on
> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
> 
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> Kernel panic - not syncing: panic_on_warn set ...
> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Workqueue: events wait_rcu_exp_gp
> Call Trace:
>  __dump_stack lib/dump_stack.c:15 [inline]
>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>  panic+0x1fb/0x412 kernel/panic.c:179
>  __warn+0x1c4/0x1e0 kernel/panic.c:540
>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
>  kthread+0x326/0x3f0 kernel/kthread.c:227
>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
> 
> 
> Not reproducible. But looking at the code, shouldn't it be:
> 
>  static void rcu_seq_end(unsigned long *sp)
>  {
>         smp_mb(); /* Ensure update-side operation before counter increment. */
> +       WARN_ON_ONCE(!(*sp & 0x1));
>         WRITE_ONCE(*sp, *sp + 1);
> -       WARN_ON_ONCE(*sp & 0x1);
>  }
> 
> ?
> 
> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
> consequently can allow start of next grace periods. Which in turn can
> make the warning fire. Am I missing something?
> 
> I don't see any other bad consequences of this. The rest of
> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
> returned and destroyed work on stack and next period has started and
> ended, but it seems OK.

I believe that this is a heygood change, but I don't see how it will
help in this case.  BTW, may I have your Signed-off-by?

The reason I don't believe that it will help is that the
rcu_exp_gp_seq_end() function is called from a workqueue handler that
is invoked holding ->exp_mutex, and this mutex is not released until
after the handler invokes rcu_seq_end() and then wakes up the task that
scheduled the workqueue handler.  So the ordering above should not matter
(but I agree that your ordering is cleaner.

That said, it looks like I am missing some memory barriers, please
see the following patch.

But what architecture did you see this on?

							Thanx, Paul

------------------------------------------------------------------------

commit 1038d97908afb04750d0775748e2165a6e92d851
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Sat Mar 4 12:33:53 2017 -0800

    rcu: Expedited wakeups need to be fully ordered
    
    Expedited grace periods use workqueue handlers that wake up the requesters,
    but there is no lock mediating this wakeup.  Therefore, memory barriers
    are required to ensure that the handler's memory references are seen by
    all to occur before synchronize_*_expedited() returns to its caller.
    Possibly detected by syzkaller.
    
    Reported-by: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 51ca287828a2..027e123d93c7 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -533,6 +533,7 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
 				rnp->exp_seq_rq = s;
 			spin_unlock(&rnp->exp_lock);
 		}
+		smp_mb(); /* All above changes before wakeup. */
 		wake_up_all(&rnp->exp_wq[(rsp->expedited_sequence >> 1) & 0x3]);
 	}
 	trace_rcu_exp_grace_period(rsp->name, s, TPS("endwake"));
@@ -614,6 +615,7 @@ static void _synchronize_rcu_expedited(struct rcu_state *rsp,
 	wait_event(rnp->exp_wq[(s >> 1) & 0x3],
 		   sync_exp_work_done(rsp,
 				      &rdp->exp_workdone0, s));
+	smp_mb(); /* Workqueue actions happen before return. */
 
 	/* Let the next expedited grace period start. */
 	mutex_unlock(&rsp->exp_mutex);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-04 20:40 ` Paul E. McKenney
@ 2017-03-05 10:50   ` Dmitry Vyukov
  2017-03-05 18:47     ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Dmitry Vyukov @ 2017-03-05 10:50 UTC (permalink / raw)
  To: Paul McKenney
  Cc: josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai, LKML,
	syzkaller

On Sat, Mar 4, 2017 at 9:40 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Sat, Mar 04, 2017 at 05:01:19PM +0100, Dmitry Vyukov wrote:
>> Hello,
>>
>> Paul, you wanted bugs in rcu.
>
> Well, whether I want them or not, I must deal with them.  ;-)
>
>> I've got this WARNING while running syzkaller fuzzer on
>> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
>> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> Kernel panic - not syncing: panic_on_warn set ...
>> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> Workqueue: events wait_rcu_exp_gp
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:15 [inline]
>>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>>  panic+0x1fb/0x412 kernel/panic.c:179
>>  __warn+0x1c4/0x1e0 kernel/panic.c:540
>>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
>>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
>>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
>>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
>>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
>>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
>>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
>>  kthread+0x326/0x3f0 kernel/kthread.c:227
>>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
>> Dumping ftrace buffer:
>>    (ftrace buffer empty)
>> Kernel Offset: disabled
>> Rebooting in 86400 seconds..
>>
>>
>> Not reproducible. But looking at the code, shouldn't it be:
>>
>>  static void rcu_seq_end(unsigned long *sp)
>>  {
>>         smp_mb(); /* Ensure update-side operation before counter increment. */
>> +       WARN_ON_ONCE(!(*sp & 0x1));
>>         WRITE_ONCE(*sp, *sp + 1);
>> -       WARN_ON_ONCE(*sp & 0x1);
>>  }
>>
>> ?
>>
>> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
>> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
>> consequently can allow start of next grace periods. Which in turn can
>> make the warning fire. Am I missing something?
>>
>> I don't see any other bad consequences of this. The rest of
>> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
>> returned and destroyed work on stack and next period has started and
>> ended, but it seems OK.
>
> I believe that this is a heygood change, but I don't see how it will
> help in this case.  BTW, may I have your Signed-off-by?
>
> The reason I don't believe that it will help is that the
> rcu_exp_gp_seq_end() function is called from a workqueue handler that
> is invoked holding ->exp_mutex, and this mutex is not released until
> after the handler invokes rcu_seq_end() and then wakes up the task that
> scheduled the workqueue handler.  So the ordering above should not matter
> (but I agree that your ordering is cleaner.
>
> That said, it looks like I am missing some memory barriers, please
> see the following patch.
>
> But what architecture did you see this on?


This is just x86.

You seem to assume that wait_event() waits for the wakeup. It does not
work this way. It can return as soon as the condition becomes true
without ever waiting:

305 #define wait_event(wq, condition)                                       \
306 do {                                                                    \
307         might_sleep();                                                  \
308         if (condition)                                                  \
309                 break;                                                  \
310         __wait_event(wq, condition);                                    \
311 } while (0)

Mailed a signed patch:
https://groups.google.com/d/msg/syzkaller/XzUXuAzKkCw/5054wU9MEAAJ

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-05 10:50   ` Dmitry Vyukov
@ 2017-03-05 18:47     ` Paul E. McKenney
  2017-03-06  9:24       ` Dmitry Vyukov
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-05 18:47 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai, LKML,
	syzkaller

On Sun, Mar 05, 2017 at 11:50:39AM +0100, Dmitry Vyukov wrote:
> On Sat, Mar 4, 2017 at 9:40 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Sat, Mar 04, 2017 at 05:01:19PM +0100, Dmitry Vyukov wrote:
> >> Hello,
> >>
> >> Paul, you wanted bugs in rcu.
> >
> > Well, whether I want them or not, I must deal with them.  ;-)
> >
> >> I've got this WARNING while running syzkaller fuzzer on
> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
> >>
> >> ------------[ cut here ]------------
> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> Kernel panic - not syncing: panic_on_warn set ...
> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >> Workqueue: events wait_rcu_exp_gp
> >> Call Trace:
> >>  __dump_stack lib/dump_stack.c:15 [inline]
> >>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
> >>  panic+0x1fb/0x412 kernel/panic.c:179
> >>  __warn+0x1c4/0x1e0 kernel/panic.c:540
> >>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
> >>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
> >>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
> >>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
> >>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
> >>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
> >>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
> >>  kthread+0x326/0x3f0 kernel/kthread.c:227
> >>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> >> Dumping ftrace buffer:
> >>    (ftrace buffer empty)
> >> Kernel Offset: disabled
> >> Rebooting in 86400 seconds..
> >>
> >>
> >> Not reproducible. But looking at the code, shouldn't it be:
> >>
> >>  static void rcu_seq_end(unsigned long *sp)
> >>  {
> >>         smp_mb(); /* Ensure update-side operation before counter increment. */
> >> +       WARN_ON_ONCE(!(*sp & 0x1));
> >>         WRITE_ONCE(*sp, *sp + 1);
> >> -       WARN_ON_ONCE(*sp & 0x1);
> >>  }
> >>
> >> ?
> >>
> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
> >> consequently can allow start of next grace periods. Which in turn can
> >> make the warning fire. Am I missing something?
> >>
> >> I don't see any other bad consequences of this. The rest of
> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
> >> returned and destroyed work on stack and next period has started and
> >> ended, but it seems OK.
> >
> > I believe that this is a heygood change, but I don't see how it will
> > help in this case.  BTW, may I have your Signed-off-by?
> >
> > The reason I don't believe that it will help is that the
> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
> > is invoked holding ->exp_mutex, and this mutex is not released until
> > after the handler invokes rcu_seq_end() and then wakes up the task that
> > scheduled the workqueue handler.  So the ordering above should not matter
> > (but I agree that your ordering is cleaner.
> >
> > That said, it looks like I am missing some memory barriers, please
> > see the following patch.
> >
> > But what architecture did you see this on?
> 
> 
> This is just x86.
> 
> You seem to assume that wait_event() waits for the wakeup. It does not
> work this way. It can return as soon as the condition becomes true
> without ever waiting:
> 
> 305 #define wait_event(wq, condition)                                       \
> 306 do {                                                                    \
> 307         might_sleep();                                                  \
> 308         if (condition)                                                  \
> 309                 break;                                                  \
> 310         __wait_event(wq, condition);                                    \
> 311 } while (0)

Agreed, hence my patch in the previous email.  I guess I knew that, but
on the day I wrote that code, my fingers didn't.  Or somew similar lame
excuse.  ;-)

> Mailed a signed patch:
> https://groups.google.com/d/msg/syzkaller/XzUXuAzKkCw/5054wU9MEAAJ

This is the patch you also sent by email, that moves the WARN_ON_ONCE(),
thank you!

							Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-05 18:47     ` Paul E. McKenney
@ 2017-03-06  9:24       ` Dmitry Vyukov
  2017-03-06 10:07         ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Dmitry Vyukov @ 2017-03-06  9:24 UTC (permalink / raw)
  To: Paul McKenney
  Cc: josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai, LKML,
	syzkaller

On Sun, Mar 5, 2017 at 7:47 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Sun, Mar 05, 2017 at 11:50:39AM +0100, Dmitry Vyukov wrote:
>> On Sat, Mar 4, 2017 at 9:40 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Sat, Mar 04, 2017 at 05:01:19PM +0100, Dmitry Vyukov wrote:
>> >> Hello,
>> >>
>> >> Paul, you wanted bugs in rcu.
>> >
>> > Well, whether I want them or not, I must deal with them.  ;-)
>> >
>> >> I've got this WARNING while running syzkaller fuzzer on
>> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
>> >>
>> >> ------------[ cut here ]------------
>> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
>> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >> Kernel panic - not syncing: panic_on_warn set ...
>> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
>> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >> Workqueue: events wait_rcu_exp_gp
>> >> Call Trace:
>> >>  __dump_stack lib/dump_stack.c:15 [inline]
>> >>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>> >>  panic+0x1fb/0x412 kernel/panic.c:179
>> >>  __warn+0x1c4/0x1e0 kernel/panic.c:540
>> >>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
>> >>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
>> >>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
>> >>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
>> >>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
>> >>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
>> >>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
>> >>  kthread+0x326/0x3f0 kernel/kthread.c:227
>> >>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
>> >> Dumping ftrace buffer:
>> >>    (ftrace buffer empty)
>> >> Kernel Offset: disabled
>> >> Rebooting in 86400 seconds..
>> >>
>> >>
>> >> Not reproducible. But looking at the code, shouldn't it be:
>> >>
>> >>  static void rcu_seq_end(unsigned long *sp)
>> >>  {
>> >>         smp_mb(); /* Ensure update-side operation before counter increment. */
>> >> +       WARN_ON_ONCE(!(*sp & 0x1));
>> >>         WRITE_ONCE(*sp, *sp + 1);
>> >> -       WARN_ON_ONCE(*sp & 0x1);
>> >>  }
>> >>
>> >> ?
>> >>
>> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
>> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
>> >> consequently can allow start of next grace periods. Which in turn can
>> >> make the warning fire. Am I missing something?
>> >>
>> >> I don't see any other bad consequences of this. The rest of
>> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
>> >> returned and destroyed work on stack and next period has started and
>> >> ended, but it seems OK.
>> >
>> > I believe that this is a heygood change, but I don't see how it will
>> > help in this case.  BTW, may I have your Signed-off-by?
>> >
>> > The reason I don't believe that it will help is that the
>> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
>> > is invoked holding ->exp_mutex, and this mutex is not released until
>> > after the handler invokes rcu_seq_end() and then wakes up the task that
>> > scheduled the workqueue handler.  So the ordering above should not matter
>> > (but I agree that your ordering is cleaner.
>> >
>> > That said, it looks like I am missing some memory barriers, please
>> > see the following patch.
>> >
>> > But what architecture did you see this on?
>>
>>
>> This is just x86.
>>
>> You seem to assume that wait_event() waits for the wakeup. It does not
>> work this way. It can return as soon as the condition becomes true
>> without ever waiting:
>>
>> 305 #define wait_event(wq, condition)                                       \
>> 306 do {                                                                    \
>> 307         might_sleep();                                                  \
>> 308         if (condition)                                                  \
>> 309                 break;                                                  \
>> 310         __wait_event(wq, condition);                                    \
>> 311 } while (0)
>
> Agreed, hence my patch in the previous email.  I guess I knew that, but

Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?
I think you placed the barrier incorrectly for that. rcu_exp_wait_wake
is already too late. The write that unblocks waiter is in rcu_seq_end
so you need a release barrier _before_ that write.
Also can we please start using smp_load_acquire/smp_store_release
where they are what doctor said. They are faster, more readable,
better for race detectors _and_ would prevent you from introducing
this bug, because you would need to find the exact write that
signifies completion. I.e.:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d80c2587bed8..aa7ba83f6a56 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3534,7 +3534,7 @@ static void rcu_seq_start(unsigned long *sp)
 static void rcu_seq_end(unsigned long *sp)
 {
        smp_mb(); /* Ensure update-side operation before counter increment. */
-       WRITE_ONCE(*sp, *sp + 1);
+       smp_store_release(sp, *sp + 1);
        WARN_ON_ONCE(*sp & 0x1);
 }

@@ -3554,7 +3554,7 @@ static unsigned long rcu_seq_snap(unsigned long *sp)
  */
 static bool rcu_seq_done(unsigned long *sp, unsigned long s)
 {
-       return ULONG_CMP_GE(READ_ONCE(*sp), s);
+       return ULONG_CMP_GE(smp_load_acquire(sp), s);
 }



> on the day I wrote that code, my fingers didn't.  Or somew similar lame
> excuse.  ;-)
>
>> Mailed a signed patch:
>> https://groups.google.com/d/msg/syzkaller/XzUXuAzKkCw/5054wU9MEAAJ
>
> This is the patch you also sent by email, that moves the WARN_ON_ONCE(),
> thank you!

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-06  9:24       ` Dmitry Vyukov
@ 2017-03-06 10:07         ` Paul E. McKenney
  2017-03-06 10:11           ` Dmitry Vyukov
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-06 10:07 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai, LKML,
	syzkaller

On Mon, Mar 06, 2017 at 10:24:24AM +0100, Dmitry Vyukov wrote:
> On Sun, Mar 5, 2017 at 7:47 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Sun, Mar 05, 2017 at 11:50:39AM +0100, Dmitry Vyukov wrote:
> >> On Sat, Mar 4, 2017 at 9:40 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Sat, Mar 04, 2017 at 05:01:19PM +0100, Dmitry Vyukov wrote:
> >> >> Hello,
> >> >>
> >> >> Paul, you wanted bugs in rcu.
> >> >
> >> > Well, whether I want them or not, I must deal with them.  ;-)
> >> >
> >> >> I've got this WARNING while running syzkaller fuzzer on
> >> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
> >> >>
> >> >> ------------[ cut here ]------------
> >> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >> Kernel panic - not syncing: panic_on_warn set ...
> >> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
> >> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >> >> Workqueue: events wait_rcu_exp_gp
> >> >> Call Trace:
> >> >>  __dump_stack lib/dump_stack.c:15 [inline]
> >> >>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
> >> >>  panic+0x1fb/0x412 kernel/panic.c:179
> >> >>  __warn+0x1c4/0x1e0 kernel/panic.c:540
> >> >>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
> >> >>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
> >> >>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
> >> >>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
> >> >>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
> >> >>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
> >> >>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
> >> >>  kthread+0x326/0x3f0 kernel/kthread.c:227
> >> >>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> >> >> Dumping ftrace buffer:
> >> >>    (ftrace buffer empty)
> >> >> Kernel Offset: disabled
> >> >> Rebooting in 86400 seconds..
> >> >>
> >> >>
> >> >> Not reproducible. But looking at the code, shouldn't it be:
> >> >>
> >> >>  static void rcu_seq_end(unsigned long *sp)
> >> >>  {
> >> >>         smp_mb(); /* Ensure update-side operation before counter increment. */
> >> >> +       WARN_ON_ONCE(!(*sp & 0x1));
> >> >>         WRITE_ONCE(*sp, *sp + 1);
> >> >> -       WARN_ON_ONCE(*sp & 0x1);
> >> >>  }
> >> >>
> >> >> ?
> >> >>
> >> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
> >> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
> >> >> consequently can allow start of next grace periods. Which in turn can
> >> >> make the warning fire. Am I missing something?
> >> >>
> >> >> I don't see any other bad consequences of this. The rest of
> >> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
> >> >> returned and destroyed work on stack and next period has started and
> >> >> ended, but it seems OK.
> >> >
> >> > I believe that this is a heygood change, but I don't see how it will
> >> > help in this case.  BTW, may I have your Signed-off-by?
> >> >
> >> > The reason I don't believe that it will help is that the
> >> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
> >> > is invoked holding ->exp_mutex, and this mutex is not released until
> >> > after the handler invokes rcu_seq_end() and then wakes up the task that
> >> > scheduled the workqueue handler.  So the ordering above should not matter
> >> > (but I agree that your ordering is cleaner.
> >> >
> >> > That said, it looks like I am missing some memory barriers, please
> >> > see the following patch.
> >> >
> >> > But what architecture did you see this on?
> >>
> >>
> >> This is just x86.
> >>
> >> You seem to assume that wait_event() waits for the wakeup. It does not
> >> work this way. It can return as soon as the condition becomes true
> >> without ever waiting:
> >>
> >> 305 #define wait_event(wq, condition)                                       \
> >> 306 do {                                                                    \
> >> 307         might_sleep();                                                  \
> >> 308         if (condition)                                                  \
> >> 309                 break;                                                  \
> >> 310         __wait_event(wq, condition);                                    \
> >> 311 } while (0)
> >
> > Agreed, hence my patch in the previous email.  I guess I knew that, but
> 
> Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?

No, there is a mutex release and acquisition that do the synchronization,
but only if the wakeup has appropriate barriers.  The issue is that
part of the mutex's critical section executes in a workqueue, possibly
on some other CPU.

							Thanx, Paul

> I think you placed the barrier incorrectly for that. rcu_exp_wait_wake
> is already too late. The write that unblocks waiter is in rcu_seq_end
> so you need a release barrier _before_ that write.
> Also can we please start using smp_load_acquire/smp_store_release
> where they are what doctor said. They are faster, more readable,
> better for race detectors _and_ would prevent you from introducing
> this bug, because you would need to find the exact write that
> signifies completion. I.e.:
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d80c2587bed8..aa7ba83f6a56 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3534,7 +3534,7 @@ static void rcu_seq_start(unsigned long *sp)
>  static void rcu_seq_end(unsigned long *sp)
>  {
>         smp_mb(); /* Ensure update-side operation before counter increment. */
> -       WRITE_ONCE(*sp, *sp + 1);
> +       smp_store_release(sp, *sp + 1);
>         WARN_ON_ONCE(*sp & 0x1);
>  }
> 
> @@ -3554,7 +3554,7 @@ static unsigned long rcu_seq_snap(unsigned long *sp)
>   */
>  static bool rcu_seq_done(unsigned long *sp, unsigned long s)
>  {
> -       return ULONG_CMP_GE(READ_ONCE(*sp), s);
> +       return ULONG_CMP_GE(smp_load_acquire(sp), s);
>  }
> 
> 
> 
> > on the day I wrote that code, my fingers didn't.  Or somew similar lame
> > excuse.  ;-)
> >
> >> Mailed a signed patch:
> >> https://groups.google.com/d/msg/syzkaller/XzUXuAzKkCw/5054wU9MEAAJ
> >
> > This is the patch you also sent by email, that moves the WARN_ON_ONCE(),
> > thank you!
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-06 10:07         ` Paul E. McKenney
@ 2017-03-06 10:11           ` Dmitry Vyukov
  2017-03-06 23:08             ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Dmitry Vyukov @ 2017-03-06 10:11 UTC (permalink / raw)
  To: Paul McKenney
  Cc: josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai, LKML,
	syzkaller

On Mon, Mar 6, 2017 at 11:07 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Mar 06, 2017 at 10:24:24AM +0100, Dmitry Vyukov wrote:
>> On Sun, Mar 5, 2017 at 7:47 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Sun, Mar 05, 2017 at 11:50:39AM +0100, Dmitry Vyukov wrote:
>> >> On Sat, Mar 4, 2017 at 9:40 PM, Paul E. McKenney
>> >> <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Sat, Mar 04, 2017 at 05:01:19PM +0100, Dmitry Vyukov wrote:
>> >> >> Hello,
>> >> >>
>> >> >> Paul, you wanted bugs in rcu.
>> >> >
>> >> > Well, whether I want them or not, I must deal with them.  ;-)
>> >> >
>> >> >> I've got this WARNING while running syzkaller fuzzer on
>> >> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
>> >> >>
>> >> >> ------------[ cut here ]------------
>> >> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
>> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >> >> Kernel panic - not syncing: panic_on_warn set ...
>> >> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
>> >> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >> >> Workqueue: events wait_rcu_exp_gp
>> >> >> Call Trace:
>> >> >>  __dump_stack lib/dump_stack.c:15 [inline]
>> >> >>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>> >> >>  panic+0x1fb/0x412 kernel/panic.c:179
>> >> >>  __warn+0x1c4/0x1e0 kernel/panic.c:540
>> >> >>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
>> >> >>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >> >>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
>> >> >>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
>> >> >>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
>> >> >>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
>> >> >>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
>> >> >>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
>> >> >>  kthread+0x326/0x3f0 kernel/kthread.c:227
>> >> >>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
>> >> >> Dumping ftrace buffer:
>> >> >>    (ftrace buffer empty)
>> >> >> Kernel Offset: disabled
>> >> >> Rebooting in 86400 seconds..
>> >> >>
>> >> >>
>> >> >> Not reproducible. But looking at the code, shouldn't it be:
>> >> >>
>> >> >>  static void rcu_seq_end(unsigned long *sp)
>> >> >>  {
>> >> >>         smp_mb(); /* Ensure update-side operation before counter increment. */
>> >> >> +       WARN_ON_ONCE(!(*sp & 0x1));
>> >> >>         WRITE_ONCE(*sp, *sp + 1);
>> >> >> -       WARN_ON_ONCE(*sp & 0x1);
>> >> >>  }
>> >> >>
>> >> >> ?
>> >> >>
>> >> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
>> >> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
>> >> >> consequently can allow start of next grace periods. Which in turn can
>> >> >> make the warning fire. Am I missing something?
>> >> >>
>> >> >> I don't see any other bad consequences of this. The rest of
>> >> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
>> >> >> returned and destroyed work on stack and next period has started and
>> >> >> ended, but it seems OK.
>> >> >
>> >> > I believe that this is a heygood change, but I don't see how it will
>> >> > help in this case.  BTW, may I have your Signed-off-by?
>> >> >
>> >> > The reason I don't believe that it will help is that the
>> >> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
>> >> > is invoked holding ->exp_mutex, and this mutex is not released until
>> >> > after the handler invokes rcu_seq_end() and then wakes up the task that
>> >> > scheduled the workqueue handler.  So the ordering above should not matter
>> >> > (but I agree that your ordering is cleaner.
>> >> >
>> >> > That said, it looks like I am missing some memory barriers, please
>> >> > see the following patch.
>> >> >
>> >> > But what architecture did you see this on?
>> >>
>> >>
>> >> This is just x86.
>> >>
>> >> You seem to assume that wait_event() waits for the wakeup. It does not
>> >> work this way. It can return as soon as the condition becomes true
>> >> without ever waiting:
>> >>
>> >> 305 #define wait_event(wq, condition)                                       \
>> >> 306 do {                                                                    \
>> >> 307         might_sleep();                                                  \
>> >> 308         if (condition)                                                  \
>> >> 309                 break;                                                  \
>> >> 310         __wait_event(wq, condition);                                    \
>> >> 311 } while (0)
>> >
>> > Agreed, hence my patch in the previous email.  I guess I knew that, but
>>
>> Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?
>
> No, there is a mutex release and acquisition that do the synchronization,
> but only if the wakeup has appropriate barriers.  The issue is that
> part of the mutex's critical section executes in a workqueue, possibly
> on some other CPU.

What is that mutex? And what locks/unlocks provide synchronization? I
see that one uses exp_mutex and another -- exp_wake_mutex.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-06 10:11           ` Dmitry Vyukov
@ 2017-03-06 23:08             ` Paul E. McKenney
  2017-03-07  7:05               ` Dmitry Vyukov
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-06 23:08 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai, LKML,
	syzkaller

On Mon, Mar 06, 2017 at 11:11:23AM +0100, Dmitry Vyukov wrote:
> On Mon, Mar 6, 2017 at 11:07 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, Mar 06, 2017 at 10:24:24AM +0100, Dmitry Vyukov wrote:
> >> On Sun, Mar 5, 2017 at 7:47 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Sun, Mar 05, 2017 at 11:50:39AM +0100, Dmitry Vyukov wrote:
> >> >> On Sat, Mar 4, 2017 at 9:40 PM, Paul E. McKenney
> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > On Sat, Mar 04, 2017 at 05:01:19PM +0100, Dmitry Vyukov wrote:
> >> >> >> Hello,
> >> >> >>
> >> >> >> Paul, you wanted bugs in rcu.
> >> >> >
> >> >> > Well, whether I want them or not, I must deal with them.  ;-)
> >> >> >
> >> >> >> I've got this WARNING while running syzkaller fuzzer on
> >> >> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
> >> >> >>
> >> >> >> ------------[ cut here ]------------
> >> >> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >> >> Kernel panic - not syncing: panic_on_warn set ...
> >> >> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
> >> >> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >> >> >> Workqueue: events wait_rcu_exp_gp
> >> >> >> Call Trace:
> >> >> >>  __dump_stack lib/dump_stack.c:15 [inline]
> >> >> >>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
> >> >> >>  panic+0x1fb/0x412 kernel/panic.c:179
> >> >> >>  __warn+0x1c4/0x1e0 kernel/panic.c:540
> >> >> >>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
> >> >> >>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >> >>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
> >> >> >>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
> >> >> >>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
> >> >> >>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
> >> >> >>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
> >> >> >>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
> >> >> >>  kthread+0x326/0x3f0 kernel/kthread.c:227
> >> >> >>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> >> >> >> Dumping ftrace buffer:
> >> >> >>    (ftrace buffer empty)
> >> >> >> Kernel Offset: disabled
> >> >> >> Rebooting in 86400 seconds..
> >> >> >>
> >> >> >>
> >> >> >> Not reproducible. But looking at the code, shouldn't it be:
> >> >> >>
> >> >> >>  static void rcu_seq_end(unsigned long *sp)
> >> >> >>  {
> >> >> >>         smp_mb(); /* Ensure update-side operation before counter increment. */
> >> >> >> +       WARN_ON_ONCE(!(*sp & 0x1));
> >> >> >>         WRITE_ONCE(*sp, *sp + 1);
> >> >> >> -       WARN_ON_ONCE(*sp & 0x1);
> >> >> >>  }
> >> >> >>
> >> >> >> ?
> >> >> >>
> >> >> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
> >> >> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
> >> >> >> consequently can allow start of next grace periods. Which in turn can
> >> >> >> make the warning fire. Am I missing something?
> >> >> >>
> >> >> >> I don't see any other bad consequences of this. The rest of
> >> >> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
> >> >> >> returned and destroyed work on stack and next period has started and
> >> >> >> ended, but it seems OK.
> >> >> >
> >> >> > I believe that this is a heygood change, but I don't see how it will
> >> >> > help in this case.  BTW, may I have your Signed-off-by?
> >> >> >
> >> >> > The reason I don't believe that it will help is that the
> >> >> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
> >> >> > is invoked holding ->exp_mutex, and this mutex is not released until
> >> >> > after the handler invokes rcu_seq_end() and then wakes up the task that
> >> >> > scheduled the workqueue handler.  So the ordering above should not matter
> >> >> > (but I agree that your ordering is cleaner.
> >> >> >
> >> >> > That said, it looks like I am missing some memory barriers, please
> >> >> > see the following patch.
> >> >> >
> >> >> > But what architecture did you see this on?
> >> >>
> >> >>
> >> >> This is just x86.
> >> >>
> >> >> You seem to assume that wait_event() waits for the wakeup. It does not
> >> >> work this way. It can return as soon as the condition becomes true
> >> >> without ever waiting:
> >> >>
> >> >> 305 #define wait_event(wq, condition)                                       \
> >> >> 306 do {                                                                    \
> >> >> 307         might_sleep();                                                  \
> >> >> 308         if (condition)                                                  \
> >> >> 309                 break;                                                  \
> >> >> 310         __wait_event(wq, condition);                                    \
> >> >> 311 } while (0)
> >> >
> >> > Agreed, hence my patch in the previous email.  I guess I knew that, but
> >>
> >> Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?
> >
> > No, there is a mutex release and acquisition that do the synchronization,
> > but only if the wakeup has appropriate barriers.  The issue is that
> > part of the mutex's critical section executes in a workqueue, possibly
> > on some other CPU.
> 
> What is that mutex? And what locks/unlocks provide synchronization? I
> see that one uses exp_mutex and another -- exp_wake_mutex.

Both of them.

->exp_mutex is acquired by the task requesting the grace period, and
the counter's first increment is done by that task under that mutex.
This task then schedules a workqueue, which drives forward the grace
period.  Upon grace-period completion, the workqueue handler does the
second increment (the one that your patch addressed).  The workqueue
handler then acquires ->exp_wake_mutex and wakes the task that holds
->exp_mutex (along with all other tasks waiting for this grace period),
and that task releases ->exp_mutex, which allows the next grace period to
start (and the first increment for that next grace period to be carried
out under that lock).  The workqueue handler releases ->exp_wake_mutex
after finishing its wakeups.

Does that make sense?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-06 23:08             ` Paul E. McKenney
@ 2017-03-07  7:05               ` Dmitry Vyukov
  2017-03-07 14:27                 ` Boqun Feng
  2017-03-07 15:16                 ` Paul E. McKenney
  0 siblings, 2 replies; 21+ messages in thread
From: Dmitry Vyukov @ 2017-03-07  7:05 UTC (permalink / raw)
  To: Paul McKenney
  Cc: josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai, LKML,
	syzkaller

On Tue, Mar 7, 2017 at 12:08 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>> >> >> >> Hello,
>> >> >> >>
>> >> >> >> Paul, you wanted bugs in rcu.
>> >> >> >
>> >> >> > Well, whether I want them or not, I must deal with them.  ;-)
>> >> >> >
>> >> >> >> I've got this WARNING while running syzkaller fuzzer on
>> >> >> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
>> >> >> >>
>> >> >> >> ------------[ cut here ]------------
>> >> >> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
>> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >> >> >> Kernel panic - not syncing: panic_on_warn set ...
>> >> >> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
>> >> >> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >> >> >> Workqueue: events wait_rcu_exp_gp
>> >> >> >> Call Trace:
>> >> >> >>  __dump_stack lib/dump_stack.c:15 [inline]
>> >> >> >>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>> >> >> >>  panic+0x1fb/0x412 kernel/panic.c:179
>> >> >> >>  __warn+0x1c4/0x1e0 kernel/panic.c:540
>> >> >> >>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
>> >> >> >>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
>> >> >> >>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
>> >> >> >>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
>> >> >> >>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
>> >> >> >>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
>> >> >> >>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
>> >> >> >>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
>> >> >> >>  kthread+0x326/0x3f0 kernel/kthread.c:227
>> >> >> >>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
>> >> >> >> Dumping ftrace buffer:
>> >> >> >>    (ftrace buffer empty)
>> >> >> >> Kernel Offset: disabled
>> >> >> >> Rebooting in 86400 seconds..
>> >> >> >>
>> >> >> >>
>> >> >> >> Not reproducible. But looking at the code, shouldn't it be:
>> >> >> >>
>> >> >> >>  static void rcu_seq_end(unsigned long *sp)
>> >> >> >>  {
>> >> >> >>         smp_mb(); /* Ensure update-side operation before counter increment. */
>> >> >> >> +       WARN_ON_ONCE(!(*sp & 0x1));
>> >> >> >>         WRITE_ONCE(*sp, *sp + 1);
>> >> >> >> -       WARN_ON_ONCE(*sp & 0x1);
>> >> >> >>  }
>> >> >> >>
>> >> >> >> ?
>> >> >> >>
>> >> >> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
>> >> >> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
>> >> >> >> consequently can allow start of next grace periods. Which in turn can
>> >> >> >> make the warning fire. Am I missing something?
>> >> >> >>
>> >> >> >> I don't see any other bad consequences of this. The rest of
>> >> >> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
>> >> >> >> returned and destroyed work on stack and next period has started and
>> >> >> >> ended, but it seems OK.
>> >> >> >
>> >> >> > I believe that this is a heygood change, but I don't see how it will
>> >> >> > help in this case.  BTW, may I have your Signed-off-by?
>> >> >> >
>> >> >> > The reason I don't believe that it will help is that the
>> >> >> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
>> >> >> > is invoked holding ->exp_mutex, and this mutex is not released until
>> >> >> > after the handler invokes rcu_seq_end() and then wakes up the task that
>> >> >> > scheduled the workqueue handler.  So the ordering above should not matter
>> >> >> > (but I agree that your ordering is cleaner.
>> >> >> >
>> >> >> > That said, it looks like I am missing some memory barriers, please
>> >> >> > see the following patch.
>> >> >> >
>> >> >> > But what architecture did you see this on?
>> >> >>
>> >> >>
>> >> >> This is just x86.
>> >> >>
>> >> >> You seem to assume that wait_event() waits for the wakeup. It does not
>> >> >> work this way. It can return as soon as the condition becomes true
>> >> >> without ever waiting:
>> >> >>
>> >> >> 305 #define wait_event(wq, condition)                                       \
>> >> >> 306 do {                                                                    \
>> >> >> 307         might_sleep();                                                  \
>> >> >> 308         if (condition)                                                  \
>> >> >> 309                 break;                                                  \
>> >> >> 310         __wait_event(wq, condition);                                    \
>> >> >> 311 } while (0)
>> >> >
>> >> > Agreed, hence my patch in the previous email.  I guess I knew that, but
>> >>
>> >> Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?
>> >
>> > No, there is a mutex release and acquisition that do the synchronization,
>> > but only if the wakeup has appropriate barriers.  The issue is that
>> > part of the mutex's critical section executes in a workqueue, possibly
>> > on some other CPU.
>>
>> What is that mutex? And what locks/unlocks provide synchronization? I
>> see that one uses exp_mutex and another -- exp_wake_mutex.
>
> Both of them.
>
> ->exp_mutex is acquired by the task requesting the grace period, and
> the counter's first increment is done by that task under that mutex.
> This task then schedules a workqueue, which drives forward the grace
> period.  Upon grace-period completion, the workqueue handler does the
> second increment (the one that your patch addressed).  The workqueue
> handler then acquires ->exp_wake_mutex and wakes the task that holds
> ->exp_mutex (along with all other tasks waiting for this grace period),
> and that task releases ->exp_mutex, which allows the next grace period to
> start (and the first increment for that next grace period to be carried
> out under that lock).  The workqueue handler releases ->exp_wake_mutex
> after finishing its wakeups.


Then we need the following for the case when task requesting the grace
period does not block, right?

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d80c2587bed8..aa7ba83f6a56 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3534,7 +3534,7 @@ static void rcu_seq_start(unsigned long *sp)
 static void rcu_seq_end(unsigned long *sp)
 {
        smp_mb(); /* Ensure update-side operation before counter increment. */
-       WRITE_ONCE(*sp, *sp + 1);
+       smp_store_release(sp, *sp + 1);
        WARN_ON_ONCE(*sp & 0x1);
 }

@@ -3554,7 +3554,7 @@ static unsigned long rcu_seq_snap(unsigned long *sp)
  */
 static bool rcu_seq_done(unsigned long *sp, unsigned long s)
 {
-       return ULONG_CMP_GE(READ_ONCE(*sp), s);
+       return ULONG_CMP_GE(smp_load_acquire(sp), s);

 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-07  7:05               ` Dmitry Vyukov
@ 2017-03-07 14:27                 ` Boqun Feng
  2017-03-07 14:43                   ` Dmitry Vyukov
  2017-03-07 15:16                 ` Paul E. McKenney
  1 sibling, 1 reply; 21+ messages in thread
From: Boqun Feng @ 2017-03-07 14:27 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Paul McKenney, josh, Steven Rostedt, Mathieu Desnoyers,
	jiangshanlai, LKML, syzkaller

[-- Attachment #1: Type: text/plain, Size: 2097 bytes --]

On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
[...]
> >>
> >> What is that mutex? And what locks/unlocks provide synchronization? I
> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> >
> > Both of them.
> >
> > ->exp_mutex is acquired by the task requesting the grace period, and
> > the counter's first increment is done by that task under that mutex.
> > This task then schedules a workqueue, which drives forward the grace
> > period.  Upon grace-period completion, the workqueue handler does the
> > second increment (the one that your patch addressed).  The workqueue
> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> > ->exp_mutex (along with all other tasks waiting for this grace period),
> > and that task releases ->exp_mutex, which allows the next grace period to
> > start (and the first increment for that next grace period to be carried
> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> > after finishing its wakeups.
> 
> 
> Then we need the following for the case when task requesting the grace
> period does not block, right?
> 

Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
smp_mb__before_atomic() in sync_exp_work_done() already provide the
required ordering, no?

Regards,
Boqun

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d80c2587bed8..aa7ba83f6a56 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3534,7 +3534,7 @@ static void rcu_seq_start(unsigned long *sp)
>  static void rcu_seq_end(unsigned long *sp)
>  {
>         smp_mb(); /* Ensure update-side operation before counter increment. */
> -       WRITE_ONCE(*sp, *sp + 1);
> +       smp_store_release(sp, *sp + 1);
>         WARN_ON_ONCE(*sp & 0x1);
>  }
> 
> @@ -3554,7 +3554,7 @@ static unsigned long rcu_seq_snap(unsigned long *sp)
>   */
>  static bool rcu_seq_done(unsigned long *sp, unsigned long s)
>  {
> -       return ULONG_CMP_GE(READ_ONCE(*sp), s);
> +       return ULONG_CMP_GE(smp_load_acquire(sp), s);
> 
>  }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-07 14:27                 ` Boqun Feng
@ 2017-03-07 14:43                   ` Dmitry Vyukov
  2017-03-07 15:27                     ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Dmitry Vyukov @ 2017-03-07 14:43 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul McKenney, josh, Steven Rostedt, Mathieu Desnoyers,
	jiangshanlai, LKML, syzkaller

On Tue, Mar 7, 2017 at 3:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> [...]
>> >>
>> >> What is that mutex? And what locks/unlocks provide synchronization? I
>> >> see that one uses exp_mutex and another -- exp_wake_mutex.
>> >
>> > Both of them.
>> >
>> > ->exp_mutex is acquired by the task requesting the grace period, and
>> > the counter's first increment is done by that task under that mutex.
>> > This task then schedules a workqueue, which drives forward the grace
>> > period.  Upon grace-period completion, the workqueue handler does the
>> > second increment (the one that your patch addressed).  The workqueue
>> > handler then acquires ->exp_wake_mutex and wakes the task that holds
>> > ->exp_mutex (along with all other tasks waiting for this grace period),
>> > and that task releases ->exp_mutex, which allows the next grace period to
>> > start (and the first increment for that next grace period to be carried
>> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
>> > after finishing its wakeups.
>>
>>
>> Then we need the following for the case when task requesting the grace
>> period does not block, right?
>>
>
> Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
> smp_mb__before_atomic() in sync_exp_work_done() already provide the
> required ordering, no?

smp_mb() is probably fine, but smp_mb__before_atomic() is release not
acquire. If we want to play that game, then I guess we also need
smp_mb__after_atomic() there. But it would be way easier to understand
what's happens there and prove that it's correct, if we use
store_release/load_acquire.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-07 14:43                   ` Dmitry Vyukov
@ 2017-03-07 15:27                     ` Paul E. McKenney
  2017-03-07 18:37                       ` Dmitry Vyukov
  2017-03-07 23:05                       ` Boqun Feng
  0 siblings, 2 replies; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-07 15:27 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Boqun Feng, josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai,
	LKML, syzkaller

On Tue, Mar 07, 2017 at 03:43:42PM +0100, Dmitry Vyukov wrote:
> On Tue, Mar 7, 2017 at 3:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> > On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> > [...]
> >> >>
> >> >> What is that mutex? And what locks/unlocks provide synchronization? I
> >> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> >> >
> >> > Both of them.
> >> >
> >> > ->exp_mutex is acquired by the task requesting the grace period, and
> >> > the counter's first increment is done by that task under that mutex.
> >> > This task then schedules a workqueue, which drives forward the grace
> >> > period.  Upon grace-period completion, the workqueue handler does the
> >> > second increment (the one that your patch addressed).  The workqueue
> >> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> >> > ->exp_mutex (along with all other tasks waiting for this grace period),
> >> > and that task releases ->exp_mutex, which allows the next grace period to
> >> > start (and the first increment for that next grace period to be carried
> >> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> >> > after finishing its wakeups.
> >>
> >> Then we need the following for the case when task requesting the grace
> >> period does not block, right?
> >
> > Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
> > smp_mb__before_atomic() in sync_exp_work_done() already provide the
> > required ordering, no?
> 
> smp_mb() is probably fine, but smp_mb__before_atomic() is release not
> acquire. If we want to play that game, then I guess we also need
> smp_mb__after_atomic() there. But it would be way easier to understand
> what's happens there and prove that it's correct, if we use
> store_release/load_acquire.

Fair point, how about the following?

							Thanx, Paul

------------------------------------------------------------------------

commit 6fd8074f1976596898e39f5b7ea1755652533906
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Tue Mar 7 07:21:23 2017 -0800

    rcu: Add smp_mb__after_atomic() to sync_exp_work_done()
    
    The sync_exp_work_done() function needs to fully order the counter-check
    operation against anything happening after the corresponding grace period.
    This is a theoretical bug, as all current architectures either provide
    full ordering for atomic operation on the one hand or implement,
    however, a little future-proofing is a good thing.  This commit
    therefore adds smp_mb__after_atomic() after the atomic_long_inc()
    in sync_exp_work_done().
    
    Reported-by: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 027e123d93c7..652071abd9b4 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
 		/* Ensure test happens before caller kfree(). */
 		smp_mb__before_atomic(); /* ^^^ */
 		atomic_long_inc(stat);
+		smp_mb__after_atomic(); /* ^^^ */
 		return true;
 	}
 	return false;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-07 15:27                     ` Paul E. McKenney
@ 2017-03-07 18:37                       ` Dmitry Vyukov
  2017-03-07 19:09                         ` Paul E. McKenney
  2017-03-07 23:05                       ` Boqun Feng
  1 sibling, 1 reply; 21+ messages in thread
From: Dmitry Vyukov @ 2017-03-07 18:37 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Boqun Feng, josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai,
	LKML, syzkaller

On Tue, Mar 7, 2017 at 4:27 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>> > [...]
>> >> >>
>> >> >> What is that mutex? And what locks/unlocks provide synchronization? I
>> >> >> see that one uses exp_mutex and another -- exp_wake_mutex.
>> >> >
>> >> > Both of them.
>> >> >
>> >> > ->exp_mutex is acquired by the task requesting the grace period, and
>> >> > the counter's first increment is done by that task under that mutex.
>> >> > This task then schedules a workqueue, which drives forward the grace
>> >> > period.  Upon grace-period completion, the workqueue handler does the
>> >> > second increment (the one that your patch addressed).  The workqueue
>> >> > handler then acquires ->exp_wake_mutex and wakes the task that holds
>> >> > ->exp_mutex (along with all other tasks waiting for this grace period),
>> >> > and that task releases ->exp_mutex, which allows the next grace period to
>> >> > start (and the first increment for that next grace period to be carried
>> >> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
>> >> > after finishing its wakeups.
>> >>
>> >> Then we need the following for the case when task requesting the grace
>> >> period does not block, right?
>> >
>> > Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
>> > smp_mb__before_atomic() in sync_exp_work_done() already provide the
>> > required ordering, no?
>>
>> smp_mb() is probably fine, but smp_mb__before_atomic() is release not
>> acquire. If we want to play that game, then I guess we also need
>> smp_mb__after_atomic() there. But it would be way easier to understand
>> what's happens there and prove that it's correct, if we use
>> store_release/load_acquire.
>
> Fair point, how about the following?

I am not qualified enough to reason about these smp_mb__after_atomic.
>From practical point of view there may be enough barriers in the
resulting machine code already, but re formal semantics of adding
smp_mb__after_atomic after an unrelated subsequent atomic RMW op I
gave up. You must be the best candidate for this now :)



>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 6fd8074f1976596898e39f5b7ea1755652533906
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Tue Mar 7 07:21:23 2017 -0800
>
>     rcu: Add smp_mb__after_atomic() to sync_exp_work_done()
>
>     The sync_exp_work_done() function needs to fully order the counter-check
>     operation against anything happening after the corresponding grace period.
>     This is a theoretical bug, as all current architectures either provide
>     full ordering for atomic operation on the one hand or implement,
>     however, a little future-proofing is a good thing.  This commit
>     therefore adds smp_mb__after_atomic() after the atomic_long_inc()
>     in sync_exp_work_done().
>
>     Reported-by: Dmitry Vyukov <dvyukov@google.com>
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 027e123d93c7..652071abd9b4 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
>                 /* Ensure test happens before caller kfree(). */
>                 smp_mb__before_atomic(); /* ^^^ */
>                 atomic_long_inc(stat);
> +               smp_mb__after_atomic(); /* ^^^ */
>                 return true;
>         }
>         return false;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-07 18:37                       ` Dmitry Vyukov
@ 2017-03-07 19:09                         ` Paul E. McKenney
  0 siblings, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-07 19:09 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Boqun Feng, josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai,
	LKML, syzkaller

On Tue, Mar 07, 2017 at 07:37:57PM +0100, Dmitry Vyukov wrote:
> On Tue, Mar 7, 2017 at 4:27 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >> > [...]
> >> >> >>
> >> >> >> What is that mutex? And what locks/unlocks provide synchronization? I
> >> >> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> >> >> >
> >> >> > Both of them.
> >> >> >
> >> >> > ->exp_mutex is acquired by the task requesting the grace period, and
> >> >> > the counter's first increment is done by that task under that mutex.
> >> >> > This task then schedules a workqueue, which drives forward the grace
> >> >> > period.  Upon grace-period completion, the workqueue handler does the
> >> >> > second increment (the one that your patch addressed).  The workqueue
> >> >> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> >> >> > ->exp_mutex (along with all other tasks waiting for this grace period),
> >> >> > and that task releases ->exp_mutex, which allows the next grace period to
> >> >> > start (and the first increment for that next grace period to be carried
> >> >> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> >> >> > after finishing its wakeups.
> >> >>
> >> >> Then we need the following for the case when task requesting the grace
> >> >> period does not block, right?
> >> >
> >> > Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
> >> > smp_mb__before_atomic() in sync_exp_work_done() already provide the
> >> > required ordering, no?
> >>
> >> smp_mb() is probably fine, but smp_mb__before_atomic() is release not
> >> acquire. If we want to play that game, then I guess we also need
> >> smp_mb__after_atomic() there. But it would be way easier to understand
> >> what's happens there and prove that it's correct, if we use
> >> store_release/load_acquire.
> >
> > Fair point, how about the following?
> 
> I am not qualified enough to reason about these smp_mb__after_atomic.
> >From practical point of view there may be enough barriers in the
> resulting machine code already, but re formal semantics of adding
> smp_mb__after_atomic after an unrelated subsequent atomic RMW op I
> gave up. You must be the best candidate for this now :)

Unfortunately, there are code paths from sync_exp_work_done() that
have no memory barriers.  :-(

And I might be the best candidate, but this email thread has definitely
shown that I am not infallable, never mind that there was already
plenty of evidence on this particular point.  So thank you again for
your testing and review efforts!

                                                        Thanx, Paul

> > ------------------------------------------------------------------------
> >
> > commit 6fd8074f1976596898e39f5b7ea1755652533906
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Tue Mar 7 07:21:23 2017 -0800
> >
> >     rcu: Add smp_mb__after_atomic() to sync_exp_work_done()
> >
> >     The sync_exp_work_done() function needs to fully order the counter-check
> >     operation against anything happening after the corresponding grace period.
> >     This is a theoretical bug, as all current architectures either provide
> >     full ordering for atomic operation on the one hand or implement,
> >     however, a little future-proofing is a good thing.  This commit
> >     therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> >     in sync_exp_work_done().
> >
> >     Reported-by: Dmitry Vyukov <dvyukov@google.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index 027e123d93c7..652071abd9b4 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
> >                 /* Ensure test happens before caller kfree(). */
> >                 smp_mb__before_atomic(); /* ^^^ */
> >                 atomic_long_inc(stat);
> > +               smp_mb__after_atomic(); /* ^^^ */
> >                 return true;
> >         }
> >         return false;
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-07 15:27                     ` Paul E. McKenney
  2017-03-07 18:37                       ` Dmitry Vyukov
@ 2017-03-07 23:05                       ` Boqun Feng
  2017-03-07 23:31                         ` Paul E. McKenney
  1 sibling, 1 reply; 21+ messages in thread
From: Boqun Feng @ 2017-03-07 23:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Dmitry Vyukov, josh, Steven Rostedt, Mathieu Desnoyers,
	jiangshanlai, LKML, syzkaller

[-- Attachment #1: Type: text/plain, Size: 4122 bytes --]

On Tue, Mar 07, 2017 at 07:27:15AM -0800, Paul E. McKenney wrote:
> On Tue, Mar 07, 2017 at 03:43:42PM +0100, Dmitry Vyukov wrote:
> > On Tue, Mar 7, 2017 at 3:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> > > On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> > > [...]
> > >> >>
> > >> >> What is that mutex? And what locks/unlocks provide synchronization? I
> > >> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> > >> >
> > >> > Both of them.
> > >> >
> > >> > ->exp_mutex is acquired by the task requesting the grace period, and
> > >> > the counter's first increment is done by that task under that mutex.
> > >> > This task then schedules a workqueue, which drives forward the grace
> > >> > period.  Upon grace-period completion, the workqueue handler does the
> > >> > second increment (the one that your patch addressed).  The workqueue
> > >> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> > >> > ->exp_mutex (along with all other tasks waiting for this grace period),
> > >> > and that task releases ->exp_mutex, which allows the next grace period to
> > >> > start (and the first increment for that next grace period to be carried
> > >> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> > >> > after finishing its wakeups.
> > >>
> > >> Then we need the following for the case when task requesting the grace
> > >> period does not block, right?
> > >
> > > Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
> > > smp_mb__before_atomic() in sync_exp_work_done() already provide the
> > > required ordering, no?
> > 
> > smp_mb() is probably fine, but smp_mb__before_atomic() is release not
> > acquire. If we want to play that game, then I guess we also need

The point is that smp_mb__before_atomic() + atomic_long_inc() will
guarantee a smp_mb() before or right along with the atomic operation,
and that's enough because rcu_seq_done() followed by a smp_mb() will
give it a acquire-like behavior.

> > smp_mb__after_atomic() there. But it would be way easier to understand

Adding smp_mb__after_atomic() would be pointless as it's the load of 
->expedited_sequence that we want to ensure having acquire behavior
rather than the atomic increment of @stat.

> > what's happens there and prove that it's correct, if we use
> > store_release/load_acquire.
> 
> Fair point, how about the following?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 6fd8074f1976596898e39f5b7ea1755652533906
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Tue Mar 7 07:21:23 2017 -0800
> 
>     rcu: Add smp_mb__after_atomic() to sync_exp_work_done()
>     
>     The sync_exp_work_done() function needs to fully order the counter-check
>     operation against anything happening after the corresponding grace period.
>     This is a theoretical bug, as all current architectures either provide
>     full ordering for atomic operation on the one hand or implement,
>     however, a little future-proofing is a good thing.  This commit
>     therefore adds smp_mb__after_atomic() after the atomic_long_inc()
>     in sync_exp_work_done().
>     
>     Reported-by: Dmitry Vyukov <dvyukov@google.com>
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 027e123d93c7..652071abd9b4 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
>  		/* Ensure test happens before caller kfree(). */
>  		smp_mb__before_atomic(); /* ^^^ */
>  		atomic_long_inc(stat);
> +		smp_mb__after_atomic(); /* ^^^ */

If we really care about future-proofing, I think it's more safe to
change smp_mb__before_atomic() to smp_mb() rather than adding
__after_atomic() barrier. Though I think both would be unnecessary ;-)

Regards,
Boqun

>  		return true;
>  	}
>  	return false;
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-07 23:05                       ` Boqun Feng
@ 2017-03-07 23:31                         ` Paul E. McKenney
  2017-03-08  1:39                           ` Boqun Feng
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-07 23:31 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Dmitry Vyukov, josh, Steven Rostedt, Mathieu Desnoyers,
	jiangshanlai, LKML, syzkaller

On Wed, Mar 08, 2017 at 07:05:13AM +0800, Boqun Feng wrote:
> On Tue, Mar 07, 2017 at 07:27:15AM -0800, Paul E. McKenney wrote:
> > On Tue, Mar 07, 2017 at 03:43:42PM +0100, Dmitry Vyukov wrote:
> > > On Tue, Mar 7, 2017 at 3:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> > > > On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> > > > [...]
> > > >> >>
> > > >> >> What is that mutex? And what locks/unlocks provide synchronization? I
> > > >> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> > > >> >
> > > >> > Both of them.
> > > >> >
> > > >> > ->exp_mutex is acquired by the task requesting the grace period, and
> > > >> > the counter's first increment is done by that task under that mutex.
> > > >> > This task then schedules a workqueue, which drives forward the grace
> > > >> > period.  Upon grace-period completion, the workqueue handler does the
> > > >> > second increment (the one that your patch addressed).  The workqueue
> > > >> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> > > >> > ->exp_mutex (along with all other tasks waiting for this grace period),
> > > >> > and that task releases ->exp_mutex, which allows the next grace period to
> > > >> > start (and the first increment for that next grace period to be carried
> > > >> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> > > >> > after finishing its wakeups.
> > > >>
> > > >> Then we need the following for the case when task requesting the grace
> > > >> period does not block, right?
> > > >
> > > > Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
> > > > smp_mb__before_atomic() in sync_exp_work_done() already provide the
> > > > required ordering, no?
> > > 
> > > smp_mb() is probably fine, but smp_mb__before_atomic() is release not
> > > acquire. If we want to play that game, then I guess we also need
> 
> The point is that smp_mb__before_atomic() + atomic_long_inc() will
> guarantee a smp_mb() before or right along with the atomic operation,
> and that's enough because rcu_seq_done() followed by a smp_mb() will
> give it a acquire-like behavior.

Given current architectures, true enough, from what I can see.

However, let's take a look at atomic_ops.rst:


	If a caller requires memory barrier semantics around an atomic_t
	operation which does not return a value, a set of interfaces are
	defined which accomplish this::

		void smp_mb__before_atomic(void);
		void smp_mb__after_atomic(void);

	For example, smp_mb__before_atomic() can be used like so::

		obj->dead = 1;
		smp_mb__before_atomic();
		atomic_dec(&obj->ref_count);

	It makes sure that all memory operations preceding the atomic_dec()
	call are strongly ordered with respect to the atomic counter
	operation.  In the above example, it guarantees that the assignment of
	"1" to obj->dead will be globally visible to other cpus before the
	atomic counter decrement.

	Without the explicit smp_mb__before_atomic() call, the
	implementation could legally allow the atomic counter update visible
	to other cpus before the "obj->dead = 1;" assignment.

So the ordering is guaranteed against the atomic operation, not
necessarily the stuff after it.  But again, the implementations I know
of do make the guarantee, hence my calling it a theoretical bug in the
commit log.

> > > smp_mb__after_atomic() there. But it would be way easier to understand
> 
> Adding smp_mb__after_atomic() would be pointless as it's the load of 
> ->expedited_sequence that we want to ensure having acquire behavior
> rather than the atomic increment of @stat.

Again, agreed given current code, but atomic_ops.rst doesn't guarantee
ordering past the actual atomic operation itself.

							Thanx, Paul

> > > what's happens there and prove that it's correct, if we use
> > > store_release/load_acquire.
> > 
> > Fair point, how about the following?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit 6fd8074f1976596898e39f5b7ea1755652533906
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Tue Mar 7 07:21:23 2017 -0800
> > 
> >     rcu: Add smp_mb__after_atomic() to sync_exp_work_done()
> >     
> >     The sync_exp_work_done() function needs to fully order the counter-check
> >     operation against anything happening after the corresponding grace period.
> >     This is a theoretical bug, as all current architectures either provide
> >     full ordering for atomic operation on the one hand or implement,
> >     however, a little future-proofing is a good thing.  This commit
> >     therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> >     in sync_exp_work_done().
> >     
> >     Reported-by: Dmitry Vyukov <dvyukov@google.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index 027e123d93c7..652071abd9b4 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
> >  		/* Ensure test happens before caller kfree(). */
> >  		smp_mb__before_atomic(); /* ^^^ */
> >  		atomic_long_inc(stat);
> > +		smp_mb__after_atomic(); /* ^^^ */
> 
> If we really care about future-proofing, I think it's more safe to
> change smp_mb__before_atomic() to smp_mb() rather than adding
> __after_atomic() barrier. Though I think both would be unnecessary ;-)
> 
> Regards,
> Boqun
> 
> >  		return true;
> >  	}
> >  	return false;
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-07 23:31                         ` Paul E. McKenney
@ 2017-03-08  1:39                           ` Boqun Feng
  2017-03-08  2:26                             ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Boqun Feng @ 2017-03-08  1:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Dmitry Vyukov, josh, Steven Rostedt, Mathieu Desnoyers,
	jiangshanlai, LKML, syzkaller

[-- Attachment #1: Type: text/plain, Size: 6979 bytes --]

On Tue, Mar 07, 2017 at 03:31:54PM -0800, Paul E. McKenney wrote:
> On Wed, Mar 08, 2017 at 07:05:13AM +0800, Boqun Feng wrote:
> > On Tue, Mar 07, 2017 at 07:27:15AM -0800, Paul E. McKenney wrote:
> > > On Tue, Mar 07, 2017 at 03:43:42PM +0100, Dmitry Vyukov wrote:
> > > > On Tue, Mar 7, 2017 at 3:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> > > > > On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> > > > > [...]
> > > > >> >>
> > > > >> >> What is that mutex? And what locks/unlocks provide synchronization? I
> > > > >> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> > > > >> >
> > > > >> > Both of them.
> > > > >> >
> > > > >> > ->exp_mutex is acquired by the task requesting the grace period, and
> > > > >> > the counter's first increment is done by that task under that mutex.
> > > > >> > This task then schedules a workqueue, which drives forward the grace
> > > > >> > period.  Upon grace-period completion, the workqueue handler does the
> > > > >> > second increment (the one that your patch addressed).  The workqueue
> > > > >> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> > > > >> > ->exp_mutex (along with all other tasks waiting for this grace period),
> > > > >> > and that task releases ->exp_mutex, which allows the next grace period to
> > > > >> > start (and the first increment for that next grace period to be carried
> > > > >> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> > > > >> > after finishing its wakeups.
> > > > >>
> > > > >> Then we need the following for the case when task requesting the grace
> > > > >> period does not block, right?
> > > > >
> > > > > Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
> > > > > smp_mb__before_atomic() in sync_exp_work_done() already provide the
> > > > > required ordering, no?
> > > > 
> > > > smp_mb() is probably fine, but smp_mb__before_atomic() is release not
> > > > acquire. If we want to play that game, then I guess we also need
> > 
> > The point is that smp_mb__before_atomic() + atomic_long_inc() will
> > guarantee a smp_mb() before or right along with the atomic operation,
> > and that's enough because rcu_seq_done() followed by a smp_mb() will
> > give it a acquire-like behavior.
> 
> Given current architectures, true enough, from what I can see.
> 
> However, let's take a look at atomic_ops.rst:
> 
> 
> 	If a caller requires memory barrier semantics around an atomic_t
> 	operation which does not return a value, a set of interfaces are
> 	defined which accomplish this::
> 
> 		void smp_mb__before_atomic(void);
> 		void smp_mb__after_atomic(void);
> 
> 	For example, smp_mb__before_atomic() can be used like so::
> 
> 		obj->dead = 1;
> 		smp_mb__before_atomic();
> 		atomic_dec(&obj->ref_count);
> 
> 	It makes sure that all memory operations preceding the atomic_dec()
> 	call are strongly ordered with respect to the atomic counter
> 	operation.  In the above example, it guarantees that the assignment of
> 	"1" to obj->dead will be globally visible to other cpus before the
> 	atomic counter decrement.
> 
> 	Without the explicit smp_mb__before_atomic() call, the
> 	implementation could legally allow the atomic counter update visible
> 	to other cpus before the "obj->dead = 1;" assignment.
> 
> So the ordering is guaranteed against the atomic operation, not
> necessarily the stuff after it.  But again, the implementations I know
> of do make the guarantee, hence my calling it a theoretical bug in the
> commit log.
> 

Fair enough ;-) It's me who misunderstood this part of document.

However, the names of the barriers are smp_mb__{before,after}_atomic(),
so if they, semantically, only provide ordering for the corresponding
atomic ops rather than a full barrier, I would their names are
misleading ;-)

> > > > smp_mb__after_atomic() there. But it would be way easier to understand
> > 
> > Adding smp_mb__after_atomic() would be pointless as it's the load of 
> > ->expedited_sequence that we want to ensure having acquire behavior
> > rather than the atomic increment of @stat.
> 
> Again, agreed given current code, but atomic_ops.rst doesn't guarantee
> ordering past the actual atomic operation itself.
> 

Neither does atomic_ops.rst guarantee the ordering between a load before
the atomic op and memory accesses after the atomic op, right? I.e.
atomic_ops.rst doesn't say no for reordering like this:

	r1 = READ_ONCE(a); ---------+
	atomic_long_inc(b);         |              
	smp_mb__after_atomic();     |
	WRITE_ONCE(c);              |
	{r1 = READ_ONCE(a)} <-------+
	
So it's still not an acquire for READ_ONCE(a), in our case "a" is
->expedited_sequence.

To me, we can either fix the atomic_ops.rst or, as I proposed, just
change smp_mb__before_atomic() to smp_mb().

Thoughts?

Regards,
Boqun

> 							Thanx, Paul
> 
> > > > what's happens there and prove that it's correct, if we use
> > > > store_release/load_acquire.
> > > 
> > > Fair point, how about the following?
> > > 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > commit 6fd8074f1976596898e39f5b7ea1755652533906
> > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Date:   Tue Mar 7 07:21:23 2017 -0800
> > > 
> > >     rcu: Add smp_mb__after_atomic() to sync_exp_work_done()
> > >     
> > >     The sync_exp_work_done() function needs to fully order the counter-check
> > >     operation against anything happening after the corresponding grace period.
> > >     This is a theoretical bug, as all current architectures either provide
> > >     full ordering for atomic operation on the one hand or implement,
> > >     however, a little future-proofing is a good thing.  This commit
> > >     therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > >     in sync_exp_work_done().
> > >     
> > >     Reported-by: Dmitry Vyukov <dvyukov@google.com>
> > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > 
> > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > index 027e123d93c7..652071abd9b4 100644
> > > --- a/kernel/rcu/tree_exp.h
> > > +++ b/kernel/rcu/tree_exp.h
> > > @@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
> > >  		/* Ensure test happens before caller kfree(). */
> > >  		smp_mb__before_atomic(); /* ^^^ */
> > >  		atomic_long_inc(stat);
> > > +		smp_mb__after_atomic(); /* ^^^ */
> > 
> > If we really care about future-proofing, I think it's more safe to
> > change smp_mb__before_atomic() to smp_mb() rather than adding
> > __after_atomic() barrier. Though I think both would be unnecessary ;-)
> > 
> > Regards,
> > Boqun
> > 
> > >  		return true;
> > >  	}
> > >  	return false;
> > > 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-08  1:39                           ` Boqun Feng
@ 2017-03-08  2:26                             ` Paul E. McKenney
  2017-03-08  2:44                               ` Boqun Feng
  0 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-08  2:26 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Dmitry Vyukov, josh, Steven Rostedt, Mathieu Desnoyers,
	jiangshanlai, LKML, syzkaller

On Wed, Mar 08, 2017 at 09:39:13AM +0800, Boqun Feng wrote:
> On Tue, Mar 07, 2017 at 03:31:54PM -0800, Paul E. McKenney wrote:
> > On Wed, Mar 08, 2017 at 07:05:13AM +0800, Boqun Feng wrote:
> > > On Tue, Mar 07, 2017 at 07:27:15AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Mar 07, 2017 at 03:43:42PM +0100, Dmitry Vyukov wrote:
> > > > > On Tue, Mar 7, 2017 at 3:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> > > > > > On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> > > > > > [...]
> > > > > >> >>
> > > > > >> >> What is that mutex? And what locks/unlocks provide synchronization? I
> > > > > >> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> > > > > >> >
> > > > > >> > Both of them.
> > > > > >> >
> > > > > >> > ->exp_mutex is acquired by the task requesting the grace period, and
> > > > > >> > the counter's first increment is done by that task under that mutex.
> > > > > >> > This task then schedules a workqueue, which drives forward the grace
> > > > > >> > period.  Upon grace-period completion, the workqueue handler does the
> > > > > >> > second increment (the one that your patch addressed).  The workqueue
> > > > > >> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> > > > > >> > ->exp_mutex (along with all other tasks waiting for this grace period),
> > > > > >> > and that task releases ->exp_mutex, which allows the next grace period to
> > > > > >> > start (and the first increment for that next grace period to be carried
> > > > > >> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> > > > > >> > after finishing its wakeups.
> > > > > >>
> > > > > >> Then we need the following for the case when task requesting the grace
> > > > > >> period does not block, right?
> > > > > >
> > > > > > Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
> > > > > > smp_mb__before_atomic() in sync_exp_work_done() already provide the
> > > > > > required ordering, no?
> > > > > 
> > > > > smp_mb() is probably fine, but smp_mb__before_atomic() is release not
> > > > > acquire. If we want to play that game, then I guess we also need
> > > 
> > > The point is that smp_mb__before_atomic() + atomic_long_inc() will
> > > guarantee a smp_mb() before or right along with the atomic operation,
> > > and that's enough because rcu_seq_done() followed by a smp_mb() will
> > > give it a acquire-like behavior.
> > 
> > Given current architectures, true enough, from what I can see.
> > 
> > However, let's take a look at atomic_ops.rst:
> > 
> > 
> > 	If a caller requires memory barrier semantics around an atomic_t
> > 	operation which does not return a value, a set of interfaces are
> > 	defined which accomplish this::
> > 
> > 		void smp_mb__before_atomic(void);
> > 		void smp_mb__after_atomic(void);
> > 
> > 	For example, smp_mb__before_atomic() can be used like so::
> > 
> > 		obj->dead = 1;
> > 		smp_mb__before_atomic();
> > 		atomic_dec(&obj->ref_count);
> > 
> > 	It makes sure that all memory operations preceding the atomic_dec()
> > 	call are strongly ordered with respect to the atomic counter
> > 	operation.  In the above example, it guarantees that the assignment of
> > 	"1" to obj->dead will be globally visible to other cpus before the
> > 	atomic counter decrement.
> > 
> > 	Without the explicit smp_mb__before_atomic() call, the
> > 	implementation could legally allow the atomic counter update visible
> > 	to other cpus before the "obj->dead = 1;" assignment.
> > 
> > So the ordering is guaranteed against the atomic operation, not
> > necessarily the stuff after it.  But again, the implementations I know
> > of do make the guarantee, hence my calling it a theoretical bug in the
> > commit log.
> 
> Fair enough ;-) It's me who misunderstood this part of document.
> 
> However, the names of the barriers are smp_mb__{before,after}_atomic(),
> so if they, semantically, only provide ordering for the corresponding
> atomic ops rather than a full barrier, I would their names are
> misleading ;-)

Well, if you have both ordering before and after, then you have full
ordering.

> > > > > smp_mb__after_atomic() there. But it would be way easier to understand
> > > 
> > > Adding smp_mb__after_atomic() would be pointless as it's the load of 
> > > ->expedited_sequence that we want to ensure having acquire behavior
> > > rather than the atomic increment of @stat.
> > 
> > Again, agreed given current code, but atomic_ops.rst doesn't guarantee
> > ordering past the actual atomic operation itself.
> 
> Neither does atomic_ops.rst guarantee the ordering between a load before
> the atomic op and memory accesses after the atomic op, right? I.e.
> atomic_ops.rst doesn't say no for reordering like this:
> 
> 	r1 = READ_ONCE(a); ---------+
> 	atomic_long_inc(b);         |              
> 	smp_mb__after_atomic();     |
> 	WRITE_ONCE(c);              |
> 	{r1 = READ_ONCE(a)} <-------+
> 	
> So it's still not an acquire for READ_ONCE(a), in our case "a" is
> ->expedited_sequence.
> 
> To me, we can either fix the atomic_ops.rst or, as I proposed, just
> change smp_mb__before_atomic() to smp_mb().

Or have both an smp_mb__before_atomic() and an smp_mb__after_atomic(),
as is the usual approach when you need full ordering.  ;-)

							Thanx, Paul

> Thoughts?
> 
> Regards,
> Boqun
> 
> > 							Thanx, Paul
> > 
> > > > > what's happens there and prove that it's correct, if we use
> > > > > store_release/load_acquire.
> > > > 
> > > > Fair point, how about the following?
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > commit 6fd8074f1976596898e39f5b7ea1755652533906
> > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > Date:   Tue Mar 7 07:21:23 2017 -0800
> > > > 
> > > >     rcu: Add smp_mb__after_atomic() to sync_exp_work_done()
> > > >     
> > > >     The sync_exp_work_done() function needs to fully order the counter-check
> > > >     operation against anything happening after the corresponding grace period.
> > > >     This is a theoretical bug, as all current architectures either provide
> > > >     full ordering for atomic operation on the one hand or implement,
> > > >     however, a little future-proofing is a good thing.  This commit
> > > >     therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > > >     in sync_exp_work_done().
> > > >     
> > > >     Reported-by: Dmitry Vyukov <dvyukov@google.com>
> > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > 
> > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > index 027e123d93c7..652071abd9b4 100644
> > > > --- a/kernel/rcu/tree_exp.h
> > > > +++ b/kernel/rcu/tree_exp.h
> > > > @@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
> > > >  		/* Ensure test happens before caller kfree(). */
> > > >  		smp_mb__before_atomic(); /* ^^^ */
> > > >  		atomic_long_inc(stat);
> > > > +		smp_mb__after_atomic(); /* ^^^ */
> > > 
> > > If we really care about future-proofing, I think it's more safe to
> > > change smp_mb__before_atomic() to smp_mb() rather than adding
> > > __after_atomic() barrier. Though I think both would be unnecessary ;-)
> > > 
> > > Regards,
> > > Boqun
> > > 
> > > >  		return true;
> > > >  	}
> > > >  	return false;
> > > > 
> > 
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-08  2:26                             ` Paul E. McKenney
@ 2017-03-08  2:44                               ` Boqun Feng
  2017-03-08  3:08                                 ` Paul E. McKenney
  0 siblings, 1 reply; 21+ messages in thread
From: Boqun Feng @ 2017-03-08  2:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Dmitry Vyukov, josh, Steven Rostedt, Mathieu Desnoyers,
	jiangshanlai, LKML, syzkaller

[-- Attachment #1: Type: text/plain, Size: 8332 bytes --]

On Tue, Mar 07, 2017 at 06:26:03PM -0800, Paul E. McKenney wrote:
> On Wed, Mar 08, 2017 at 09:39:13AM +0800, Boqun Feng wrote:
> > On Tue, Mar 07, 2017 at 03:31:54PM -0800, Paul E. McKenney wrote:
> > > On Wed, Mar 08, 2017 at 07:05:13AM +0800, Boqun Feng wrote:
> > > > On Tue, Mar 07, 2017 at 07:27:15AM -0800, Paul E. McKenney wrote:
> > > > > On Tue, Mar 07, 2017 at 03:43:42PM +0100, Dmitry Vyukov wrote:
> > > > > > On Tue, Mar 7, 2017 at 3:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> > > > > > > On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> > > > > > > [...]
> > > > > > >> >>
> > > > > > >> >> What is that mutex? And what locks/unlocks provide synchronization? I
> > > > > > >> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> > > > > > >> >
> > > > > > >> > Both of them.
> > > > > > >> >
> > > > > > >> > ->exp_mutex is acquired by the task requesting the grace period, and
> > > > > > >> > the counter's first increment is done by that task under that mutex.
> > > > > > >> > This task then schedules a workqueue, which drives forward the grace
> > > > > > >> > period.  Upon grace-period completion, the workqueue handler does the
> > > > > > >> > second increment (the one that your patch addressed).  The workqueue
> > > > > > >> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> > > > > > >> > ->exp_mutex (along with all other tasks waiting for this grace period),
> > > > > > >> > and that task releases ->exp_mutex, which allows the next grace period to
> > > > > > >> > start (and the first increment for that next grace period to be carried
> > > > > > >> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> > > > > > >> > after finishing its wakeups.
> > > > > > >>
> > > > > > >> Then we need the following for the case when task requesting the grace
> > > > > > >> period does not block, right?
> > > > > > >
> > > > > > > Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
> > > > > > > smp_mb__before_atomic() in sync_exp_work_done() already provide the
> > > > > > > required ordering, no?
> > > > > > 
> > > > > > smp_mb() is probably fine, but smp_mb__before_atomic() is release not
> > > > > > acquire. If we want to play that game, then I guess we also need
> > > > 
> > > > The point is that smp_mb__before_atomic() + atomic_long_inc() will
> > > > guarantee a smp_mb() before or right along with the atomic operation,
> > > > and that's enough because rcu_seq_done() followed by a smp_mb() will
> > > > give it a acquire-like behavior.
> > > 
> > > Given current architectures, true enough, from what I can see.
> > > 
> > > However, let's take a look at atomic_ops.rst:
> > > 
> > > 
> > > 	If a caller requires memory barrier semantics around an atomic_t
> > > 	operation which does not return a value, a set of interfaces are
> > > 	defined which accomplish this::
> > > 
> > > 		void smp_mb__before_atomic(void);
> > > 		void smp_mb__after_atomic(void);
> > > 
> > > 	For example, smp_mb__before_atomic() can be used like so::
> > > 
> > > 		obj->dead = 1;
> > > 		smp_mb__before_atomic();
> > > 		atomic_dec(&obj->ref_count);
> > > 
> > > 	It makes sure that all memory operations preceding the atomic_dec()
> > > 	call are strongly ordered with respect to the atomic counter
> > > 	operation.  In the above example, it guarantees that the assignment of
> > > 	"1" to obj->dead will be globally visible to other cpus before the
> > > 	atomic counter decrement.
> > > 
> > > 	Without the explicit smp_mb__before_atomic() call, the
> > > 	implementation could legally allow the atomic counter update visible
> > > 	to other cpus before the "obj->dead = 1;" assignment.
> > > 
> > > So the ordering is guaranteed against the atomic operation, not
> > > necessarily the stuff after it.  But again, the implementations I know
> > > of do make the guarantee, hence my calling it a theoretical bug in the
> > > commit log.
> > 
> > Fair enough ;-) It's me who misunderstood this part of document.
> > 
> > However, the names of the barriers are smp_mb__{before,after}_atomic(),
> > so if they, semantically, only provide ordering for the corresponding
> > atomic ops rather than a full barrier, I would their names are
> > misleading ;-)
> 
> Well, if you have both ordering before and after, then you have full
> ordering.
> 

I mean the names of the barriers are *smp_mb*__before_atomic() and
*smp_mb*__after_atomic(), so it's natural to think they provide a
smp_mb() in some situations ;-)

> > > > > > smp_mb__after_atomic() there. But it would be way easier to understand
> > > > 
> > > > Adding smp_mb__after_atomic() would be pointless as it's the load of 
> > > > ->expedited_sequence that we want to ensure having acquire behavior
> > > > rather than the atomic increment of @stat.
> > > 
> > > Again, agreed given current code, but atomic_ops.rst doesn't guarantee
> > > ordering past the actual atomic operation itself.
> > 
> > Neither does atomic_ops.rst guarantee the ordering between a load before
> > the atomic op and memory accesses after the atomic op, right? I.e.
> > atomic_ops.rst doesn't say no for reordering like this:
> > 
> > 	r1 = READ_ONCE(a); ---------+
> > 	atomic_long_inc(b);         |              
> > 	smp_mb__after_atomic();     |
> > 	WRITE_ONCE(c);              |
> > 	{r1 = READ_ONCE(a)} <-------+
> > 	
> > So it's still not an acquire for READ_ONCE(a), in our case "a" is
> > ->expedited_sequence.
> > 
> > To me, we can either fix the atomic_ops.rst or, as I proposed, just
> > change smp_mb__before_atomic() to smp_mb().
> 
> Or have both an smp_mb__before_atomic() and an smp_mb__after_atomic(),
> as is the usual approach when you need full ordering.  ;-)
> 

Yes ;-) It's just that "adding a barrier after one operation to provide
acquire semantic for another operation" looks weird to me.

Regards,
Boqun

> 							Thanx, Paul
> 
> > Thoughts?
> > 
> > Regards,
> > Boqun
> > 
> > > 							Thanx, Paul
> > > 
> > > > > > what's happens there and prove that it's correct, if we use
> > > > > > store_release/load_acquire.
> > > > > 
> > > > > Fair point, how about the following?
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > ------------------------------------------------------------------------
> > > > > 
> > > > > commit 6fd8074f1976596898e39f5b7ea1755652533906
> > > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > Date:   Tue Mar 7 07:21:23 2017 -0800
> > > > > 
> > > > >     rcu: Add smp_mb__after_atomic() to sync_exp_work_done()
> > > > >     
> > > > >     The sync_exp_work_done() function needs to fully order the counter-check
> > > > >     operation against anything happening after the corresponding grace period.
> > > > >     This is a theoretical bug, as all current architectures either provide
> > > > >     full ordering for atomic operation on the one hand or implement,
> > > > >     however, a little future-proofing is a good thing.  This commit
> > > > >     therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > > > >     in sync_exp_work_done().
> > > > >     
> > > > >     Reported-by: Dmitry Vyukov <dvyukov@google.com>
> > > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > 
> > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > index 027e123d93c7..652071abd9b4 100644
> > > > > --- a/kernel/rcu/tree_exp.h
> > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > @@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
> > > > >  		/* Ensure test happens before caller kfree(). */
> > > > >  		smp_mb__before_atomic(); /* ^^^ */
> > > > >  		atomic_long_inc(stat);
> > > > > +		smp_mb__after_atomic(); /* ^^^ */
> > > > 
> > > > If we really care about future-proofing, I think it's more safe to
> > > > change smp_mb__before_atomic() to smp_mb() rather than adding
> > > > __after_atomic() barrier. Though I think both would be unnecessary ;-)
> > > > 
> > > > Regards,
> > > > Boqun
> > > > 
> > > > >  		return true;
> > > > >  	}
> > > > >  	return false;
> > > > > 
> > > 
> > > 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-08  2:44                               ` Boqun Feng
@ 2017-03-08  3:08                                 ` Paul E. McKenney
  0 siblings, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-08  3:08 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Dmitry Vyukov, josh, Steven Rostedt, Mathieu Desnoyers,
	jiangshanlai, LKML, syzkaller

On Wed, Mar 08, 2017 at 10:44:17AM +0800, Boqun Feng wrote:
> On Tue, Mar 07, 2017 at 06:26:03PM -0800, Paul E. McKenney wrote:
> > On Wed, Mar 08, 2017 at 09:39:13AM +0800, Boqun Feng wrote:
> > > On Tue, Mar 07, 2017 at 03:31:54PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Mar 08, 2017 at 07:05:13AM +0800, Boqun Feng wrote:
> > > > > On Tue, Mar 07, 2017 at 07:27:15AM -0800, Paul E. McKenney wrote:
> > > > > > On Tue, Mar 07, 2017 at 03:43:42PM +0100, Dmitry Vyukov wrote:
> > > > > > > On Tue, Mar 7, 2017 at 3:27 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> > > > > > > > On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> > > > > > > > [...]
> > > > > > > >> >>
> > > > > > > >> >> What is that mutex? And what locks/unlocks provide synchronization? I
> > > > > > > >> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> > > > > > > >> >
> > > > > > > >> > Both of them.
> > > > > > > >> >
> > > > > > > >> > ->exp_mutex is acquired by the task requesting the grace period, and
> > > > > > > >> > the counter's first increment is done by that task under that mutex.
> > > > > > > >> > This task then schedules a workqueue, which drives forward the grace
> > > > > > > >> > period.  Upon grace-period completion, the workqueue handler does the
> > > > > > > >> > second increment (the one that your patch addressed).  The workqueue
> > > > > > > >> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> > > > > > > >> > ->exp_mutex (along with all other tasks waiting for this grace period),
> > > > > > > >> > and that task releases ->exp_mutex, which allows the next grace period to
> > > > > > > >> > start (and the first increment for that next grace period to be carried
> > > > > > > >> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> > > > > > > >> > after finishing its wakeups.
> > > > > > > >>
> > > > > > > >> Then we need the following for the case when task requesting the grace
> > > > > > > >> period does not block, right?
> > > > > > > >
> > > > > > > > Won't be necessary I think, as the smp_mb() in rcu_seq_end() and the
> > > > > > > > smp_mb__before_atomic() in sync_exp_work_done() already provide the
> > > > > > > > required ordering, no?
> > > > > > > 
> > > > > > > smp_mb() is probably fine, but smp_mb__before_atomic() is release not
> > > > > > > acquire. If we want to play that game, then I guess we also need
> > > > > 
> > > > > The point is that smp_mb__before_atomic() + atomic_long_inc() will
> > > > > guarantee a smp_mb() before or right along with the atomic operation,
> > > > > and that's enough because rcu_seq_done() followed by a smp_mb() will
> > > > > give it a acquire-like behavior.
> > > > 
> > > > Given current architectures, true enough, from what I can see.
> > > > 
> > > > However, let's take a look at atomic_ops.rst:
> > > > 
> > > > 
> > > > 	If a caller requires memory barrier semantics around an atomic_t
> > > > 	operation which does not return a value, a set of interfaces are
> > > > 	defined which accomplish this::
> > > > 
> > > > 		void smp_mb__before_atomic(void);
> > > > 		void smp_mb__after_atomic(void);
> > > > 
> > > > 	For example, smp_mb__before_atomic() can be used like so::
> > > > 
> > > > 		obj->dead = 1;
> > > > 		smp_mb__before_atomic();
> > > > 		atomic_dec(&obj->ref_count);
> > > > 
> > > > 	It makes sure that all memory operations preceding the atomic_dec()
> > > > 	call are strongly ordered with respect to the atomic counter
> > > > 	operation.  In the above example, it guarantees that the assignment of
> > > > 	"1" to obj->dead will be globally visible to other cpus before the
> > > > 	atomic counter decrement.
> > > > 
> > > > 	Without the explicit smp_mb__before_atomic() call, the
> > > > 	implementation could legally allow the atomic counter update visible
> > > > 	to other cpus before the "obj->dead = 1;" assignment.
> > > > 
> > > > So the ordering is guaranteed against the atomic operation, not
> > > > necessarily the stuff after it.  But again, the implementations I know
> > > > of do make the guarantee, hence my calling it a theoretical bug in the
> > > > commit log.
> > > 
> > > Fair enough ;-) It's me who misunderstood this part of document.
> > > 
> > > However, the names of the barriers are smp_mb__{before,after}_atomic(),
> > > so if they, semantically, only provide ordering for the corresponding
> > > atomic ops rather than a full barrier, I would their names are
> > > misleading ;-)
> > 
> > Well, if you have both ordering before and after, then you have full
> > ordering.
> 
> I mean the names of the barriers are *smp_mb*__before_atomic() and
> *smp_mb*__after_atomic(), so it's natural to think they provide a
> smp_mb() in some situations ;-)
> 
> > > > > > > smp_mb__after_atomic() there. But it would be way easier to understand
> > > > > 
> > > > > Adding smp_mb__after_atomic() would be pointless as it's the load of 
> > > > > ->expedited_sequence that we want to ensure having acquire behavior
> > > > > rather than the atomic increment of @stat.
> > > > 
> > > > Again, agreed given current code, but atomic_ops.rst doesn't guarantee
> > > > ordering past the actual atomic operation itself.
> > > 
> > > Neither does atomic_ops.rst guarantee the ordering between a load before
> > > the atomic op and memory accesses after the atomic op, right? I.e.
> > > atomic_ops.rst doesn't say no for reordering like this:
> > > 
> > > 	r1 = READ_ONCE(a); ---------+
> > > 	atomic_long_inc(b);         |              
> > > 	smp_mb__after_atomic();     |
> > > 	WRITE_ONCE(c);              |
> > > 	{r1 = READ_ONCE(a)} <-------+
> > > 	
> > > So it's still not an acquire for READ_ONCE(a), in our case "a" is
> > > ->expedited_sequence.
> > > 
> > > To me, we can either fix the atomic_ops.rst or, as I proposed, just
> > > change smp_mb__before_atomic() to smp_mb().
> > 
> > Or have both an smp_mb__before_atomic() and an smp_mb__after_atomic(),
> > as is the usual approach when you need full ordering.  ;-)
> 
> Yes ;-) It's just that "adding a barrier after one operation to provide
> acquire semantic for another operation" looks weird to me.

But they are memory barriers!  They are -suppposed- to look weird!  ;-)

							Thanx, Paul

> Regards,
> Boqun
> 
> > 							Thanx, Paul
> > 
> > > Thoughts?
> > > 
> > > Regards,
> > > Boqun
> > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > > > what's happens there and prove that it's correct, if we use
> > > > > > > store_release/load_acquire.
> > > > > > 
> > > > > > Fair point, how about the following?
> > > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > > ------------------------------------------------------------------------
> > > > > > 
> > > > > > commit 6fd8074f1976596898e39f5b7ea1755652533906
> > > > > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > Date:   Tue Mar 7 07:21:23 2017 -0800
> > > > > > 
> > > > > >     rcu: Add smp_mb__after_atomic() to sync_exp_work_done()
> > > > > >     
> > > > > >     The sync_exp_work_done() function needs to fully order the counter-check
> > > > > >     operation against anything happening after the corresponding grace period.
> > > > > >     This is a theoretical bug, as all current architectures either provide
> > > > > >     full ordering for atomic operation on the one hand or implement,
> > > > > >     however, a little future-proofing is a good thing.  This commit
> > > > > >     therefore adds smp_mb__after_atomic() after the atomic_long_inc()
> > > > > >     in sync_exp_work_done().
> > > > > >     
> > > > > >     Reported-by: Dmitry Vyukov <dvyukov@google.com>
> > > > > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > 
> > > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > > index 027e123d93c7..652071abd9b4 100644
> > > > > > --- a/kernel/rcu/tree_exp.h
> > > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > > @@ -247,6 +247,7 @@ static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
> > > > > >  		/* Ensure test happens before caller kfree(). */
> > > > > >  		smp_mb__before_atomic(); /* ^^^ */
> > > > > >  		atomic_long_inc(stat);
> > > > > > +		smp_mb__after_atomic(); /* ^^^ */
> > > > > 
> > > > > If we really care about future-proofing, I think it's more safe to
> > > > > change smp_mb__before_atomic() to smp_mb() rather than adding
> > > > > __after_atomic() barrier. Though I think both would be unnecessary ;-)
> > > > > 
> > > > > Regards,
> > > > > Boqun
> > > > > 
> > > > > >  		return true;
> > > > > >  	}
> > > > > >  	return false;
> > > > > > 
> > > > 
> > > > 
> > 
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: rcu: WARNING in rcu_seq_end
  2017-03-07  7:05               ` Dmitry Vyukov
  2017-03-07 14:27                 ` Boqun Feng
@ 2017-03-07 15:16                 ` Paul E. McKenney
  1 sibling, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2017-03-07 15:16 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: josh, Steven Rostedt, Mathieu Desnoyers, jiangshanlai, LKML,
	syzkaller

On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote:
> On Tue, Mar 7, 2017 at 12:08 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> >> >> Hello,
> >> >> >> >>
> >> >> >> >> Paul, you wanted bugs in rcu.
> >> >> >> >
> >> >> >> > Well, whether I want them or not, I must deal with them.  ;-)
> >> >> >> >
> >> >> >> >> I've got this WARNING while running syzkaller fuzzer on
> >> >> >> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760:
> >> >> >> >>
> >> >> >> >> ------------[ cut here ]------------
> >> >> >> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
> >> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >> >> >> Kernel panic - not syncing: panic_on_warn set ...
> >> >> >> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
> >> >> >> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >> >> >> >> Workqueue: events wait_rcu_exp_gp
> >> >> >> >> Call Trace:
> >> >> >> >>  __dump_stack lib/dump_stack.c:15 [inline]
> >> >> >> >>  dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
> >> >> >> >>  panic+0x1fb/0x412 kernel/panic.c:179
> >> >> >> >>  __warn+0x1c4/0x1e0 kernel/panic.c:540
> >> >> >> >>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
> >> >> >> >>  rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
> >> >> >> >>  rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
> >> >> >> >>  rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
> >> >> >> >>  rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
> >> >> >> >>  wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
> >> >> >> >>  process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
> >> >> >> >>  worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
> >> >> >> >>  kthread+0x326/0x3f0 kernel/kthread.c:227
> >> >> >> >>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> >> >> >> >> Dumping ftrace buffer:
> >> >> >> >>    (ftrace buffer empty)
> >> >> >> >> Kernel Offset: disabled
> >> >> >> >> Rebooting in 86400 seconds..
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Not reproducible. But looking at the code, shouldn't it be:
> >> >> >> >>
> >> >> >> >>  static void rcu_seq_end(unsigned long *sp)
> >> >> >> >>  {
> >> >> >> >>         smp_mb(); /* Ensure update-side operation before counter increment. */
> >> >> >> >> +       WARN_ON_ONCE(!(*sp & 0x1));
> >> >> >> >>         WRITE_ONCE(*sp, *sp + 1);
> >> >> >> >> -       WARN_ON_ONCE(*sp & 0x1);
> >> >> >> >>  }
> >> >> >> >>
> >> >> >> >> ?
> >> >> >> >>
> >> >> >> >> Otherwise wait_event in _synchronize_rcu_expedited can return as soon
> >> >> >> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this
> >> >> >> >> consequently can allow start of next grace periods. Which in turn can
> >> >> >> >> make the warning fire. Am I missing something?
> >> >> >> >>
> >> >> >> >> I don't see any other bad consequences of this. The rest of
> >> >> >> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has
> >> >> >> >> returned and destroyed work on stack and next period has started and
> >> >> >> >> ended, but it seems OK.
> >> >> >> >
> >> >> >> > I believe that this is a heygood change, but I don't see how it will
> >> >> >> > help in this case.  BTW, may I have your Signed-off-by?
> >> >> >> >
> >> >> >> > The reason I don't believe that it will help is that the
> >> >> >> > rcu_exp_gp_seq_end() function is called from a workqueue handler that
> >> >> >> > is invoked holding ->exp_mutex, and this mutex is not released until
> >> >> >> > after the handler invokes rcu_seq_end() and then wakes up the task that
> >> >> >> > scheduled the workqueue handler.  So the ordering above should not matter
> >> >> >> > (but I agree that your ordering is cleaner.
> >> >> >> >
> >> >> >> > That said, it looks like I am missing some memory barriers, please
> >> >> >> > see the following patch.
> >> >> >> >
> >> >> >> > But what architecture did you see this on?
> >> >> >>
> >> >> >>
> >> >> >> This is just x86.
> >> >> >>
> >> >> >> You seem to assume that wait_event() waits for the wakeup. It does not
> >> >> >> work this way. It can return as soon as the condition becomes true
> >> >> >> without ever waiting:
> >> >> >>
> >> >> >> 305 #define wait_event(wq, condition)                                       \
> >> >> >> 306 do {                                                                    \
> >> >> >> 307         might_sleep();                                                  \
> >> >> >> 308         if (condition)                                                  \
> >> >> >> 309                 break;                                                  \
> >> >> >> 310         __wait_event(wq, condition);                                    \
> >> >> >> 311 } while (0)
> >> >> >
> >> >> > Agreed, hence my patch in the previous email.  I guess I knew that, but
> >> >>
> >> >> Ah, you meant to synchronize rcu_seq_end with rcu_seq_done?
> >> >
> >> > No, there is a mutex release and acquisition that do the synchronization,
> >> > but only if the wakeup has appropriate barriers.  The issue is that
> >> > part of the mutex's critical section executes in a workqueue, possibly
> >> > on some other CPU.
> >>
> >> What is that mutex? And what locks/unlocks provide synchronization? I
> >> see that one uses exp_mutex and another -- exp_wake_mutex.
> >
> > Both of them.
> >
> > ->exp_mutex is acquired by the task requesting the grace period, and
> > the counter's first increment is done by that task under that mutex.
> > This task then schedules a workqueue, which drives forward the grace
> > period.  Upon grace-period completion, the workqueue handler does the
> > second increment (the one that your patch addressed).  The workqueue
> > handler then acquires ->exp_wake_mutex and wakes the task that holds
> > ->exp_mutex (along with all other tasks waiting for this grace period),
> > and that task releases ->exp_mutex, which allows the next grace period to
> > start (and the first increment for that next grace period to be carried
> > out under that lock).  The workqueue handler releases ->exp_wake_mutex
> > after finishing its wakeups.

First and foremost, thank you for reviewing this stuff!!!

> Then we need the following for the case when task requesting the grace
> period does not block, right?
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d80c2587bed8..aa7ba83f6a56 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3534,7 +3534,7 @@ static void rcu_seq_start(unsigned long *sp)
>  static void rcu_seq_end(unsigned long *sp)
>  {
>         smp_mb(); /* Ensure update-side operation before counter increment. */
> -       WRITE_ONCE(*sp, *sp + 1);
> +       smp_store_release(sp, *sp + 1);
>         WARN_ON_ONCE(*sp & 0x1);
>  }

The above is not needed, as the smp_mb() covers it completely.

> @@ -3554,7 +3554,7 @@ static unsigned long rcu_seq_snap(unsigned long *sp)
>   */
>  static bool rcu_seq_done(unsigned long *sp, unsigned long s)
>  {
> -       return ULONG_CMP_GE(READ_ONCE(*sp), s);
> +       return ULONG_CMP_GE(smp_load_acquire(sp), s);

And this one is covered by the smp_mb__before_atomic() inside the "if"
statement in sync_exp_work_done(), via rcu_exp_gp_seq_done().  That is,
sync_exp_work_done() invokes rcu_exp_gp_seq_done() which in turn invokes
rcu_seq_done().

Why smp_mb() instead of release-acquire?  Because the grace-period
relationship is quite strong, it cannot be implemented with
release-acquire on all platforms.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-03-08  3:14 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-04 16:01 rcu: WARNING in rcu_seq_end Dmitry Vyukov
2017-03-04 20:40 ` Paul E. McKenney
2017-03-05 10:50   ` Dmitry Vyukov
2017-03-05 18:47     ` Paul E. McKenney
2017-03-06  9:24       ` Dmitry Vyukov
2017-03-06 10:07         ` Paul E. McKenney
2017-03-06 10:11           ` Dmitry Vyukov
2017-03-06 23:08             ` Paul E. McKenney
2017-03-07  7:05               ` Dmitry Vyukov
2017-03-07 14:27                 ` Boqun Feng
2017-03-07 14:43                   ` Dmitry Vyukov
2017-03-07 15:27                     ` Paul E. McKenney
2017-03-07 18:37                       ` Dmitry Vyukov
2017-03-07 19:09                         ` Paul E. McKenney
2017-03-07 23:05                       ` Boqun Feng
2017-03-07 23:31                         ` Paul E. McKenney
2017-03-08  1:39                           ` Boqun Feng
2017-03-08  2:26                             ` Paul E. McKenney
2017-03-08  2:44                               ` Boqun Feng
2017-03-08  3:08                                 ` Paul E. McKenney
2017-03-07 15:16                 ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).