Re: [PATCH] rcu: exp: Fix "must hold exp_mutex" comments for QS reporting functions

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Boqun Feng <boqun.feng@gmail.com>
Cc: linux-kernel@vger.kernel.org,
	Josh Triplett <josh@joshtriplett.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>
Subject: Re: [PATCH] rcu: exp: Fix "must hold exp_mutex" comments for QS reporting functions
Date: Wed, 7 Mar 2018 20:30:17 -0800	[thread overview]
Message-ID: <20180308043017.GL3918@linux.vnet.ibm.com> (raw)
In-Reply-To: <20180308034627.n2xw32zzujnmy2gb@tardis>

On Thu, Mar 08, 2018 at 11:46:27AM +0800, Boqun Feng wrote:
> On Wed, Mar 07, 2018 at 07:48:29AM -0800, Paul E. McKenney wrote:
> > On Wed, Mar 07, 2018 at 04:49:39PM +0800, Boqun Feng wrote:
> > > Since commit d9a3da0699b2 ("rcu: Add expedited grace-period support for
> > > preemptible RCU"), there are comments for some funtions in
> > > rcu_report_exp_rnp()'s call-chain saying that exp_mutex or its
> > > predecessors needs to be held.
> > > 
> > > However, exp_mutex and its predecessors are merely used for synchronize
> > > between GPs, and it's clearly that all variables visited by those
> > > functions are under the protection of rcu_node's ->lock. Moreover, those
> > > functions are currently called without held exp_mutex, and seems that
> > > doesn't introduce any trouble.
> > > 
> > > So this patch fix this problem by correcting the comments
> > > 
> > > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > > Fixes: d9a3da0699b2 ("rcu: Add expedited grace-period support for preemptible RCU")
> > 
> > Good catch, applied!
> > 
> > These have been around for almost a decade!  ;-)  They happened because
> > I ended up rewriting expedited support several times before it was
> > accepted, and apparently failed to update the comments towards the end.
> > 
> > So thank you for catching this one!
> > 
> > But wouldn't it be better to use lockdep instead of comments in this case?
> > 
> 
> Agreed. That's a good idea. And I might find something about this, so I
> add this for sync_rcu_preempt_exp_done(), 
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 2fd882b08b7c..1fa394fe2f8a 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -20,6 +20,8 @@
>   * Authors: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>   */
>  
> +#include <linux/lockdep.h>
> +
>  /*
>   * Record the start of an expedited grace period.
>   */
> @@ -158,6 +160,8 @@ static void __maybe_unused sync_exp_reset_tree(struct rcu_state *rsp)
>   */
>  static bool sync_rcu_preempt_exp_done(struct rcu_node *rnp)
>  {
> +	BUG_ON(!lockdep_is_held(&rnp->lock));
> +
>  	return rnp->exp_tasks == NULL &&
>  	       READ_ONCE(rnp->expmask) == 0;
>  }
> 
> And I could got this with rcutorture (--configs TREE03 --bootargs
> "rcutorture.gp_exp=1" --duration 10 --kconfig "CONFIG_PROVE_LOCKING=y"):

Certainly stronger medicine than the comment!  ;-)

> [    0.013629] ------------[ cut here ]------------
> [    0.014000] kernel BUG at /home/fixme/linux-rcu/kernel/rcu/tree_exp.h:163!
> [    0.014009] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [    0.014697] Modules linked in:
> [    0.014992] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.0-rc1+ #1
> [    0.015000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
> [    0.015000] RIP: 0010:sync_rcu_preempt_exp_done+0x30/0x40
> [    0.015000] RSP: 0000:ffffffffb7003d00 EFLAGS: 00010246
> [    0.015000] RAX: 0000000000000000 RBX: ffffffffb7064d00 RCX: 0000000000000003
> [    0.015000] RDX: 0000000080000003 RSI: ffffffffb7064d18 RDI: ffffffffb7024da8
> [    0.015000] RBP: ffffffffb7064d00 R08: 0000000000000000 R09: 0000000000000000
> [    0.015000] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffb70686d0
> [    0.015000] R13: ffffffffb765e2e0 R14: ffffffffb7064d00 R15: 0000000000000004
> [    0.015000] FS:  0000000000000000(0000) GS:ffff94fb9fc00000(0000) knlGS:0000000000000000
> [    0.015000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.015000] CR2: ffff94fb8e57f000 CR3: 000000000d01e000 CR4: 00000000000006f0
> [    0.015000] Call Trace:
> [    0.015000]  rcu_exp_wait_wake+0x6d/0x9e0
> [    0.015000]  ? rcu_cleanup_dead_rnp+0x90/0x90
> [    0.015000]  ? rcu_report_exp_cpu_mult+0x60/0x60
> [    0.015000]  _synchronize_rcu_expedited+0x680/0x6f0
> [    0.015000]  ? rcu_cleanup_dead_rnp+0x90/0x90
> [    0.015000]  ? acpi_hw_read_port+0x4c/0xc2
> [    0.015000]  ? acpi_hw_read+0xf4/0x153
> [    0.015000]  synchronize_rcu.part.79+0x53/0x60
> [    0.015000]  ? acpi_read_bit_register+0x38/0x69
> [    0.015000]  rcu_test_sync_prims+0x5/0x20
> [    0.015000]  rest_init+0x6/0xc0
> [    0.015000]  start_kernel+0x447/0x467
> [    0.015000]  secondary_startup_64+0xa5/0xb0
> [    0.015000] Code: ff 48 89 fb 48 83 c7 18 e8 9e 52 fd ff 85 c0 74 1a 31 c0 48 83 bb c0 00 00 00 00 74 02 5b c3 48 8b 43 68 5b 48 85 c0 0f 94 c0 c3 <0f> 0b 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 41 56 49 89 fe 
> [    0.015000] RIP: sync_rcu_preempt_exp_done+0x30/0x40 RSP: ffffffffb7003d00
> [    0.015007] ---[ end trace 09539f6b90637d6c ]---
> 
> 
> And I found in rcu_exp_wait_wake(), some of sync_rcu_preempt_exp_done()
> are not protected by rcu_node's lock. I have a straight-forward fix
> (along with the debug changes I mention above).
> 
> With this fix, rcutorture doesn't complain about the lock missing
> anymore. I will continue to investigate whether missing lock is a
> problem, but in the meanwhile, looking forward to your insight ;-)
> 
> Regards,
> Boqun
> ---------------------------------->8
> Subject: [PATCH] rcu: exp: Protect sync_rcu_preempt_exp_done() with rcu_node
>  lock
> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  kernel/rcu/tree_exp.h | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 2fd882b08b7c..0001ac0ce6a7 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -20,6 +20,8 @@
>   * Authors: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>   */
>  
> +#include <linux/lockdep.h>
> +
>  /*
>   * Record the start of an expedited grace period.
>   */
> @@ -158,10 +160,30 @@ static void __maybe_unused sync_exp_reset_tree(struct rcu_state *rsp)
>   */
>  static bool sync_rcu_preempt_exp_done(struct rcu_node *rnp)
>  {
> +	BUG_ON(!lockdep_is_held(&rnp->lock));
> +
>  	return rnp->exp_tasks == NULL &&
>  	       READ_ONCE(rnp->expmask) == 0;
>  }
>  
> +/*
> + * Like sync_rcu_preempt_exp_done(), but this function assumes the caller
> + * doesn't hold the rcu_node's ->lock, and will acquire and release the lock
> + * itself
> + */
> +static bool sync_rcu_preempt_exp_done_unlocked(struct rcu_node *rnp)
> +{
> +	unsigned long flags;
> +	bool ret;
> +
> +	raw_spin_lock_irqsave_rcu_node(rnp, flags);
> +	ret = sync_rcu_preempt_exp_done(rnp);

Let's see...  The sync_rcu_preempt_exp_done() function checks the
->exp_tasks pointer and the ->expmask bitmask.  The number of bits in the
mask can only decrease, and the ->exp_tasks pointer can only transition
from NULL to non-NULL when there is at least one bit set.  However,
there is no ordering in sync_rcu_preempt_exp_done(), so it is possible
that it could be fooled without the lock:

o	CPU 0 in sync_rcu_preempt_exp_done() reads ->exp_tasks and
	sees that it is NULL.

o	CPU 1 blocks within an RCU read-side critical section, so
	it enqueues the task and points ->exp_tasks at it and
	clears CPU 1's bit in ->expmask.

o	All other CPUs clear their bits in ->expmask.

o	CPU 0 reads ->expmask, sees that it is zero, so incorrectly
	concludes that all quiescent states have completed, despite
	the fact that ->exp_tasks is non-NULL.

So it seems to me that the lock is needed.  Good catch!!!  The problem
would occur only if the task running on CPU 0 received a spurious
wakeup, but that could potentially happen.

If lock contention becomes a problem, memory-ordering tricks could be
applied, but the lock is of course simpler.

I am guessing that this is a prototype patch, and that you are planning
to add lockdep annotations in more places, but either way please let
me know.

						Thanx, Paul

> +	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> +
> +	return ret;
> +}
> +
> +
>  /*
>   * Report the exit from RCU read-side critical section for the last task
>   * that queued itself during or before the current expedited preemptible-RCU
> @@ -490,6 +512,7 @@ static void synchronize_sched_expedited_wait(struct rcu_state *rsp)
>  	struct rcu_node *rnp;
>  	struct rcu_node *rnp_root = rcu_get_root(rsp);
>  	int ret;
> +	unsigned long flags;
>  
>  	trace_rcu_exp_grace_period(rsp->name, rcu_exp_gp_seq_endval(rsp), TPS("startwait"));
>  	jiffies_stall = rcu_jiffies_till_stall_check();
> @@ -498,9 +521,9 @@ static void synchronize_sched_expedited_wait(struct rcu_state *rsp)
>  	for (;;) {
>  		ret = swait_event_timeout(
>  				rsp->expedited_wq,
> -				sync_rcu_preempt_exp_done(rnp_root),
> +				sync_rcu_preempt_exp_done_unlocked(rnp_root),
>  				jiffies_stall);
> -		if (ret > 0 || sync_rcu_preempt_exp_done(rnp_root))
> +		if (ret > 0 || sync_rcu_preempt_exp_done_unlocked(rnp_root))
>  			return;
>  		WARN_ON(ret < 0);  /* workqueues should not be signaled. */
>  		if (rcu_cpu_stall_suppress)
> @@ -533,8 +556,14 @@ static void synchronize_sched_expedited_wait(struct rcu_state *rsp)
>  			rcu_for_each_node_breadth_first(rsp, rnp) {
>  				if (rnp == rnp_root)
>  					continue; /* printed unconditionally */
> -				if (sync_rcu_preempt_exp_done(rnp))
> +
> +				raw_spin_lock_irqsave_rcu_node(rnp, flags);
> +				if (sync_rcu_preempt_exp_done(rnp)) {
> +					raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  					continue;
> +				}
> +				raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> +
>  				pr_cont(" l=%u:%d-%d:%#lx/%c",
>  					rnp->level, rnp->grplo, rnp->grphi,
>  					rnp->expmask,
> -- 
> 2.16.2
>

next prev parent reply	other threads:[~2018-03-08  4:29 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-07  8:49 [PATCH] rcu: exp: Fix "must hold exp_mutex" comments for QS reporting functions Boqun Feng
2018-03-07 15:48 ` Paul E. McKenney
2018-03-08  3:46   ` Boqun Feng
2018-03-08  4:30     ` Paul E. McKenney [this message]
2018-03-08  4:54       ` Boqun Feng
2018-03-08  8:30         ` Boqun Feng
2018-03-08 15:42           ` Paul E. McKenney
2018-03-09  6:57             ` Boqun Feng
2018-03-09 20:17               ` Paul E. McKenney
2018-03-12  5:28                 ` Boqun Feng
2018-03-13  0:15                   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180308043017.GL3918@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=boqun.feng@gmail.com \
    --cc=jiangshanlai@gmail.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.