Re: [PATCH] rcu: Narrow early boot window of illegal synchronous grace periods

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-kernel@vger.kernel.org, lv.zheng@intel.com,
	stan.kain@gmail.com, waffolz@hotmail.com, josh@joshtriplett.org,
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com,
	jiangshanlai@gmail.com, mingo@kernel.org,
	torvalds@linux-foundation.org, rafael@kernel.org
Subject: Re: [PATCH] rcu: Narrow early boot window of illegal synchronous grace periods
Date: Sat, 14 Jan 2017 00:00:22 -0800	[thread overview]
Message-ID: <20170114080022.GS5238@linux.vnet.ibm.com> (raw)
In-Reply-To: <20170113112519.q6tmariemhhk5e56@pd.tnic>

On Fri, Jan 13, 2017 at 12:25:19PM +0100, Borislav Petkov wrote:
> On Thu, Jan 12, 2017 at 06:38:07PM -0800, Paul E. McKenney wrote:
> > The current preemptible RCU implementation goes through three phases
> > during bootup.  In the first phase, there is only one CPU that is running
> > with preemption disabled, so that a no-op is a synchronous grace period.
> > In the second mid-boot phase, the scheduler is running, but RCU has
> > not yet gotten its kthreads spawned (and, for expedited grace periods,
> > workqueues are not yet running.  During this time, any attempt to do
> > a synchronous grace period will hang the system (or complain bitterly,
> > depending).  In the third and final phase, RCU is fully operational and
> > everything works normally.
> > 
> > This has been OK for some time, but there has recently been some
> > synchronous grace periods showing up during the second mid-boot phase.
> 
> You probably should add the callchain from the thread as an example and
> for future reference:
> 
> early_amd_iommu_init()
> |-> acpi_put_table(ivrs_base);
> |-> acpi_tb_put_table(table_desc);
> |-> acpi_tb_invalidate_table(table_desc);
> |-> acpi_tb_release_table(...)
> |-> acpi_os_unmap_memory
> |-> acpi_os_unmap_iomem
> |-> acpi_os_map_cleanup
> |-> synchronize_rcu_expedited   <-- the kernel/rcu/tree_exp.h version with CONFIG_PREEMPT_RCU=y
> 
> > This commit therefore reworks RCU to permit synchronous grace periods

It now looks like this:

------------------------------------------------------------------------

Note that the code was buggy even before this commit, as it was subject
to failure on real-time systems that forced all expedited grace periods
to run as normal grace periods (for example, using the rcu_normal ksysfs
parameter).  The callchain from the failure case is as follows:

early_amd_iommu_init()
|-> acpi_put_table(ivrs_base);
|-> acpi_tb_put_table(table_desc);
|-> acpi_tb_invalidate_table(table_desc);
|-> acpi_tb_release_table(...)
|-> acpi_os_unmap_memory
|-> acpi_os_unmap_iomem
|-> acpi_os_map_cleanup
|-> synchronize_rcu_expedited

The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
which caused the code to try using workqueues before they were
initialized, which did not go well.

------------------------------------------------------------------------

Does that work?

> Just say "Rework RCU to ..."
> 
> "This commit" and "This patch" and the such are not needed in commit
> messages.

Fair point, but this wording appears in almost all of my patches.  ;-)

My rationale is that it provides a clear transition from describing the
problem to introducing the solution.

> > to proceed during this mid-boot phase.
> > 
> > This commit accomplishes this by setting a flag from the existing
> > rcu_scheduler_starting() function which causes all synchronous grace
> > periods to take the expedited path.  The expedited path now checks this
> > flag, using the requesting task to drive the expedited grace period
> > forward during the mid-boot phase.  Finally, this flag is updated by a
> > core_initcall() function named rcu_exp_runtime_mode(), which causes the
> > runtime codepaths to be used.
> > 
> > Note that this arrangement assumes that tasks are not sent POSIX signals
> > (or anything similar) from the time that the first task is spawned
> > through core_initcall() time.
> > 
> > Reported-by: "Zheng, Lv" <lv.zheng@intel.com>
> > Reported-by: Borislav Petkov <bp@alien8.de>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 321f9ed552a9..01f71e1d2e94 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -444,6 +444,10 @@ bool __rcu_is_watching(void);
> >  #error "Unknown RCU implementation specified to kernel configuration"
> >  #endif
> >  
> > +#define RCU_SCHEDULER_INACTIVE	0
> > +#define RCU_SCHEDULER_INIT	1
> > +#define RCU_SCHEDULER_RUNNING	2
> > +
> >  /*
> >   * init_rcu_head_on_stack()/destroy_rcu_head_on_stack() are needed for dynamic
> >   * initialization and destruction of rcu_head on the stack. rcu_head structures
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 80adef7d4c3d..0d6ff3e471be 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -136,6 +136,7 @@ int rcu_jiffies_till_stall_check(void);
> >  #define TPS(x)  tracepoint_string(x)
> >  
> >  void rcu_early_boot_tests(void);
> > +void rcu_test_sync_prims(void);
> >  
> >  /*
> >   * This function really isn't for public consumption, but RCU is special in
> > diff --git a/kernel/rcu/tiny_plugin.h b/kernel/rcu/tiny_plugin.h
> > index 196f0302e2f4..e3953bdee383 100644
> > --- a/kernel/rcu/tiny_plugin.h
> > +++ b/kernel/rcu/tiny_plugin.h
> > @@ -65,7 +65,7 @@ EXPORT_SYMBOL_GPL(rcu_scheduler_active);
> >  void __init rcu_scheduler_starting(void)
> >  {
> >  	WARN_ON(nr_context_switches() > 0);
> > -	rcu_scheduler_active = 1;
> > +	rcu_scheduler_active = RCU_SCHEDULER_RUNNING;
> 
> This tiny RCU version is setting _RUNNING while the
> kernel/rcu/tree.c-one is setting it to _INIT. The tiny bypasses the
> _INIT step now?
> 
> I'm guessing because you've added a third state - the _RUNNING and
> tiny doesn't need the intermediary _INIT, it is being set straight to
> _RUNNING...

Exactly, but yes, worth a comment.

The header comment for rcu_scheduler_starting() is now as follows:

/*
 * During boot, we forgive RCU lockdep issues.  After this function is
 * invoked, we start taking RCU lockdep issues seriously.  Note that unlike
 * Tree RCU, Tiny RCU transitions directly from RCU_SCHEDULER_INACTIVE
 * to RCU_SCHEDULER_RUNNING, skipping the RCU_SCHEDULER_INIT stage.
 * The reason for this is that Tiny RCU does not need kthreads, so does
 * not have to care about the fact that the scheduler is half-initialized
 * at a certain phase of the boot process.
 */

> >  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 96c52e43f7ca..7bcce4607863 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -127,13 +127,16 @@ int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> >  int sysctl_panic_on_rcu_stall __read_mostly;
> >  
> >  /*
> > - * The rcu_scheduler_active variable transitions from zero to one just
> > - * before the first task is spawned.  So when this variable is zero, RCU
> > + * The rcu_scheduler_active variable transitions from
> > + * RCU_SCHEDULER_INACTIVE to RCU_SCHEDULER_INIT just before the first
> > + * task is spawned.  So when this variable is RCU_SCHEDULER_INACTIVE, RCU
> >   * can assume that there is but one task, allowing RCU to (for example)
> >   * optimize synchronize_rcu() to a simple barrier().  When this variable
> >   * is one, RCU must actually do all the hard work required to detect real
> 
> ... is RCU_SCHEDULER_INIT, RCU must ...
> 
> >   * grace periods.  This variable is also used to suppress boot-time false
> > - * positives from lockdep-RCU error checking.
> > + * positives from lockdep-RCU error checking.  Finally, this variable
> 
> 					       .  Finally, it...
> 
> By now we know it is this variable :-)
> 
> > + * transitions from RCU_SCHEDULER_INIT to RCU_SCHEDULER_RUNNING after RCU
> > + * is fully initialized, including all of its tasks having been spawned.
> 
> s/tasks/kthreads/ ?
> 
> Should make it clearer...

Good catches, fixed!

> ...
> 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index d3053e99fdb6..e59e1849b89a 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -532,18 +532,28 @@ struct rcu_exp_work {
> >  };
> >  
> >  /*
> > + * Common code to drive an expedited grace period forward, used by
> > + * workqueues and mid-boot-time tasks.
> > + */
> > +static void rcu_exp_sel_wait_wake(struct rcu_state *rsp,
> > +				  smp_call_func_t func, unsigned long s)
> > +{
> > +	/* Initialize the rcu_node tree in preparation for the wait. */
> > +	sync_rcu_exp_select_cpus(rsp, func);
> > +
> > +	/* Wait and clean up, including waking everyone. */
> > +	rcu_exp_wait_wake(rsp, s);
> > +}
> > +
> > +/*
> >   * Work-queue handler to drive an expedited grace period forward.
> >   */
> >  static void wait_rcu_exp_gp(struct work_struct *wp)
> >  {
> >  	struct rcu_exp_work *rewp;
> >  
> > -	/* Initialize the rcu_node tree in preparation for the wait. */
> >  	rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > -	sync_rcu_exp_select_cpus(rewp->rew_rsp, rewp->rew_func);
> > -
> > -	/* Wait and clean up, including waking everyone. */
> > -	rcu_exp_wait_wake(rewp->rew_rsp, rewp->rew_s);
> > +	rcu_exp_sel_wait_wake(rewp->rew_rsp, rewp->rew_func, rewp->rew_s);
> >  }
> >  
> >  /*
> > @@ -569,12 +579,18 @@ static void _synchronize_rcu_expedited(struct rcu_state *rsp,
> >  	if (exp_funnel_lock(rsp, s))
> >  		return;  /* Someone else did our work for us. */
> >  
> > -	/* Marshall arguments and schedule the expedited grace period. */
> > -	rew.rew_func = func;
> > -	rew.rew_rsp = rsp;
> > -	rew.rew_s = s;
> > -	INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> > -	schedule_work(&rew.rew_work);
> > +	/* Ensure that load happens before action based on it. */
> > +	if (unlikely(rcu_scheduler_active == RCU_SCHEDULER_INIT)) {
> 
> Ok, so this variable is, AFAICT, on some hot paths. And we will query
> it each time we synchronize_sched() when we decide to do the expedited
> grace periods. There's that rcu_gp_is_expedited() which decides but I
> don't have an idea on which paths that happens...

It is marked __read_mostly for this reason, but I have always assumed that
the synchronous grace-period primitives were off the hotpath.

> In any case, should we make this var a jump label or so which gets
> patched properly or are the expedited paths comparatively seldom?

I believe that this would not buy very much, but if this variable starts
showing up on profiles, then perhaps a jump label would be appropriate.
As a separate patch, though!

> > +		/* Direct call during scheduler init and early_initcalls(). */
> > +		rcu_exp_sel_wait_wake(rsp, func, s);
> > +	} else {
> > +		/* Marshall arguments & schedule the expedited grace period. */
> > +		rew.rew_func = func;
> > +		rew.rew_rsp = rsp;
> > +		rew.rew_s = s;
> > +		INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> > +		schedule_work(&rew.rew_work);
> > +	}
> >  
> >  	/* Wait for expedited grace period to complete. */
> >  	rdp = per_cpu_ptr(rsp->rda, raw_smp_processor_id());
> 
> Rest looks ok to me but WTH do I know about RCU internals...

Thank you for your review and comments!

							Thanx, Paul

next prev parent reply	other threads:[~2017-01-14  8:09 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-13  2:38 [PATCH] rcu: Narrow early boot window of illegal synchronous grace periods Paul E. McKenney
2017-01-13 11:25 ` Borislav Petkov
2017-01-14  8:00   ` Paul E. McKenney [this message]
2017-01-14 10:35     ` Borislav Petkov
2017-01-14 12:27       ` Rafael J. Wysocki
2017-01-14 12:50         ` Borislav Petkov
2017-01-16  1:57           ` Zheng, Lv
2017-01-16  4:56             ` Paul E. McKenney
2017-01-15  5:24       ` Paul E. McKenney
2017-01-15 10:09         ` Borislav Petkov
2017-01-15 20:33           ` Paul E. McKenney
2017-01-13 17:02 ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170114080022.GS5238@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=bp@alien8.de \
    --cc=jiangshanlai@gmail.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lv.zheng@intel.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=stan.kain@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=waffolz@hotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.