All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Josh Triplett <josh@joshtriplett.org>
Cc: linux-kernel@vger.kernel.org, mingo@elte.hu,
	laijs@cn.fujitsu.com, dipankar@in.ibm.com,
	akpm@linux-foundation.org, mathieu.desnoyers@polymtl.ca,
	niv@us.ibm.com, tglx@linutronix.de, peterz@infradead.org,
	rostedt@goodmis.org, Valdis.Kletnieks@vt.edu,
	dhowells@redhat.com, edumazet@google.com, darren@dvhart.com,
	fweisbec@gmail.com, sbw@mit.edu
Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ
Date: Sat, 13 Apr 2013 15:09:43 -0700	[thread overview]
Message-ID: <20130413220943.GB29861@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130413195336.GA14799@leaf>

On Sat, Apr 13, 2013 at 12:53:36PM -0700, Josh Triplett wrote:
> On Sat, Apr 13, 2013 at 12:34:25PM -0700, Paul E. McKenney wrote:
> > On Sat, Apr 13, 2013 at 11:18:00AM -0700, Josh Triplett wrote:
> > > On Fri, Apr 12, 2013 at 11:38:04PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> > > > > On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > > > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > > > 
> > > > > > Systems with HZ=100 can have slow bootup times due to the default
> > > > > > three-jiffy delays between quiescent-state forcing attempts.  This
> > > > > > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > > > > > on the value of HZ.  However, this would break very large systems that
> > > > > > require more time between quiescent-state forcing attempts.  This
> > > > > > commit therefore also ups the default delay by one jiffy for each
> > > > > > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > > > > > runtime, -not- NR_CPUS at build time).
> > > > > > 
> > > > > > Reported-by: Paul Mackerras <paulus@au1.ibm.com>
> > > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > 
> > > > > Something seems very wrong if RCU regularly hits the fqs code during
> > > > > boot; feels like there's some more straightforward solution we're
> > > > > missing.  What causes these CPUs to fall under RCU's scrutiny during
> > > > > boot yet not actually hit the RCU codepaths naturally?
> > > > 
> > > > The problem is that they are running HZ=100, so that RCU will often
> > > > take 30-60 milliseconds per grace period.  At that point, you only
> > > > need 16-30 grace periods to chew up a full second, so it is not all
> > > > that hard to eat up the additional 8-12 seconds of boot time that
> > > > they were seeing.  IIRC, UP boot was costing them 4 seconds.
> > > > 
> > > > For HZ=1000, this would translate to 800ms to 1.2s, which is nowhere
> > > > near as annoying.
> > > 
> > > That raises two questions, though.  First, who calls synchronize_rcu()
> > > repeatedly during boot, and could they call call_rcu() instead to avoid
> > > blocking for an RCU grace period?  Second, why does RCU need 3-6 jiffies
> > > to resolve a grace period during boot?  That suggests that RCU doesn't
> > > actually resolve a grace period until the force-quiescent-state
> > > machinery kicks in, meaning that the normal quiescent-state mechanism
> > > didn't work.
> > 
> > Indeed, converting synchronize_rcu() to call_rcu() might also be
> > helpful.  The reason that RCU often does not resolve grace periods until
> > force_quiescent_state() is that it is often the case during boot that
> > all but one CPU is idle.  RCU tries hard to avoid waking up idle CPUs,
> > so it must scan them.  Scanning is relatively expensive, so there is
> > reason to wait.
> 
> How are those CPUs going idle without first telling RCU that they're
> quiesced?  Seems like, during boot at least, you want RCU to use its
> idle==quiesced logic to proactively note continuously-quiescent states.
> Ideally, you should not hit the FQS code at all during boot.

FQS is RCU's idle==quiesced logic.  ;-)

In theory, RCU could add logic at idle entry to report a quiescent state,
in fact CONFIG_RCU_FAST_NO_HZ used to do exactly that.  In practice,
this is not good for energy efficiency at runtime for a goodly number
of workloads, which is why CONFIG_RCU_FAST_NO_HZ now relies on callback
numbering and FQS.

I understand that at boot time, energy efficiency is best served by
making boot go faster, but that means that something has to tell RCU
when boot is complete.

> > One thing that could be done would be to scan immediately during boot,
> > and then back off once boot has completed.  Of course, RCU has no idea
> > when boot has completed, but one way to get this effect is to boot
> > with rcutree.jiffies_till_first_fqs=0, and then use sysfs to set it
> > to 3 once boot has completed.
> 
> What do you mean by "boot has completed" here?  The kernel's early
> initialization, the kernel's initialization up to running /sbin/init, or
> userspace initialization up through supporting user login?

That is exactly the question.  After all, if RCU is going to do something
special during boot, it needs to know when boot ends.  People normally
count boot as up to user login, but RCU currently has no way to know
when this is, at least as far as I know.  Which is why I suggested that
something tell RCU via sysfs.

Regardless, for the usual definition of "boot is complete", user space has
to decide when boot is complete.  The kernel is out of the loop early on.

> In any case, I don't think it makes sense to do this with FQS.

OK, let's go through the possibilities I can imagine at the moment:

1.	Force the scheduling-clock interrupt to remain on during
	boot.  This way, each CPU could tell RCU of its idle/non-idle
	state.  Of course, something then needs to tell the kernel
	when boot is over so that it can go back to energy-efficient
	mode.

2.	Set rcutree.jiffies_till_first_fqs=0 at boot time, then when
	boot is complete, set it to 3 via sysfs, or to some magic number
	telling RCU to recompute the default.  This has the virtue of
	allowing different userspaces to handle this differently.

3.	Take a half-step by having RCU register a callback during the
	latest phase of kernel-visible boot.  I am under the impression
	that this is a relatively small fraction of boot, so it would
	be sub-optimal.

4.	Make CPUs announce quiescence on each entry to idle.  This
	covers the transition to idle, but when a given CPU stays idle
	for more than one grace period, RCU has to do something to verify
	that the CPU remains idle.  Right now, that is FQS's job --
	it cycles through the dyntick-idle structures of all CPUs that
	have not already announced quiescence.

5.	Make CPUs IPI RCU's grace-period kthread on each transition
	to and from idle.  I might be missing something, but given the
	cost and disuptiveness of IPIs, this does not seem to me to be
	a strategy to win.

6.	IPI the CPUs to see if they are still idle.  This would defeat
	energy efficiency.  Of course, RCU could take this approach
	only during boot, but it is cheaper and faster to just check
	each CPU's rcu_dynticks structure -- which is what FQS does.

7.	Treat all normal grace periods as expedited grace periods, but
	only during boot.  It is fairly easy for RCU to do this, but
	again, something has to tell RCU when boot is complete.

8.	Your idea here.  Plus more of mine as I remember them.  ;-)

So, what am I missing?

							Thanx, Paul


  reply	other threads:[~2013-04-13 22:09 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-12 23:18 [PATCH tip/core/rcu 0/7] RCU fixes for 3.11 Paul E. McKenney
2013-04-12 23:19 ` [PATCH tip/core/rcu 1/7] rcu: Convert rcutree.c printk calls Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 2/7] rcu: Convert rcutree_plugin.h " Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 3/7] rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods Paul E. McKenney
2013-04-13 14:06     ` Frederic Weisbecker
2013-04-13 15:19       ` Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 4/7] rcu: Don't allocate bootmem from rcu_init() Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 5/7] rcu: Remove "Experimental" flags Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ Paul E. McKenney
2013-04-12 23:54     ` Josh Triplett
2013-04-13  6:38       ` Paul E. McKenney
2013-04-13 18:18         ` Josh Triplett
2013-04-13 19:34           ` Paul E. McKenney
2013-04-13 19:53             ` Josh Triplett
2013-04-13 22:09               ` Paul E. McKenney [this message]
2013-04-14  6:10                 ` Paul E. McKenney
2013-05-14 12:20                 ` Peter Zijlstra
2013-05-14 14:12                   ` Paul E. McKenney
2013-05-14 14:51                     ` Peter Zijlstra
2013-05-14 15:47                       ` Paul E. McKenney
2013-05-15  8:56                         ` Peter Zijlstra
2013-05-15  9:02                           ` Peter Zijlstra
2013-05-15 17:31                             ` Paul E. McKenney
2013-05-16  9:45                               ` Peter Zijlstra
2013-05-16 13:22                                 ` Paul E. McKenney
2013-05-21  9:45                                   ` Peter Zijlstra
2013-05-21 16:54                                     ` Paul E. McKenney
2013-05-15 16:37                           ` Paul E. McKenney
2013-05-16  9:37                             ` Peter Zijlstra
2013-05-16 13:13                               ` Paul E. McKenney
2013-05-15  9:20                     ` Ingo Molnar
2013-05-15 15:44                       ` Paul E. McKenney
2013-05-28 10:07                         ` Ingo Molnar
2013-05-29  1:29                           ` Paul E. McKenney
2013-04-15  2:03         ` Paul Mackerras
2013-04-15 17:26           ` Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 7/7] rcu: Merge adjacent identical ifdefs Paul E. McKenney
2013-04-13  0:01 ` [PATCH tip/core/rcu 0/7] RCU fixes for 3.11 Josh Triplett

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130413220943.GB29861@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=akpm@linux-foundation.org \
    --cc=darren@dvhart.com \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=josh@joshtriplett.org \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@polymtl.ca \
    --cc=mingo@elte.hu \
    --cc=niv@us.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sbw@mit.edu \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.