linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Josh Triplett <josh@joshtriplett.org>
Cc: linux-kernel@vger.kernel.org, mingo@elte.hu,
	laijs@cn.fujitsu.com, dipankar@in.ibm.com,
	akpm@linux-foundation.org, mathieu.desnoyers@polymtl.ca,
	niv@us.ibm.com, tglx@linutronix.de, peterz@infradead.org,
	rostedt@goodmis.org, Valdis.Kletnieks@vt.edu,
	dhowells@redhat.com, edumazet@google.com, darren@dvhart.com,
	fweisbec@gmail.com, sbw@mit.edu
Subject: Re: [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ
Date: Sat, 13 Apr 2013 23:10:01 -0700	[thread overview]
Message-ID: <20130414061000.GA16307@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130413220943.GB29861@linux.vnet.ibm.com>

On Sat, Apr 13, 2013 at 03:09:43PM -0700, Paul E. McKenney wrote:
> On Sat, Apr 13, 2013 at 12:53:36PM -0700, Josh Triplett wrote:
> > On Sat, Apr 13, 2013 at 12:34:25PM -0700, Paul E. McKenney wrote:
> > > On Sat, Apr 13, 2013 at 11:18:00AM -0700, Josh Triplett wrote:
> > > > On Fri, Apr 12, 2013 at 11:38:04PM -0700, Paul E. McKenney wrote:
> > > > > On Fri, Apr 12, 2013 at 04:54:02PM -0700, Josh Triplett wrote:
> > > > > > On Fri, Apr 12, 2013 at 04:19:13PM -0700, Paul E. McKenney wrote:
> > > > > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > > > > 
> > > > > > > Systems with HZ=100 can have slow bootup times due to the default
> > > > > > > three-jiffy delays between quiescent-state forcing attempts.  This
> > > > > > > commit therefore auto-tunes the RCU_JIFFIES_TILL_FORCE_QS value based
> > > > > > > on the value of HZ.  However, this would break very large systems that
> > > > > > > require more time between quiescent-state forcing attempts.  This
> > > > > > > commit therefore also ups the default delay by one jiffy for each
> > > > > > > 256 CPUs that might be on the system (based off of nr_cpu_ids at
> > > > > > > runtime, -not- NR_CPUS at build time).
> > > > > > > 
> > > > > > > Reported-by: Paul Mackerras <paulus@au1.ibm.com>
> > > > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > 
> > > > > > Something seems very wrong if RCU regularly hits the fqs code during
> > > > > > boot; feels like there's some more straightforward solution we're
> > > > > > missing.  What causes these CPUs to fall under RCU's scrutiny during
> > > > > > boot yet not actually hit the RCU codepaths naturally?
> > > > > 
> > > > > The problem is that they are running HZ=100, so that RCU will often
> > > > > take 30-60 milliseconds per grace period.  At that point, you only
> > > > > need 16-30 grace periods to chew up a full second, so it is not all
> > > > > that hard to eat up the additional 8-12 seconds of boot time that
> > > > > they were seeing.  IIRC, UP boot was costing them 4 seconds.
> > > > > 
> > > > > For HZ=1000, this would translate to 800ms to 1.2s, which is nowhere
> > > > > near as annoying.
> > > > 
> > > > That raises two questions, though.  First, who calls synchronize_rcu()
> > > > repeatedly during boot, and could they call call_rcu() instead to avoid
> > > > blocking for an RCU grace period?  Second, why does RCU need 3-6 jiffies
> > > > to resolve a grace period during boot?  That suggests that RCU doesn't
> > > > actually resolve a grace period until the force-quiescent-state
> > > > machinery kicks in, meaning that the normal quiescent-state mechanism
> > > > didn't work.
> > > 
> > > Indeed, converting synchronize_rcu() to call_rcu() might also be
> > > helpful.  The reason that RCU often does not resolve grace periods until
> > > force_quiescent_state() is that it is often the case during boot that
> > > all but one CPU is idle.  RCU tries hard to avoid waking up idle CPUs,
> > > so it must scan them.  Scanning is relatively expensive, so there is
> > > reason to wait.
> > 
> > How are those CPUs going idle without first telling RCU that they're
> > quiesced?  Seems like, during boot at least, you want RCU to use its
> > idle==quiesced logic to proactively note continuously-quiescent states.
> > Ideally, you should not hit the FQS code at all during boot.
> 
> FQS is RCU's idle==quiesced logic.  ;-)
> 
> In theory, RCU could add logic at idle entry to report a quiescent state,
> in fact CONFIG_RCU_FAST_NO_HZ used to do exactly that.  In practice,
> this is not good for energy efficiency at runtime for a goodly number
> of workloads, which is why CONFIG_RCU_FAST_NO_HZ now relies on callback
> numbering and FQS.
> 
> I understand that at boot time, energy efficiency is best served by
> making boot go faster, but that means that something has to tell RCU
> when boot is complete.
> 
> > > One thing that could be done would be to scan immediately during boot,
> > > and then back off once boot has completed.  Of course, RCU has no idea
> > > when boot has completed, but one way to get this effect is to boot
> > > with rcutree.jiffies_till_first_fqs=0, and then use sysfs to set it
> > > to 3 once boot has completed.
> > 
> > What do you mean by "boot has completed" here?  The kernel's early
> > initialization, the kernel's initialization up to running /sbin/init, or
> > userspace initialization up through supporting user login?
> 
> That is exactly the question.  After all, if RCU is going to do something
> special during boot, it needs to know when boot ends.  People normally
> count boot as up to user login, but RCU currently has no way to know
> when this is, at least as far as I know.  Which is why I suggested that
> something tell RCU via sysfs.
> 
> Regardless, for the usual definition of "boot is complete", user space has
> to decide when boot is complete.  The kernel is out of the loop early on.
> 
> > In any case, I don't think it makes sense to do this with FQS.
> 
> OK, let's go through the possibilities I can imagine at the moment:
> 
> 1.	Force the scheduling-clock interrupt to remain on during
> 	boot.  This way, each CPU could tell RCU of its idle/non-idle
> 	state.  Of course, something then needs to tell the kernel
> 	when boot is over so that it can go back to energy-efficient
> 	mode.
> 
> 2.	Set rcutree.jiffies_till_first_fqs=0 at boot time, then when
> 	boot is complete, set it to 3 via sysfs, or to some magic number
> 	telling RCU to recompute the default.  This has the virtue of
> 	allowing different userspaces to handle this differently.
> 
> 3.	Take a half-step by having RCU register a callback during the
> 	latest phase of kernel-visible boot.  I am under the impression
> 	that this is a relatively small fraction of boot, so it would
> 	be sub-optimal.
> 
> 4.	Make CPUs announce quiescence on each entry to idle.  This
> 	covers the transition to idle, but when a given CPU stays idle
> 	for more than one grace period, RCU has to do something to verify
> 	that the CPU remains idle.  Right now, that is FQS's job --
> 	it cycles through the dyntick-idle structures of all CPUs that
> 	have not already announced quiescence.
> 
> 5.	Make CPUs IPI RCU's grace-period kthread on each transition
> 	to and from idle.  I might be missing something, but given the
> 	cost and disuptiveness of IPIs, this does not seem to me to be
> 	a strategy to win.
> 
> 6.	IPI the CPUs to see if they are still idle.  This would defeat
> 	energy efficiency.  Of course, RCU could take this approach
> 	only during boot, but it is cheaper and faster to just check
> 	each CPU's rcu_dynticks structure -- which is what FQS does.
> 
> 7.	Treat all normal grace periods as expedited grace periods, but
> 	only during boot.  It is fairly easy for RCU to do this, but
> 	again, something has to tell RCU when boot is complete.
> 
> 8.	Your idea here.  Plus more of mine as I remember them.  ;-)
> 
> So, what am I missing?

Hmmm...  I suppose I could have RCU define boot as being (say) the ten
seconds following the early_inits.  That is crude enough that it might
actually work reasonably well.

							Thanx, Paul


  reply	other threads:[~2013-04-14  6:10 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-12 23:18 [PATCH tip/core/rcu 0/7] RCU fixes for 3.11 Paul E. McKenney
2013-04-12 23:19 ` [PATCH tip/core/rcu 1/7] rcu: Convert rcutree.c printk calls Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 2/7] rcu: Convert rcutree_plugin.h " Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 3/7] rcu: Kick adaptive-ticks CPUs that are holding up RCU grace periods Paul E. McKenney
2013-04-13 14:06     ` Frederic Weisbecker
2013-04-13 15:19       ` Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 4/7] rcu: Don't allocate bootmem from rcu_init() Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 5/7] rcu: Remove "Experimental" flags Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 6/7] rcu: Drive quiescent-state-forcing delay from HZ Paul E. McKenney
2013-04-12 23:54     ` Josh Triplett
2013-04-13  6:38       ` Paul E. McKenney
2013-04-13 18:18         ` Josh Triplett
2013-04-13 19:34           ` Paul E. McKenney
2013-04-13 19:53             ` Josh Triplett
2013-04-13 22:09               ` Paul E. McKenney
2013-04-14  6:10                 ` Paul E. McKenney [this message]
2013-05-14 12:20                 ` Peter Zijlstra
2013-05-14 14:12                   ` Paul E. McKenney
2013-05-14 14:51                     ` Peter Zijlstra
2013-05-14 15:47                       ` Paul E. McKenney
2013-05-15  8:56                         ` Peter Zijlstra
2013-05-15  9:02                           ` Peter Zijlstra
2013-05-15 17:31                             ` Paul E. McKenney
2013-05-16  9:45                               ` Peter Zijlstra
2013-05-16 13:22                                 ` Paul E. McKenney
2013-05-21  9:45                                   ` Peter Zijlstra
2013-05-21 16:54                                     ` Paul E. McKenney
2013-05-15 16:37                           ` Paul E. McKenney
2013-05-16  9:37                             ` Peter Zijlstra
2013-05-16 13:13                               ` Paul E. McKenney
2013-05-15  9:20                     ` Ingo Molnar
2013-05-15 15:44                       ` Paul E. McKenney
2013-05-28 10:07                         ` Ingo Molnar
2013-05-29  1:29                           ` Paul E. McKenney
2013-04-15  2:03         ` Paul Mackerras
2013-04-15 17:26           ` Paul E. McKenney
2013-04-12 23:19   ` [PATCH tip/core/rcu 7/7] rcu: Merge adjacent identical ifdefs Paul E. McKenney
2013-04-13  0:01 ` [PATCH tip/core/rcu 0/7] RCU fixes for 3.11 Josh Triplett

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130414061000.GA16307@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=akpm@linux-foundation.org \
    --cc=darren@dvhart.com \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=josh@joshtriplett.org \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@polymtl.ca \
    --cc=mingo@elte.hu \
    --cc=niv@us.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sbw@mit.edu \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).