All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org,
	mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com,
	akpm@linux-foundation.org, mathieu.desnoyers@efficios.com,
	josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org,
	dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com,
	oleg@redhat.com, joel@joelfernandes.org
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs
Date: Fri, 9 Aug 2019 09:51:20 -0700	[thread overview]
Message-ID: <20190809165120.GA5668@linux.ibm.com> (raw)
In-Reply-To: <20190808213012.GA28773@linux.ibm.com>

On Thu, Aug 08, 2019 at 02:30:12PM -0700, Paul E. McKenney wrote:
> On Thu, Aug 08, 2019 at 01:35:41PM -0700, Paul E. McKenney wrote:
> > On Wed, Aug 07, 2019 at 02:41:31PM -0700, Paul E. McKenney wrote:
> > > On Tue, Aug 06, 2019 at 11:08:24AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 05, 2019 at 10:48:00AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> > > > > > On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
> > > > > > 
> > > > > > > > Right; so clearly we're not understanding what's happening. That seems
> > > > > > > > like a requirement for actually doing a patch.
> > > > > > > 
> > > > > > > Almost but not quite.  It is a requirement for a patch *that* *is*
> > > > > > > *supposed* *to* *be* *a* *fix*.  If you are trying to prohibit me from
> > > > > > > writing experimental patches, please feel free to take a long walk on
> > > > > > > a short pier.
> > > > > > > 
> > > > > > > Understood???
> > > > > > 
> > > > > > Ah, my bad, I thought you were actually proposing this as an actual
> > > > > > patch. I now see that is my bad, I'd overlooked the RFC part.
> > > > > 
> > > > > No problem!
> > > > > 
> > > > > And of course adding tracing decreases the frequency and duration of
> > > > > the multi_cpu_stop().  Re-running with shorter-duration triggering.  ;-)
> > > > 
> > > > And I did eventually get a good trace.  If I am interpreting this trace
> > > > correctly, the torture_-135 task didn't get around to attempting to wake
> > > > up all of the CPUs.  I will try again, but this time with the sched_switch
> > > > trace event enabled.
> > > > 
> > > > As a side note, enabling ftrace from the command line seems to interact
> > > > badly with turning tracing off and on in the kernel, so I eventually
> > > > resorted to trace_printk() in the functions of interest.  The trace
> > > > output is below, followed by the current diagnostic patch.  Please note
> > > > that I am -not- using the desperation hammer-the-scheduler patches.
> > > > 
> > > > More as I learn more!
> > > 
> > > And of course I forgot to dump out the online CPUs, so I really had no
> > > idea whether or not all the CPUs were accounted for.  I added tracing
> > > to dump out the online CPUs at the beginning of __stop_cpus() and then
> > > reworked it a few times to get the problem to happen in reasonable time.
> > > Please see below for the resulting annotated trace.
> > > 
> > > I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
> > > but all the migration threads are running within about 2 milliseconds.
> > > It is then almost two minutes(!) until the next trace message.
> > > 
> > > Looks like time to (very carefully!) instrument multi_cpu_stop().
> > > 
> > > Of course, if you have any ideas, please do not keep them a secret!
> > 
> > Functionally, multi_cpu_stop() is working fine, according to the trace
> > below (search for a line beginning with TAB).  But somehow CPU 2 took
> > almost three -minutes- to do one iteration of the loop.  The prime suspect
> > in that loop is cpu_relax() due to the hypervisor having an opportunity
> > to do something at that point.  The commentary below (again, search for
> > a line beginning with TAB) gives my analysis.
> > 
> > Of course, if I am correct, it should be possible to catch cpu_relax()
> > in the act.  That is the next step, give or take the Heisenbuggy nature
> > of this beast.
> > 
> > Another thing for me to try is to run longer with !NO_HZ_FULL, just in
> > case the earlier runs just got lucky.
> > 
> > Thoughts?
> 
> And it really can happen:
> 
> [ 1881.467922] migratio-33      4...1 1879530317us : stop_machine_yield: cpu_relax() took 756140 ms
> 
> The previous timestamp was 1123391100us, so the cpu_relax() is almost
> exactly the full delay.
> 
> But another instance stalled for many minutes without a ten-second
> cpu_relax().  So it is not just cpu_relax() causing trouble.  I could
> rationalize that vCPU preemption being at fault...
> 
> And my diagnostic patch is below, just in case I am doing something
> stupid with that.

I did a 12-hour run with the same configuration except for leaving out the
"nohz_full=1-7" kernel parameter without problems (aside from the RCU CPU
stall warnings due to the ftrace_dump() at the end of the run -- isn't
there some way to clear the ftrace buffer without actually printing it?).

My next step is to do an over-the-weekend run with the same configuration,
then a similar run with more recent kernel and qemu but with the
"nohz_full=1-7".  If both of those pass, I will consider this to be a
KVM/qemu bug that has since been fixed.

							Thanx, Paul

  reply	other threads:[~2019-08-09 16:52 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-02 15:14 [PATCH tip/core/rcu 0/14] No-CBs bypass addition for v5.4 Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 01/14] rcu/nocb: Atomic ->len field in rcu_segcblist structure Paul E. McKenney
2019-08-04 14:50   ` Peter Zijlstra
2019-08-04 14:52     ` Peter Zijlstra
2019-08-04 18:45       ` Paul E. McKenney
2019-08-04 18:42     ` Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing Paul E. McKenney
2019-08-07  0:03   ` Joel Fernandes
2019-08-07  0:16     ` Joel Fernandes
2019-08-07  0:35     ` Paul E. McKenney
2019-08-07  0:40       ` Steven Rostedt
2019-08-07  1:17         ` Paul E. McKenney
2019-08-07  1:24           ` Steven Rostedt
2019-08-07  3:47             ` Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 03/14] rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 04/14] rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 05/14] rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake() Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 06/14] rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks() Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 07/14] rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 08/14] rcu/nocb: Reduce __call_rcu_nocb_wake() " Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 09/14] rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 10/14] rcu: Allow rcu_do_batch() to dynamically adjust batch sizes Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 11/14] EXP nohz: Add TICK_DEP_BIT_RCU Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 12/14] rcu/nohz: Force on tick when invoking lots of callbacks Paul E. McKenney
2019-08-02 15:15 ` [PATCH RFC tip/core/rcu 13/14] rcutorture: Force on tick for readers and callback flooders Paul E. McKenney
2019-08-02 15:15 ` [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs Paul E. McKenney
2019-08-04 14:43   ` Peter Zijlstra
2019-08-04 14:48     ` Peter Zijlstra
2019-08-04 18:41       ` Paul E. McKenney
2019-08-04 20:24         ` Paul E. McKenney
2019-08-05  4:19           ` Paul E. McKenney
2019-08-05  8:07             ` Peter Zijlstra
2019-08-05 14:47               ` Paul E. McKenney
2019-08-05  8:05         ` Peter Zijlstra
2019-08-05 14:54           ` Paul E. McKenney
2019-08-05 15:50             ` Peter Zijlstra
2019-08-05 17:48               ` Paul E. McKenney
2019-08-06 18:08                 ` Paul E. McKenney
2019-08-07 21:41                   ` Paul E. McKenney
2019-08-08 20:35                     ` Paul E. McKenney
2019-08-08 21:30                       ` Paul E. McKenney
2019-08-09 16:51                         ` Paul E. McKenney [this message]
2019-08-09 18:07                           ` Joel Fernandes
2019-08-09 18:39                             ` Paul E. McKenney
2019-08-12 21:02   ` Frederic Weisbecker
2019-08-12 23:23     ` Paul E. McKenney
2019-08-13  1:33       ` Joel Fernandes
2019-08-13 12:30       ` Frederic Weisbecker
2019-08-13 14:48         ` Paul E. McKenney
2019-08-14 17:55           ` Joel Fernandes
2019-08-14 22:05             ` Paul E. McKenney
2019-08-15 15:07               ` Joel Fernandes
2019-08-15 17:23                 ` Paul E. McKenney
2019-08-15 18:15                   ` Joel Fernandes
2019-08-15 18:39                     ` Paul E. McKenney
2019-08-15 19:42                       ` Joel Fernandes
2019-08-13 21:06       ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190809165120.GA5668@linux.ibm.com \
    --to=paulmck@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.