public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: linux-kernel@vger.kernel.org, randy.dunlap@oracle.com,
	Valdis.Kletnieks@vt.edu, a.p.zijlstra@chello.nl
Subject: Re: [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch
Date: Sat, 21 May 2011 13:39:22 -0700	[thread overview]
Message-ID: <20110521203922.GI2271@linux.vnet.ibm.com> (raw)
In-Reply-To: <20110521191418.GA30688@elte.hu>

On Sat, May 21, 2011 at 09:14:18PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Sat, May 21, 2011 at 04:28:44PM +0200, Ingo Molnar wrote:
> > > 
> > > * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > > Hello, Ingo,
> > > > 
> > > > This pull requests covers some RCU bug fixes and one patch rework.
> > > > 
> > > > The first group breaks up the infamous now-reverted (but ultimately
> > > > vindicated) "Decrease memory-barrier usage based on semi-formal proof"
> > > > commit into five commits.  These five commits immediately follow the
> > > > revert, and the diff across all six of these commits is empty, so that
> > > > the effect of the five commits is to revert the revert.
> > > 
> > > But ... the regression that was observed with that commit needs to be fixed 
> > > first, or not? In what way was the barrier commit vindicated?
> > 
> > From what I can see, the hang was fixed by Frederic's patch at 
> > https://lkml.org/lkml/2011/5/19/753.  I was interpreting that as vindication, 
> > perhaps ill-advisedly.
> 
> I mean, without Frederic's patch we are getting very long hangs due to the 
> barrier patch, right?

Yes.  The reason we are seeing these hangs is that HARDIRQ_ENTER()
invoked irq_enter(), which calls rcu_irq_enter() but that the matching
HARDIRQ_EXIT() invoked __irq_exit(), which does not call rcu_irq_exit().
This resulted in calls to rcu_irq_enter() that were not balanced by
matching calls to rcu_irq_exit().  Therefore, after these tests completed,
RCU's dyntick-idle nesting count was a large number, which caused RCU
to conclude that the affected CPU was not in dyntick-idle mode when in
fact it was.

RCU would therefore incorrectly wait for this dyntick-idle CPU.

With Frederic's patch, these tests don't ever call either rcu_irq_enter()
or rcu_irq_exit(), which works because the CPU running the test is
already marked as not being in dyntick-idle mode.

So, with Frederic's patch, the rcu_irq_enter() and rcu_irq_exit() calls
are balanced and things work.

The reason that the imbalance was not noticed before the barrier patch
was applied is that the old implementation of rcu_enter_nohz() ignored
the nesting depth.  This could still result in delays, but much shorter
ones.  Whenever there was a delay, RCU would IPI the CPU with the
unbalanced nesting level, which would eventually result in rcu_enter_nohz()
being called, which in turn would force RCU to see that the CPU was in
dyntick-idle mode.

Hmmm...  I should add this line of reasoning to one of the commit logs,
shouldn't I?  (Added it.  Which of course invalidates my pull request.)

> Even if the barrier patch is not to blame - somehow it still managed to produce 
> these hangs - and we do not understand it yet.

>From Yinghai's message https://lkml.org/lkml/2011/5/12/465, I believe
that the residual delay he is seeing is not due to the barrier patch,
but rather due to a26ac2455 (move TREE_RCU from softirq to kthrea).

More on this below.

> > Yinghai said that he was still seeing a delay, adn that he was seeing it even 
> > with the "Decrease memory-barrier usage based on semi-formal proof" reverted: 
> > https://lkml.org/lkml/2011/5/20/427.  This hang seems to happen when he uses 
> > gcc 4.5.0, but not when using gcc 4.5.1, assuming I understood his sequence 
> > of emails.  So I was interpreting that as meaning that the delay was unlikely 
> > to be caused by that commit, probably by one of the later commits.
> > 
> > I clearly need to figure out what is causing this delay.  I asked Yinghai to 
> > apply c7a378603 (Remove waitqueue usage for cpu, node, and boost kthreads) 
> > from Peter Zijlstra because the long delays that Yinghai is seeing (93 
> > seconds for memory_dev_init() rather than 3 or 4 seconds) might be due to my 
> > less-efficient method of awakening the RCU kthreads, so that Peter's 
> > approache might help.
> > 
> > If that doesn't speed things up for Yinghai, then I will work out some 
> > tracing to help localize the slowdown that he is seeing.
> > 
> > Of course, if you would rather that I get to the bottom of this before 
> > pulling, fair enough!
> 
> We should fix the delay regression i suspect - do we have to revert more stuff 
> perhaps?
> 
> Would it be possible to figure out what caused that other delay for Yinghai?

Earlier, Yinghai reported that reverting a26ac2455ffc (move TREE_RCU
from softirq to kthread) and everything after it made what appears to be
the same sort of delay go away (https://lkml.org/lkml/2011/5/12/465).
This commit replaced raise_softirq() with wait queues, flags, and
wake_up().  Later, Yinghai said that the delay shows up in kernels
built using opensuse 11.3, but not in kernels build using fedora 14
(https://lkml.org/lkml/2011/5/20/469).  Still later, he said that opensuse
11.3 has gcc 4.5.0 and fedora 14 has gcc 4.5.1.

Differences in compilers usually don't produce 20-to-1 latency differences
without something amplifying them.  In this case, that something
is likely to be the wait/wakeup coordination.  Peter's recent patch
(https://lkml.org/lkml/2011/5/19/133) to fix some CPU-hotplug-related
issues in the scheduler (https://lkml.org/lkml/2011/5/13/22) changed
RCU's kthread wait/wakeup coordination.

So I asked that Yinghai try c7a3786030 (Remove waitqueue usage for cpu,
node, and boost kthreads) from Peter currently queued on -rcu.

If that doesn't help, I will probably provide Yinghai some tracing
patches.

							Thanx, Paul

  reply	other threads:[~2011-05-21 20:39 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-21 14:06 [GIT PULL rcu/next] fixes and breakup of memory-barrier-decrease patch Paul E. McKenney
2011-05-21 14:28 ` Ingo Molnar
2011-05-21 19:08   ` Paul E. McKenney
2011-05-21 19:14     ` Ingo Molnar
2011-05-21 20:39       ` Paul E. McKenney [this message]
2011-05-22  9:04         ` Ingo Molnar
2011-05-22 16:17           ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110521203922.GI2271@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=a.p.zijlstra@chello.nl \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=randy.dunlap@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox