netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: David Witbrodt <dawitbro@sbcglobal.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, Yinghai Lu <yhlu.kernel@gmail.com>,
	Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>,
	"H. Peter Anvin" <hpa@zytor.com>, netdev <netdev@vger.kernel.org>
Subject: Re: HPET regression in 2.6.26 versus 2.6.25 -- RCU problem
Date: Sat, 9 Aug 2008 06:56:50 -0700	[thread overview]
Message-ID: <20080809135650.GE8125@linux.vnet.ibm.com> (raw)
In-Reply-To: <859858.77737.qm@web82105.mail.mud.yahoo.com>

On Sat, Aug 09, 2008 at 05:39:26AM -0700, David Witbrodt wrote:
> 
> 
> > On Fri, 2008-08-08 at 18:23 -0700, David Witbrodt wrote:
> > > I have tracked the regression down to an RCU problem.
> > > [...]
> > > After reading some documentation in Documentation/RCU/, it looks like 
> > > something is misusing RCU -- and, according to the Documentation, those kinds 
> > > of mistakes are easy to make.  Maybe necessary calls to
> > >  
> > >     rcu_read_lock()
> > >     rcu_read_unlock()
> > > 
> > > are missing, and something about my hardware is triggering a freeze that 
> > > doesn't occur on most hardware.
> > > 
> > > 
> > > For some reason, turning off the HPET by booting with "hpet=disabled" keeps
> > > the freeze from happening.  Just reading a couple of those docs about RCU
> > > made me dizzy, so I hope someone familiar with RCU issues will take a look
> > > at the code in the files I've listed.  Surely you guys can take it from here
> > > now?!
> > > 
> > > If not, just give me some experimental code changes to make to get my 2.6.26
> > > and 2.6.27 kernels working again without disabling HPET!!!
> > 
> > 
> > The typical way to deadlock like this is do something like:
> > 
> > rcu_read_lock();
> > 
> >    synchronize_rcu();
> > 
> > rcu_read_unlock();
> > 
> > While I cannot immediately see any such usage in the function you
> > quoted, it could be on of the callers.. let me browse some code..
> > 
> > Can't seem to find anything like that.
> > 
> > What's weird though - is that HPET makes any difference on these network
> > code paths.
> > 
> > Could we end up calling rcu too soon? I doubt we bring up ipv4 before
> > rcu..
> 
> I'm _way_ over my head in this discussion, but here's some more food
> for thought.  Last weekend, when I first tried 2.6.26 and discovered the
> freeze, I thought an error of my own in .config was causing it.  Before
> I ever sought help, I made about a dozen experiments with different
> .config files.
> 
> One series of those experiments involved turning off most of the kernel...
> including CONFIG_INET.  The kernel still froze, but when entering 
> pci_init().  (This info can be read in my original post to the Debian BTS,
> which I have provided links for a couple of times in this LKML thread.  I
> even went further and removed enough that the freeze was avoided, but so
> much of the kernel was missing that my init scripts couldn't mount a hard
> disk any more.  Trying to restore enough to allow HD mounting just brought
> back the freeze.)
> 
> I am completely ignorant about how the kernel works, so any guesses I have
> are probably worthless... but I'll throw some out anyway:
> 
> 1.  Maybe HPET is used (if present) for timing by RCU, so disabling it
> forces RCU to work differently.  (Pure guess here:  I know nothing about
> RCU, and haven't even tried looking at its code.)

RCU doesn't use HPET directly.  Most of its time-dependent behavior
comes from its being invoked from the scheduling-clock interrupt.

> 2.  Maybe my hardware is broken.  We need see one initcall return that
> report over 280,000 msecs... when the entire boot->freeze time was about
> 3 secs.  On the other hand, 2.6.25 (and before) work just fine with HPET
> enabled.

For CONFIG_CLASSIC_RCU and !CONFIG_PREEMPT, in-kernel infinite spin loops
will cause synchronize_rcu() to hang.  For other RCU configurations,
spinning with interrupts disabled will result in similar hangs.  Invoking
synchronize_rcu() very early in boot (before rcu_init() has been called)
will of course also hang.

Could you please let me know whether your config has CONFIG_CLASSIC_RCU
or CONFIG_PREEMPT_RCU?

> 3. I was able to find the commit that introduced the freeze
> (3def3d6ddf43dbe20c00c3cbc38dfacc8586998f), so there has to be a connection
> between that commit and the RCU problem.  Is it possible that a prexisting
> error or oversight in the code was merely exposed by that commit?  (And 
> only on certain hardware?)  Or does that code itself contain the error?

Thank you for finding the commit -- should be quite helpful!!!

A quick look reveals what appears to be reader-writer locking rather
than RCU.  It does run in early boot before rcu_init(), so if it managed
to call synchronize_rcu() somehow you indeed would see a hang.  I do
not see such a call, but then again, I don't know this code much at all.

This is the second time in as many days that motivated RCU's working
correctly before rcu_init()...  Hmmm...

> 4. Another bug has been posted on the Debian BTS, which is worked around
> by disabling HPET.  The user provided some links to bugzilla.kernel.org
> where David Brownell is fighting with some HPET/RTC issues (but no mention
> of RCU):
> http://bugzilla.kernel.org/show_bug.cgi?id=11111
> http://bugzilla.kernel.org/show_bug.cgi?id=11153
> 
> I honestly don't know whether this is related to my problem or not.  :-(

Nor me.

> If any has any test code I can run to detect massive HPET breakage on
> these motherboards, I'll be glad to do so.  Or any other experimental
> code changes, for that matter.

If you can answer my CONFIG_CLASSIC_RCU vs. CONFIG_PREEMPT_RCU question
above, I should be able to provide you a diagnostic patch that would say
which CPU RCU was waiting on.  At least assuming that at least one CPU
was still taking the scheduling-clock interrupt, that is.  ;-)

							Thanx, Paul

  reply	other threads:[~2008-08-09 13:56 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-09 12:39 HPET regression in 2.6.26 versus 2.6.25 -- RCU problem David Witbrodt
2008-08-09 13:56 ` Paul E. McKenney [this message]
2008-08-11 11:25   ` Ingo Molnar
2008-08-11 16:15     ` Yinghai Lu
  -- strict thread matches above, loose matches on Subject: below --
2008-08-12 17:29 David Witbrodt
2008-08-12 17:38 ` Ray Lee
2008-08-12 15:17 David Witbrodt
2008-08-12 16:03 ` Ray Lee
2008-08-11 19:05 David Witbrodt
2008-08-11 18:01 David Witbrodt
2008-08-11 18:08 ` Yinghai Lu
2008-08-09 22:35 David Witbrodt
2008-08-10 15:15 ` Paul E. McKenney
     [not found] <506429.22669.qm@web82107.mail.mud.yahoo.com>
2008-08-09  7:34 ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080809135650.GE8125@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=dawitbro@sbcglobal.net \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=yhlu.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).