Re: mcelog stalls on 2.6.39-rc5

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Jim Schutt <jaschut@sandia.gov>
Cc: linux-kernel@vger.kernel.org
Subject: Re: mcelog stalls on 2.6.39-rc5
Date: Thu, 28 Apr 2011 17:31:10 -0700	[thread overview]
Message-ID: <20110429003110.GC2191@linux.vnet.ibm.com> (raw)
In-Reply-To: <4DB990A1.8090205@sandia.gov>

On Thu, Apr 28, 2011 at 10:06:57AM -0600, Jim Schutt wrote:
> Jim Schutt wrote:
> >Paul E. McKenney wrote:
> >>On Wed, Apr 27, 2011 at 01:59:01PM -0600, Jim Schutt wrote:
> >>>Hi,
> >>>
> >>>Testing 2.6.39-rc5 is giving me the following stall:
> >>>
> >>>    [ 5767.731001] INFO: rcu_sched_state detected stall on CPU
> >>>1 (t=60001 jiffies)
> >>>    [ 5767.732001] INFO: rcu_sched_state detected stalls on
> >>>CPUs/tasks: { 1} (detected by 0, t=60002 jiffies)
> >>>    [ 5947.763001] INFO: rcu_sched_state detected stall on CPU
> >>>1 (t=240032 jiffies)
> >>>    [ 5947.764001] INFO: rcu_sched_state detected stalls on
> >>>CPUs/tasks: { 1} (detected by 0, t=240034 jiffies)
> >>>    [ 6024.489362] libceph: mon0 172.17.40.34:6789 socket closed
> >>>    [ 6121.281139] INFO: task mcelog:6513 blocked for more
> >>>than 120 seconds.
> >>>    [ 6121.287575] "echo 0 >
> >>>/proc/sys/kernel/hung_task_timeout_secs" disables this
> >>>message.
> >>>    [ 6121.295397]  ffff880177aefce8 0000000000000082
> >>>ffffffff810339b6 ffff880226d35a40
> >>>    [ 6121.302840]  ffff88018226c3b0 ffff88018226c3b0
> >>>0000000000011e80 ffff880226d35a40
> >>>    [ 6121.310284]  ffff88018226c760 ffff880177aefe80
> >>>ffff880177aefd18 ffffffff813af047
> >>>    [ 6121.317725] Call Trace:
> >>>    [ 6121.320176]  [<ffffffff810339b6>] ?
> >>>calc_load_account_idle+0xe/0x1d
> >>>    [ 6121.326437]  [<ffffffff813af047>] schedule+0x159/0x193
> >>>    [ 6121.331569]  [<ffffffff813af449>] schedule_timeout+0x36/0xe2
> >>>    [ 6121.337223]  [<ffffffff810ad9eb>] ? trace_hardirqs_on+0x9/0x20
> >>>    [ 6121.343047]  [<ffffffff813aee2b>] do_wait_for_common+0x97/0xe3
> >>>    [ 6121.348967]  [<ffffffff8103e8fa>] ? try_to_wake_up+0x200/0x200
> >>>    [ 6121.354794]  [<ffffffff8107a4bf>] ? __raw_spin_lock_irq+0x17/0x2f
> >>>    [ 6121.360878]  [<ffffffff813af2a9>] wait_for_common+0x36/0x4d
> >>>    [ 6121.366441]  [<ffffffff813af378>] wait_for_completion+0x1d/0x1f
> >>>    [ 6121.372356]  [<ffffffff8109ae6d>] synchronize_sched+0x40/0x49
> >>>    [ 6121.378096]  [<ffffffff810635b8>] ? find_get_pid+0x1b/0x1b
> >>>    [ 6121.383574]  [<ffffffff81015fe5>] mce_read+0x17f/0x25d
> >>>    [ 6121.388707]  [<ffffffff81111af5>] ? rw_verify_area+0xac/0xdb
> >>>    [ 6121.394358]  [<ffffffff811121f1>] vfs_read+0xa9/0xe1
> >>>    [ 6121.399317]  [<ffffffff8111248d>] sys_read+0x4c/0x70
> >>>    [ 6121.404278]  [<ffffffff813b6deb>] system_call_fastpath+0x16/0x1b
> >>>    [ 6127.795001] INFO: rcu_sched_state detected stall on CPU
> >>>1 (t=420064 jiffies)
> >>>    [ 6127.796001] INFO: rcu_sched_state detected stalls on
> >>>CPUs/tasks: { 1} (detected by 0, t=420066 jiffies)
> >>>    [ 6241.410171] INFO: task mcelog:6513 blocked for more
> >>>than 120 seconds.
> >>>
> >>>Reverting commit a4dd99250dc makes the stalls go away:
> >>>
> >>>    rcu: create new rcu_access_index() and use in mce
> >>>
> >>>    The MCE subsystem needs to sample an RCU-protected index outside of
> >>>    any protection for that index.  If this was a pointer, we would use
> >>>    rcu_access_pointer(), but there is no corresponding
> >>>rcu_access_index().
> >>>    This commit therefore creates an rcu_access_index() and applies it
> >>>    to MCE.
> >>>
> >>>    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >>>    Tested-by: Zdenek Kabelac <zkabelac@redhat.com>
> >>
> >>Wow!  This is just instructions, just wrapping the access in
> >>ACCESS_ONCE().
> >>
> >>Was the original RCU CPU stall repeatable?
> >
> >Yes.  I saw it on two different machines.
> >Both were running as Ceph clients, doing sustained
> >streaming writes, if that helps any.
> >
> >I can attempt to repeat with any extra debugging
> >you'd like me to try.
> 
> Evidently I cannot reproduce this after all, so my revert
> test was incorrect.
> 
> If I can learn how to reproduce this I'll post with details;
> in the meantime ignore this, and sorry for the noise.

OK -- I must confess that I could not understand how this change
could cause the problem!  ;-)

						Thanx, Paul

     prev parent reply	other threads:[~2011-04-29  0:31 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-27 19:59 mcelog stalls on 2.6.39-rc5 Jim Schutt
2011-04-27 23:03 ` Paul E. McKenney
2011-04-28 14:06   ` Jim Schutt
2011-04-28 16:06     ` Jim Schutt
2011-04-29  0:31       ` Paul E. McKenney [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110429003110.GC2191@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=jaschut@sandia.gov \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.