public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* mcelog stalls on 2.6.39-rc5
@ 2011-04-27 19:59 Jim Schutt
  2011-04-27 23:03 ` Paul E. McKenney
  0 siblings, 1 reply; 5+ messages in thread
From: Jim Schutt @ 2011-04-27 19:59 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

Hi,

Testing 2.6.39-rc5 is giving me the following stall:

     [ 5767.731001] INFO: rcu_sched_state detected stall on CPU 1 (t=60001 jiffies)
     [ 5767.732001] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=60002 jiffies)
     [ 5947.763001] INFO: rcu_sched_state detected stall on CPU 1 (t=240032 jiffies)
     [ 5947.764001] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=240034 jiffies)
     [ 6024.489362] libceph: mon0 172.17.40.34:6789 socket closed
     [ 6121.281139] INFO: task mcelog:6513 blocked for more than 120 seconds.
     [ 6121.287575] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
     [ 6121.295397]  ffff880177aefce8 0000000000000082 ffffffff810339b6 ffff880226d35a40
     [ 6121.302840]  ffff88018226c3b0 ffff88018226c3b0 0000000000011e80 ffff880226d35a40
     [ 6121.310284]  ffff88018226c760 ffff880177aefe80 ffff880177aefd18 ffffffff813af047
     [ 6121.317725] Call Trace:
     [ 6121.320176]  [<ffffffff810339b6>] ? calc_load_account_idle+0xe/0x1d
     [ 6121.326437]  [<ffffffff813af047>] schedule+0x159/0x193
     [ 6121.331569]  [<ffffffff813af449>] schedule_timeout+0x36/0xe2
     [ 6121.337223]  [<ffffffff810ad9eb>] ? trace_hardirqs_on+0x9/0x20
     [ 6121.343047]  [<ffffffff813aee2b>] do_wait_for_common+0x97/0xe3
     [ 6121.348967]  [<ffffffff8103e8fa>] ? try_to_wake_up+0x200/0x200
     [ 6121.354794]  [<ffffffff8107a4bf>] ? __raw_spin_lock_irq+0x17/0x2f
     [ 6121.360878]  [<ffffffff813af2a9>] wait_for_common+0x36/0x4d
     [ 6121.366441]  [<ffffffff813af378>] wait_for_completion+0x1d/0x1f
     [ 6121.372356]  [<ffffffff8109ae6d>] synchronize_sched+0x40/0x49
     [ 6121.378096]  [<ffffffff810635b8>] ? find_get_pid+0x1b/0x1b
     [ 6121.383574]  [<ffffffff81015fe5>] mce_read+0x17f/0x25d
     [ 6121.388707]  [<ffffffff81111af5>] ? rw_verify_area+0xac/0xdb
     [ 6121.394358]  [<ffffffff811121f1>] vfs_read+0xa9/0xe1
     [ 6121.399317]  [<ffffffff8111248d>] sys_read+0x4c/0x70
     [ 6121.404278]  [<ffffffff813b6deb>] system_call_fastpath+0x16/0x1b
     [ 6127.795001] INFO: rcu_sched_state detected stall on CPU 1 (t=420064 jiffies)
     [ 6127.796001] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=420066 jiffies)
     [ 6241.410171] INFO: task mcelog:6513 blocked for more than 120 seconds.

Reverting commit a4dd99250dc makes the stalls go away:

     rcu: create new rcu_access_index() and use in mce

     The MCE subsystem needs to sample an RCU-protected index outside of
     any protection for that index.  If this was a pointer, we would use
     rcu_access_pointer(), but there is no corresponding rcu_access_index().
     This commit therefore creates an rcu_access_index() and applies it
     to MCE.

     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
     Tested-by: Zdenek Kabelac <zkabelac@redhat.com>

Here's some bits from my config that might be relevant:

# RCU Subsystem
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_TRACE is not set
CONFIG_RCU_FANOUT=64
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_EDAC_DECODE_MCE=y
# CONFIG_EDAC_MCE_INJ is not set
CONFIG_EDAC_MCE=y
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_DETECTOR=y
CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_RCU_CPU_STALL_DETECTOR_RUNNABLE=y

Please let me know if there is anything I can do to help sort this out.

-- Jim


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: mcelog stalls on 2.6.39-rc5
  2011-04-27 19:59 mcelog stalls on 2.6.39-rc5 Jim Schutt
@ 2011-04-27 23:03 ` Paul E. McKenney
  2011-04-28 14:06   ` Jim Schutt
  0 siblings, 1 reply; 5+ messages in thread
From: Paul E. McKenney @ 2011-04-27 23:03 UTC (permalink / raw)
  To: Jim Schutt; +Cc: linux-kernel

On Wed, Apr 27, 2011 at 01:59:01PM -0600, Jim Schutt wrote:
> Hi,
> 
> Testing 2.6.39-rc5 is giving me the following stall:
> 
>     [ 5767.731001] INFO: rcu_sched_state detected stall on CPU 1 (t=60001 jiffies)
>     [ 5767.732001] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=60002 jiffies)
>     [ 5947.763001] INFO: rcu_sched_state detected stall on CPU 1 (t=240032 jiffies)
>     [ 5947.764001] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=240034 jiffies)
>     [ 6024.489362] libceph: mon0 172.17.40.34:6789 socket closed
>     [ 6121.281139] INFO: task mcelog:6513 blocked for more than 120 seconds.
>     [ 6121.287575] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>     [ 6121.295397]  ffff880177aefce8 0000000000000082 ffffffff810339b6 ffff880226d35a40
>     [ 6121.302840]  ffff88018226c3b0 ffff88018226c3b0 0000000000011e80 ffff880226d35a40
>     [ 6121.310284]  ffff88018226c760 ffff880177aefe80 ffff880177aefd18 ffffffff813af047
>     [ 6121.317725] Call Trace:
>     [ 6121.320176]  [<ffffffff810339b6>] ? calc_load_account_idle+0xe/0x1d
>     [ 6121.326437]  [<ffffffff813af047>] schedule+0x159/0x193
>     [ 6121.331569]  [<ffffffff813af449>] schedule_timeout+0x36/0xe2
>     [ 6121.337223]  [<ffffffff810ad9eb>] ? trace_hardirqs_on+0x9/0x20
>     [ 6121.343047]  [<ffffffff813aee2b>] do_wait_for_common+0x97/0xe3
>     [ 6121.348967]  [<ffffffff8103e8fa>] ? try_to_wake_up+0x200/0x200
>     [ 6121.354794]  [<ffffffff8107a4bf>] ? __raw_spin_lock_irq+0x17/0x2f
>     [ 6121.360878]  [<ffffffff813af2a9>] wait_for_common+0x36/0x4d
>     [ 6121.366441]  [<ffffffff813af378>] wait_for_completion+0x1d/0x1f
>     [ 6121.372356]  [<ffffffff8109ae6d>] synchronize_sched+0x40/0x49
>     [ 6121.378096]  [<ffffffff810635b8>] ? find_get_pid+0x1b/0x1b
>     [ 6121.383574]  [<ffffffff81015fe5>] mce_read+0x17f/0x25d
>     [ 6121.388707]  [<ffffffff81111af5>] ? rw_verify_area+0xac/0xdb
>     [ 6121.394358]  [<ffffffff811121f1>] vfs_read+0xa9/0xe1
>     [ 6121.399317]  [<ffffffff8111248d>] sys_read+0x4c/0x70
>     [ 6121.404278]  [<ffffffff813b6deb>] system_call_fastpath+0x16/0x1b
>     [ 6127.795001] INFO: rcu_sched_state detected stall on CPU 1 (t=420064 jiffies)
>     [ 6127.796001] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=420066 jiffies)
>     [ 6241.410171] INFO: task mcelog:6513 blocked for more than 120 seconds.
> 
> Reverting commit a4dd99250dc makes the stalls go away:
> 
>     rcu: create new rcu_access_index() and use in mce
> 
>     The MCE subsystem needs to sample an RCU-protected index outside of
>     any protection for that index.  If this was a pointer, we would use
>     rcu_access_pointer(), but there is no corresponding rcu_access_index().
>     This commit therefore creates an rcu_access_index() and applies it
>     to MCE.
> 
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>     Tested-by: Zdenek Kabelac <zkabelac@redhat.com>

Wow!  This is just instructions, just wrapping the access in
ACCESS_ONCE().

Was the original RCU CPU stall repeatable?

							Thanx, Paul

> Here's some bits from my config that might be relevant:
> 
> # RCU Subsystem
> CONFIG_TREE_RCU=y
> # CONFIG_PREEMPT_RCU is not set
> # CONFIG_RCU_TRACE is not set
> CONFIG_RCU_FANOUT=64
> # CONFIG_RCU_FANOUT_EXACT is not set
> # CONFIG_RCU_FAST_NO_HZ is not set
> # CONFIG_TREE_RCU_TRACE is not set
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_PREEMPT_NONE is not set
> CONFIG_PREEMPT_VOLUNTARY=y
> # CONFIG_PREEMPT is not set
> CONFIG_X86_MCE=y
> CONFIG_X86_MCE_INTEL=y
> CONFIG_X86_MCE_AMD=y
> CONFIG_X86_MCE_THRESHOLD=y
> # CONFIG_X86_MCE_INJECT is not set
> CONFIG_EDAC_DECODE_MCE=y
> # CONFIG_EDAC_MCE_INJ is not set
> CONFIG_EDAC_MCE=y
> # CONFIG_SPARSE_RCU_POINTER is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> CONFIG_RCU_CPU_STALL_DETECTOR=y
> CONFIG_RCU_CPU_STALL_TIMEOUT=60
> CONFIG_RCU_CPU_STALL_DETECTOR_RUNNABLE=y
> 
> Please let me know if there is anything I can do to help sort this out.
> 
> -- Jim
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: mcelog stalls on 2.6.39-rc5
  2011-04-27 23:03 ` Paul E. McKenney
@ 2011-04-28 14:06   ` Jim Schutt
  2011-04-28 16:06     ` Jim Schutt
  0 siblings, 1 reply; 5+ messages in thread
From: Jim Schutt @ 2011-04-28 14:06 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

Paul E. McKenney wrote:
> On Wed, Apr 27, 2011 at 01:59:01PM -0600, Jim Schutt wrote:
>> Hi,
>>
>> Testing 2.6.39-rc5 is giving me the following stall:
>>
>>     [ 5767.731001] INFO: rcu_sched_state detected stall on CPU 1 (t=60001 jiffies)
>>     [ 5767.732001] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=60002 jiffies)
>>     [ 5947.763001] INFO: rcu_sched_state detected stall on CPU 1 (t=240032 jiffies)
>>     [ 5947.764001] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=240034 jiffies)
>>     [ 6024.489362] libceph: mon0 172.17.40.34:6789 socket closed
>>     [ 6121.281139] INFO: task mcelog:6513 blocked for more than 120 seconds.
>>     [ 6121.287575] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>     [ 6121.295397]  ffff880177aefce8 0000000000000082 ffffffff810339b6 ffff880226d35a40
>>     [ 6121.302840]  ffff88018226c3b0 ffff88018226c3b0 0000000000011e80 ffff880226d35a40
>>     [ 6121.310284]  ffff88018226c760 ffff880177aefe80 ffff880177aefd18 ffffffff813af047
>>     [ 6121.317725] Call Trace:
>>     [ 6121.320176]  [<ffffffff810339b6>] ? calc_load_account_idle+0xe/0x1d
>>     [ 6121.326437]  [<ffffffff813af047>] schedule+0x159/0x193
>>     [ 6121.331569]  [<ffffffff813af449>] schedule_timeout+0x36/0xe2
>>     [ 6121.337223]  [<ffffffff810ad9eb>] ? trace_hardirqs_on+0x9/0x20
>>     [ 6121.343047]  [<ffffffff813aee2b>] do_wait_for_common+0x97/0xe3
>>     [ 6121.348967]  [<ffffffff8103e8fa>] ? try_to_wake_up+0x200/0x200
>>     [ 6121.354794]  [<ffffffff8107a4bf>] ? __raw_spin_lock_irq+0x17/0x2f
>>     [ 6121.360878]  [<ffffffff813af2a9>] wait_for_common+0x36/0x4d
>>     [ 6121.366441]  [<ffffffff813af378>] wait_for_completion+0x1d/0x1f
>>     [ 6121.372356]  [<ffffffff8109ae6d>] synchronize_sched+0x40/0x49
>>     [ 6121.378096]  [<ffffffff810635b8>] ? find_get_pid+0x1b/0x1b
>>     [ 6121.383574]  [<ffffffff81015fe5>] mce_read+0x17f/0x25d
>>     [ 6121.388707]  [<ffffffff81111af5>] ? rw_verify_area+0xac/0xdb
>>     [ 6121.394358]  [<ffffffff811121f1>] vfs_read+0xa9/0xe1
>>     [ 6121.399317]  [<ffffffff8111248d>] sys_read+0x4c/0x70
>>     [ 6121.404278]  [<ffffffff813b6deb>] system_call_fastpath+0x16/0x1b
>>     [ 6127.795001] INFO: rcu_sched_state detected stall on CPU 1 (t=420064 jiffies)
>>     [ 6127.796001] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1} (detected by 0, t=420066 jiffies)
>>     [ 6241.410171] INFO: task mcelog:6513 blocked for more than 120 seconds.
>>
>> Reverting commit a4dd99250dc makes the stalls go away:
>>
>>     rcu: create new rcu_access_index() and use in mce
>>
>>     The MCE subsystem needs to sample an RCU-protected index outside of
>>     any protection for that index.  If this was a pointer, we would use
>>     rcu_access_pointer(), but there is no corresponding rcu_access_index().
>>     This commit therefore creates an rcu_access_index() and applies it
>>     to MCE.
>>
>>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>     Tested-by: Zdenek Kabelac <zkabelac@redhat.com>
> 
> Wow!  This is just instructions, just wrapping the access in
> ACCESS_ONCE().
> 
> Was the original RCU CPU stall repeatable?

Yes.  I saw it on two different machines.
Both were running as Ceph clients, doing sustained
streaming writes, if that helps any.

I can attempt to repeat with any extra debugging
you'd like me to try.

-- Jim

> 
> 							Thanx, Paul
> 
>> Here's some bits from my config that might be relevant:
>>
>> # RCU Subsystem
>> CONFIG_TREE_RCU=y
>> # CONFIG_PREEMPT_RCU is not set
>> # CONFIG_RCU_TRACE is not set
>> CONFIG_RCU_FANOUT=64
>> # CONFIG_RCU_FANOUT_EXACT is not set
>> # CONFIG_RCU_FAST_NO_HZ is not set
>> # CONFIG_TREE_RCU_TRACE is not set
>> CONFIG_PREEMPT_NOTIFIERS=y
>> # CONFIG_PREEMPT_NONE is not set
>> CONFIG_PREEMPT_VOLUNTARY=y
>> # CONFIG_PREEMPT is not set
>> CONFIG_X86_MCE=y
>> CONFIG_X86_MCE_INTEL=y
>> CONFIG_X86_MCE_AMD=y
>> CONFIG_X86_MCE_THRESHOLD=y
>> # CONFIG_X86_MCE_INJECT is not set
>> CONFIG_EDAC_DECODE_MCE=y
>> # CONFIG_EDAC_MCE_INJ is not set
>> CONFIG_EDAC_MCE=y
>> # CONFIG_SPARSE_RCU_POINTER is not set
>> # CONFIG_RCU_TORTURE_TEST is not set
>> CONFIG_RCU_CPU_STALL_DETECTOR=y
>> CONFIG_RCU_CPU_STALL_TIMEOUT=60
>> CONFIG_RCU_CPU_STALL_DETECTOR_RUNNABLE=y
>>
>> Please let me know if there is anything I can do to help sort this out.
>>
>> -- Jim
>>
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: mcelog stalls on 2.6.39-rc5
  2011-04-28 14:06   ` Jim Schutt
@ 2011-04-28 16:06     ` Jim Schutt
  2011-04-29  0:31       ` Paul E. McKenney
  0 siblings, 1 reply; 5+ messages in thread
From: Jim Schutt @ 2011-04-28 16:06 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

Jim Schutt wrote:
> Paul E. McKenney wrote:
>> On Wed, Apr 27, 2011 at 01:59:01PM -0600, Jim Schutt wrote:
>>> Hi,
>>>
>>> Testing 2.6.39-rc5 is giving me the following stall:
>>>
>>>     [ 5767.731001] INFO: rcu_sched_state detected stall on CPU 1 
>>> (t=60001 jiffies)
>>>     [ 5767.732001] INFO: rcu_sched_state detected stalls on 
>>> CPUs/tasks: { 1} (detected by 0, t=60002 jiffies)
>>>     [ 5947.763001] INFO: rcu_sched_state detected stall on CPU 1 
>>> (t=240032 jiffies)
>>>     [ 5947.764001] INFO: rcu_sched_state detected stalls on 
>>> CPUs/tasks: { 1} (detected by 0, t=240034 jiffies)
>>>     [ 6024.489362] libceph: mon0 172.17.40.34:6789 socket closed
>>>     [ 6121.281139] INFO: task mcelog:6513 blocked for more than 120 
>>> seconds.
>>>     [ 6121.287575] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>> disables this message.
>>>     [ 6121.295397]  ffff880177aefce8 0000000000000082 
>>> ffffffff810339b6 ffff880226d35a40
>>>     [ 6121.302840]  ffff88018226c3b0 ffff88018226c3b0 
>>> 0000000000011e80 ffff880226d35a40
>>>     [ 6121.310284]  ffff88018226c760 ffff880177aefe80 
>>> ffff880177aefd18 ffffffff813af047
>>>     [ 6121.317725] Call Trace:
>>>     [ 6121.320176]  [<ffffffff810339b6>] ? 
>>> calc_load_account_idle+0xe/0x1d
>>>     [ 6121.326437]  [<ffffffff813af047>] schedule+0x159/0x193
>>>     [ 6121.331569]  [<ffffffff813af449>] schedule_timeout+0x36/0xe2
>>>     [ 6121.337223]  [<ffffffff810ad9eb>] ? trace_hardirqs_on+0x9/0x20
>>>     [ 6121.343047]  [<ffffffff813aee2b>] do_wait_for_common+0x97/0xe3
>>>     [ 6121.348967]  [<ffffffff8103e8fa>] ? try_to_wake_up+0x200/0x200
>>>     [ 6121.354794]  [<ffffffff8107a4bf>] ? __raw_spin_lock_irq+0x17/0x2f
>>>     [ 6121.360878]  [<ffffffff813af2a9>] wait_for_common+0x36/0x4d
>>>     [ 6121.366441]  [<ffffffff813af378>] wait_for_completion+0x1d/0x1f
>>>     [ 6121.372356]  [<ffffffff8109ae6d>] synchronize_sched+0x40/0x49
>>>     [ 6121.378096]  [<ffffffff810635b8>] ? find_get_pid+0x1b/0x1b
>>>     [ 6121.383574]  [<ffffffff81015fe5>] mce_read+0x17f/0x25d
>>>     [ 6121.388707]  [<ffffffff81111af5>] ? rw_verify_area+0xac/0xdb
>>>     [ 6121.394358]  [<ffffffff811121f1>] vfs_read+0xa9/0xe1
>>>     [ 6121.399317]  [<ffffffff8111248d>] sys_read+0x4c/0x70
>>>     [ 6121.404278]  [<ffffffff813b6deb>] system_call_fastpath+0x16/0x1b
>>>     [ 6127.795001] INFO: rcu_sched_state detected stall on CPU 1 
>>> (t=420064 jiffies)
>>>     [ 6127.796001] INFO: rcu_sched_state detected stalls on 
>>> CPUs/tasks: { 1} (detected by 0, t=420066 jiffies)
>>>     [ 6241.410171] INFO: task mcelog:6513 blocked for more than 120 
>>> seconds.
>>>
>>> Reverting commit a4dd99250dc makes the stalls go away:
>>>
>>>     rcu: create new rcu_access_index() and use in mce
>>>
>>>     The MCE subsystem needs to sample an RCU-protected index outside of
>>>     any protection for that index.  If this was a pointer, we would use
>>>     rcu_access_pointer(), but there is no corresponding 
>>> rcu_access_index().
>>>     This commit therefore creates an rcu_access_index() and applies it
>>>     to MCE.
>>>
>>>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>>     Tested-by: Zdenek Kabelac <zkabelac@redhat.com>
>>
>> Wow!  This is just instructions, just wrapping the access in
>> ACCESS_ONCE().
>>
>> Was the original RCU CPU stall repeatable?
> 
> Yes.  I saw it on two different machines.
> Both were running as Ceph clients, doing sustained
> streaming writes, if that helps any.
> 
> I can attempt to repeat with any extra debugging
> you'd like me to try.

Evidently I cannot reproduce this after all, so my revert
test was incorrect.

If I can learn how to reproduce this I'll post with details;
in the meantime ignore this, and sorry for the noise.

-- Jim

> 
> -- Jim
> 
>>
>>                             Thanx, Paul
>>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: mcelog stalls on 2.6.39-rc5
  2011-04-28 16:06     ` Jim Schutt
@ 2011-04-29  0:31       ` Paul E. McKenney
  0 siblings, 0 replies; 5+ messages in thread
From: Paul E. McKenney @ 2011-04-29  0:31 UTC (permalink / raw)
  To: Jim Schutt; +Cc: linux-kernel

On Thu, Apr 28, 2011 at 10:06:57AM -0600, Jim Schutt wrote:
> Jim Schutt wrote:
> >Paul E. McKenney wrote:
> >>On Wed, Apr 27, 2011 at 01:59:01PM -0600, Jim Schutt wrote:
> >>>Hi,
> >>>
> >>>Testing 2.6.39-rc5 is giving me the following stall:
> >>>
> >>>    [ 5767.731001] INFO: rcu_sched_state detected stall on CPU
> >>>1 (t=60001 jiffies)
> >>>    [ 5767.732001] INFO: rcu_sched_state detected stalls on
> >>>CPUs/tasks: { 1} (detected by 0, t=60002 jiffies)
> >>>    [ 5947.763001] INFO: rcu_sched_state detected stall on CPU
> >>>1 (t=240032 jiffies)
> >>>    [ 5947.764001] INFO: rcu_sched_state detected stalls on
> >>>CPUs/tasks: { 1} (detected by 0, t=240034 jiffies)
> >>>    [ 6024.489362] libceph: mon0 172.17.40.34:6789 socket closed
> >>>    [ 6121.281139] INFO: task mcelog:6513 blocked for more
> >>>than 120 seconds.
> >>>    [ 6121.287575] "echo 0 >
> >>>/proc/sys/kernel/hung_task_timeout_secs" disables this
> >>>message.
> >>>    [ 6121.295397]  ffff880177aefce8 0000000000000082
> >>>ffffffff810339b6 ffff880226d35a40
> >>>    [ 6121.302840]  ffff88018226c3b0 ffff88018226c3b0
> >>>0000000000011e80 ffff880226d35a40
> >>>    [ 6121.310284]  ffff88018226c760 ffff880177aefe80
> >>>ffff880177aefd18 ffffffff813af047
> >>>    [ 6121.317725] Call Trace:
> >>>    [ 6121.320176]  [<ffffffff810339b6>] ?
> >>>calc_load_account_idle+0xe/0x1d
> >>>    [ 6121.326437]  [<ffffffff813af047>] schedule+0x159/0x193
> >>>    [ 6121.331569]  [<ffffffff813af449>] schedule_timeout+0x36/0xe2
> >>>    [ 6121.337223]  [<ffffffff810ad9eb>] ? trace_hardirqs_on+0x9/0x20
> >>>    [ 6121.343047]  [<ffffffff813aee2b>] do_wait_for_common+0x97/0xe3
> >>>    [ 6121.348967]  [<ffffffff8103e8fa>] ? try_to_wake_up+0x200/0x200
> >>>    [ 6121.354794]  [<ffffffff8107a4bf>] ? __raw_spin_lock_irq+0x17/0x2f
> >>>    [ 6121.360878]  [<ffffffff813af2a9>] wait_for_common+0x36/0x4d
> >>>    [ 6121.366441]  [<ffffffff813af378>] wait_for_completion+0x1d/0x1f
> >>>    [ 6121.372356]  [<ffffffff8109ae6d>] synchronize_sched+0x40/0x49
> >>>    [ 6121.378096]  [<ffffffff810635b8>] ? find_get_pid+0x1b/0x1b
> >>>    [ 6121.383574]  [<ffffffff81015fe5>] mce_read+0x17f/0x25d
> >>>    [ 6121.388707]  [<ffffffff81111af5>] ? rw_verify_area+0xac/0xdb
> >>>    [ 6121.394358]  [<ffffffff811121f1>] vfs_read+0xa9/0xe1
> >>>    [ 6121.399317]  [<ffffffff8111248d>] sys_read+0x4c/0x70
> >>>    [ 6121.404278]  [<ffffffff813b6deb>] system_call_fastpath+0x16/0x1b
> >>>    [ 6127.795001] INFO: rcu_sched_state detected stall on CPU
> >>>1 (t=420064 jiffies)
> >>>    [ 6127.796001] INFO: rcu_sched_state detected stalls on
> >>>CPUs/tasks: { 1} (detected by 0, t=420066 jiffies)
> >>>    [ 6241.410171] INFO: task mcelog:6513 blocked for more
> >>>than 120 seconds.
> >>>
> >>>Reverting commit a4dd99250dc makes the stalls go away:
> >>>
> >>>    rcu: create new rcu_access_index() and use in mce
> >>>
> >>>    The MCE subsystem needs to sample an RCU-protected index outside of
> >>>    any protection for that index.  If this was a pointer, we would use
> >>>    rcu_access_pointer(), but there is no corresponding
> >>>rcu_access_index().
> >>>    This commit therefore creates an rcu_access_index() and applies it
> >>>    to MCE.
> >>>
> >>>    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >>>    Tested-by: Zdenek Kabelac <zkabelac@redhat.com>
> >>
> >>Wow!  This is just instructions, just wrapping the access in
> >>ACCESS_ONCE().
> >>
> >>Was the original RCU CPU stall repeatable?
> >
> >Yes.  I saw it on two different machines.
> >Both were running as Ceph clients, doing sustained
> >streaming writes, if that helps any.
> >
> >I can attempt to repeat with any extra debugging
> >you'd like me to try.
> 
> Evidently I cannot reproduce this after all, so my revert
> test was incorrect.
> 
> If I can learn how to reproduce this I'll post with details;
> in the meantime ignore this, and sorry for the noise.

OK -- I must confess that I could not understand how this change
could cause the problem!  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-04-29  0:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-27 19:59 mcelog stalls on 2.6.39-rc5 Jim Schutt
2011-04-27 23:03 ` Paul E. McKenney
2011-04-28 14:06   ` Jim Schutt
2011-04-28 16:06     ` Jim Schutt
2011-04-29  0:31       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox