linux-alpha.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* rcu_sched_state detected stalls on Alpha with generic config
@ 2011-12-07  6:51 Michael Cree
  2011-12-08  1:16 ` Paul E. McKenney
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Cree @ 2011-12-07  6:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-alpha, Shaohua Li, Paul E. McKenney, Richard Henderson,
	Ivan Kokshaysky, Matt Turner

I am seeing "rcu_sched_state detected stall on CPU" messages on Alpha
architecture with a generic SMP config.  Interactive tasks are seen to
lock up, with "INFO: task X blocked for more than 120 seconds" in the
kernel logs, and eventual kernel oops and panic, on latest 3.2-rc4 and
traceable back to 3.0.  Bisection between 2.6.39 and 3.0 leads to commit:

09223371deac67d08ca0b70bd18787920284c967
rcu: Use softirq to address performance regression

as the first bad commit.

Tested on an Alpha ES45 (Titan) with three 1.25 GHz CPUs and 4 GByte
memory.  Testing procedure is to build git software and run its test
suite with -j4 in the make command argument.

The CPU stall messages and eventually system lockup is only seen with a
generic Alpha config, never with a Titan machine specific config.

An example of kernel logs is (this one probably produced when I tried to
shutdown the system when it is falling over):

[45360.930876] INFO: rcu_sched_state detected stall on CPU 1 (t=798848
jiffies)
[45360.931853] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1}
(detected by 0, t=798850 jiffies)
[45489.080225] INFO: task umount:17371 blocked for more than 120 seconds.
[45489.158350] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[45489.252100] umount          D fffffc00013461ac     0 17371  17368
0x00000000
[45489.336084] fffffc00fdd53db8 fffffc00fdd97bb8 fffffc000108ca1c
fffffc00dcc9e800
[45489.422998]        fffffc00dcc9e810 fffffc00013b3a5d fffffc000106289c
fffffc00ff0dfda8
[45489.519678]        0000000000000000 fffffc000108c81c fffffc0001cd73f0
0000000000000001
[45489.615381]        fffffc00010627f0 0000000000000000 fffffc00dcc9e920
fffffc00ff0bf780
[45489.712060]        fffffc00010111b8 fffffc00ff0dfda8 fffffc00ff0dfde8
fffffc0001cdaa58
[45489.808740]        0000000000000000 0000000000000000 fffffc0000000000
fffffc0000000000
[45489.907373] Trace:
[45489.930810] [<fffffc000108ca1c>] watchdog+0x200/0x27c
[45489.991357] [<fffffc000106289c>] kthread+0xac/0xc4
[45490.048974] [<fffffc000108c81c>] watchdog+0x0/0x27c
[45490.107568] [<fffffc00010627f0>] kthread+0x0/0xc4
[45490.164209] [<fffffc00010111b8>] kernel_thread+0x28/0x90
[45490.227685]

Let me know if any other information is needed to narrow down the problem.

Cheers
Michael.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rcu_sched_state detected stalls on Alpha with generic config
  2011-12-07  6:51 rcu_sched_state detected stalls on Alpha with generic config Michael Cree
@ 2011-12-08  1:16 ` Paul E. McKenney
  2012-01-19 23:15   ` Michael Cree
  0 siblings, 1 reply; 4+ messages in thread
From: Paul E. McKenney @ 2011-12-08  1:16 UTC (permalink / raw)
  To: Michael Cree
  Cc: linux-kernel, linux-alpha, Shaohua Li, Richard Henderson,
	Ivan Kokshaysky, Matt Turner

On Wed, Dec 07, 2011 at 07:51:23PM +1300, Michael Cree wrote:
> I am seeing "rcu_sched_state detected stall on CPU" messages on Alpha
> architecture with a generic SMP config.  Interactive tasks are seen to
> lock up, with "INFO: task X blocked for more than 120 seconds" in the
> kernel logs, and eventual kernel oops and panic, on latest 3.2-rc4 and
> traceable back to 3.0.  Bisection between 2.6.39 and 3.0 leads to commit:
> 
> 09223371deac67d08ca0b70bd18787920284c967
> rcu: Use softirq to address performance regression
> 
> as the first bad commit.
> 
> Tested on an Alpha ES45 (Titan) with three 1.25 GHz CPUs and 4 GByte
> memory.  Testing procedure is to build git software and run its test
> suite with -j4 in the make command argument.
> 
> The CPU stall messages and eventually system lockup is only seen with a
> generic Alpha config, never with a Titan machine specific config.
> 
> An example of kernel logs is (this one probably produced when I tried to
> shutdown the system when it is falling over):
> 
> [45360.930876] INFO: rcu_sched_state detected stall on CPU 1 (t=798848
> jiffies)
> [45360.931853] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 1}
> (detected by 0, t=798850 jiffies)
> [45489.080225] INFO: task umount:17371 blocked for more than 120 seconds.
> [45489.158350] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [45489.252100] umount          D fffffc00013461ac     0 17371  17368
> 0x00000000
> [45489.336084] fffffc00fdd53db8 fffffc00fdd97bb8 fffffc000108ca1c
> fffffc00dcc9e800
> [45489.422998]        fffffc00dcc9e810 fffffc00013b3a5d fffffc000106289c
> fffffc00ff0dfda8
> [45489.519678]        0000000000000000 fffffc000108c81c fffffc0001cd73f0
> 0000000000000001
> [45489.615381]        fffffc00010627f0 0000000000000000 fffffc00dcc9e920
> fffffc00ff0bf780
> [45489.712060]        fffffc00010111b8 fffffc00ff0dfda8 fffffc00ff0dfde8
> fffffc0001cdaa58
> [45489.808740]        0000000000000000 0000000000000000 fffffc0000000000
> fffffc0000000000
> [45489.907373] Trace:
> [45489.930810] [<fffffc000108ca1c>] watchdog+0x200/0x27c
> [45489.991357] [<fffffc000106289c>] kthread+0xac/0xc4
> [45490.048974] [<fffffc000108c81c>] watchdog+0x0/0x27c
> [45490.107568] [<fffffc00010627f0>] kthread+0x0/0xc4
> [45490.164209] [<fffffc00010111b8>] kernel_thread+0x28/0x90
> [45490.227685]
> 
> Let me know if any other information is needed to narrow down the problem.

Are there any specific differences between the Titan config and the
generic Alpha config that might help me work out what is going on?
Odd -- it looks like the watchdog is complaining at the time of the hang.
Is it possible to use sysrq to get the state of the other tasks?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rcu_sched_state detected stalls on Alpha with generic config
  2011-12-08  1:16 ` Paul E. McKenney
@ 2012-01-19 23:15   ` Michael Cree
  2012-01-19 23:39     ` Paul E. McKenney
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Cree @ 2012-01-19 23:15 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, linux-alpha, Shaohua Li, Richard Henderson,
	Ivan Kokshaysky, Matt Turner

On 8/12/2011, at 2:16 PM, Paul E. McKenney wrote:
> On Wed, Dec 07, 2011 at 07:51:23PM +1300, Michael Cree wrote:
>> I am seeing "rcu_sched_state detected stall on CPU" messages on Alpha
>> architecture with a generic SMP config.  Interactive tasks are seen  
>> to
>> lock up, with "INFO: task X blocked for more than 120 seconds" in the
>> kernel logs, and eventual kernel oops and panic, on latest 3.2-rc4  
>> and
>> traceable back to 3.0.  Bisection between 2.6.39 and 3.0 leads to  
>> commit:
>>
>> 09223371deac67d08ca0b70bd18787920284c967
>> rcu: Use softirq to address performance regression
>>
>> as the first bad commit.
>>
>> Tested on an Alpha ES45 (Titan) with three 1.25 GHz CPUs and 4 GByte
>> memory.  Testing procedure is to build git software and run its test
>> suite with -j4 in the make command argument.
>>
>> The CPU stall messages and eventually system lockup is only seen  
>> with a
>> generic Alpha config, never with a Titan machine specific config.
>>
>> An example of kernel logs is (this one probably produced when I  
>> tried to
>> shutdown the system when it is falling over):
>>
>> [45360.930876] INFO: rcu_sched_state detected stall on CPU 1  
>> (t=798848
>> jiffies)
>> [45360.931853] INFO: rcu_sched_state detected stalls on CPUs/tasks:  
>> { 1}
>> (detected by 0, t=798850 jiffies)
>> [45489.080225] INFO: task umount:17371 blocked for more than 120  
>> seconds.
>> [45489.158350] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [45489.252100] umount          D fffffc00013461ac     0 17371  17368
>> 0x00000000
>> [45489.336084] fffffc00fdd53db8 fffffc00fdd97bb8 fffffc000108ca1c
>> fffffc00dcc9e800
>> [45489.422998]        fffffc00dcc9e810 fffffc00013b3a5d  
>> fffffc000106289c
>> fffffc00ff0dfda8
>> [45489.519678]        0000000000000000 fffffc000108c81c  
>> fffffc0001cd73f0
>> 0000000000000001
>> [45489.615381]        fffffc00010627f0 0000000000000000  
>> fffffc00dcc9e920
>> fffffc00ff0bf780
>> [45489.712060]        fffffc00010111b8 fffffc00ff0dfda8  
>> fffffc00ff0dfde8
>> fffffc0001cdaa58
>> [45489.808740]        0000000000000000 0000000000000000  
>> fffffc0000000000
>> fffffc0000000000
>> [45489.907373] Trace:
>> [45489.930810] [<fffffc000108ca1c>] watchdog+0x200/0x27c
>> [45489.991357] [<fffffc000106289c>] kthread+0xac/0xc4
>> [45490.048974] [<fffffc000108c81c>] watchdog+0x0/0x27c
>> [45490.107568] [<fffffc00010627f0>] kthread+0x0/0xc4
>> [45490.164209] [<fffffc00010111b8>] kernel_thread+0x28/0x90
>> [45490.227685]
>>
>> Let me know if any other information is needed to narrow down the  
>> problem.
>
> Are there any specific differences between the Titan config and the
> generic Alpha config that might help me work out what is going on?
> Odd -- it looks like the watchdog is complaining at the time of the  
> hang.
> Is it possible to use sysrq to get the state of the other tasks?

Been rather busy or on holiday in the meantime, but just got back to  
looking at this problem.

It looks like it is a faulty compiler, not a problem in the kernel  
source.

The problems described above are seen when compiling with gcc-4.4 and  
we now realise that gcc-4.4 has an optimisation problem on Alpha and  
is the cause of reported problems/crashes in software other than the  
kernel.  These problems are fixed when compiling with gcc-4.5.

I have now compiled a "bad" kernel with gcc-4.5 and I no longer can  
trigger the failure --- the kernel worked for a couple of hours (and a  
bad one compiled with gcc-4.4 has never lasted that long without  
crashing!).

Cheers
Michael.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rcu_sched_state detected stalls on Alpha with generic config
  2012-01-19 23:15   ` Michael Cree
@ 2012-01-19 23:39     ` Paul E. McKenney
  0 siblings, 0 replies; 4+ messages in thread
From: Paul E. McKenney @ 2012-01-19 23:39 UTC (permalink / raw)
  To: Michael Cree
  Cc: linux-kernel, linux-alpha, Shaohua Li, Richard Henderson,
	Ivan Kokshaysky, Matt Turner

On Fri, Jan 20, 2012 at 12:15:18PM +1300, Michael Cree wrote:
> On 8/12/2011, at 2:16 PM, Paul E. McKenney wrote:
> >On Wed, Dec 07, 2011 at 07:51:23PM +1300, Michael Cree wrote:
> >>I am seeing "rcu_sched_state detected stall on CPU" messages on Alpha
> >>architecture with a generic SMP config.  Interactive tasks are
> >>seen to
> >>lock up, with "INFO: task X blocked for more than 120 seconds" in the
> >>kernel logs, and eventual kernel oops and panic, on latest
> >>3.2-rc4 and
> >>traceable back to 3.0.  Bisection between 2.6.39 and 3.0 leads
> >>to commit:
> >>
> >>09223371deac67d08ca0b70bd18787920284c967
> >>rcu: Use softirq to address performance regression
> >>
> >>as the first bad commit.
> >>
> >>Tested on an Alpha ES45 (Titan) with three 1.25 GHz CPUs and 4 GByte
> >>memory.  Testing procedure is to build git software and run its test
> >>suite with -j4 in the make command argument.
> >>
> >>The CPU stall messages and eventually system lockup is only seen
> >>with a
> >>generic Alpha config, never with a Titan machine specific config.
> >>
> >>An example of kernel logs is (this one probably produced when I
> >>tried to
> >>shutdown the system when it is falling over):
> >>
> >>[45360.930876] INFO: rcu_sched_state detected stall on CPU 1
> >>(t=798848
> >>jiffies)
> >>[45360.931853] INFO: rcu_sched_state detected stalls on
> >>CPUs/tasks: { 1}
> >>(detected by 0, t=798850 jiffies)
> >>[45489.080225] INFO: task umount:17371 blocked for more than 120
> >>seconds.
> >>[45489.158350] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>disables this message.
> >>[45489.252100] umount          D fffffc00013461ac     0 17371  17368
> >>0x00000000
> >>[45489.336084] fffffc00fdd53db8 fffffc00fdd97bb8 fffffc000108ca1c
> >>fffffc00dcc9e800
> >>[45489.422998]        fffffc00dcc9e810 fffffc00013b3a5d
> >>fffffc000106289c
> >>fffffc00ff0dfda8
> >>[45489.519678]        0000000000000000 fffffc000108c81c
> >>fffffc0001cd73f0
> >>0000000000000001
> >>[45489.615381]        fffffc00010627f0 0000000000000000
> >>fffffc00dcc9e920
> >>fffffc00ff0bf780
> >>[45489.712060]        fffffc00010111b8 fffffc00ff0dfda8
> >>fffffc00ff0dfde8
> >>fffffc0001cdaa58
> >>[45489.808740]        0000000000000000 0000000000000000
> >>fffffc0000000000
> >>fffffc0000000000
> >>[45489.907373] Trace:
> >>[45489.930810] [<fffffc000108ca1c>] watchdog+0x200/0x27c
> >>[45489.991357] [<fffffc000106289c>] kthread+0xac/0xc4
> >>[45490.048974] [<fffffc000108c81c>] watchdog+0x0/0x27c
> >>[45490.107568] [<fffffc00010627f0>] kthread+0x0/0xc4
> >>[45490.164209] [<fffffc00010111b8>] kernel_thread+0x28/0x90
> >>[45490.227685]
> >>
> >>Let me know if any other information is needed to narrow down
> >>the problem.
> >
> >Are there any specific differences between the Titan config and the
> >generic Alpha config that might help me work out what is going on?
> >Odd -- it looks like the watchdog is complaining at the time of
> >the hang.
> >Is it possible to use sysrq to get the state of the other tasks?
> 
> Been rather busy or on holiday in the meantime, but just got back to
> looking at this problem.
> 
> It looks like it is a faulty compiler, not a problem in the kernel
> source.
> 
> The problems described above are seen when compiling with gcc-4.4
> and we now realise that gcc-4.4 has an optimisation problem on Alpha
> and is the cause of reported problems/crashes in software other than
> the kernel.  These problems are fixed when compiling with gcc-4.5.
> 
> I have now compiled a "bad" kernel with gcc-4.5 and I no longer can
> trigger the failure --- the kernel worked for a couple of hours (and
> a bad one compiled with gcc-4.4 has never lasted that long without
> crashing!).

Thank you for letting me know -- always glad to blame the compiler.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-01-19 23:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-07  6:51 rcu_sched_state detected stalls on Alpha with generic config Michael Cree
2011-12-08  1:16 ` Paul E. McKenney
2012-01-19 23:15   ` Michael Cree
2012-01-19 23:39     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).