rcu self-detected stall messages on OMAP3, 4 boards

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* rcu self-detected stall messages on OMAP3, 4 boards
@ 2012-09-12 22:51 Paul Walmsley
  2012-09-13  1:12 ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-12 22:51 UTC (permalink / raw)
  To: linux-arm-kernel


Hi Paul

Recently several of us have been seeing "INFO: rcu_sched self-detected
stall on CPU { 0} (t=20611 jiffies)" stack tracebacks on various OMAP3
and 4 board.

I only noticed it during v3.6-rc3, but I suspect it's been happening 
for users at least since May:

http://www.mail-archive.com/linux-omap at vger.kernel.org/msg68942.html


The only quasi-reproducible test case that I've found so far 
is to boot the board with serial console enabled to a login prompt, then 
wait for a few minutes, then send a keypress to the board via serial.
The tracebacks I get look like this:

[  467.480712] INFO: rcu_sched self-detected stall on CPU { 0}  (t=20611 jiffies)
[  467.484741] [<c001b7cc>] (unwind_backtrace+0x0/0xf0) from [<c00acc94>] (rcu_check_callbacks+0x180/0x630)
[  467.489929] [<c00acc94>] (rcu_check_callbacks+0x180/0x630) from [<c0052b18>] (update_process_times+0x38/0x68)
[  467.495361] [<c0052b18>] (update_process_times+0x38/0x68) from [<c008c04c>] (tick_sched_timer+0x80/0xec)
[  467.500518] [<c008c04c>] (tick_sched_timer+0x80/0xec) from [<c0068544>] (__run_hrtimer+0x7c/0x1e0)
[  467.505401] [<c0068544>] (__run_hrtimer+0x7c/0x1e0) from [<c0069328>] (hrtimer_interrupt+0x11c/0x2d0)
[  467.510437] [<c0069328>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a04c>] (twd_handler+0x30/0x44)
[  467.515350] [<c001a04c>] (twd_handler+0x30/0x44) from [<c00a71a0>] (handle_percpu_devid_irq+0x90/0x13c)
[  467.520477] [<c00a71a0>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a3914>] (generic_handle_irq+0x30/0x48)
[  467.525939] [<c00a3914>] (generic_handle_irq+0x30/0x48) from [<c0014c58>] (handle_IRQ+0x4c/0xac)
[  467.530731] [<c0014c58>] (handle_IRQ+0x4c/0xac) from [<c0008478>] (gic_handle_irq+0x28/0x5c)
[  467.535339] [<c0008478>] (gic_handle_irq+0x28/0x5c) from [<c04f8ce4>] (__irq_svc+0x44/0x5c)
[  467.539886] Exception stack(0xc0729f58 to 0xc0729fa0)
[  467.542663] 9f40:                                                       00047f2a 00000001
[  467.547119] 9f60: 00000000 c074a940 c0728000 c07c4b08 c05045a0 c074be20 00000000 411fc092
[  467.551574] 9f80: c074c040 00000000 00000001 c0729fa0 00047f2b c0014f50 20000113 ffffffff
[  467.556030] [<c04f8ce4>] (__irq_svc+0x44/0x5c) from [<c0014f50>] (default_idle+0x20/0x44)
[  467.560485] [<c0014f50>] (default_idle+0x20/0x44) from [<c001517c>] (cpu_idle+0x9c/0x114)
[  467.564971] [<c001517c>] (cpu_idle+0x9c/0x114) from [<c06d77b0>] (start_kernel+0x2b4/0x304)

Looks like this message was added as of commit
a858af2875fb291d0f4b0a4419fefbf03c2379c0 ("rcu: Print scheduling-clock
information on RCU CPU stall-warning messages").

Do you have any suggestions for how we can determine what is causing
this?

Here's an example of a kernel config that we use:

http://www.pwsan.com/omap/testlogs/am33xx_hwmod_clock_devel_3.7/20120912092510/build/omap2plus_defconfig/Kconfig

A few observations that may or may not be relevant: we use NO_HZ, and 
we also have a clockevents timer that is relatively slow to program.


regards,

- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-12 22:51 rcu self-detected stall messages on OMAP3, 4 boards Paul Walmsley
@ 2012-09-13  1:12 ` Paul E. McKenney
  2012-09-13 18:52   ` Paul Walmsley
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-13  1:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 12, 2012 at 10:51:30PM +0000, Paul Walmsley wrote:
> 
> Hi Paul
> 
> Recently several of us have been seeing "INFO: rcu_sched self-detected
> stall on CPU { 0} (t=20611 jiffies)" stack tracebacks on various OMAP3
> and 4 board.
> 
> I only noticed it during v3.6-rc3, but I suspect it's been happening 
> for users at least since May:
> 
> http://www.mail-archive.com/linux-omap at vger.kernel.org/msg68942.html
> 
> 
> The only quasi-reproducible test case that I've found so far 
> is to boot the board with serial console enabled to a login prompt, then 
> wait for a few minutes, then send a keypress to the board via serial.
> The tracebacks I get look like this:

Interesting.  I am assuming that the interrupt in the stack below came
from idle, if not, please let me know what.

Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
give me a bit more information about why RCU thought that there was
a stall.  (CCing Becky Bruce, who saw something similar recently.)

Subodh Nijsure (also CCed) reported something that might be similar on
ARM, and also reported that setting the following got rid of the stalls:

	CONFIG_CPU_IDLE=y
	CONFIG_CPU_IDLE_GOV_LADDER=y
	CONFIG_CPU_IDLE_GOV_MENU=y

At which point he was happy, which was good, but which also left the
underlying problem unsolved.  Do these affect your system?  If so,
do they cause a different ARM idle loop to be executed?

							Thanx, Paul

> [  467.480712] INFO: rcu_sched self-detected stall on CPU { 0}  (t=20611 jiffies)
> [  467.484741] [<c001b7cc>] (unwind_backtrace+0x0/0xf0) from [<c00acc94>] (rcu_check_callbacks+0x180/0x630)
> [  467.489929] [<c00acc94>] (rcu_check_callbacks+0x180/0x630) from [<c0052b18>] (update_process_times+0x38/0x68)
> [  467.495361] [<c0052b18>] (update_process_times+0x38/0x68) from [<c008c04c>] (tick_sched_timer+0x80/0xec)
> [  467.500518] [<c008c04c>] (tick_sched_timer+0x80/0xec) from [<c0068544>] (__run_hrtimer+0x7c/0x1e0)
> [  467.505401] [<c0068544>] (__run_hrtimer+0x7c/0x1e0) from [<c0069328>] (hrtimer_interrupt+0x11c/0x2d0)
> [  467.510437] [<c0069328>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a04c>] (twd_handler+0x30/0x44)
> [  467.515350] [<c001a04c>] (twd_handler+0x30/0x44) from [<c00a71a0>] (handle_percpu_devid_irq+0x90/0x13c)
> [  467.520477] [<c00a71a0>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a3914>] (generic_handle_irq+0x30/0x48)
> [  467.525939] [<c00a3914>] (generic_handle_irq+0x30/0x48) from [<c0014c58>] (handle_IRQ+0x4c/0xac)
> [  467.530731] [<c0014c58>] (handle_IRQ+0x4c/0xac) from [<c0008478>] (gic_handle_irq+0x28/0x5c)
> [  467.535339] [<c0008478>] (gic_handle_irq+0x28/0x5c) from [<c04f8ce4>] (__irq_svc+0x44/0x5c)
> [  467.539886] Exception stack(0xc0729f58 to 0xc0729fa0)
> [  467.542663] 9f40:                                                       00047f2a 00000001
> [  467.547119] 9f60: 00000000 c074a940 c0728000 c07c4b08 c05045a0 c074be20 00000000 411fc092
> [  467.551574] 9f80: c074c040 00000000 00000001 c0729fa0 00047f2b c0014f50 20000113 ffffffff
> [  467.556030] [<c04f8ce4>] (__irq_svc+0x44/0x5c) from [<c0014f50>] (default_idle+0x20/0x44)
> [  467.560485] [<c0014f50>] (default_idle+0x20/0x44) from [<c001517c>] (cpu_idle+0x9c/0x114)
> [  467.564971] [<c001517c>] (cpu_idle+0x9c/0x114) from [<c06d77b0>] (start_kernel+0x2b4/0x304)
> 
> Looks like this message was added as of commit
> a858af2875fb291d0f4b0a4419fefbf03c2379c0 ("rcu: Print scheduling-clock
> information on RCU CPU stall-warning messages").
> 
> Do you have any suggestions for how we can determine what is causing
> this?
> 
> Here's an example of a kernel config that we use:
> 
> http://www.pwsan.com/omap/testlogs/am33xx_hwmod_clock_devel_3.7/20120912092510/build/omap2plus_defconfig/Kconfig
> 
> A few observations that may or may not be relevant: we use NO_HZ, and 
> we also have a clockevents timer that is relatively slow to program.
> 
> 
> regards,
> 
> - Paul
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-13  1:12 ` Paul E. McKenney
@ 2012-09-13 18:52   ` Paul Walmsley
  2012-09-20  0:03     ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-13 18:52 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Paul,

thanks for the reply,

On Wed, 12 Sep 2012, Paul E. McKenney wrote:

> Interesting.  I am assuming that the interrupt in the stack below came
> from idle, if not, please let me know what.

According to the exception stack section in the original traceback, it
appears that the serial interrupt took the SoC out of idle.

> Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
> give me a bit more information about why RCU thought that there was
> a stall.  (CCing Becky Bruce, who saw something similar recently.)

At the bottom of this mail is a series of tracebacks with
CONFIG_RCU_CPU_STALL_INFO=y.  Unlike the traceback that was sent in
the last message, these were not triggered by serial activity.  These
appeared every 300 seconds.

> Subodh Nijsure (also CCed) reported something that might be similar on
> ARM, and also reported that setting the following got rid of the stalls:
> 
> 	CONFIG_CPU_IDLE=y
> 	CONFIG_CPU_IDLE_GOV_LADDER=y
> 	CONFIG_CPU_IDLE_GOV_MENU=y
> 
> At which point he was happy, which was good, but which also left the
> underlying problem unsolved.  Do these affect your system?  If so,
> do they cause a different ARM idle loop to be executed?

Will give this a try.  What board was Subodh using?


- Paul


Debian GNU/Linux wheezy/sid armel ttyO2

armel login: [  305.942108] INFO: rcu_sched self-detected stall on CPU
[  305.944946]  1: (7 GPs behind) idle=57b/1/0 
[  305.947265]   (t=22811 jiffies)
[  305.949066] [<c001b7cc>] (unwind_backtrace+0x0/0xf0) from [<c00acc28>] (rcu_check_callbacks+0x1b0/0x678)
[  305.954223] [<c00acc28>] (rcu_check_callbacks+0x1b0/0x678) from [<c00529e0>] (update_process_times+0x38/0x68)
[  305.959625] [<c00529e0>] (update_process_times+0x38/0x68) from [<c008bf14>] (tick_sched_timer+0x80/0xec)
[  305.964813] [<c008bf14>] (tick_sched_timer+0x80/0xec) from [<c006840c>] (__run_hrtimer+0x7c/0x1e0)
[  305.969696] [<c006840c>] (__run_hrtimer+0x7c/0x1e0) from [<c00691f0>] (hrtimer_interrupt+0x11c/0x2d0)
[  305.974731] [<c00691f0>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a04c>] (twd_handler+0x30/0x44)
[  305.979644] [<c001a04c>] (twd_handler+0x30/0x44) from [<c00a7068>] (handle_percpu_devid_irq+0x90/0x13c)
[  305.984741] [<c00a7068>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a37dc>] (generic_handle_irq+0x30/0x48)
[  305.990234] [<c00a37dc>] (generic_handle_irq+0x30/0x48) from [<c0014c58>] (handle_IRQ+0x4c/0xac)
[  305.995025] [<c0014c58>] (handle_IRQ+0x4c/0xac) from [<c0008478>] (gic_handle_irq+0x28/0x5c)
[  305.999633] [<c0008478>] (gic_handle_irq+0x28/0x5c) from [<c04f8ca4>] (__irq_svc+0x44/0x5c)
[  306.004180] Exception stack(0xde86ff88 to 0xde86ffd0)
[  306.006927] ff80:                   0003c6d0 00000001 00000000 de8660c0 de86e000 c07c23c8
[  306.011383] ffa0: c0504590 c0749e20 00000000 411fc092 c074a040 00000000 00000001 de86ffd0
[  306.015838] ffc0: 0003c6d1 c0014f50 20000113 ffffffff
[  306.018585] [<c04f8ca4>] (__irq_svc+0x44/0x5c) from [<c0014f50>] (default_idle+0x20/0x44)
[  306.023040] [<c0014f50>] (default_idle+0x20/0x44) from [<c001517c>] (cpu_idle+0x9c/0x114)
[  306.027526] [<c001517c>] (cpu_idle+0x9c/0x114) from [<804f1af4>] (0x804f1af4)
[  602.004486] INFO: rcu_sched detected stalls on CPUs/tasks:
[  602.007476]  (detected by 0, t=60707 jiffies)
[  602.009857] INFO: Stall ended before state dump start
[  906.027893] INFO: rcu_sched self-detected stall on CPU
[  906.030700]  1: (6 GPs behind) idle=647/1/0 
[  906.033020]   (t=38379 jiffies)
[  906.034790] [<c001b7cc>] (unwind_backtrace+0x0/0xf0) from [<c00acc28>] (rcu_check_callbacks+0x1b0/0x678)
[  906.039947] [<c00acc28>] (rcu_check_callbacks+0x1b0/0x678) from [<c00529e0>] (update_process_times+0x38/0x68)
[  906.045349] [<c00529e0>] (update_process_times+0x38/0x68) from [<c008bf14>] (tick_sched_timer+0x80/0xec)
[  906.050537] [<c008bf14>] (tick_sched_timer+0x80/0xec) from [<c006840c>] (__run_hrtimer+0x7c/0x1e0)
[  906.055419] [<c006840c>] (__run_hrtimer+0x7c/0x1e0) from [<c00691f0>] (hrtimer_interrupt+0x11c/0x2d0)
[  906.060424] [<c00691f0>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a04c>] (twd_handler+0x30/0x44)
[  906.065307] [<c001a04c>] (twd_handler+0x30/0x44) from [<c00a7068>] (handle_percpu_devid_irq+0x90/0x13c)
[  906.070434] [<c00a7068>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a37dc>] (generic_handle_irq+0x30/0x48)
[  906.075897] [<c00a37dc>] (generic_handle_irq+0x30/0x48) from [<c0014c58>] (handle_IRQ+0x4c/0xac)
[  906.080688] [<c0014c58>] (handle_IRQ+0x4c/0xac) from [<c0008478>] (gic_handle_irq+0x28/0x5c)
[  906.085296] [<c0008478>] (gic_handle_irq+0x28/0x5c) from [<c04f8ca4>] (__irq_svc+0x44/0x5c)
[  906.089843] Exception stack(0xde86ff88 to 0xde86ffd0)
[  906.092590] ff80:                   0003cb06 00000001 00000000 de8660c0 de86e000 c07c23c8
[  906.097045] ffa0: c0504590 c0749e20 00000000 411fc092 c074a040 00000000 00000001 de86ffd0
[  906.101501] ffc0: 0003cb07 c0014f50 20000113 ffffffff
[  906.104278] [<c04f8ca4>] (__irq_svc+0x44/0x5c) from [<c0014f50>] (default_idle+0x20/0x44)
[  906.108734] [<c0014f50>] (default_idle+0x20/0x44) from [<c001517c>] (cpu_idle+0x9c/0x114)
[  906.113189] [<c001517c>] (cpu_idle+0x9c/0x114) from [<804f1af4>] (0x804f1af4)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-13 18:52   ` Paul Walmsley
@ 2012-09-20  0:03     ` Paul E. McKenney
  2012-09-20  7:56       ` Paul Walmsley
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-20  0:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 13, 2012 at 06:52:10PM +0000, Paul Walmsley wrote:
> Hi Paul,
> 
> thanks for the reply,
> 
> On Wed, 12 Sep 2012, Paul E. McKenney wrote:
> 
> > Interesting.  I am assuming that the interrupt in the stack below came
> > from idle, if not, please let me know what.
> 
> According to the exception stack section in the original traceback, it
> appears that the serial interrupt took the SoC out of idle.
> 
> > Could you please reproduce with CONFIG_RCU_CPU_STALL_INFO=y?  That would
> > give me a bit more information about why RCU thought that there was
> > a stall.  (CCing Becky Bruce, who saw something similar recently.)
> 
> At the bottom of this mail is a series of tracebacks with
> CONFIG_RCU_CPU_STALL_INFO=y.  Unlike the traceback that was sent in
> the last message, these were not triggered by serial activity.  These
> appeared every 300 seconds.
> 
> > Subodh Nijsure (also CCed) reported something that might be similar on
> > ARM, and also reported that setting the following got rid of the stalls:
> > 
> > 	CONFIG_CPU_IDLE=y
> > 	CONFIG_CPU_IDLE_GOV_LADDER=y
> > 	CONFIG_CPU_IDLE_GOV_MENU=y
> > 
> > At which point he was happy, which was good, but which also left the
> > underlying problem unsolved.  Do these affect your system?  If so,
> > do they cause a different ARM idle loop to be executed?
> 
> Will give this a try.  What board was Subodh using?

Hello, Paul,

Any news on trying the above settings?

							Thanx, Paul

> - Paul
> 
> 
> Debian GNU/Linux wheezy/sid armel ttyO2
> 
> armel login: [  305.942108] INFO: rcu_sched self-detected stall on CPU
> [  305.944946]  1: (7 GPs behind) idle=57b/1/0 
> [  305.947265]   (t=22811 jiffies)
> [  305.949066] [<c001b7cc>] (unwind_backtrace+0x0/0xf0) from [<c00acc28>] (rcu_check_callbacks+0x1b0/0x678)
> [  305.954223] [<c00acc28>] (rcu_check_callbacks+0x1b0/0x678) from [<c00529e0>] (update_process_times+0x38/0x68)
> [  305.959625] [<c00529e0>] (update_process_times+0x38/0x68) from [<c008bf14>] (tick_sched_timer+0x80/0xec)
> [  305.964813] [<c008bf14>] (tick_sched_timer+0x80/0xec) from [<c006840c>] (__run_hrtimer+0x7c/0x1e0)
> [  305.969696] [<c006840c>] (__run_hrtimer+0x7c/0x1e0) from [<c00691f0>] (hrtimer_interrupt+0x11c/0x2d0)
> [  305.974731] [<c00691f0>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a04c>] (twd_handler+0x30/0x44)
> [  305.979644] [<c001a04c>] (twd_handler+0x30/0x44) from [<c00a7068>] (handle_percpu_devid_irq+0x90/0x13c)
> [  305.984741] [<c00a7068>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a37dc>] (generic_handle_irq+0x30/0x48)
> [  305.990234] [<c00a37dc>] (generic_handle_irq+0x30/0x48) from [<c0014c58>] (handle_IRQ+0x4c/0xac)
> [  305.995025] [<c0014c58>] (handle_IRQ+0x4c/0xac) from [<c0008478>] (gic_handle_irq+0x28/0x5c)
> [  305.999633] [<c0008478>] (gic_handle_irq+0x28/0x5c) from [<c04f8ca4>] (__irq_svc+0x44/0x5c)
> [  306.004180] Exception stack(0xde86ff88 to 0xde86ffd0)
> [  306.006927] ff80:                   0003c6d0 00000001 00000000 de8660c0 de86e000 c07c23c8
> [  306.011383] ffa0: c0504590 c0749e20 00000000 411fc092 c074a040 00000000 00000001 de86ffd0
> [  306.015838] ffc0: 0003c6d1 c0014f50 20000113 ffffffff
> [  306.018585] [<c04f8ca4>] (__irq_svc+0x44/0x5c) from [<c0014f50>] (default_idle+0x20/0x44)
> [  306.023040] [<c0014f50>] (default_idle+0x20/0x44) from [<c001517c>] (cpu_idle+0x9c/0x114)
> [  306.027526] [<c001517c>] (cpu_idle+0x9c/0x114) from [<804f1af4>] (0x804f1af4)
> [  602.004486] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  602.007476]  (detected by 0, t=60707 jiffies)
> [  602.009857] INFO: Stall ended before state dump start
> [  906.027893] INFO: rcu_sched self-detected stall on CPU
> [  906.030700]  1: (6 GPs behind) idle=647/1/0 
> [  906.033020]   (t=38379 jiffies)
> [  906.034790] [<c001b7cc>] (unwind_backtrace+0x0/0xf0) from [<c00acc28>] (rcu_check_callbacks+0x1b0/0x678)
> [  906.039947] [<c00acc28>] (rcu_check_callbacks+0x1b0/0x678) from [<c00529e0>] (update_process_times+0x38/0x68)
> [  906.045349] [<c00529e0>] (update_process_times+0x38/0x68) from [<c008bf14>] (tick_sched_timer+0x80/0xec)
> [  906.050537] [<c008bf14>] (tick_sched_timer+0x80/0xec) from [<c006840c>] (__run_hrtimer+0x7c/0x1e0)
> [  906.055419] [<c006840c>] (__run_hrtimer+0x7c/0x1e0) from [<c00691f0>] (hrtimer_interrupt+0x11c/0x2d0)
> [  906.060424] [<c00691f0>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a04c>] (twd_handler+0x30/0x44)
> [  906.065307] [<c001a04c>] (twd_handler+0x30/0x44) from [<c00a7068>] (handle_percpu_devid_irq+0x90/0x13c)
> [  906.070434] [<c00a7068>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a37dc>] (generic_handle_irq+0x30/0x48)
> [  906.075897] [<c00a37dc>] (generic_handle_irq+0x30/0x48) from [<c0014c58>] (handle_IRQ+0x4c/0xac)
> [  906.080688] [<c0014c58>] (handle_IRQ+0x4c/0xac) from [<c0008478>] (gic_handle_irq+0x28/0x5c)
> [  906.085296] [<c0008478>] (gic_handle_irq+0x28/0x5c) from [<c04f8ca4>] (__irq_svc+0x44/0x5c)
> [  906.089843] Exception stack(0xde86ff88 to 0xde86ffd0)
> [  906.092590] ff80:                   0003cb06 00000001 00000000 de8660c0 de86e000 c07c23c8
> [  906.097045] ffa0: c0504590 c0749e20 00000000 411fc092 c074a040 00000000 00000001 de86ffd0
> [  906.101501] ffc0: 0003cb07 c0014f50 20000113 ffffffff
> [  906.104278] [<c04f8ca4>] (__irq_svc+0x44/0x5c) from [<c0014f50>] (default_idle+0x20/0x44)
> [  906.108734] [<c0014f50>] (default_idle+0x20/0x44) from [<c001517c>] (cpu_idle+0x9c/0x114)
> [  906.113189] [<c001517c>] (cpu_idle+0x9c/0x114) from [<804f1af4>] (0x804f1af4)
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-20  0:03     ` Paul E. McKenney
@ 2012-09-20  7:56       ` Paul Walmsley
  2012-09-20 15:03         ` Bruce, Becky
  2012-09-20 21:49         ` Bruce, Becky
  0 siblings, 2 replies; 45+ messages in thread
From: Paul Walmsley @ 2012-09-20  7:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Wed, 19 Sep 2012, Paul E. McKenney wrote:

> On Thu, Sep 13, 2012 at 06:52:10PM +0000, Paul Walmsley wrote:
> 
> > On Wed, 12 Sep 2012, Paul E. McKenney wrote:
>  
> > > Subodh Nijsure (also CCed) reported something that might be similar on
> > > ARM, and also reported that setting the following got rid of the stalls:
> > > 
> > > 	CONFIG_CPU_IDLE=y
> > > 	CONFIG_CPU_IDLE_GOV_LADDER=y
> > > 	CONFIG_CPU_IDLE_GOV_MENU=y
> > > 
> > > At which point he was happy, which was good, but which also left the
> > > underlying problem unsolved.  Do these affect your system?  If so,
> > > do they cause a different ARM idle loop to be executed?
> > 
> > Will give this a try.  What board was Subodh using?
> 
> Any news on trying the above settings?

Sorry, haven't had the chance to try it yet due to the impending merge 
window opening.  Once things settle down I'll give it a try -- or maybe 
someone else can test it in the meantime.


- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-20  7:56       ` Paul Walmsley
@ 2012-09-20 15:03         ` Bruce, Becky
  2012-09-20 21:49         ` Bruce, Becky
  1 sibling, 0 replies; 45+ messages in thread
From: Bruce, Becky @ 2012-09-20 15:03 UTC (permalink / raw)
  To: linux-arm-kernel

I'll give it a try on my panda later today.

B

Sent from my iPhone

On Sep 20, 2012, at 2:56 AM, "Paul Walmsley" <paul@pwsan.com> wrote:

> Hi,
> 
> On Wed, 19 Sep 2012, Paul E. McKenney wrote:
> 
>> On Thu, Sep 13, 2012 at 06:52:10PM +0000, Paul Walmsley wrote:
>> 
>>> On Wed, 12 Sep 2012, Paul E. McKenney wrote:
>> 
>>>> Subodh Nijsure (also CCed) reported something that might be similar on
>>>> ARM, and also reported that setting the following got rid of the stalls:
>>>> 
>>>>    CONFIG_CPU_IDLE=y
>>>>    CONFIG_CPU_IDLE_GOV_LADDER=y
>>>>    CONFIG_CPU_IDLE_GOV_MENU=y
>>>> 
>>>> At which point he was happy, which was good, but which also left the
>>>> underlying problem unsolved.  Do these affect your system?  If so,
>>>> do they cause a different ARM idle loop to be executed?
>>> 
>>> Will give this a try.  What board was Subodh using?
>> 
>> Any news on trying the above settings?
> 
> Sorry, haven't had the chance to try it yet due to the impending merge 
> window opening.  Once things settle down I'll give it a try -- or maybe 
> someone else can test it in the meantime.
> 
> 
> - Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-20  7:56       ` Paul Walmsley
  2012-09-20 15:03         ` Bruce, Becky
@ 2012-09-20 21:49         ` Bruce, Becky
  2012-09-20 22:01           ` Paul E. McKenney
  1 sibling, 1 reply; 45+ messages in thread
From: Bruce, Becky @ 2012-09-20 21:49 UTC (permalink / raw)
  To: linux-arm-kernel


On Sep 20, 2012, at 2:56 AM, Paul Walmsley wrote:

> Hi,
> 
> On Wed, 19 Sep 2012, Paul E. McKenney wrote:
> 
>> On Thu, Sep 13, 2012 at 06:52:10PM +0000, Paul Walmsley wrote:
>> 
>>> On Wed, 12 Sep 2012, Paul E. McKenney wrote:
>> 
>>>> Subodh Nijsure (also CCed) reported something that might be similar on
>>>> ARM, and also reported that setting the following got rid of the stalls:
>>>> 
>>>> 	CONFIG_CPU_IDLE=y
>>>> 	CONFIG_CPU_IDLE_GOV_LADDER=y
>>>> 	CONFIG_CPU_IDLE_GOV_MENU=y
>>>> 
>>>> At which point he was happy, which was good, but which also left the
>>>> underlying problem unsolved.  Do these affect your system?  If so,
>>>> do they cause a different ARM idle loop to be executed?
>>> 
>>> Will give this a try.  What board was Subodh using?
>> 
>> Any news on trying the above settings?
> 
> Sorry, haven't had the chance to try it yet due to the impending merge 
> window opening.  Once things settle down I'll give it a try -- or maybe 
> someone else can test it in the meantime.
> 

OK, people, you can stop heckling me about "sent from my iPhone" - I'm in the wilds of rural Louisiana with really bad internet service and was trying to work on my phone (but, alas, did not notice the CC list included the entire universe).   Shame on me.

With the above set, I don't seem to see any stalls with the RCU timeout set to 60s (the default).  I left the board running for 25 minutes; I will fire it up again later and let it run for a bit longer, but usually I end up seeing the problem pretty quickly so I don't expect that to result in anything.  I also didn't see any stalls on Paul's RCU tree as of a week ago at 60s, so as far as I can tell the CPU_IDLE stuff didn't have any impact (it wasn't on when I tested Paul's tree).

If I drop the timeout to 5s as Paul M. suggested for debug a while back, I do see stalls (both with CPU_IDLE stuff and without). 

I'm using the default omap2plus config, with RCU stall info enabled and the cpu idle stuff turned on (console dump below).  This is a Panda ES 1.1 (OMAP4460)

root at omap4430-panda:~# zcat /proc/config/gz. .. .. ..gz | grep RCU
# RCU Subsystem
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_RCU_FANOUT=32
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PROVE_RCU is not set
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=5
CONFIG_RCU_CPU_STALL_INFO=y
# CONFIG_RCU_TRACE is not set
root at omap4430-panda:~# zcat /proc/config.gz | grep IDLE
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED=y

Paul, let me know if you want me to try anything else.  My internet connection is spotty today but (obviously :) I will see emails on my phone and will test when I can.

Cheers,
B

Console output:

root at omap4430-panda:~# [  377.495361] INFO: rcu_sched self-detected stall on CPU
[  377.500762] .1: (1 ticks this GP) idle=dcd/1/0 
[  377.505523] . (t=761 jiffies)
[  377.508666] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
[  377.518615] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
[  377.529022] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
[  377.538970] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
[  377.548339] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
[  377.558013] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
[  377.567413] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
[  377.577270] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
[  377.587768] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
[  377.596984] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
[  377.605834] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
[  377.614593] Exception stack(0xee06ff08 to 0xee06ff50)
[  377.619873] ff00:                   00000001 00000001 00000000 3b9aca00 c608bc44 00000057
[  377.628448] ff20: c146a4f0 00000002 c54e3b8f 00000056 c048fb3c c0c47654 00000001 ee06ff50
[  377.637023] ff40: c0084774 c0390fac 20000113 ffffffff
[  377.642333] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
[  377.651458] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
[  377.661499] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
[  377.672515] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
[  377.683563] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
[  377.692779] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
root at omap4430-panda:~# [  821.495361] INFO: rcu_sched self-detected stall on CPU
[  821.500762] .1: (1 ticks this GP) idle=0ad/1/0 
[  821.505523] . (t=755 jiffies)
[  821.508666] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
[  821.518615] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
[  821.529022] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
[  821.538940] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
[  821.548339] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
[  821.558013] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
[  821.567413] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
[  821.577270] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
[  821.587768] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
[  821.596984] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
[  821.605834] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
[  821.614593] Exception stack(0xee06ff08 to 0xee06ff50)
[  821.619873] ff00:                   00000001 00000001 00000000 3b9aca00 267f1536 000000bf
[  821.628448] ff20: c146a4f0 00000002 7da95560 000000be c048fb3c c0c47654 00000000 ee06ff50
[  821.637023] ff40: c0084774 c0390fac 20000113 ffffffff
[  821.642333] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
[  821.651458] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
[  821.661468] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
[  821.672515] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
[  821.683563] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
[  821.692749] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
[  827.495361] INFO: rcu_sched self-detected stall on CPU
[  827.500762] .1: (1 ticks this GP) idle=0d1/1/0 
[  827.505523] . (t=733 jiffies)
[  827.508636] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
[  827.518585] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
[  827.528991] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
[  827.538940] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
[  827.548339] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
[  827.558013] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
[  827.567382] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
[  827.577239] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
[  827.587738] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
[  827.596954] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
[  827.605804] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
[  827.614562] Exception stack(0xee06ff08 to 0xee06ff50)
[  827.619842] ff00:                   00000001 00000001 00000000 3b9aca00 8c1fd142 000000c0
[  827.628417] ff20: c146a4f0 00000002 a8004dd7 000000bf c048fb3c c0c47654 00000000 ee06ff50
[  827.636993] ff40: c0084774 c0390fac 20000113 ffffffff
[  827.642303] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
[  827.651428] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
[  827.661437] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
[  827.672485] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
[  827.683502] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
[  827.692718] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
[  833.495391] INFO: rcu_sched self-detected stall on CPU
[  833.500793] .1: (3 GPs behind) idle=0d9/1/0 
[  833.505279] . (t=733 jiffies)
[  833.508392] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
[  833.518341] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
[  833.528747] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
[  833.538696] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
[  833.548095] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
[  833.557769] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
[  833.567138] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
[  833.576995] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
[  833.587493] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
[  833.596710] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
[  833.605560] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
[  833.614318] Exception stack(0xee06ff08 to 0xee06ff50)
[  833.619598] ff00:                   00000001 00000001 00000000 3b9aca00 f1c10484 000000c1
[  833.628173] ff20: c146a4f0 00000002 d257bd83 000000c0 c048fb3c c0c47654 00000001 ee06ff50
[  833.636749] ff40: c0084774 c0390fac 20000113 ffffffff
[  833.642059] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
[  833.651184] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
[  833.661193] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
[  833.672241] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
[  833.683288] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
[  833.692474] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
[  839.495422] INFO: rcu_sched self-detected stall on CPU
[  839.500823] .1: (1 ticks this GP) idle=0fd/1/0 
[  839.505554] . (t=733 jiffies)
[  839.508697] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
[  839.518646] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
[  839.529052] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
[  839.538970] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
[  839.548370] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
[  839.558044] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
[  839.567443] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
[  839.577301] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
[  839.587799] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
[  839.597015] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
[  839.605865] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
[  839.614593] Exception stack(0xee06ff08 to 0xee06ff50)
[  839.619903] ff00:                   00000001 00000001 00000000 3b9aca00 576237c7 000000c3
[  839.628479] ff20: c146a4f0 00000002 284df8f1 000000c3 c048fb3c c0c47654 00000000 ee06ff50
[  839.637054] ff40: c0084774 c0390fac 20000113 ffffffff
[  839.642333] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
[  839.651458] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
[  839.661499] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
[  839.672546] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
[  839.683563] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
[  839.692779] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)

....... ad infinitum

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-20 21:49         ` Bruce, Becky
@ 2012-09-20 22:01           ` Paul E. McKenney
  2012-09-20 22:47             ` Paul Walmsley
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-20 22:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 20, 2012 at 09:49:13PM +0000, Bruce, Becky wrote:
> 
> On Sep 20, 2012, at 2:56 AM, Paul Walmsley wrote:
> 
> > Hi,
> > 
> > On Wed, 19 Sep 2012, Paul E. McKenney wrote:
> > 
> >> On Thu, Sep 13, 2012 at 06:52:10PM +0000, Paul Walmsley wrote:
> >> 
> >>> On Wed, 12 Sep 2012, Paul E. McKenney wrote:
> >> 
> >>>> Subodh Nijsure (also CCed) reported something that might be similar on
> >>>> ARM, and also reported that setting the following got rid of the stalls:
> >>>> 
> >>>> 	CONFIG_CPU_IDLE=y
> >>>> 	CONFIG_CPU_IDLE_GOV_LADDER=y
> >>>> 	CONFIG_CPU_IDLE_GOV_MENU=y
> >>>> 
> >>>> At which point he was happy, which was good, but which also left the
> >>>> underlying problem unsolved.  Do these affect your system?  If so,
> >>>> do they cause a different ARM idle loop to be executed?
> >>> 
> >>> Will give this a try.  What board was Subodh using?
> >> 
> >> Any news on trying the above settings?
> > 
> > Sorry, haven't had the chance to try it yet due to the impending merge 
> > window opening.  Once things settle down I'll give it a try -- or maybe 
> > someone else can test it in the meantime.
> > 
> 
> OK, people, you can stop heckling me about "sent from my iPhone" - I'm in the wilds of rural Louisiana with really bad internet service and was trying to work on my phone (but, alas, did not notice the CC list included the entire universe).   Shame on me.
> 
> With the above set, I don't seem to see any stalls with the RCU timeout set to 60s (the default).  I left the board running for 25 minutes; I will fire it up again later and let it run for a bit longer, but usually I end up seeing the problem pretty quickly so I don't expect that to result in anything.  I also didn't see any stalls on Paul's RCU tree as of a week ago at 60s, so as far as I can tell the CPU_IDLE stuff didn't have any impact (it wasn't on when I tested Paul's tree).
> 
> If I drop the timeout to 5s as Paul M. suggested for debug a while back, I do see stalls (both with CPU_IDLE stuff and without). 
> 
> I'm using the default omap2plus config, with RCU stall info enabled and the cpu idle stuff turned on (console dump below).  This is a Panda ES 1.1 (OMAP4460)

Thank you for the testing, Becky!

Paul Walmsley, please let me know if the config below doesn't clear things
up for you or if there is some reason why this config is infeasible.

							Thanx, Paul

> root at omap4430-panda:~# zcat /proc/config/gz. .. .. ..gz | grep RCU
> # RCU Subsystem
> CONFIG_TREE_RCU=y
> # CONFIG_PREEMPT_RCU is not set
> CONFIG_RCU_FANOUT=32
> CONFIG_RCU_FANOUT_LEAF=16
> # CONFIG_RCU_FANOUT_EXACT is not set
> # CONFIG_RCU_FAST_NO_HZ is not set
> # CONFIG_TREE_RCU_TRACE is not set
> # CONFIG_PROVE_RCU is not set
> # CONFIG_SPARSE_RCU_POINTER is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> CONFIG_RCU_CPU_STALL_TIMEOUT=5
> CONFIG_RCU_CPU_STALL_INFO=y
> # CONFIG_RCU_TRACE is not set
> root at omap4430-panda:~# zcat /proc/config.gz | grep IDLE
> CONFIG_GENERIC_SMP_IDLE_THREAD=y
> CONFIG_CPU_IDLE=y
> CONFIG_CPU_IDLE_GOV_LADDER=y
> CONFIG_CPU_IDLE_GOV_MENU=y
> CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED=y
> 
> Paul, let me know if you want me to try anything else.  My internet connection is spotty today but (obviously :) I will see emails on my phone and will test when I can.
> 
> Cheers,
> B
> 
> Console output:
> 
> root at omap4430-panda:~# [  377.495361] INFO: rcu_sched self-detected stall on CPU
> [  377.500762] .1: (1 ticks this GP) idle=dcd/1/0 
> [  377.505523] . (t=761 jiffies)
> [  377.508666] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
> [  377.518615] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
> [  377.529022] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
> [  377.538970] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
> [  377.548339] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
> [  377.558013] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
> [  377.567413] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
> [  377.577270] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
> [  377.587768] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
> [  377.596984] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
> [  377.605834] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
> [  377.614593] Exception stack(0xee06ff08 to 0xee06ff50)
> [  377.619873] ff00:                   00000001 00000001 00000000 3b9aca00 c608bc44 00000057
> [  377.628448] ff20: c146a4f0 00000002 c54e3b8f 00000056 c048fb3c c0c47654 00000001 ee06ff50
> [  377.637023] ff40: c0084774 c0390fac 20000113 ffffffff
> [  377.642333] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
> [  377.651458] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
> [  377.661499] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
> [  377.672515] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
> [  377.683563] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
> [  377.692779] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
> root at omap4430-panda:~# [  821.495361] INFO: rcu_sched self-detected stall on CPU
> [  821.500762] .1: (1 ticks this GP) idle=0ad/1/0 
> [  821.505523] . (t=755 jiffies)
> [  821.508666] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
> [  821.518615] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
> [  821.529022] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
> [  821.538940] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
> [  821.548339] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
> [  821.558013] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
> [  821.567413] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
> [  821.577270] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
> [  821.587768] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
> [  821.596984] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
> [  821.605834] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
> [  821.614593] Exception stack(0xee06ff08 to 0xee06ff50)
> [  821.619873] ff00:                   00000001 00000001 00000000 3b9aca00 267f1536 000000bf
> [  821.628448] ff20: c146a4f0 00000002 7da95560 000000be c048fb3c c0c47654 00000000 ee06ff50
> [  821.637023] ff40: c0084774 c0390fac 20000113 ffffffff
> [  821.642333] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
> [  821.651458] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
> [  821.661468] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
> [  821.672515] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
> [  821.683563] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
> [  821.692749] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
> [  827.495361] INFO: rcu_sched self-detected stall on CPU
> [  827.500762] .1: (1 ticks this GP) idle=0d1/1/0 
> [  827.505523] . (t=733 jiffies)
> [  827.508636] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
> [  827.518585] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
> [  827.528991] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
> [  827.538940] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
> [  827.548339] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
> [  827.558013] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
> [  827.567382] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
> [  827.577239] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
> [  827.587738] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
> [  827.596954] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
> [  827.605804] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
> [  827.614562] Exception stack(0xee06ff08 to 0xee06ff50)
> [  827.619842] ff00:                   00000001 00000001 00000000 3b9aca00 8c1fd142 000000c0
> [  827.628417] ff20: c146a4f0 00000002 a8004dd7 000000bf c048fb3c c0c47654 00000000 ee06ff50
> [  827.636993] ff40: c0084774 c0390fac 20000113 ffffffff
> [  827.642303] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
> [  827.651428] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
> [  827.661437] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
> [  827.672485] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
> [  827.683502] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
> [  827.692718] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
> [  833.495391] INFO: rcu_sched self-detected stall on CPU
> [  833.500793] .1: (3 GPs behind) idle=0d9/1/0 
> [  833.505279] . (t=733 jiffies)
> [  833.508392] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
> [  833.518341] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
> [  833.528747] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
> [  833.538696] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
> [  833.548095] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
> [  833.557769] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
> [  833.567138] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
> [  833.576995] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
> [  833.587493] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
> [  833.596710] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
> [  833.605560] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
> [  833.614318] Exception stack(0xee06ff08 to 0xee06ff50)
> [  833.619598] ff00:                   00000001 00000001 00000000 3b9aca00 f1c10484 000000c1
> [  833.628173] ff20: c146a4f0 00000002 d257bd83 000000c0 c048fb3c c0c47654 00000001 ee06ff50
> [  833.636749] ff40: c0084774 c0390fac 20000113 ffffffff
> [  833.642059] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
> [  833.651184] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
> [  833.661193] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
> [  833.672241] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
> [  833.683288] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
> [  833.692474] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
> [  839.495422] INFO: rcu_sched self-detected stall on CPU
> [  839.500823] .1: (1 ticks this GP) idle=0fd/1/0 
> [  839.505554] . (t=733 jiffies)
> [  839.508697] [<c0019da0>] (unwind_backtrace+0x0/0xf8) from [<c009b138>] (rcu_check_callbacks+0x204/0x790)
> [  839.518646] [<c009b138>] (rcu_check_callbacks+0x204/0x790) from [<c0045890>] (update_process_times+0x38/0x68)
> [  839.529052] [<c0045890>] (update_process_times+0x38/0x68) from [<c007d47c>] (tick_sched_timer+0x80/0xec)
> [  839.538970] [<c007d47c>] (tick_sched_timer+0x80/0xec) from [<c005a2fc>] (__run_hrtimer+0x7c/0x218)
> [  839.548370] [<c005a2fc>] (__run_hrtimer+0x7c/0x218) from [<c005b040>] (hrtimer_interrupt+0x130/0x2d8)
> [  839.558044] [<c005b040>] (hrtimer_interrupt+0x130/0x2d8) from [<c0018998>] (twd_handler+0x30/0x44)
> [  839.567443] [<c0018998>] (twd_handler+0x30/0x44) from [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158)
> [  839.577301] [<c00960cc>] (handle_percpu_devid_irq+0x90/0x158) from [<c00929ac>] (generic_handle_irq+0x30/0x44)
> [  839.587799] [<c00929ac>] (generic_handle_irq+0x30/0x44) from [<c0013bd8>] (handle_IRQ+0x4c/0xac)
> [  839.597015] [<c0013bd8>] (handle_IRQ+0x4c/0xac) from [<c0008470>] (gic_handle_irq+0x24/0x58)
> [  839.605865] [<c0008470>] (gic_handle_irq+0x24/0x58) from [<c0487604>] (__irq_svc+0x44/0x58)
> [  839.614593] Exception stack(0xee06ff08 to 0xee06ff50)
> [  839.619903] ff00:                   00000001 00000001 00000000 3b9aca00 576237c7 000000c3
> [  839.628479] ff20: c146a4f0 00000002 284df8f1 000000c3 c048fb3c c0c47654 00000000 ee06ff50
> [  839.637054] ff40: c0084774 c0390fac 20000113 ffffffff
> [  839.642333] [<c0487604>] (__irq_svc+0x44/0x58) from [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4)
> [  839.651458] [<c0390fac>] (cpuidle_wrap_enter+0x4c/0xa4) from [<c0390a80>] (cpuidle_enter_state+0x14/0x68)
> [  839.661499] [<c0390a80>] (cpuidle_enter_state+0x14/0x68) from [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0)
> [  839.672546] [<c0392890>] (cpuidle_enter_state_coupled+0x210/0x2a0) from [<c0390c48>] (cpuidle_idle_call+0x174/0x308)
> [  839.683563] [<c0390c48>] (cpuidle_idle_call+0x174/0x308) from [<c0014098>] (cpu_idle+0x54/0x12c)
> [  839.692779] [<c0014098>] (cpu_idle+0x54/0x12c) from [<8047c6f4>] (0x8047c6f4)
> 
> ....... ad infinitum

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-20 22:01           ` Paul E. McKenney
@ 2012-09-20 22:47             ` Paul Walmsley
  2012-09-20 23:21               ` Paul E. McKenney
  2012-09-21 17:47               ` Paul Walmsley
  0 siblings, 2 replies; 45+ messages in thread
From: Paul Walmsley @ 2012-09-20 22:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 20 Sep 2012, Paul E. McKenney wrote:

> On Thu, Sep 20, 2012 at 09:49:13PM +0000, Bruce, Becky wrote:
>
> > OK, people, you can stop heckling me about "sent from my iPhone" - I'm in the wilds of rural Louisiana with really bad internet service and was trying to work on my phone (but, alas, did not notice the CC list included the entire universe).   Shame on me.
> > 
> > With the above set, I don't seem to see any stalls with the RCU timeout set to 60s (the default).  I left the board running for 25 minutes; I will fire it up again later and let it run for a bit longer, but usually I end up seeing the problem pretty quickly so I don't expect that to result in anything.  I also didn't see any stalls on Paul's RCU tree as of a week ago at 60s, so as far as I can tell the CPU_IDLE stuff didn't have any impact (it wasn't on when I tested Paul's tree).
> > 
> > If I drop the timeout to 5s as Paul M. suggested for debug a while back, I do see stalls (both with CPU_IDLE stuff and without). 
> > 
> > I'm using the default omap2plus config, with RCU stall info enabled and the cpu idle stuff turned on (console dump below).  This is a Panda ES 1.1 (OMAP4460)
> 
> Thank you for the testing, Becky!
> 
> Paul Walmsley, please let me know if the config below doesn't clear things
> up for you or if there is some reason why this config is infeasible.

Will certainly test it here once things settle down.  But I wouldn't 
consider the use of CONFIG_CPU_IDLE to be a 'fix' until the problem can be 
understood.  We use !CONFIG_CPU_IDLE kernels for testing here.


- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-20 22:47             ` Paul Walmsley
@ 2012-09-20 23:21               ` Paul E. McKenney
  2012-09-21 18:08                 ` Paul Walmsley
  2012-09-21 17:47               ` Paul Walmsley
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-20 23:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 20, 2012 at 10:47:25PM +0000, Paul Walmsley wrote:
> On Thu, 20 Sep 2012, Paul E. McKenney wrote:
> 
> > On Thu, Sep 20, 2012 at 09:49:13PM +0000, Bruce, Becky wrote:
> >
> > > OK, people, you can stop heckling me about "sent from my iPhone" - I'm in the wilds of rural Louisiana with really bad internet service and was trying to work on my phone (but, alas, did not notice the CC list included the entire universe).   Shame on me.
> > > 
> > > With the above set, I don't seem to see any stalls with the RCU timeout set to 60s (the default).  I left the board running for 25 minutes; I will fire it up again later and let it run for a bit longer, but usually I end up seeing the problem pretty quickly so I don't expect that to result in anything.  I also didn't see any stalls on Paul's RCU tree as of a week ago at 60s, so as far as I can tell the CPU_IDLE stuff didn't have any impact (it wasn't on when I tested Paul's tree).
> > > 
> > > If I drop the timeout to 5s as Paul M. suggested for debug a while back, I do see stalls (both with CPU_IDLE stuff and without). 
> > > 
> > > I'm using the default omap2plus config, with RCU stall info enabled and the cpu idle stuff turned on (console dump below).  This is a Panda ES 1.1 (OMAP4460)
> > 
> > Thank you for the testing, Becky!
> > 
> > Paul Walmsley, please let me know if the config below doesn't clear things
> > up for you or if there is some reason why this config is infeasible.
> 
> Will certainly test it here once things settle down.  But I wouldn't 
> consider the use of CONFIG_CPU_IDLE to be a 'fix' until the problem can be 
> understood.  We use !CONFIG_CPU_IDLE kernels for testing here.

Fair point.  I am wondering whether there is some path into the idle
loop that somehow avoids telling RCU that the CPU has in face entered
idle.  There needs to be an rcu_idle_enter() call on the way to idle,
otherwise RCU CPU stall warnings are expected behavior.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-20 22:47             ` Paul Walmsley
  2012-09-20 23:21               ` Paul E. McKenney
@ 2012-09-21 17:47               ` Paul Walmsley
  2012-09-21 17:51                 ` Paul Walmsley
  2012-09-21 21:20                 ` Paul E. McKenney
  1 sibling, 2 replies; 45+ messages in thread
From: Paul Walmsley @ 2012-09-21 17:47 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Paul

On Thu, 20 Sep 2012, Paul Walmsley wrote:

> On Thu, 20 Sep 2012, Paul E. McKenney wrote:
>
> > Paul Walmsley, please let me know if the config below doesn't clear things
> > up for you or if there is some reason why this config is infeasible.
> 
> Will certainly test it here once things settle down.

I built an OMAP kernel from Linus' commit 
4651afbbae968772efd6dc4ba461cba9b49bb9d8 ("Merge branch 'for-3.6-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq").  The config used 
was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.

Long story short, the rcu_sched stall messages continue to appear.  
Transcript below.

The above config should be easy to reproduce from mainline.  Paul, not 
sure if you're still working with Linaro, but you should be able to get 
access to a Pandaboard through them to try it yourself, should you wish.


- Paul

## Booting kernel from Legacy Image at 82000000 ...                                                   
   Image Name:   Linux-3.6.0-rc6-00025-g4651afb                                                       
   Image Type:   ARM Linux Kernel Image (uncompressed)                                                
   Data Size:    4061576 Bytes = 3.9 MiB                                                              
   Load Address: 80008000                                                                             
   Entry Point:  80008000                                                                             
   Verifying Checksum ... OK                                                                          
   Loading Kernel Image ... OK                                                                        
OK                                                                                                    
                                                                                                      
Starting kernel ...                                                                                   
                                                                                                      
Uncompressing Linux... done, booting the kernel.                                                      
[    0.000000] Booting Linux on physical CPU 0                                                        
[    0.000000] Linux version 3.6.0-rc6-00025-g4651afb (paul at dusk) (gcc version 4.5.1 (Sourcery G++ Lite 2010.09-50) ) #342 SMP Fri Sep 21 11:28:00 MDT 2012                                                                         
[    0.000000] CPU: ARMv7 Processor [411fc092] revision 2 (ARMv7), cr=10c53c7d                        
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache               
[    0.000000] Machine: OMAP4 Panda board                                                             
[    0.000000] debug: ignoring loglevel setting.                                                      
[    0.000000] Reserving 16777216 bytes SDRAM for VRAM                                                
[    0.000000] Memory policy: ECC disabled, Data cache writealloc
[    0.000000] On node 0 totalpages: 126720                                                           
[    0.000000] free_area_init_node: node 0, pgdat c07c6140, node_mem_map c0d23000
[    0.000000]   Normal zone: 1024 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 125696 pages, LIFO batch:31
[    0.000000] OMAP4430 ES2.0
[    0.000000] PERCPU: Embedded 9 pages/cpu @c112b000 s14016 r8192 d14656 u36864
[    0.000000] pcpu-alloc: s14016 r8192 d14656 u36864 alloc=9*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 
[    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 125696
[    0.000000] Kernel command line: console=ttyO2,230400n8 vram=16M root=/dev/mmcblk0p2 rw rootfstype=ext3 rootwait earlyprintk debug ignore_loglevel
[    0.000000] PID hash table entries: 2048 (order: 1, 8192 bytes)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
[    0.000000] Memory: 495MB = 495MB total
[    0.000000] Memory: 488828k/488828k available, 35460k reserved, 0K highmem
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xfff00000 - 0xfffe0000   ( 896 kB)
[    0.000000]     vmalloc : 0xe0800000 - 0xff000000   ( 488 MB)
[    0.000000]     lowmem  : 0xc0000000 - 0xe0000000   ( 512 MB)
[    0.000000]     pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
[    0.000000]     modules : 0xbf000000 - 0xbfe00000   (  14 MB)
[    0.000000]       .text : 0xc0008000 - 0xc06dad24   (6988 kB)
[    0.000000]       .init : 0xc06db000 - 0xc072a6c0   ( 318 kB)
[    0.000000]       .data : 0xc072c000 - 0xc07c8940   ( 627 kB)
[    0.000000]        .bss : 0xc07c8964 - 0xc0d22a4c   (5481 kB)
[    0.000000] Hierarchical RCU implementation.
[    0.000000]  Additional per-CPU info printed with stalls.
[    0.000000] NR_IRQS:474
[    0.000000] omap_hwmod: dpll_mpu_m2_ck: missing clockdomain for dpll_mpu_m2_ck.
[    0.000000] OMAP clockevent source: GPTIMER1 at 32768 Hz
[    0.000000] omap_hwmod: sys_32k_ck: missing clockdomain for sys_32k_ck.
[    0.000000] sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms
[    0.000000] OMAP clocksource: 32k_counter at 32768 Hz
[    0.000000] Console: colour dummy device 80x30
[    0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[    0.000000] ... MAX_LOCKDEP_SUBCLASSES:  8
[    0.000000] ... MAX_LOCK_DEPTH:          48
[    0.000000] ... MAX_LOCKDEP_KEYS:        8191
[    0.000000] ... CLASSHASH_SIZE:          4096
[    0.000000] ... MAX_LOCKDEP_ENTRIES:     16384
[    0.000000] ... MAX_LOCKDEP_CHAINS:      32768
[    0.000000] ... CHAINHASH_SIZE:          16384
[    0.000000]  memory used by lock dependency info: 3695 kB
[    0.000000]  per task-struct memory footprint: 1152 bytes
[    0.001190] Calibrating delay loop... 2007.19 BogoMIPS (lpj=7839744)
[    0.124725] pid_max: default: 32768 minimum: 301
[    0.125518] Security Framework initialized
[    0.125793] Mount-cache hash table entries: 512
[    0.131286] CPU: Testing write buffer coherency: ok
[    0.132751] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0value (unknown osc rate)
[    0.402343] twl 1-0048: PIH (irq 39) chaining IRQs 352..372
[    0.403930] VUSB: 3300 mV normal standby
[    0.404907] vdd_mpu: 500 <--> 1500 mV normal 
[    0.405792] vdd_iva: 500 <--> 1500 mV normal 
[    0.406890] vdd_core: 500 <--> 1500 mV normal 
[    0.408081] V1V8: 1800 mV normal standby
[    0.408966] V2V1: 2100 mV normal standby
[    0.409973] VMMC: 1200 <--> 3000 mV at 3000 mV normal standby
[    0.411132] VPP: 1800 <--> 2500 mV at 1900 mV normal standby
[    0.412445] VCXIO: 1800 mV normal standby
[    0.412506] VCXIO: supplied by V2V1
[    0.414367] VDAC: 1800 mV normal standby
[    0.414459] VDAC: supplied by V2V1
[    0.415863] VAUX2_6030: 1200 <--> 2800 mV at 1800 mV normal standby
[    0.417053] VAUX3_6030: 1000 <--> 3000 mV at 1200 mV normal standby
[    0.418487] VANA: 2100 mV normal standby
[    0.430084] omap_i2c omap_i2c.2: bus 2 rev2.4.0 at 400 kHz
[    0gistered twl_rtc as rtc0
[    1.241333] i2c /dev entries driver
[    1.245208] Driver for 1-wire Dallas network protocol.
[    1.249908] omap_wdt: OMAP Watchdog Timer Rev 0x00: initial timeout 60 sec
[    1.254272] cpuidle: using governor ladder
[    1.256530] cpuidle: using governor menu
[    1.259948] omap_hsmmc omap_hsmmc.0: Failed to get debounce clk
[    1.263305] omap-dma-engine omap-dma-engine: allocating channel for 62
[    1.266906] omap-dma-engine omap-dma-engine: allocating channel for 61
[    1.317443] omap_hsmmc omap_hsmmc.4: Failed to get debounce clk
[    1.320709] omap-dma-engine omap-dma-engine: allocating channel for 60
[    1.324279] omap-dma-engine omap-dma-engine: allocating channel for 59
[    1.444854] usbcore: registered new interface driver usbhid
[    1.447906] usbhid: USB HID core driver
[    1.450012] oprofile: hardware counters not available
[    1.452789] oprofile: using timer interrupt.
[    1.455657] TCP: cubic registered
[    1.457489] Initializing XFRM netlink socket
[    1.459930] NET: Registered protocol family 17
[    1.462402] NET: Registered protocol family 15
[    1.465057] Key type dns_resolver registered
[    1.467559] VFP support v0.3: implementor 41 architecture 3 part 30 variant 9 rev 1
[    1.476593] omap_vc_i2c_init: I2C config for vdd_iva does not match other channels (0).
[    1.480895] omap_vc_i2c_init: I2C config for vdd_mpu does not match other channels (0).[    1.486236] Power Management for TI OMAP4.
[    1.489135] ThumbEE CPU extension supported.
[    1.508178] clock: disabling unused clocks to save power
[    1.513977] VANA: incomplete constraints, leaving on
[    1.517669] VDAC: incomplete constraints, leaving on
[    1.521118] VUSB: incomplete constraints, leaving on
[    1.525665] twl_rtc twl_rtc: setting system clock to 2000-01-01 00:00:00 UTC (946684800)
[    1.532409] Waiting for root device /dev/mmcblk0p2...
[    1.599426] mmc0: host does not support reading read-only switch. assuming write-enable.
[    1.603942] mmc0: new SD card at address aaaa
[    1.607849] mmcblk0: mmc0:aaaa SD02G 1.89 GiB 
[    1.614410]  mmcblk0: p1 p2
[    3.508514] kjournald starting.  Commit interval 5 seconds
[    3.668579] EXT3-fs (mmcblk0p2): using internal journal
[    3.675384] EXT3-fs (mmcblk0p2): recovery complete
[    3.678039] EXT3-fs (mmcblk0p2): mounted filesystem with ordered data mode
[    3.682037] VFS: Mounted root (ext3 filesystem) on device 179:2.
[    3.685607] Freeing init memory: 316K
modprobe: FATAL: Could not load /lib/modules/3.6.0-rc6-00025-g4651afb/modules.dep: No such file or directory

INIT: version 2.88 booting
Using makefile-style concurrent boot in runlevel S.
Starting the hotplug events dispatcher: udevd.
Synthesizing the initial hotplug events...done.
Waiting for /dev to be fully populated...done.
Activating swap...done.
Cleaning up ifupdown....
Setting up networking....
Loading kernel modules...done.
Activating lvm and md swap...done.
Checking file systems...fsck from util-linux 2.19.1
done.
Mounting local filesystems...done.
Activating swapfile swap...done.
Cleaning up temporary files....
Configuring network interfaces...done.
Cleaning up temporary files....
Setting kernel variables ...done.
INIT: Entering runlevel: 2
Using makefile-style concurrent boot in runlevel 2.

Debian GNU/Linux wheezy/sid armel ttyO2

armel login: [  305.848632] INFO: rcu_sched self-detected stall on CPU
[  305.851470]  1: (17 GPs behind) idle=439/1/0 
[  305.853851]   (t=33915 jiffies)
[  305.855621] [<c001bcb0>] (unwind_backtrace+0x0/0xf0) from [<c00adcd0>] (rcu_check_callbacks+0x1b0/0x678)
[  305.860809] [<c00adcd0>] (rcu_check_callbacks+0x1b0/0x678) from [<c00537e8>] (update_process_times+0x38/0x68)
[  305.866210] [<c00537e8>] (update_process_times+0x38/0x68) from [<c008cf28>] (tick_sched_timer+0x80/0xec)
[  305.871398] [<c008cf28>] (tick_sched_timer+0x80/0xec) from [<c0069414>] (__run_hrtimer+0x7c/0x1e0)
[  305.876281] [<c0069414>] (__run_hrtimer+0x7c/0x1e0) from [<c006a1f8>] (hrtimer_interrupt+0x11c/0x2d0)
[  305.881317] [<c006a1f8>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a3ec>] (twd_handler+0x30/0x44)
[  305.886199] [<c001a3ec>] (twd_handler+0x30/0x44) from [<c00a8110>] (handle_percpu_devid_irq+0x90/0x13c)
[  305.891326] [<c00a8110>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a4884>] (generic_handle_irq+0x30/0x48)
[  305.896820] [<c00a4884>] (generic_handle_irq+0x30/0x48) from [<c0014e38>] (handle_IRQ+0x4c/0xac)
[  305.901611] [<c0014e38>] (handle_IRQ+0x4c/0xac) from [<c00084cc>] (gic_handle_irq+0x28/0x5c)
[  305.906219] [<c00084cc>] (gic_handle_irq+0x28/0x5c) from [<c04fdf64>] (__irq_svc+0x44/0x5c)
[  305.910766] Exception stack(0xde86ff08 to 0xde86ff50)
[  305.913513] ff00:                   0002a80e 00000001 00000000 de8660c0 35b88a9a 00000047
[  305.917968] ff20: c11345e0 00000002 efc9ba42 00000046 c0509618 00000000 00000001 de86ff50
[  305.922424] ff40: 0002a80f c03f8f1c 20000113 ffffffff
[  305.925201] [<c04fdf64>] (__irq_svc+0x44/0x5c) from [<c03f8f1c>] (cpuidle_wrap_enter+0x4c/0xa0)
[  305.929962] [<c03f8f1c>] (cpuidle_wrap_enter+0x4c/0xa0) from [<c03f8a10>] (cpuidle_enter_state+0x14/0x68)
[  305.935180] [<c03f8a10>] (cpuidle_enter_state+0x14/0x68) from [<c03fab34>] (cpuidle_enter_state_coupled+0x240/0x2e8)
[  305.940917] [<c03fab34>] (cpuidle_enter_state_coupled+0x240/0x2e8) from [<c03f8b48>] (cpuidle_idle_call+0xe4/0x2e0)
[  305.946594] [<c03f8b48>] (cpuidle_idle_call+0xe4/0x2e0) from [<c0015358>] (cpu_idle+0x98/0x124)
[  305.951324] [<c0015358>] (cpu_idle+0x98/0x124) from [<804f6df4>] (0x804f6df4)
root
[  375.629669] INFO: rcu_sched self-detected stall on CPU
[  375.632476]  0: (2 GPs behind) idle=c1b/1/0 
[  375.634796]   (t=8914 jiffies)
[  375.636505] [<c001bcb0>] (unwind_backtrace+0x0/0xf0) from [<c00adcd0>] (rcu_check_callbacks+0x1b0/0x678)
[  375.641693] [<c00adcd0>] (rcu_check_callbacks+0x1b0/0x678) from [<c00537e8>] (update_process_times+0x38/0x68)
[  375.647094] [<c00537e8>] (update_process_times+0x38/0x68) from [<c008cf28>] (tick_sched_timer+0x80/0xec)
[  375.652252] [<c008cf28>] (tick_sched_timer+0x80/0xec) from [<c0069414>] (__run_hrtimer+0x7c/0x1e0)
[  375.657135] [<c0069414>] (__run_hrtimer+0x7c/0x1e0) from [<c006a1f8>] (hrtimer_interrupt+0x11c/0x2d0)
[  375.662170] [<c006a1f8>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a3ec>] (twd_handler+0x30/0x44)
[  375.667053] [<c001a3ec>] (twd_handler+0x30/0x44) from [<c00a8110>] (handle_percpu_devid_irq+0x90/0x13c)
[  375.672180] [<c00a8110>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a4884>] (generic_handle_irq+0x30/0x48)
[  375.677642] [<c00a4884>] (generic_handle_irq+0x30/0x48) from [<c0014e38>] (handle_IRQ+0x4c/0xac)
[  375.682434] [<c0014e38>] (handle_IRQ+0x4c/0xac) from [<c00084cc>] (gic_handle_irq+0x28/0x5c)
[  375.687042] [<c00084cc>] (gic_handle_irq+0x28/0x5c) from [<c04fdf64>] (__irq_svc+0x44/0x5c)
[  375.691589] Exception stack(0xc072df08 to 0xc072df50)
[  375.694366] df00:                   0002fb7e 00000001 00000000 c074e8e8 74fea80e 00000057
[  375.698822] df20: c112b5e0 00000000 74fc5402 00000057 c0d1c94c 00000000 00000001 c072df50
[  375.703277] df40: 0002fb7f c03f8f1c 20000013 ffffffff
[  375.706024] [<c04fdf64>] (__irq_svc+0x44/0x5c) from [<c03f8f1c>] (cpuidle_wrap_enter+0x4c/0xa0)
[  375.710784] [<c03f8f1c>] (cpuidle_wrap_enter+0x4c/0xa0) from [<c03f8ba8>] (cpuidle_idle_call+0x144/0x2e0)
[  375.716003] [<c03f8ba8>] (cpuidle_idle_call+0x144/0x2e0) from [<c0015358>] (cpu_idle+0x98/0x124)
[  375.720794] [<c0015358>] (cpu_idle+0x98/0x124) from [<c06db7b0>] (start_kernel+0x2b4/0x304)
Last login: Sat Jan  1 00:09:14 UTC 2000 on ttyO2
Linux armel 3.6.0-rc6-00025-g4651afb #342 SMP Fri Sep 21 11:28:00 MDT 2012 armv7l

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root at armel:~# cat /sys/devices/system/cpu/cpuidle/*
omap4_idle
menu
root at armel:~# 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 17:47               ` Paul Walmsley
@ 2012-09-21 17:51                 ` Paul Walmsley
  2012-09-21 21:20                 ` Paul E. McKenney
  1 sibling, 0 replies; 45+ messages in thread
From: Paul Walmsley @ 2012-09-21 17:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 21 Sep 2012, Paul Walmsley wrote:

> The config used was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE 
> by hand. 

One other thing I forgot to mention - CONFIG_RCU_CPU_STALL_INFO was 
enabled by hand also.  Below is the diff between omap2plus_defconfig and 
the config that was used here.


- Paul

--- .config	2012-09-21 11:50:19.565972113 -0600
+++ .testconfig	2012-09-21 11:50:12.233946950 -0600
@@ -536,7 +536,9 @@
 # CPU Frequency scaling
 #
 # CONFIG_CPU_FREQ is not set
-# CONFIG_CPU_IDLE is not set
+CONFIG_CPU_IDLE=y
+CONFIG_CPU_IDLE_GOV_LADDER=y
+CONFIG_CPU_IDLE_GOV_MENU=y
 CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED=y
 
 #
@@ -2535,7 +2537,7 @@
 # CONFIG_BOOT_PRINTK_DELAY is not set
 # CONFIG_RCU_TORTURE_TEST is not set
 CONFIG_RCU_CPU_STALL_TIMEOUT=60
-# CONFIG_RCU_CPU_STALL_INFO is not set
+CONFIG_RCU_CPU_STALL_INFO=y
 # CONFIG_RCU_TRACE is not set
 # CONFIG_KPROBES_SANITY_TEST is not set
 # CONFIG_BACKTRACE_SELF_TEST is not set

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-20 23:21               ` Paul E. McKenney
@ 2012-09-21 18:08                 ` Paul Walmsley
  2012-09-21 18:58                   ` Paul E. McKenney
  2012-09-21 18:59                   ` Paul Walmsley
  0 siblings, 2 replies; 45+ messages in thread
From: Paul Walmsley @ 2012-09-21 18:08 UTC (permalink / raw)
  To: linux-arm-kernel

cc Frederic Weisbecker - context is here: 

   http://marc.info/?l=linux-kernel&m=134749030206016&w=2

On Thu, 20 Sep 2012, Paul E. McKenney wrote:

> Fair point.  I am wondering whether there is some path into the idle
> loop that somehow avoids telling RCU that the CPU has in face entered
> idle.  There needs to be an rcu_idle_enter() call on the way to idle,
> otherwise RCU CPU stall warnings are expected behavior.

As far as I know, our only idle entry point is in 
arch/arm/common/process.c:cpu_idle().

Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
{stop,start}_critical_timings().  Making that change here didn't help.

Also tried commenting out the code from the stop_critical_timings() call 
to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
also didn't help, which suggests that the problem is not caused by the 
OMAP-specific PM idle code.

- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 18:08                 ` Paul Walmsley
@ 2012-09-21 18:58                   ` Paul E. McKenney
  2012-09-21 19:11                     ` Paul Walmsley
  2012-09-21 18:59                   ` Paul Walmsley
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-21 18:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 21, 2012 at 06:08:59PM +0000, Paul Walmsley wrote:
> cc Frederic Weisbecker - context is here: 
> 
>    http://marc.info/?l=linux-kernel&m=134749030206016&w=2
> 
> On Thu, 20 Sep 2012, Paul E. McKenney wrote:
> 
> > Fair point.  I am wondering whether there is some path into the idle
> > loop that somehow avoids telling RCU that the CPU has in face entered
> > idle.  There needs to be an rcu_idle_enter() call on the way to idle,
> > otherwise RCU CPU stall warnings are expected behavior.
> 
> As far as I know, our only idle entry point is in 
> arch/arm/common/process.c:cpu_idle().

In mainline, this is arch/arm/kernel/process.c, correct?

> Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
> {stop,start}_critical_timings().  Making that change here didn't help.

The reason x86 does this is that they have idle notifiers deeper in the
idle loop that use RCU read-side critical sections.  So this was an
expected result.

> Also tried commenting out the code from the stop_critical_timings() call 
> to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
> also didn't help, which suggests that the problem is not caused by the 
> OMAP-specific PM idle code.

I must admit that you make a convincing case here.  Though it does leave
me wondering what is different about Panda (and MX28, IIRC).

I may take your advice of remote access to a Panda board, though that
is likely to take a bit of time due to timezones.  Regardless of the
underlying issue here, I clearly need to make the stall-warning messages
do a better job of printing out needed information.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 18:08                 ` Paul Walmsley
  2012-09-21 18:58                   ` Paul E. McKenney
@ 2012-09-21 18:59                   ` Paul Walmsley
  1 sibling, 0 replies; 45+ messages in thread
From: Paul Walmsley @ 2012-09-21 18:59 UTC (permalink / raw)
  To: linux-arm-kernel

Hi

Just did a test with CONFIG_NO_HZ=n, and no rcu_sched stall messages 
appeared for 60 minutes.

Following is the diff from omap2plus_defconfig.

- Paul

--- .config	2012-09-21 12:51:19.114938954 -0600
+++ .testconfig	2012-09-21 12:51:15.686926318 -0600
@@ -69,7 +69,7 @@
 # Timers subsystem
 #
 CONFIG_TICK_ONESHOT=y
-CONFIG_NO_HZ=y
+# CONFIG_NO_HZ is not set
 CONFIG_HIGH_RES_TIMERS=y
 
 #
@@ -80,7 +80,6 @@
 CONFIG_RCU_FANOUT=32
 CONFIG_RCU_FANOUT_LEAF=16
 # CONFIG_RCU_FANOUT_EXACT is not set
-# CONFIG_RCU_FAST_NO_HZ is not set
 # CONFIG_TREE_RCU_TRACE is not set
 CONFIG_IKCONFIG=y
 CONFIG_IKCONFIG_PROC=y
@@ -2535,7 +2534,7 @@
 # CONFIG_BOOT_PRINTK_DELAY is not set
 # CONFIG_RCU_TORTURE_TEST is not set
 CONFIG_RCU_CPU_STALL_TIMEOUT=60
-# CONFIG_RCU_CPU_STALL_INFO is not set
+CONFIG_RCU_CPU_STALL_INFO=y
 # CONFIG_RCU_TRACE is not set
 # CONFIG_KPROBES_SANITY_TEST is not set
 # CONFIG_BACKTRACE_SELF_TEST is not set

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 18:58                   ` Paul E. McKenney
@ 2012-09-21 19:11                     ` Paul Walmsley
  2012-09-21 19:57                       ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-21 19:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

> On Fri, Sep 21, 2012 at 06:08:59PM +0000, Paul Walmsley wrote:
> 
> > As far as I know, our only idle entry point is in 
> > arch/arm/common/process.c:cpu_idle().
> 
> In mainline, this is arch/arm/kernel/process.c, correct?

Indeed; sorry about that, mistyped.

> > Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
> > {stop,start}_critical_timings().  Making that change here didn't help.
> 
> The reason x86 does this is that they have idle notifiers deeper in the
> idle loop that use RCU read-side critical sections.  So this was an
> expected result.

OK

> > Also tried commenting out the code from the stop_critical_timings() call 
> > to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
> > also didn't help, which suggests that the problem is not caused by the 
> > OMAP-specific PM idle code.
> 
> I must admit that you make a convincing case here.  Though it does leave
> me wondering what is different about Panda (and MX28, IIRC).

Given the dependency on CONFIG_NO_HZ, the stalls are probably dependent on 
the userspace in use.  The userspaces here are quite minimal and so allow 
the system to stay idle for relatively long periods of time.

> I may take your advice of remote access to a Panda board, though that
> is likely to take a bit of time due to timezones.  Regardless of the
> underlying issue here, I clearly need to make the stall-warning messages
> do a better job of printing out needed information.

If you've got a patch in mind for that, I'll boot it here.

One other observation.  omap2plus_defconfig sets CONFIG_NO_HZ=y but 
doesn't set CONFIG_RCU_FAST_NO_HZ.  The stall warning messages still 
appear when CONFIG_RCU_FAST_NO_HZ=y.  One of them is attached below (with 
CONFIG_RCU_CPU_STALL_INFO set as well, obviously).

As an aside, in the CONFIG_RCU_FAST_NO_HZ=y build, I dropped a printk() 
into rcu_idle_gp_timer_func() and it doesn't look like it ever executed.


- Paul

[  305.832000] INFO: rcu_sched self-detected stall on CPU
[  305.834838]  1: (2 GPs behind) idle=5b1/1/0 drain=0 . timer=4294967295
[  305.838378]   (t=17463 jiffies)
[  305.840118] [<c001be10>] (unwind_backtrace+0x0/0xf0) from [<c00ad65c>] (rcu_pending+0xd0/0x540)
[  305.844848] [<c00ad65c>] (rcu_pending+0xd0/0x540) from [<c00ae5cc>] (rcu_check_callbacks+0x110/0x198)
[  305.849884] [<c00ae5cc>] (rcu_check_callbacks+0x110/0x198) from [<c0053800>] (update_process_times+0x38/0x68)
[  305.855285] [<c0053800>] (update_process_times+0x38/0x68) from [<c008cf40>] (tick_sched_timer+0x80/0xec)
[  305.860473] [<c008cf40>] (tick_sched_timer+0x80/0xec) from [<c006942c>] (__run_hrtimer+0x7c/0x1e0)
[  305.865356] [<c006942c>] (__run_hrtimer+0x7c/0x1e0) from [<c006a210>] (hrtimer_interrupt+0x11c/0x2d0)
[  305.870361] [<c006a210>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a54c>] (twd_handler+0x30/0x44)
[  305.875274] [<c001a54c>] (twd_handler+0x30/0x44) from [<c00a8128>] (handle_percpu_devid_irq+0x90/0x13c)
[  305.880371] [<c00a8128>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a489c>] (generic_handle_irq+0x30/0x48)
[  305.885833] [<c00a489c>] (generic_handle_irq+0x30/0x48) from [<c0014fb8>] (handle_IRQ+0x4c/0xac)
[  305.890624] [<c0014fb8>] (handle_IRQ+0x4c/0xac) from [<c000864c>] (gic_handle_irq+0x28/0x5c)
[  305.895233] [<c000864c>] (gic_handle_irq+0x28/0x5c) from [<c04fbc64>] (__irq_svc+0x44/0x5c)
[  305.899780] Exception stack(0xde86ff88 to 0xde86ffd0)
[  305.902526] ff80:                   0004c062 00000001 00000000 de8660c0 de86e000 c07c42c8
[  305.906982] ffa0: c05075a0 c074bdd0 00000000 411fc092 c074bff0 00000000 00000001 de86ffd0
[  305.911437] ffc0: 0004c063 c00152b0 20000113 ffffffff
[  305.914184] [<c04fbc64>] (__irq_svc+0x44/0x5c) from [<c00152b0>] (default_idle+0x20/0x44)
[  305.918640] [<c00152b0>] (default_idle+0x20/0x44) from [<c00154dc>] (cpu_idle+0x9c/0x114)
[  305.923126] [<c00154dc>] (cpu_idle+0x9c/0x114) from [<804f4a34>] (0x804f4a34)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 19:11                     ` Paul Walmsley
@ 2012-09-21 19:57                       ` Paul E. McKenney
  2012-09-21 20:31                         ` Tony Lindgren
                                           ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-21 19:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 21, 2012 at 07:11:14PM +0000, Paul Walmsley wrote:
> On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> > On Fri, Sep 21, 2012 at 06:08:59PM +0000, Paul Walmsley wrote:
> > 
> > > As far as I know, our only idle entry point is in 
> > > arch/arm/common/process.c:cpu_idle().
> > 
> > In mainline, this is arch/arm/kernel/process.c, correct?
> 
> Indeed; sorry about that, mistyped.

No problem!

> > > Looking at the x86 idle entry, they call rcu_idle_{enter,exit}() inside 
> > > {stop,start}_critical_timings().  Making that change here didn't help.
> > 
> > The reason x86 does this is that they have idle notifiers deeper in the
> > idle loop that use RCU read-side critical sections.  So this was an
> > expected result.
> 
> OK
> 
> > > Also tried commenting out the code from the stop_critical_timings() call 
> > > to the WARN_ON(irqs_disabled()), and adding a local_irq_enable().  That 
> > > also didn't help, which suggests that the problem is not caused by the 
> > > OMAP-specific PM idle code.
> > 
> > I must admit that you make a convincing case here.  Though it does leave
> > me wondering what is different about Panda (and MX28, IIRC).
> 
> Given the dependency on CONFIG_NO_HZ, the stalls are probably dependent on 
> the userspace in use.  The userspaces here are quite minimal and so allow 
> the system to stay idle for relatively long periods of time.

Could you please point me to a recipe for creating a minimal userspace?
Just in case it is the userspac erather than the architecture/hardware
that makes the difference.

> > I may take your advice of remote access to a Panda board, though that
> > is likely to take a bit of time due to timezones.  Regardless of the
> > underlying issue here, I clearly need to make the stall-warning messages
> > do a better job of printing out needed information.
> 
> If you've got a patch in mind for that, I'll boot it here.

Hammering it out, will send it along when it is a bit less destructive.  ;-)

> One other observation.  omap2plus_defconfig sets CONFIG_NO_HZ=y but 
> doesn't set CONFIG_RCU_FAST_NO_HZ.  The stall warning messages still 
> appear when CONFIG_RCU_FAST_NO_HZ=y.  One of them is attached below (with 
> CONFIG_RCU_CPU_STALL_INFO set as well, obviously).

Just to make sure I understand the combinations:

o	All stalls have happened when running a minimal userspace.
o	CONFIG_NO_HZ=n suppresses the stalls.
o	CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
	no observable effect on the stalls.

Did I get that right, or am I missing a combination?

> As an aside, in the CONFIG_RCU_FAST_NO_HZ=y build, I dropped a printk() 
> into rcu_idle_gp_timer_func() and it doesn't look like it ever executed.

Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is cancelled
upon exit from idle, and therefore should (almost) never actually execute.
Its sole purpose is to wake up the CPU.  ;-)

							Thanx, Paul

> - Paul
> 
> [  305.832000] INFO: rcu_sched self-detected stall on CPU
> [  305.834838]  1: (2 GPs behind) idle=5b1/1/0 drain=0 . timer=4294967295
> [  305.838378]   (t=17463 jiffies)
> [  305.840118] [<c001be10>] (unwind_backtrace+0x0/0xf0) from [<c00ad65c>] (rcu_pending+0xd0/0x540)
> [  305.844848] [<c00ad65c>] (rcu_pending+0xd0/0x540) from [<c00ae5cc>] (rcu_check_callbacks+0x110/0x198)
> [  305.849884] [<c00ae5cc>] (rcu_check_callbacks+0x110/0x198) from [<c0053800>] (update_process_times+0x38/0x68)
> [  305.855285] [<c0053800>] (update_process_times+0x38/0x68) from [<c008cf40>] (tick_sched_timer+0x80/0xec)
> [  305.860473] [<c008cf40>] (tick_sched_timer+0x80/0xec) from [<c006942c>] (__run_hrtimer+0x7c/0x1e0)
> [  305.865356] [<c006942c>] (__run_hrtimer+0x7c/0x1e0) from [<c006a210>] (hrtimer_interrupt+0x11c/0x2d0)
> [  305.870361] [<c006a210>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a54c>] (twd_handler+0x30/0x44)
> [  305.875274] [<c001a54c>] (twd_handler+0x30/0x44) from [<c00a8128>] (handle_percpu_devid_irq+0x90/0x13c)
> [  305.880371] [<c00a8128>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a489c>] (generic_handle_irq+0x30/0x48)
> [  305.885833] [<c00a489c>] (generic_handle_irq+0x30/0x48) from [<c0014fb8>] (handle_IRQ+0x4c/0xac)
> [  305.890624] [<c0014fb8>] (handle_IRQ+0x4c/0xac) from [<c000864c>] (gic_handle_irq+0x28/0x5c)
> [  305.895233] [<c000864c>] (gic_handle_irq+0x28/0x5c) from [<c04fbc64>] (__irq_svc+0x44/0x5c)
> [  305.899780] Exception stack(0xde86ff88 to 0xde86ffd0)
> [  305.902526] ff80:                   0004c062 00000001 00000000 de8660c0 de86e000 c07c42c8
> [  305.906982] ffa0: c05075a0 c074bdd0 00000000 411fc092 c074bff0 00000000 00000001 de86ffd0
> [  305.911437] ffc0: 0004c063 c00152b0 20000113 ffffffff
> [  305.914184] [<c04fbc64>] (__irq_svc+0x44/0x5c) from [<c00152b0>] (default_idle+0x20/0x44)
> [  305.918640] [<c00152b0>] (default_idle+0x20/0x44) from [<c00154dc>] (cpu_idle+0x9c/0x114)
> [  305.923126] [<c00154dc>] (cpu_idle+0x9c/0x114) from [<804f4a34>] (0x804f4a34)
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 19:57                       ` Paul E. McKenney
@ 2012-09-21 20:31                         ` Tony Lindgren
  2012-09-21 22:03                           ` Paul E. McKenney
  2012-09-21 22:12                         ` Paul E. McKenney
  2012-09-22 18:42                         ` Paul Walmsley
  2 siblings, 1 reply; 45+ messages in thread
From: Tony Lindgren @ 2012-09-21 20:31 UTC (permalink / raw)
  To: linux-arm-kernel

* Paul E. McKenney <paulmck@linux.vnet.ibm.com> [120921 12:58]:
> 
> Just to make sure I understand the combinations:
> 
> o	All stalls have happened when running a minimal userspace.
> o	CONFIG_NO_HZ=n suppresses the stalls.
> o	CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> 	no observable effect on the stalls.

The reason why you may need minimal userspace is to cut down
the number of timers waking up the system with NO_HZ.
Booting with init=/bin/sh might also do the trick for that.

Regards,

Tony

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 17:47               ` Paul Walmsley
  2012-09-21 17:51                 ` Paul Walmsley
@ 2012-09-21 21:20                 ` Paul E. McKenney
  2012-09-21 22:41                   ` Paul Walmsley
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-21 21:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 21, 2012 at 05:47:31PM +0000, Paul Walmsley wrote:
> Hi Paul
> 
> On Thu, 20 Sep 2012, Paul Walmsley wrote:
> 
> > On Thu, 20 Sep 2012, Paul E. McKenney wrote:
> >
> > > Paul Walmsley, please let me know if the config below doesn't clear things
> > > up for you or if there is some reason why this config is infeasible.
> > 
> > Will certainly test it here once things settle down.
> 
> I built an OMAP kernel from Linus' commit 
> 4651afbbae968772efd6dc4ba461cba9b49bb9d8 ("Merge branch 'for-3.6-fixes' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq").  The config used 
> was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
> on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.

Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?
If not, could you please try it?  (This patch cleared up a similar
problem for Becky, also on OMAP.)

						Thanx, Paul

> Long story short, the rcu_sched stall messages continue to appear.  
> Transcript below.
> 
> The above config should be easy to reproduce from mainline.  Paul, not 
> sure if you're still working with Linaro, but you should be able to get 
> access to a Pandaboard through them to try it yourself, should you wish.
> 
> 
> - Paul
> 
> ## Booting kernel from Legacy Image at 82000000 ...                                                   
>    Image Name:   Linux-3.6.0-rc6-00025-g4651afb                                                       
>    Image Type:   ARM Linux Kernel Image (uncompressed)                                                
>    Data Size:    4061576 Bytes = 3.9 MiB                                                              
>    Load Address: 80008000                                                                             
>    Entry Point:  80008000                                                                             
>    Verifying Checksum ... OK                                                                          
>    Loading Kernel Image ... OK                                                                        
> OK                                                                                                    
>                                                                                                       
> Starting kernel ...                                                                                   
>                                                                                                       
> Uncompressing Linux... done, booting the kernel.                                                      
> [    0.000000] Booting Linux on physical CPU 0                                                        
> [    0.000000] Linux version 3.6.0-rc6-00025-g4651afb (paul at dusk) (gcc version 4.5.1 (Sourcery G++ Lite 2010.09-50) ) #342 SMP Fri Sep 21 11:28:00 MDT 2012                                                                         
> [    0.000000] CPU: ARMv7 Processor [411fc092] revision 2 (ARMv7), cr=10c53c7d                        
> [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache               
> [    0.000000] Machine: OMAP4 Panda board                                                             
> [    0.000000] debug: ignoring loglevel setting.                                                      
> [    0.000000] Reserving 16777216 bytes SDRAM for VRAM                                                
> [    0.000000] Memory policy: ECC disabled, Data cache writealloc
> [    0.000000] On node 0 totalpages: 126720                                                           
> [    0.000000] free_area_init_node: node 0, pgdat c07c6140, node_mem_map c0d23000
> [    0.000000]   Normal zone: 1024 pages used for memmap
> [    0.000000]   Normal zone: 0 pages reserved
> [    0.000000]   Normal zone: 125696 pages, LIFO batch:31
> [    0.000000] OMAP4430 ES2.0
> [    0.000000] PERCPU: Embedded 9 pages/cpu @c112b000 s14016 r8192 d14656 u36864
> [    0.000000] pcpu-alloc: s14016 r8192 d14656 u36864 alloc=9*4096
> [    0.000000] pcpu-alloc: [0] 0 [0] 1 
> [    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 125696
> [    0.000000] Kernel command line: console=ttyO2,230400n8 vram=16M root=/dev/mmcblk0p2 rw rootfstype=ext3 rootwait earlyprintk debug ignore_loglevel
> [    0.000000] PID hash table entries: 2048 (order: 1, 8192 bytes)
> [    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
> [    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
> [    0.000000] Memory: 495MB = 495MB total
> [    0.000000] Memory: 488828k/488828k available, 35460k reserved, 0K highmem
> [    0.000000] Virtual kernel memory layout:
> [    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
> [    0.000000]     fixmap  : 0xfff00000 - 0xfffe0000   ( 896 kB)
> [    0.000000]     vmalloc : 0xe0800000 - 0xff000000   ( 488 MB)
> [    0.000000]     lowmem  : 0xc0000000 - 0xe0000000   ( 512 MB)
> [    0.000000]     pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
> [    0.000000]     modules : 0xbf000000 - 0xbfe00000   (  14 MB)
> [    0.000000]       .text : 0xc0008000 - 0xc06dad24   (6988 kB)
> [    0.000000]       .init : 0xc06db000 - 0xc072a6c0   ( 318 kB)
> [    0.000000]       .data : 0xc072c000 - 0xc07c8940   ( 627 kB)
> [    0.000000]        .bss : 0xc07c8964 - 0xc0d22a4c   (5481 kB)
> [    0.000000] Hierarchical RCU implementation.
> [    0.000000]  Additional per-CPU info printed with stalls.
> [    0.000000] NR_IRQS:474
> [    0.000000] omap_hwmod: dpll_mpu_m2_ck: missing clockdomain for dpll_mpu_m2_ck.
> [    0.000000] OMAP clockevent source: GPTIMER1 at 32768 Hz
> [    0.000000] omap_hwmod: sys_32k_ck: missing clockdomain for sys_32k_ck.
> [    0.000000] sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms
> [    0.000000] OMAP clocksource: 32k_counter at 32768 Hz
> [    0.000000] Console: colour dummy device 80x30
> [    0.000000] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
> [    0.000000] ... MAX_LOCKDEP_SUBCLASSES:  8
> [    0.000000] ... MAX_LOCK_DEPTH:          48
> [    0.000000] ... MAX_LOCKDEP_KEYS:        8191
> [    0.000000] ... CLASSHASH_SIZE:          4096
> [    0.000000] ... MAX_LOCKDEP_ENTRIES:     16384
> [    0.000000] ... MAX_LOCKDEP_CHAINS:      32768
> [    0.000000] ... CHAINHASH_SIZE:          16384
> [    0.000000]  memory used by lock dependency info: 3695 kB
> [    0.000000]  per task-struct memory footprint: 1152 bytes
> [    0.001190] Calibrating delay loop... 2007.19 BogoMIPS (lpj=7839744)
> [    0.124725] pid_max: default: 32768 minimum: 301
> [    0.125518] Security Framework initialized
> [    0.125793] Mount-cache hash table entries: 512
> [    0.131286] CPU: Testing write buffer coherency: ok
> [    0.132751] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
> [    0value (unknown osc rate)
> [    0.402343] twl 1-0048: PIH (irq 39) chaining IRQs 352..372
> [    0.403930] VUSB: 3300 mV normal standby
> [    0.404907] vdd_mpu: 500 <--> 1500 mV normal 
> [    0.405792] vdd_iva: 500 <--> 1500 mV normal 
> [    0.406890] vdd_core: 500 <--> 1500 mV normal 
> [    0.408081] V1V8: 1800 mV normal standby
> [    0.408966] V2V1: 2100 mV normal standby
> [    0.409973] VMMC: 1200 <--> 3000 mV at 3000 mV normal standby
> [    0.411132] VPP: 1800 <--> 2500 mV at 1900 mV normal standby
> [    0.412445] VCXIO: 1800 mV normal standby
> [    0.412506] VCXIO: supplied by V2V1
> [    0.414367] VDAC: 1800 mV normal standby
> [    0.414459] VDAC: supplied by V2V1
> [    0.415863] VAUX2_6030: 1200 <--> 2800 mV at 1800 mV normal standby
> [    0.417053] VAUX3_6030: 1000 <--> 3000 mV at 1200 mV normal standby
> [    0.418487] VANA: 2100 mV normal standby
> [    0.430084] omap_i2c omap_i2c.2: bus 2 rev2.4.0 at 400 kHz
> [    0gistered twl_rtc as rtc0
> [    1.241333] i2c /dev entries driver
> [    1.245208] Driver for 1-wire Dallas network protocol.
> [    1.249908] omap_wdt: OMAP Watchdog Timer Rev 0x00: initial timeout 60 sec
> [    1.254272] cpuidle: using governor ladder
> [    1.256530] cpuidle: using governor menu
> [    1.259948] omap_hsmmc omap_hsmmc.0: Failed to get debounce clk
> [    1.263305] omap-dma-engine omap-dma-engine: allocating channel for 62
> [    1.266906] omap-dma-engine omap-dma-engine: allocating channel for 61
> [    1.317443] omap_hsmmc omap_hsmmc.4: Failed to get debounce clk
> [    1.320709] omap-dma-engine omap-dma-engine: allocating channel for 60
> [    1.324279] omap-dma-engine omap-dma-engine: allocating channel for 59
> [    1.444854] usbcore: registered new interface driver usbhid
> [    1.447906] usbhid: USB HID core driver
> [    1.450012] oprofile: hardware counters not available
> [    1.452789] oprofile: using timer interrupt.
> [    1.455657] TCP: cubic registered
> [    1.457489] Initializing XFRM netlink socket
> [    1.459930] NET: Registered protocol family 17
> [    1.462402] NET: Registered protocol family 15
> [    1.465057] Key type dns_resolver registered
> [    1.467559] VFP support v0.3: implementor 41 architecture 3 part 30 variant 9 rev 1
> [    1.476593] omap_vc_i2c_init: I2C config for vdd_iva does not match other channels (0).
> [    1.480895] omap_vc_i2c_init: I2C config for vdd_mpu does not match other channels (0).[    1.486236] Power Management for TI OMAP4.
> [    1.489135] ThumbEE CPU extension supported.
> [    1.508178] clock: disabling unused clocks to save power
> [    1.513977] VANA: incomplete constraints, leaving on
> [    1.517669] VDAC: incomplete constraints, leaving on
> [    1.521118] VUSB: incomplete constraints, leaving on
> [    1.525665] twl_rtc twl_rtc: setting system clock to 2000-01-01 00:00:00 UTC (946684800)
> [    1.532409] Waiting for root device /dev/mmcblk0p2...
> [    1.599426] mmc0: host does not support reading read-only switch. assuming write-enable.
> [    1.603942] mmc0: new SD card at address aaaa
> [    1.607849] mmcblk0: mmc0:aaaa SD02G 1.89 GiB 
> [    1.614410]  mmcblk0: p1 p2
> [    3.508514] kjournald starting.  Commit interval 5 seconds
> [    3.668579] EXT3-fs (mmcblk0p2): using internal journal
> [    3.675384] EXT3-fs (mmcblk0p2): recovery complete
> [    3.678039] EXT3-fs (mmcblk0p2): mounted filesystem with ordered data mode
> [    3.682037] VFS: Mounted root (ext3 filesystem) on device 179:2.
> [    3.685607] Freeing init memory: 316K
> modprobe: FATAL: Could not load /lib/modules/3.6.0-rc6-00025-g4651afb/modules.dep: No such file or directory
> 
> INIT: version 2.88 booting
> Using makefile-style concurrent boot in runlevel S.
> Starting the hotplug events dispatcher: udevd.
> Synthesizing the initial hotplug events...done.
> Waiting for /dev to be fully populated...done.
> Activating swap...done.
> Cleaning up ifupdown....
> Setting up networking....
> Loading kernel modules...done.
> Activating lvm and md swap...done.
> Checking file systems...fsck from util-linux 2.19.1
> done.
> Mounting local filesystems...done.
> Activating swapfile swap...done.
> Cleaning up temporary files....
> Configuring network interfaces...done.
> Cleaning up temporary files....
> Setting kernel variables ...done.
> INIT: Entering runlevel: 2
> Using makefile-style concurrent boot in runlevel 2.
> 
> Debian GNU/Linux wheezy/sid armel ttyO2
> 
> armel login: [  305.848632] INFO: rcu_sched self-detected stall on CPU
> [  305.851470]  1: (17 GPs behind) idle=439/1/0 
> [  305.853851]   (t=33915 jiffies)
> [  305.855621] [<c001bcb0>] (unwind_backtrace+0x0/0xf0) from [<c00adcd0>] (rcu_check_callbacks+0x1b0/0x678)
> [  305.860809] [<c00adcd0>] (rcu_check_callbacks+0x1b0/0x678) from [<c00537e8>] (update_process_times+0x38/0x68)
> [  305.866210] [<c00537e8>] (update_process_times+0x38/0x68) from [<c008cf28>] (tick_sched_timer+0x80/0xec)
> [  305.871398] [<c008cf28>] (tick_sched_timer+0x80/0xec) from [<c0069414>] (__run_hrtimer+0x7c/0x1e0)
> [  305.876281] [<c0069414>] (__run_hrtimer+0x7c/0x1e0) from [<c006a1f8>] (hrtimer_interrupt+0x11c/0x2d0)
> [  305.881317] [<c006a1f8>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a3ec>] (twd_handler+0x30/0x44)
> [  305.886199] [<c001a3ec>] (twd_handler+0x30/0x44) from [<c00a8110>] (handle_percpu_devid_irq+0x90/0x13c)
> [  305.891326] [<c00a8110>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a4884>] (generic_handle_irq+0x30/0x48)
> [  305.896820] [<c00a4884>] (generic_handle_irq+0x30/0x48) from [<c0014e38>] (handle_IRQ+0x4c/0xac)
> [  305.901611] [<c0014e38>] (handle_IRQ+0x4c/0xac) from [<c00084cc>] (gic_handle_irq+0x28/0x5c)
> [  305.906219] [<c00084cc>] (gic_handle_irq+0x28/0x5c) from [<c04fdf64>] (__irq_svc+0x44/0x5c)
> [  305.910766] Exception stack(0xde86ff08 to 0xde86ff50)
> [  305.913513] ff00:                   0002a80e 00000001 00000000 de8660c0 35b88a9a 00000047
> [  305.917968] ff20: c11345e0 00000002 efc9ba42 00000046 c0509618 00000000 00000001 de86ff50
> [  305.922424] ff40: 0002a80f c03f8f1c 20000113 ffffffff
> [  305.925201] [<c04fdf64>] (__irq_svc+0x44/0x5c) from [<c03f8f1c>] (cpuidle_wrap_enter+0x4c/0xa0)
> [  305.929962] [<c03f8f1c>] (cpuidle_wrap_enter+0x4c/0xa0) from [<c03f8a10>] (cpuidle_enter_state+0x14/0x68)
> [  305.935180] [<c03f8a10>] (cpuidle_enter_state+0x14/0x68) from [<c03fab34>] (cpuidle_enter_state_coupled+0x240/0x2e8)
> [  305.940917] [<c03fab34>] (cpuidle_enter_state_coupled+0x240/0x2e8) from [<c03f8b48>] (cpuidle_idle_call+0xe4/0x2e0)
> [  305.946594] [<c03f8b48>] (cpuidle_idle_call+0xe4/0x2e0) from [<c0015358>] (cpu_idle+0x98/0x124)
> [  305.951324] [<c0015358>] (cpu_idle+0x98/0x124) from [<804f6df4>] (0x804f6df4)
> root
> [  375.629669] INFO: rcu_sched self-detected stall on CPU
> [  375.632476]  0: (2 GPs behind) idle=c1b/1/0 
> [  375.634796]   (t=8914 jiffies)
> [  375.636505] [<c001bcb0>] (unwind_backtrace+0x0/0xf0) from [<c00adcd0>] (rcu_check_callbacks+0x1b0/0x678)
> [  375.641693] [<c00adcd0>] (rcu_check_callbacks+0x1b0/0x678) from [<c00537e8>] (update_process_times+0x38/0x68)
> [  375.647094] [<c00537e8>] (update_process_times+0x38/0x68) from [<c008cf28>] (tick_sched_timer+0x80/0xec)
> [  375.652252] [<c008cf28>] (tick_sched_timer+0x80/0xec) from [<c0069414>] (__run_hrtimer+0x7c/0x1e0)
> [  375.657135] [<c0069414>] (__run_hrtimer+0x7c/0x1e0) from [<c006a1f8>] (hrtimer_interrupt+0x11c/0x2d0)
> [  375.662170] [<c006a1f8>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a3ec>] (twd_handler+0x30/0x44)
> [  375.667053] [<c001a3ec>] (twd_handler+0x30/0x44) from [<c00a8110>] (handle_percpu_devid_irq+0x90/0x13c)
> [  375.672180] [<c00a8110>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a4884>] (generic_handle_irq+0x30/0x48)
> [  375.677642] [<c00a4884>] (generic_handle_irq+0x30/0x48) from [<c0014e38>] (handle_IRQ+0x4c/0xac)
> [  375.682434] [<c0014e38>] (handle_IRQ+0x4c/0xac) from [<c00084cc>] (gic_handle_irq+0x28/0x5c)
> [  375.687042] [<c00084cc>] (gic_handle_irq+0x28/0x5c) from [<c04fdf64>] (__irq_svc+0x44/0x5c)
> [  375.691589] Exception stack(0xc072df08 to 0xc072df50)
> [  375.694366] df00:                   0002fb7e 00000001 00000000 c074e8e8 74fea80e 00000057
> [  375.698822] df20: c112b5e0 00000000 74fc5402 00000057 c0d1c94c 00000000 00000001 c072df50
> [  375.703277] df40: 0002fb7f c03f8f1c 20000013 ffffffff
> [  375.706024] [<c04fdf64>] (__irq_svc+0x44/0x5c) from [<c03f8f1c>] (cpuidle_wrap_enter+0x4c/0xa0)
> [  375.710784] [<c03f8f1c>] (cpuidle_wrap_enter+0x4c/0xa0) from [<c03f8ba8>] (cpuidle_idle_call+0x144/0x2e0)
> [  375.716003] [<c03f8ba8>] (cpuidle_idle_call+0x144/0x2e0) from [<c0015358>] (cpu_idle+0x98/0x124)
> [  375.720794] [<c0015358>] (cpu_idle+0x98/0x124) from [<c06db7b0>] (start_kernel+0x2b4/0x304)
> Last login: Sat Jan  1 00:09:14 UTC 2000 on ttyO2
> Linux armel 3.6.0-rc6-00025-g4651afb #342 SMP Fri Sep 21 11:28:00 MDT 2012 armv7l
> 
> The programs included with the Debian GNU/Linux system are free software;
> the exact distribution terms for each program are described in the
> individual files in /usr/share/doc/*/copyright.
> 
> Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
> permitted by applicable law.
> root at armel:~# cat /sys/devices/system/cpu/cpuidle/*
> omap4_idle
> menu
> root at armel:~# 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 20:31                         ` Tony Lindgren
@ 2012-09-21 22:03                           ` Paul E. McKenney
  2012-09-22 15:45                             ` Frederic Weisbecker
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-21 22:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> [120921 12:58]:
> > 
> > Just to make sure I understand the combinations:
> > 
> > o	All stalls have happened when running a minimal userspace.
> > o	CONFIG_NO_HZ=n suppresses the stalls.
> > o	CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> > 	no observable effect on the stalls.
> 
> The reason why you may need minimal userspace is to cut down
> the number of timers waking up the system with NO_HZ.
> Booting with init=/bin/sh might also do the trick for that.

Good point!  This does make for a very quiet system, but does not
reproduce the problem under kvm, even after waiting for four minutes.
I will leave it for more time, but it looks like I really might need to
ask Linaro for remote access to a Panda.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 19:57                       ` Paul E. McKenney
  2012-09-21 20:31                         ` Tony Lindgren
@ 2012-09-21 22:12                         ` Paul E. McKenney
  2012-09-22 18:42                         ` Paul Walmsley
  2 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-21 22:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 21, 2012 at 12:57:17PM -0700, Paul E. McKenney wrote:
> On Fri, Sep 21, 2012 at 07:11:14PM +0000, Paul Walmsley wrote:
> > On Fri, 21 Sep 2012, Paul E. McKenney wrote:

[ . . . ]

> > > I may take your advice of remote access to a Panda board, though that
> > > is likely to take a bit of time due to timezones.  Regardless of the
> > > underlying issue here, I clearly need to make the stall-warning messages
> > > do a better job of printing out needed information.
> > 
> > If you've got a patch in mind for that, I'll boot it here.
> 
> Hammering it out, will send it along when it is a bit less destructive.  ;-)

And I eventually decided to keep things simple, please see below.
This applies on top of https://lkml.org/lkml/2012/8/30/290.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Add grace-period information to RCU CPU stall warnings

This commit causes the last grace period started and completed to be
printed on RCU CPU stall warning messages in order to aid diagnosis.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index cdaa7aa..fdb6854 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -925,8 +925,9 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 	print_cpu_stall_info_end();
-	printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n",
-	       smp_processor_id(), (long)(jiffies - rsp->gp_start));
+	pr_cont("(detected by %d, t=%ld jiffies, g=%lu, c=%lu)\n",
+	       smp_processor_id(), (long)(jiffies - rsp->gp_start),
+	       rsp->gpnum, rsp->completed);
 	if (ndetected == 0)
 		printk(KERN_ERR "INFO: Stall ended before state dump start\n");
 	else if (!trigger_all_cpu_backtrace())
@@ -953,7 +954,8 @@ static void print_cpu_stall(struct rcu_state *rsp)
 	print_cpu_stall_info_begin();
 	print_cpu_stall_info(rsp, smp_processor_id());
 	print_cpu_stall_info_end();
-	printk(KERN_CONT " (t=%lu jiffies)\n", jiffies - rsp->gp_start);
+	pr_cont(" (t=%lu jiffies g=%lu c=%lu)\n",
+		jiffies - rsp->gp_start, rsp->gpnum, rsp->completed);
 	if (!trigger_all_cpu_backtrace())
 		dump_stack();
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 21:20                 ` Paul E. McKenney
@ 2012-09-21 22:41                   ` Paul Walmsley
  2012-09-22  0:05                     ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-21 22:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

> On Fri, Sep 21, 2012 at 05:47:31PM +0000, Paul Walmsley wrote:
>
> > I built an OMAP kernel from Linus' commit 
> > 4651afbbae968772efd6dc4ba461cba9b49bb9d8 ("Merge branch 'for-3.6-fixes' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq").  The config used 
> > was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
> > on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.
> 
> Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?

No, it's just as described above.

> If not, could you please try it?  (This patch cleared up a similar
> problem for Becky, also on OMAP.)

Did not seem to help, either with or without CONFIG_CPU_IDLE.


- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 22:41                   ` Paul Walmsley
@ 2012-09-22  0:05                     ` Paul E. McKenney
  2012-09-22 18:16                       ` Paul Walmsley
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-22  0:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 21, 2012 at 10:41:14PM +0000, Paul Walmsley wrote:
> On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> > On Fri, Sep 21, 2012 at 05:47:31PM +0000, Paul Walmsley wrote:
> >
> > > I built an OMAP kernel from Linus' commit 
> > > 4651afbbae968772efd6dc4ba461cba9b49bb9d8 ("Merge branch 'for-3.6-fixes' of 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq").  The config used 
> > > was 'omap2plus_defconfig', and enabled CONFIG_CPU_IDLE by hand.  Booted it 
> > > on a Pandaboard (OMAP4430ES2) into a very minimal Debian rootfs.
> > 
> > Did you have the patch at https://lkml.org/lkml/2012/8/30/290 applied?
> 
> No, it's just as described above.
> 
> > If not, could you please try it?  (This patch cleared up a similar
> > problem for Becky, also on OMAP.)
> 
> Did not seem to help, either with or without CONFIG_CPU_IDLE.

I was hoping!  ;-)

And my init=/bin/sh kernel ran idle for more than an hour without
any RCU CPU stall warnings...

I am wondering if your system somehow figured out how to start a grace
period that had no RCU callbacks waiting for it.  If that happened,
then a CONFIG_NO_HZ=y system could in theory get into a state where all
CPUs are in dyntick-idle mode, so that none of them is doing anything
to force the grace period to complete.

That should be easy to diagnose, anyway.  Please see below, which
includes the earlier diagnostic patch.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 307caf1..696f189 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -879,6 +879,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	unsigned long flags;
 	int ndetected = 0;
 	struct rcu_node *rnp = rcu_get_root(rsp);
+	long totqlen = 0;
 
 	/* Only let one CPU complain about others per time interval. */
 
@@ -923,8 +924,11 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 	print_cpu_stall_info_end();
-	printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n",
-	       smp_processor_id(), (long)(jiffies - rsp->gp_start));
+	for_each_possible_cpu(cpu)
+		totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen;
+	pr_cont("(detected by %d, t=%ld jiffies, g=%lu, c=%lu, q=%lu)\n",
+	       smp_processor_id(), (long)(jiffies - rsp->gp_start),
+	       rsp->gpnum, rsp->completed, totqlen);
 	if (ndetected == 0)
 		printk(KERN_ERR "INFO: Stall ended before state dump start\n");
 	else if (!trigger_all_cpu_backtrace())
@@ -939,8 +943,10 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 
 static void print_cpu_stall(struct rcu_state *rsp)
 {
+	int cpu;
 	unsigned long flags;
 	struct rcu_node *rnp = rcu_get_root(rsp);
+	long totqlen = 0;
 
 	/*
 	 * OK, time to rat on ourselves...
@@ -951,7 +957,10 @@ static void print_cpu_stall(struct rcu_state *rsp)
 	print_cpu_stall_info_begin();
 	print_cpu_stall_info(rsp, smp_processor_id());
 	print_cpu_stall_info_end();
-	printk(KERN_CONT " (t=%lu jiffies)\n", jiffies - rsp->gp_start);
+	for_each_possible_cpu(cpu)
+		totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen;
+	pr_cont(" (t=%lu jiffies g=%lu c=%lu q=%lu)\n",
+		jiffies - rsp->gp_start, rsp->gpnum, rsp->completed, totqlen);
 	if (!trigger_all_cpu_backtrace())
 		dump_stack();
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 22:03                           ` Paul E. McKenney
@ 2012-09-22 15:45                             ` Frederic Weisbecker
  2012-09-22 16:00                               ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2012-09-22 15:45 UTC (permalink / raw)
  To: linux-arm-kernel

2012/9/22 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
>> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> [120921 12:58]:
>> >
>> > Just to make sure I understand the combinations:
>> >
>> > o   All stalls have happened when running a minimal userspace.
>> > o   CONFIG_NO_HZ=n suppresses the stalls.
>> > o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
>> >     no observable effect on the stalls.
>>
>> The reason why you may need minimal userspace is to cut down
>> the number of timers waking up the system with NO_HZ.
>> Booting with init=/bin/sh might also do the trick for that.
>
> Good point!  This does make for a very quiet system, but does not
> reproduce the problem under kvm, even after waiting for four minutes.
> I will leave it for more time, but it looks like I really might need to
> ask Linaro for remote access to a Panda.

I have one. I'm currently installing Ubuntu on it and I'll try to
manage to build
a kernel and reproduce the issue.

I'll give more news soon.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 15:45                             ` Frederic Weisbecker
@ 2012-09-22 16:00                               ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-22 16:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Sep 22, 2012 at 05:45:12PM +0200, Frederic Weisbecker wrote:
> 2012/9/22 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> > On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
> >> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> [120921 12:58]:
> >> >
> >> > Just to make sure I understand the combinations:
> >> >
> >> > o   All stalls have happened when running a minimal userspace.
> >> > o   CONFIG_NO_HZ=n suppresses the stalls.
> >> > o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> >> >     no observable effect on the stalls.
> >>
> >> The reason why you may need minimal userspace is to cut down
> >> the number of timers waking up the system with NO_HZ.
> >> Booting with init=/bin/sh might also do the trick for that.
> >
> > Good point!  This does make for a very quiet system, but does not
> > reproduce the problem under kvm, even after waiting for four minutes.
> > I will leave it for more time, but it looks like I really might need to
> > ask Linaro for remote access to a Panda.
> 
> I have one. I'm currently installing Ubuntu on it and I'll try to
> manage to build
> a kernel and reproduce the issue.
> 
> I'll give more news soon.

Thank you!

My bet is that you have to have a userspace that is so small that it
registers only a few (but at least one!) RCU callback at boot time,
then never registers any callbacks ever again.  I have coded up a
crude test case, using Tony Lindgren's suggestion of "init=/bin/sh",
but I appear to have inadvertently fixed this bug in current -rcu
(git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git,
branch rcu/next).

But I have been wrong a few times already on this particular bug...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22  0:05                     ` Paul E. McKenney
@ 2012-09-22 18:16                       ` Paul Walmsley
  2012-09-22 19:52                         ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-22 18:16 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Paul

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

> I am wondering if your system somehow figured out how to start a grace
> period that had no RCU callbacks waiting for it.  If that happened,
> then a CONFIG_NO_HZ=y system could in theory get into a state where all
> CPUs are in dyntick-idle mode, so that none of them is doing anything
> to force the grace period to complete.
>
> That should be easy to diagnose, anyway.  Please see below, which
> includes the earlier diagnostic patch.

Here you go.
 
- Paul

[  248.902618] INFO: rcu_sched self-detected stall on CPU
[  248.905456]  0: (1 ticks this GP) idle=933/1/0 
[  248.907897]   (t=26570 jiffies g=11 c=10 q=0)
[  248.910339] [<c001bc90>] (unwind_backtrace+0x0/0xf0) from [<c00ad800>] (rcu_check_callbacks+0x220/0x714)
[  248.915527] [<c00ad800>] (rcu_check_callbacks+0x220/0x714) from [<c00532a0>] (update_process_times+0x38/0x68)
[  248.920928] [<c00532a0>] (update_process_times+0x38/0x68) from [<c008c9e8>] (tick_sched_timer+0x80/0xec)
[  248.926116] [<c008c9e8>] (tick_sched_timer+0x80/0xec) from [<c0068ed4>] (__run_hrtimer+0x7c/0x1e0)
[  248.930999] [<c0068ed4>] (__run_hrtimer+0x7c/0x1e0) from [<c0069cb8>] (hrtimer_interrupt+0x11c/0x2d0)
[  248.936035] [<c0069cb8>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a3cc>] (twd_handler+0x30/0x44)
[  248.940948] [<c001a3cc>] (twd_handler+0x30/0x44) from [<c00a7bd0>] (handle_percpu_devid_irq+0x90/0x13c)
[  248.946075] [<c00a7bd0>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a4344>] (generic_handle_irq+0x30/0x48)
[  248.951538] [<c00a4344>] (generic_handle_irq+0x30/0x48) from [<c0014e38>] (handle_IRQ+0x4c/0xac)
[  248.956329] [<c0014e38>] (handle_IRQ+0x4c/0xac) from [<c00084cc>] (gic_handle_irq+0x28/0x5c)
[  248.960937] [<c00084cc>] (gic_handle_irq+0x28/0x5c) from [<c04fb1a4>] (__irq_svc+0x44/0x5c)
[  248.965484] Exception stack(0xc0729f58 to 0xc0729fa0)
[  248.968231] 9f40:                                                       0003b832 00000001
[  248.972686] 9f60: 00000000 c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 00000000 411fc092
[  248.977142] 9f80: c074bfe8 00000000 00000001 c0729fa0 0003b833 c0015130 20000113 ffffffff
[  248.981597] [<c04fb1a4>] (__irq_svc+0x44/0x5c) from [<c0015130>] (default_idle+0x20/0x44)
[  248.986083] [<c0015130>] (default_idle+0x20/0x44) from [<c001535c>] (cpu_idle+0x9c/0x114)
[  248.990539] [<c001535c>] (cpu_idle+0x9c/0x114) from [<c06d77b0>] (start_kernel+0x2b4/0x304)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-21 19:57                       ` Paul E. McKenney
  2012-09-21 20:31                         ` Tony Lindgren
  2012-09-21 22:12                         ` Paul E. McKenney
@ 2012-09-22 18:42                         ` Paul Walmsley
  2012-09-22 20:10                           ` Paul E. McKenney
  2 siblings, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-22 18:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

> Could you please point me to a recipe for creating a minimal userspace?
> Just in case it is the userspac erather than the architecture/hardware
> that makes the difference.

Tony's suggestion is pretty good.  Note that there may also be differences 
in kernel timers -- either between x86 and ARM architectures, or loaded 
device drivers -- that may confound the problem.

> Just to make sure I understand the combinations:
> 
> o	All stalls have happened when running a minimal userspace.
> o	CONFIG_NO_HZ=n suppresses the stalls.
> o	CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> 	no observable effect on the stalls.
> 
> Did I get that right, or am I missing a combination?

That's correct.

> Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
> cancelled upon exit from idle, and therefore should (almost) never 
> actually execute. Its sole purpose is to wake up the CPU.  ;-)

Right.  Just curious, what would wake up the kernel from idle to handle a 
grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
system, the time between timer ticks could potentially be several tens of 
seconds.


- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 18:16                       ` Paul Walmsley
@ 2012-09-22 19:52                         ` Paul E. McKenney
  2012-09-22 22:20                           ` Paul Walmsley
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-22 19:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Sep 22, 2012 at 06:16:15PM +0000, Paul Walmsley wrote:
> Hi Paul
> 
> On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> > I am wondering if your system somehow figured out how to start a grace
> > period that had no RCU callbacks waiting for it.  If that happened,
> > then a CONFIG_NO_HZ=y system could in theory get into a state where all
> > CPUs are in dyntick-idle mode, so that none of them is doing anything
> > to force the grace period to complete.
> >
> > That should be easy to diagnose, anyway.  Please see below, which
> > includes the earlier diagnostic patch.
> 
> Here you go.
> 
> - Paul
> 
> [  248.902618] INFO: rcu_sched self-detected stall on CPU
> [  248.905456]  0: (1 ticks this GP) idle=933/1/0 
> [  248.907897]   (t=26570 jiffies g=11 c=10 q=0)

Bingo!!!  (q=0, in case you were wondering.  And thank you for testing this!)

Strangely enough, I believe that I have inadvertently fixed this in
my -rcu tree:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next

Nevertheless, if you get a chance to try it, I would be interested to
hear if my guess is correct.  The trick is that a kthread drives the
grace period in -rcu, regardless of whether or not there are callbacks.

However, the backport would not be something that -stable would be happy
with, so I will be putting together a fix for mainline.  This thing
has been in the kernel since about 2004, not sure why you didn't hit
it earlier.

							Thanx, Paul

> [  248.910339] [<c001bc90>] (unwind_backtrace+0x0/0xf0) from [<c00ad800>] (rcu_check_callbacks+0x220/0x714)
> [  248.915527] [<c00ad800>] (rcu_check_callbacks+0x220/0x714) from [<c00532a0>] (update_process_times+0x38/0x68)
> [  248.920928] [<c00532a0>] (update_process_times+0x38/0x68) from [<c008c9e8>] (tick_sched_timer+0x80/0xec)
> [  248.926116] [<c008c9e8>] (tick_sched_timer+0x80/0xec) from [<c0068ed4>] (__run_hrtimer+0x7c/0x1e0)
> [  248.930999] [<c0068ed4>] (__run_hrtimer+0x7c/0x1e0) from [<c0069cb8>] (hrtimer_interrupt+0x11c/0x2d0)
> [  248.936035] [<c0069cb8>] (hrtimer_interrupt+0x11c/0x2d0) from [<c001a3cc>] (twd_handler+0x30/0x44)
> [  248.940948] [<c001a3cc>] (twd_handler+0x30/0x44) from [<c00a7bd0>] (handle_percpu_devid_irq+0x90/0x13c)
> [  248.946075] [<c00a7bd0>] (handle_percpu_devid_irq+0x90/0x13c) from [<c00a4344>] (generic_handle_irq+0x30/0x48)
> [  248.951538] [<c00a4344>] (generic_handle_irq+0x30/0x48) from [<c0014e38>] (handle_IRQ+0x4c/0xac)
> [  248.956329] [<c0014e38>] (handle_IRQ+0x4c/0xac) from [<c00084cc>] (gic_handle_irq+0x28/0x5c)
> [  248.960937] [<c00084cc>] (gic_handle_irq+0x28/0x5c) from [<c04fb1a4>] (__irq_svc+0x44/0x5c)
> [  248.965484] Exception stack(0xc0729f58 to 0xc0729fa0)
> [  248.968231] 9f40:                                                       0003b832 00000001
> [  248.972686] 9f60: 00000000 c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 00000000 411fc092
> [  248.977142] 9f80: c074bfe8 00000000 00000001 c0729fa0 0003b833 c0015130 20000113 ffffffff
> [  248.981597] [<c04fb1a4>] (__irq_svc+0x44/0x5c) from [<c0015130>] (default_idle+0x20/0x44)
> [  248.986083] [<c0015130>] (default_idle+0x20/0x44) from [<c001535c>] (cpu_idle+0x9c/0x114)
> [  248.990539] [<c001535c>] (cpu_idle+0x9c/0x114) from [<c06d77b0>] (start_kernel+0x2b4/0x304)
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 18:42                         ` Paul Walmsley
@ 2012-09-22 20:10                           ` Paul E. McKenney
  2012-09-22 21:59                             ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-22 20:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Sep 22, 2012 at 06:42:08PM +0000, Paul Walmsley wrote:
> On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> > Could you please point me to a recipe for creating a minimal userspace?
> > Just in case it is the userspac erather than the architecture/hardware
> > that makes the difference.
> 
> Tony's suggestion is pretty good.  Note that there may also be differences 
> in kernel timers -- either between x86 and ARM architectures, or loaded 
> device drivers -- that may confound the problem.

For example, there must be at least one RCU callback outstanding after
the boot sequence quiets down.  Of course, the last time I tried Tony's
approach, I was doing it on top of my -rcu stack, so am retrying on
v3.6-rc6.

> > Just to make sure I understand the combinations:
> > 
> > o	All stalls have happened when running a minimal userspace.
> > o	CONFIG_NO_HZ=n suppresses the stalls.
> > o	CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> > 	no observable effect on the stalls.
> > 
> > Did I get that right, or am I missing a combination?
> 
> That's correct.
> 
> > Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
> > cancelled upon exit from idle, and therefore should (almost) never 
> > actually execute. Its sole purpose is to wake up the CPU.  ;-)
> 
> Right.  Just curious, what would wake up the kernel from idle to handle a 
> grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
> system, the time between timer ticks could potentially be several tens of 
> seconds.

If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted
to shut off the scheduling-clock tick, so any CPU with RCU callbacks will
be awakened every jiffy.  The problem is that there appears to be a way
to get an RCU grace period started without any CPU having any callbacks,
which, as you surmise, would result in all the CPUs going to sleep and
the grace period never ending.  So if a CPU is awakened for any reason
after this everlasting grace period has extended for more than a minute,
the first thing that CPU will do is print an RCU CPU stall warning.

I believe that I see how to prevent callback-free grace periods from
ever starting.  (Famous last words...)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 20:10                           ` Paul E. McKenney
@ 2012-09-22 21:59                             ` Paul E. McKenney
  2012-09-22 22:25                               ` Paul Walmsley
                                                 ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-22 21:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
> On Sat, Sep 22, 2012 at 06:42:08PM +0000, Paul Walmsley wrote:
> > On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> > 
> > > Could you please point me to a recipe for creating a minimal userspace?
> > > Just in case it is the userspac erather than the architecture/hardware
> > > that makes the difference.
> > 
> > Tony's suggestion is pretty good.  Note that there may also be differences 
> > in kernel timers -- either between x86 and ARM architectures, or loaded 
> > device drivers -- that may confound the problem.
> 
> For example, there must be at least one RCU callback outstanding after
> the boot sequence quiets down.  Of course, the last time I tried Tony's
> approach, I was doing it on top of my -rcu stack, so am retrying on
> v3.6-rc6.
> 
> > > Just to make sure I understand the combinations:
> > > 
> > > o	All stalls have happened when running a minimal userspace.
> > > o	CONFIG_NO_HZ=n suppresses the stalls.
> > > o	CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
> > > 	no observable effect on the stalls.
> > > 
> > > Did I get that right, or am I missing a combination?
> > 
> > That's correct.
> > 
> > > Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
> > > cancelled upon exit from idle, and therefore should (almost) never 
> > > actually execute. Its sole purpose is to wake up the CPU.  ;-)
> > 
> > Right.  Just curious, what would wake up the kernel from idle to handle a 
> > grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
> > system, the time between timer ticks could potentially be several tens of 
> > seconds.
> 
> If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted
> to shut off the scheduling-clock tick, so any CPU with RCU callbacks will
> be awakened every jiffy.  The problem is that there appears to be a way
> to get an RCU grace period started without any CPU having any callbacks,
> which, as you surmise, would result in all the CPUs going to sleep and
> the grace period never ending.  So if a CPU is awakened for any reason
> after this everlasting grace period has extended for more than a minute,
> the first thing that CPU will do is print an RCU CPU stall warning.
> 
> I believe that I see how to prevent callback-free grace periods from
> ever starting.  (Famous last words...)

And here is a patch.  I am still having trouble reproducing the problem,
but figured that I should avoid serializing things.

							Thanx, Paul

------------------------------------------------------------------------

 b/kernel/rcutree.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

rcu: Fix day-one dyntick-idle stall-warning bug

Each grace period is supposed to have at least one callback waiting
for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
extra callback-free grace period is no big problem -- it will chew up
a tiny bit of CPU time, but it will complete normally.  In contrast,
CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
sleep indefinitely, in turn indefinitely delaying completion of the
callback-free grace period.  Given that nothing is waiting on this grace
period, this is also not a problem.

Unless RCU CPU stall warnings are also enabled, as they are in recent
kernels.  In this case, if a CPU wakes up after at least one minute
of inactivity, an RCU CPU stall warning will result.  The reason that
no one noticed until quite recently is that most systems have enough
OS noise that they will never remain absolutely idle for a full minute.
But there are some embedded systems with cut-down userspace configurations
that get into this mode quite easily.

All this begs the question of exactly how a callback-free grace period
gets started in the first place.  This can happen due to the fact that
CPUs do not necessarily agree on which grace period is in progress.
If a CPU still believes that the grace period that just completed is
still ongoing, it will believe that it has callbacks that need to wait
for another grace period, never mind the fact that the grace period
that they were waiting for just completed.  This CPU can therefore
erroneously decide to start a new grace period.

Once this CPU notices that the earlier grace period completed, it will
invoke its callbacks.  It then won't have any callbacks left.  If no
other CPU has any callbacks, we now have a callback-free grace period.

This commit therefore makes CPUs check more carefully before starting a
new grace period.  This new check relies on an array of tail pointers
into each CPU's list of callbacks.  If the CPU is up to date on which
grace periods have completed, it checks to see if any callbacks follow
the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
follow the RCU_WAIT_TAIL segment.  The reason that this works is that
the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
as soon as the CPU figures out that the old grace period has ended.

This change is to cpu_needs_another_gp(), which is called in a number
of places.  The only one that really matters is in rcu_start_gp(), where
the root rcu_node structure's ->lock is held, which prevents any
other CPU from starting or completing a grace period, so that the
comparison that determines whether the CPU is missing the completion
of a grace period is stable.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f280e54..f7bcd9e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -305,7 +305,9 @@ cpu_has_callbacks_ready_to_invoke(struct rcu_data *rdp)
 static int
 cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
 {
-	return *rdp->nxttail[RCU_DONE_TAIL] && !rcu_gp_in_progress(rsp);
+	return *rdp->nxttail[RCU_DONE_TAIL +
+			     ACCESS_ONCE(rsp->completed) != rdp->completed] &&
+	       !rcu_gp_in_progress(rsp);
 }

 /*

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 19:52                         ` Paul E. McKenney
@ 2012-09-22 22:20                           ` Paul Walmsley
  2012-09-22 23:17                             ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-22 22:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Paul

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> Strangely enough, I believe that I have inadvertently fixed this in
> my -rcu tree:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next
> 
> Nevertheless, if you get a chance to try it, I would be interested to
> hear if my guess is correct.

Yes, good news: the stall warnings go away with that branch.

> The trick is that a kthread drives the grace period in -rcu, regardless 
> of whether or not there are callbacks.

This is "rcu: Move quiescent-state forcing into kthread" ?

Added some debugging into rcu_gp_kthread() after that commit and can 
confirm that the quiescent-state forcing loop does start a few times when 
there are zero callbacks pending (modulo any races in my measurement 
code).

> However, the backport would not be something that -stable would be happy
> with, so I will be putting together a fix for mainline.  This thing
> has been in the kernel since about 2004, not sure why you didn't hit
> it earlier.

One other data point in that regard - noticed the warnings don't appear 
when the board is booted with:

commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
Author: Paul E. McKenney <paul.mckenney@linaro.org>
Date:   Tue Jun 5 15:53:53 2012 -0700

    rcu: Fix qlen_lazy breakage

...


- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 21:59                             ` Paul E. McKenney
@ 2012-09-22 22:25                               ` Paul Walmsley
  2012-09-22 23:11                                 ` Paul E. McKenney
  2012-09-23  1:42                                 ` Paul Walmsley
  2012-09-24  9:41                               ` Shilimkar, Santosh
  2012-10-01  8:55                               ` Linus Walleij
  2 siblings, 2 replies; 45+ messages in thread
From: Paul Walmsley @ 2012-09-22 22:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> And here is a patch.  I am still having trouble reproducing the problem,
> but figured that I should avoid serializing things.

Thanks, testing this now on v3.6-rc6.  One question though about the patch 
description:

> All this begs the question of exactly how a callback-free grace period
> gets started in the first place.  This can happen due to the fact that
> CPUs do not necessarily agree on which grace period is in progress.
> If a CPU still believes that the grace period that just completed is
> still ongoing, it will believe that it has callbacks that need to wait
> for another grace period, never mind the fact that the grace period
> that they were waiting for just completed.  This CPU can therefore
> erroneously decide to start a new grace period.

Doesn't this imply that this bug would only affect multi-CPU systems?  

The recent tests here have been on Pandaboard, which is dual-CPU, but my 
recollection is that I also observed the warnings on a single-core 
Beagleboard.  Will re-test.


- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 22:25                               ` Paul Walmsley
@ 2012-09-22 23:11                                 ` Paul E. McKenney
  2012-09-23  7:55                                   ` Paul Walmsley
  2012-09-23  1:42                                 ` Paul Walmsley
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-22 23:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Sep 22, 2012 at 10:25:59PM +0000, Paul Walmsley wrote:
> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > And here is a patch.  I am still having trouble reproducing the problem,
> > but figured that I should avoid serializing things.
> 
> Thanks, testing this now on v3.6-rc6.

Very cool, thank you!

>                                        One question though about the patch 
> description:
> 
> > All this begs the question of exactly how a callback-free grace period
> > gets started in the first place.  This can happen due to the fact that
> > CPUs do not necessarily agree on which grace period is in progress.
> > If a CPU still believes that the grace period that just completed is
> > still ongoing, it will believe that it has callbacks that need to wait
> > for another grace period, never mind the fact that the grace period
> > that they were waiting for just completed.  This CPU can therefore
> > erroneously decide to start a new grace period.
> 
> Doesn't this imply that this bug would only affect multi-CPU systems?  

Surprisingly not, at least when running TREE_RCU or TREE_PREEMPT_RCU.
In order to keep lock contention down to a dull roar on larger systems,
TREE_RCU keeps three sets of books: (1) the global state in the rcu_state
structure, (2) the combining-tree per-node state in the rcu_node
structure, and the per-CPU state in the rcu_data structure.  A CPU is
not officially aware of the end of a grace period until it is reflected
in its rcu_data structure.  This has the perhaps-surprising consequence
that the CPU that detected the end of the old grace period might start
a new one before becoming officially aware that the old one ended.

Why not have the CPU inform itself immediately upon noticing that the
old grace period ended?  Deadlock.  The rcu_node locks must be acquired
from leaf towards root, and the CPU is holding the root rcu_node lock
when it notices that the grace period has ended.

I have made this a bit less problematic in the bigrt branch, working
towards a goal of getting RCU into a state where automatic formal
validation might one day be possible.  And yes, I am starting to get some
formal-validation people interested in this lofty goal, see for example:
http://sites.google.com/site/popl13grace/paper.pdf.

> The recent tests here have been on Pandaboard, which is dual-CPU, but my 
> recollection is that I also observed the warnings on a single-core 
> Beagleboard.  Will re-test.

Anxiously awaiting the results.  This has been a strange one, even by
RCU's standards.

Plus I need to add a few Reported-by lines.  Next version...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 22:20                           ` Paul Walmsley
@ 2012-09-22 23:17                             ` Paul E. McKenney
  2012-09-24 21:54                               ` Paul Walmsley
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-22 23:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Sep 22, 2012 at 10:20:19PM +0000, Paul Walmsley wrote:
> Hi Paul
> 
> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > Strangely enough, I believe that I have inadvertently fixed this in
> > my -rcu tree:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next
> > 
> > Nevertheless, if you get a chance to try it, I would be interested to
> > hear if my guess is correct.
> 
> Yes, good news: the stall warnings go away with that branch.

Very good!

> > The trick is that a kthread drives the grace period in -rcu, regardless 
> > of whether or not there are callbacks.
> 
> This is "rcu: Move quiescent-state forcing into kthread" ?

Yep, plus the preceding commits moving grace-period initialization and
cleanup into that same kthread.  This was motivated by a bug report
last February complaining about 200-microsecond latency spikes from
RCU grace-period initialization.  On systems with 4096 CPUs.

Real-time response.  It is far bigger than I thought.  ;-)

> Added some debugging into rcu_gp_kthread() after that commit and can 
> confirm that the quiescent-state forcing loop does start a few times when 
> there are zero callbacks pending (modulo any races in my measurement 
> code).

Cool, thank you!  Assuming it works, that indicates that there is long-term
value to the fix for this problem.  On larger systems, extra grace periods
are not what you want, as their expense increases with the number of CPUs.

> > However, the backport would not be something that -stable would be happy
> > with, so I will be putting together a fix for mainline.  This thing
> > has been in the kernel since about 2004, not sure why you didn't hit
> > it earlier.
> 
> One other data point in that regard - noticed the warnings don't appear 
> when the board is booted with:
> 
> commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
> Author: Paul E. McKenney <paul.mckenney@linaro.org>
> Date:   Tue Jun 5 15:53:53 2012 -0700
> 
>     rcu: Fix qlen_lazy breakage

You lost me on this one.  This is already in mainline, so if you were
using (say) 3.6-rc6, you would already have this commit applied.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 22:25                               ` Paul Walmsley
  2012-09-22 23:11                                 ` Paul E. McKenney
@ 2012-09-23  1:42                                 ` Paul Walmsley
  2012-09-23  1:56                                   ` Paul E. McKenney
  1 sibling, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-23  1:42 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Paul

On Sat, 22 Sep 2012, Paul Walmsley wrote:

> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > And here is a patch.  I am still having trouble reproducing the problem,
> > but figured that I should avoid serializing things.
> 
> Thanks, testing this now on v3.6-rc6.

Looks like you solved it!

Tested v3.6-rc6 + your stall diagnostic patch:

    http://marc.info/?l=linux-arm-kernel&m=134827237215882&w=2

on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.

Then added "rcu: Fix day-one dyntick-idle stall-warning bug" from:

    http://marc.info/?l=linux-arm-kernel&m=134835120600590&w=2

Booted that, and the stall warnings did not appear within 30 minutes.

To confirm that the problem being solved matched your hypothesis, the
debugging patch below[1] was added to the RCU idle entry/exit code.

Without the bugfix patch, a boot log transcript was obtained
indicating that the idle loop was entered with tick_nohz_enabled=1
during a grace period with no callbacks present:

    http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt

The debugging events started to appear at 1.867370 seconds into the
boot.  ENTER was pressed about 464 seconds in; this triggered the
rcu_sched stall traceback.

With the bugfix patch, a boot log transcript was obtained that
indicated that the condition under test never occurred after waiting
about 20 minutes:

    http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt

Thanks for being so willing to root-cause the issue, Paul; it's 
appreciated, and it's been quite instructive as well.  Will address some 
remaining loose ends in follow-up E-mails.


- Paul


[1] Debugging patch to printk() if the previous idle loop entry occurred 
with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
present:


---
 kernel/rcutree.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f1eb7ad..f42941b 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -60,6 +60,9 @@
 
 /* Data structures. */
 
+extern int tick_nohz_enabled;
+static int no_cbs_idle_entry_count;
+
 static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 
 #define RCU_STATE_INITIALIZER(sname, cr) { \
@@ -400,8 +403,12 @@ void rcu_idle_enter(void)
 	unsigned long flags;
 	long long oldval;
 	struct rcu_dynticks *rdtp;
+	int cpu;
+	long totqlen = 0;
+	struct rcu_data *rdp;
 
 	local_irq_save(flags);
+	rdp = &__get_cpu_var(rcu_sched_data);
 	rdtp = &__get_cpu_var(rcu_dynticks);
 	oldval = rdtp->dynticks_nesting;
 	WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
@@ -410,6 +417,12 @@ void rcu_idle_enter(void)
 	else
 		rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
 	rcu_idle_enter_common(rdtp, oldval);
+	if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) {
+		for_each_possible_cpu(cpu)
+			totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen;
+		if (totqlen == 0)
+			no_cbs_idle_entry_count = 1;
+	}
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -503,6 +516,10 @@ void rcu_idle_exit(void)
 		rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
 	rcu_idle_exit_common(rdtp, oldval);
 	local_irq_restore(flags);
+	if (no_cbs_idle_entry_count) {
+		no_cbs_idle_entry_count = 0;
+		pr_err("* Tickless idle was entered with zero RCU callbacks\n");
+	}
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-23  1:42                                 ` Paul Walmsley
@ 2012-09-23  1:56                                   ` Paul E. McKenney
  2012-09-23  2:01                                     ` Paul Walmsley
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-23  1:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Sep 23, 2012 at 01:42:10AM +0000, Paul Walmsley wrote:
> Hi Paul
> 
> On Sat, 22 Sep 2012, Paul Walmsley wrote:
> 
> > On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> > 
> > > And here is a patch.  I am still having trouble reproducing the problem,
> > > but figured that I should avoid serializing things.
> > 
> > Thanks, testing this now on v3.6-rc6.
> 
> Looks like you solved it!
> 
> Tested v3.6-rc6 + your stall diagnostic patch:
> 
>     http://marc.info/?l=linux-arm-kernel&m=134827237215882&w=2
> 
> on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
> CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.
> 
> Then added "rcu: Fix day-one dyntick-idle stall-warning bug" from:
> 
>     http://marc.info/?l=linux-arm-kernel&m=134835120600590&w=2
> 
> Booted that, and the stall warnings did not appear within 30 minutes.

Very cool, thank you for your testing efforts!!!

May I apply your Tested-by to this patch?

And good show on the debugging patch -- it is quite good to have such
solid evidence that the bug that the fix was intended for was actually
occurring.

						Thanx, Paul

> To confirm that the problem being solved matched your hypothesis, the
> debugging patch below[1] was added to the RCU idle entry/exit code.
> 
> Without the bugfix patch, a boot log transcript was obtained
> indicating that the idle loop was entered with tick_nohz_enabled=1
> during a grace period with no callbacks present:
> 
>     http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt
> 
> The debugging events started to appear at 1.867370 seconds into the
> boot.  ENTER was pressed about 464 seconds in; this triggered the
> rcu_sched stall traceback.
> 
> With the bugfix patch, a boot log transcript was obtained that
> indicated that the condition under test never occurred after waiting
> about 20 minutes:
> 
>     http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt
> 
> Thanks for being so willing to root-cause the issue, Paul; it's 
> appreciated, and it's been quite instructive as well.  Will address some 
> remaining loose ends in follow-up E-mails.
> 
> 
> - Paul
> 
> 
> [1] Debugging patch to printk() if the previous idle loop entry occurred 
> with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
> present:
> 
> 
> ---
>  kernel/rcutree.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index f1eb7ad..f42941b 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -60,6 +60,9 @@
> 
>  /* Data structures. */
> 
> +extern int tick_nohz_enabled;
> +static int no_cbs_idle_entry_count;
> +
>  static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
> 
>  #define RCU_STATE_INITIALIZER(sname, cr) { \
> @@ -400,8 +403,12 @@ void rcu_idle_enter(void)
>  	unsigned long flags;
>  	long long oldval;
>  	struct rcu_dynticks *rdtp;
> +	int cpu;
> +	long totqlen = 0;
> +	struct rcu_data *rdp;
> 
>  	local_irq_save(flags);
> +	rdp = &__get_cpu_var(rcu_sched_data);
>  	rdtp = &__get_cpu_var(rcu_dynticks);
>  	oldval = rdtp->dynticks_nesting;
>  	WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
> @@ -410,6 +417,12 @@ void rcu_idle_enter(void)
>  	else
>  		rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
>  	rcu_idle_enter_common(rdtp, oldval);
> +	if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) {
> +		for_each_possible_cpu(cpu)
> +			totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen;
> +		if (totqlen == 0)
> +			no_cbs_idle_entry_count = 1;
> +	}
>  	local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(rcu_idle_enter);
> @@ -503,6 +516,10 @@ void rcu_idle_exit(void)
>  		rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
>  	rcu_idle_exit_common(rdtp, oldval);
>  	local_irq_restore(flags);
> +	if (no_cbs_idle_entry_count) {
> +		no_cbs_idle_entry_count = 0;
> +		pr_err("* Tickless idle was entered with zero RCU callbacks\n");
> +	}
>  }
>  EXPORT_SYMBOL_GPL(rcu_idle_exit);
> 
> -- 
> 1.7.10.4
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-23  1:56                                   ` Paul E. McKenney
@ 2012-09-23  2:01                                     ` Paul Walmsley
  0 siblings, 0 replies; 45+ messages in thread
From: Paul Walmsley @ 2012-09-23  2:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> Very cool, thank you for your testing efforts!!!

You're welcome.

> May I apply your Tested-by to this patch?

Please do:

Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP4430

Am testing on OMAP3730 (single-core) now.


- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 23:11                                 ` Paul E. McKenney
@ 2012-09-23  7:55                                   ` Paul Walmsley
  2012-09-23 12:11                                     ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-23  7:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> On Sat, Sep 22, 2012 at 10:25:59PM +0000, Paul Walmsley wrote:
> 
> > The recent tests here have been on Pandaboard, which is dual-CPU, but my 
> > recollection is that I also observed the warnings on a single-core 
> > Beagleboard.  Will re-test.
> 
> Anxiously awaiting the results.

The same problem exists on BeagleBoard XM (OMAP3730, single-core 
Cortex-A8):

   http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-pre-fix.txt

and the same patch fixes it:

   http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-post-fix.txt

Please feel free to update my Tested-by:, if you wish.

Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP3730, OMAP4430


- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-23  7:55                                   ` Paul Walmsley
@ 2012-09-23 12:11                                     ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-23 12:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Sep 23, 2012 at 07:55:50AM +0000, Paul Walmsley wrote:
> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > On Sat, Sep 22, 2012 at 10:25:59PM +0000, Paul Walmsley wrote:
> > 
> > > The recent tests here have been on Pandaboard, which is dual-CPU, but my 
> > > recollection is that I also observed the warnings on a single-core 
> > > Beagleboard.  Will re-test.
> > 
> > Anxiously awaiting the results.
> 
> The same problem exists on BeagleBoard XM (OMAP3730, single-core 
> Cortex-A8):
> 
>    http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-pre-fix.txt
> 
> and the same patch fixes it:
> 
>    http://www.pwsan.com/omap/transcripts/20120922-beaglexm-rcu-stall-debug-post-fix.txt
> 
> Please feel free to update my Tested-by:, if you wish.
> 
> Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP3730, OMAP4430

Very good, thank you very much!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 21:59                             ` Paul E. McKenney
  2012-09-22 22:25                               ` Paul Walmsley
@ 2012-09-24  9:41                               ` Shilimkar, Santosh
  2012-09-24 13:18                                 ` Paul E. McKenney
  2012-10-01  8:55                               ` Linus Walleij
  2 siblings, 1 reply; 45+ messages in thread
From: Shilimkar, Santosh @ 2012-09-24  9:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Sep 23, 2012 at 3:29 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
>> On Sat, Sep 22, 2012 at 06:42:08PM +0000, Paul Walmsley wrote:
>> > On Fri, 21 Sep 2012, Paul E. McKenney wrote:

[...]

>
> And here is a patch.  I am still having trouble reproducing the problem,
> but figured that I should avoid serializing things.
>
>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
>  b/kernel/rcutree.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> rcu: Fix day-one dyntick-idle stall-warning bug
>
> Each grace period is supposed to have at least one callback waiting
> for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
> extra callback-free grace period is no big problem -- it will chew up
> a tiny bit of CPU time, but it will complete normally.  In contrast,
> CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
> sleep indefinitely, in turn indefinitely delaying completion of the
> callback-free grace period.  Given that nothing is waiting on this grace
> period, this is also not a problem.
>
> Unless RCU CPU stall warnings are also enabled, as they are in recent
> kernels.  In this case, if a CPU wakes up after at least one minute
> of inactivity, an RCU CPU stall warning will result.  The reason that
> no one noticed until quite recently is that most systems have enough
> OS noise that they will never remain absolutely idle for a full minute.
> But there are some embedded systems with cut-down userspace configurations
> that get into this mode quite easily.
>
> All this begs the question of exactly how a callback-free grace period
> gets started in the first place.  This can happen due to the fact that
> CPUs do not necessarily agree on which grace period is in progress.
> If a CPU still believes that the grace period that just completed is
> still ongoing, it will believe that it has callbacks that need to wait
> for another grace period, never mind the fact that the grace period
> that they were waiting for just completed.  This CPU can therefore
> erroneously decide to start a new grace period.
>
> Once this CPU notices that the earlier grace period completed, it will
> invoke its callbacks.  It then won't have any callbacks left.  If no
> other CPU has any callbacks, we now have a callback-free grace period.
>
> This commit therefore makes CPUs check more carefully before starting a
> new grace period.  This new check relies on an array of tail pointers
> into each CPU's list of callbacks.  If the CPU is up to date on which
> grace periods have completed, it checks to see if any callbacks follow
> the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
> follow the RCU_WAIT_TAIL segment.  The reason that this works is that
> the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
> as soon as the CPU figures out that the old grace period has ended.
>
> This change is to cpu_needs_another_gp(), which is called in a number
> of places.  The only one that really matters is in rcu_start_gp(), where
> the root rcu_node structure's ->lock is held, which prevents any
> other CPU from starting or completing a grace period, so that the
> comparison that determines whether the CPU is missing the completion
> of a grace period is stable.
>
> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
As already confirmed by Paul W and others, I too no longer see the rcu dumps
any more with above patch. Thanks a lot for the fix.

Regards
Santosh

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-24  9:41                               ` Shilimkar, Santosh
@ 2012-09-24 13:18                                 ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-24 13:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 24, 2012 at 03:11:34PM +0530, Shilimkar, Santosh wrote:
> On Sun, Sep 23, 2012 at 3:29 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
> >> On Sat, Sep 22, 2012 at 06:42:08PM +0000, Paul Walmsley wrote:
> >> > On Fri, 21 Sep 2012, Paul E. McKenney wrote:
> 
> [...]
> 
> >
> > And here is a patch.  I am still having trouble reproducing the problem,
> > but figured that I should avoid serializing things.
> >
> >                                                         Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> >  b/kernel/rcutree.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > rcu: Fix day-one dyntick-idle stall-warning bug
> >
> > Each grace period is supposed to have at least one callback waiting
> > for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
> > extra callback-free grace period is no big problem -- it will chew up
> > a tiny bit of CPU time, but it will complete normally.  In contrast,
> > CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
> > sleep indefinitely, in turn indefinitely delaying completion of the
> > callback-free grace period.  Given that nothing is waiting on this grace
> > period, this is also not a problem.
> >
> > Unless RCU CPU stall warnings are also enabled, as they are in recent
> > kernels.  In this case, if a CPU wakes up after at least one minute
> > of inactivity, an RCU CPU stall warning will result.  The reason that
> > no one noticed until quite recently is that most systems have enough
> > OS noise that they will never remain absolutely idle for a full minute.
> > But there are some embedded systems with cut-down userspace configurations
> > that get into this mode quite easily.
> >
> > All this begs the question of exactly how a callback-free grace period
> > gets started in the first place.  This can happen due to the fact that
> > CPUs do not necessarily agree on which grace period is in progress.
> > If a CPU still believes that the grace period that just completed is
> > still ongoing, it will believe that it has callbacks that need to wait
> > for another grace period, never mind the fact that the grace period
> > that they were waiting for just completed.  This CPU can therefore
> > erroneously decide to start a new grace period.
> >
> > Once this CPU notices that the earlier grace period completed, it will
> > invoke its callbacks.  It then won't have any callbacks left.  If no
> > other CPU has any callbacks, we now have a callback-free grace period.
> >
> > This commit therefore makes CPUs check more carefully before starting a
> > new grace period.  This new check relies on an array of tail pointers
> > into each CPU's list of callbacks.  If the CPU is up to date on which
> > grace periods have completed, it checks to see if any callbacks follow
> > the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
> > follow the RCU_WAIT_TAIL segment.  The reason that this works is that
> > the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
> > as soon as the CPU figures out that the old grace period has ended.
> >
> > This change is to cpu_needs_another_gp(), which is called in a number
> > of places.  The only one that really matters is in rcu_start_gp(), where
> > the root rcu_node structure's ->lock is held, which prevents any
> > other CPU from starting or completing a grace period, so that the
> > comparison that determines whether the CPU is missing the completion
> > of a grace period is stable.
> >
> > Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> As already confirmed by Paul W and others, I too no longer see the rcu dumps
> any more with above patch. Thanks a lot for the fix.

Glad it finally works!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 23:17                             ` Paul E. McKenney
@ 2012-09-24 21:54                               ` Paul Walmsley
  2012-09-24 22:00                                 ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul Walmsley @ 2012-09-24 21:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> On Sat, Sep 22, 2012 at 10:20:19PM +0000, Paul Walmsley wrote:
> > On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> >
> > > This thing has been in the kernel since about 2004, not sure why you 
> > > didn't hit it earlier.
> > 
> > One other data point in that regard - noticed the warnings don't appear 
> > when the board is booted with:
> > 
> > commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
> > Author: Paul E. McKenney <paul.mckenney@linaro.org>
> > Date:   Tue Jun 5 15:53:53 2012 -0700
> > 
> >     rcu: Fix qlen_lazy breakage
> 
> You lost me on this one.  This is already in mainline, so if you were
> using (say) 3.6-rc6, you would already have this commit applied.

If I check out a kernel at this commit 
4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367 + the 
zero-callback-in-tickless-idle diagnostic patch, build and boot it, then 
the stall warnings don't appear (in 25 minutes of testing).

Messages from the diagnostic patch indicate that the kernel is entering 
idle during a grace period with no RCU callbacks, though.

This is not a big deal and does not need any further attention.  Just 
wanted to place a time boundary on the point when these messages started 
appearing.  (It is unlikely to be an optimal bound: i.e., there are 
probably later commits where the warnings also don't appear.)

- Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-24 21:54                               ` Paul Walmsley
@ 2012-09-24 22:00                                 ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2012-09-24 22:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 24, 2012 at 09:54:00PM +0000, Paul Walmsley wrote:
> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > On Sat, Sep 22, 2012 at 10:20:19PM +0000, Paul Walmsley wrote:
> > > On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> > >
> > > > This thing has been in the kernel since about 2004, not sure why you 
> > > > didn't hit it earlier.
> > > 
> > > One other data point in that regard - noticed the warnings don't appear 
> > > when the board is booted with:
> > > 
> > > commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
> > > Author: Paul E. McKenney <paul.mckenney@linaro.org>
> > > Date:   Tue Jun 5 15:53:53 2012 -0700
> > > 
> > >     rcu: Fix qlen_lazy breakage
> > 
> > You lost me on this one.  This is already in mainline, so if you were
> > using (say) 3.6-rc6, you would already have this commit applied.
> 
> If I check out a kernel at this commit 
> 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367 + the 
> zero-callback-in-tickless-idle diagnostic patch, build and boot it, then 
> the stall warnings don't appear (in 25 minutes of testing).
> 
> Messages from the diagnostic patch indicate that the kernel is entering 
> idle during a grace period with no RCU callbacks, though.
> 
> This is not a big deal and does not need any further attention.  Just 
> wanted to place a time boundary on the point when these messages started 
> appearing.  (It is unlikely to be an optimal bound: i.e., there are 
> probably later commits where the warnings also don't appear.)

Ah, got it, thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-09-22 21:59                             ` Paul E. McKenney
  2012-09-22 22:25                               ` Paul Walmsley
  2012-09-24  9:41                               ` Shilimkar, Santosh
@ 2012-10-01  8:55                               ` Linus Walleij
  2012-10-01 13:28                                 ` Paul E. McKenney
  2 siblings, 1 reply; 45+ messages in thread
From: Linus Walleij @ 2012-10-01  8:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Sep 22, 2012 at 11:59 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:

> rcu: Fix day-one dyntick-idle stall-warning bug

As mentioned in another thread this solves the same problem for ux500.
Reported/Tested-by: Linus Walleij <linus.walleij@linaro.org>

But now it appears that this commit didn't make it into v3.6 so
it definately needs to be tagged with Cc: stable at kernel.org
before it gets merged since the stall warnings are kinda scary.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 45+ messages in thread

* rcu self-detected stall messages on OMAP3, 4 boards
  2012-10-01  8:55                               ` Linus Walleij
@ 2012-10-01 13:28                                 ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2012-10-01 13:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 01, 2012 at 10:55:11AM +0200, Linus Walleij wrote:
> On Sat, Sep 22, 2012 at 11:59 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> 
> > rcu: Fix day-one dyntick-idle stall-warning bug
> 
> As mentioned in another thread this solves the same problem for ux500.
> Reported/Tested-by: Linus Walleij <linus.walleij@linaro.org>
> 
> But now it appears that this commit didn't make it into v3.6 so
> it definately needs to be tagged with Cc: stable at kernel.org
> before it gets merged since the stall warnings are kinda scary.

Ingo submitting this to Linus Torvalds earlier today, so we should be
able to send to stable shortly.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2012-10-01 13:28 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-12 22:51 rcu self-detected stall messages on OMAP3, 4 boards Paul Walmsley
2012-09-13  1:12 ` Paul E. McKenney
2012-09-13 18:52   ` Paul Walmsley
2012-09-20  0:03     ` Paul E. McKenney
2012-09-20  7:56       ` Paul Walmsley
2012-09-20 15:03         ` Bruce, Becky
2012-09-20 21:49         ` Bruce, Becky
2012-09-20 22:01           ` Paul E. McKenney
2012-09-20 22:47             ` Paul Walmsley
2012-09-20 23:21               ` Paul E. McKenney
2012-09-21 18:08                 ` Paul Walmsley
2012-09-21 18:58                   ` Paul E. McKenney
2012-09-21 19:11                     ` Paul Walmsley
2012-09-21 19:57                       ` Paul E. McKenney
2012-09-21 20:31                         ` Tony Lindgren
2012-09-21 22:03                           ` Paul E. McKenney
2012-09-22 15:45                             ` Frederic Weisbecker
2012-09-22 16:00                               ` Paul E. McKenney
2012-09-21 22:12                         ` Paul E. McKenney
2012-09-22 18:42                         ` Paul Walmsley
2012-09-22 20:10                           ` Paul E. McKenney
2012-09-22 21:59                             ` Paul E. McKenney
2012-09-22 22:25                               ` Paul Walmsley
2012-09-22 23:11                                 ` Paul E. McKenney
2012-09-23  7:55                                   ` Paul Walmsley
2012-09-23 12:11                                     ` Paul E. McKenney
2012-09-23  1:42                                 ` Paul Walmsley
2012-09-23  1:56                                   ` Paul E. McKenney
2012-09-23  2:01                                     ` Paul Walmsley
2012-09-24  9:41                               ` Shilimkar, Santosh
2012-09-24 13:18                                 ` Paul E. McKenney
2012-10-01  8:55                               ` Linus Walleij
2012-10-01 13:28                                 ` Paul E. McKenney
2012-09-21 18:59                   ` Paul Walmsley
2012-09-21 17:47               ` Paul Walmsley
2012-09-21 17:51                 ` Paul Walmsley
2012-09-21 21:20                 ` Paul E. McKenney
2012-09-21 22:41                   ` Paul Walmsley
2012-09-22  0:05                     ` Paul E. McKenney
2012-09-22 18:16                       ` Paul Walmsley
2012-09-22 19:52                         ` Paul E. McKenney
2012-09-22 22:20                           ` Paul Walmsley
2012-09-22 23:17                             ` Paul E. McKenney
2012-09-24 21:54                               ` Paul Walmsley
2012-09-24 22:00                                 ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).