public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: Bug report for RCU stalled warning [3.10.69]
       [not found] <20171011042139.GA5038@udknight>
@ 2017-10-12 20:38 ` Paul E. McKenney
  2017-10-14 12:51   ` Paul E. McKenney
  0 siblings, 1 reply; 2+ messages in thread
From: Paul E. McKenney @ 2017-10-12 20:38 UTC (permalink / raw)
  To: Wang YanQing; +Cc: linux-kernel

[ Adding LKML on CC so that others can find this. ]

On Wed, Oct 11, 2017 at 12:21:39PM +0800, Wang YanQing wrote:
> Hi, Paul McKenney.
> 
> I have received many machine-stopped-respone reports, after reboot and
> inspect message, all of them show RCU stalled, but I can't figure out
> how to fix it. I can't update the kernel, it is the painful point, so I
> need to fix it in 3.10. I have attached four messages come from different
> cpu and broads(so I guess it is a BUG instead of hardware fault), any
> suggestion is welcome.

The first step is of course to report this to your distro, as they are
the ones who do the care and feeding of such old kernels.  Please include
the information below in that report, as it might help your distro find
and fix the problem.

It looks like the stalled CPU is idle, and that the activity resulting
from the stall-warning message gets things going again.  Callbacks are
being processed, so no OOM.  But you are getting the splat every 60
seconds.  The system has only two CPUs, and is x86.

If you cannot upgrade the kernel, my ability to help is limited.  And the
diagnostics printed with the v3.10 CPU stall warnings are also quite
limited.  However, there are some things you could try as workarounds:

1.	Check to make sure that the rcu_sched kthread is getting
	the CPU time that it needs.  Preventing this kthread from
	running would create exactly this output, assuming that
	the stall warning got it going again temporarily.

2.	It looks like the disturbance of the RCU CPU stall warning
	is getting things going again.  Try artificially providing
	this disturbance, for example, by running a usermode program
	or script that runs on each CPU in turn, then sleeps for
	(say) five seconds.

3.	If you can reconfigure your kernel, try building with
	CONFIG_RCU_FAST_NO_HZ=n.

4.	Was the system running reliably on some earlier version?
	If so, consider reverting back to that version, and include
	the version information in your report to your distro.  If
	your distro provides individual patches, you should consider
	bisecting so as to locate the offending patch.

Good luck with it!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Bug report for RCU stalled warning [3.10.69]
  2017-10-12 20:38 ` Bug report for RCU stalled warning [3.10.69] Paul E. McKenney
@ 2017-10-14 12:51   ` Paul E. McKenney
  0 siblings, 0 replies; 2+ messages in thread
From: Paul E. McKenney @ 2017-10-14 12:51 UTC (permalink / raw)
  To: Wang YanQing; +Cc: linux-kernel

On Thu, Oct 12, 2017 at 01:38:24PM -0700, Paul E. McKenney wrote:
> [ Adding LKML on CC so that others can find this. ]
> 
> On Wed, Oct 11, 2017 at 12:21:39PM +0800, Wang YanQing wrote:
> > Hi, Paul McKenney.
> > 
> > I have received many machine-stopped-respone reports, after reboot and
> > inspect message, all of them show RCU stalled, but I can't figure out
> > how to fix it. I can't update the kernel, it is the painful point, so I
> > need to fix it in 3.10. I have attached four messages come from different
> > cpu and broads(so I guess it is a BUG instead of hardware fault), any
> > suggestion is welcome.
> 
> The first step is of course to report this to your distro, as they are
> the ones who do the care and feeding of such old kernels.  Please include
> the information below in that report, as it might help your distro find
> and fix the problem.
> 
> It looks like the stalled CPU is idle, and that the activity resulting
> from the stall-warning message gets things going again.  Callbacks are
> being processed, so no OOM.  But you are getting the splat every 60
> seconds.  The system has only two CPUs, and is x86.
> 
> If you cannot upgrade the kernel, my ability to help is limited.  And the
> diagnostics printed with the v3.10 CPU stall warnings are also quite
> limited.  However, there are some things you could try as workarounds:
> 
> 1.	Check to make sure that the rcu_sched kthread is getting
> 	the CPU time that it needs.  Preventing this kthread from
> 	running would create exactly this output, assuming that
> 	the stall warning got it going again temporarily.
> 
> 2.	It looks like the disturbance of the RCU CPU stall warning
> 	is getting things going again.  Try artificially providing
> 	this disturbance, for example, by running a usermode program
> 	or script that runs on each CPU in turn, then sleeps for
> 	(say) five seconds.
> 
> 3.	If you can reconfigure your kernel, try building with
> 	CONFIG_RCU_FAST_NO_HZ=n.

And if you can reconfigure kernel, in v3.10, building with
CONFIG_RCU_CPU_STALL_INFO and CONFIG_RCU_CPU_STALL_VERBOSE will provide
more information on the CPUs and tasks stalling the grace period.

							Thanx, Paul

> 4.	Was the system running reliably on some earlier version?
> 	If so, consider reverting back to that version, and include
> 	the version information in your report to your distro.  If
> 	your distro provides individual patches, you should consider
> 	bisecting so as to locate the offending patch.
> 
> Good luck with it!
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-10-14 12:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20171011042139.GA5038@udknight>
2017-10-12 20:38 ` Bug report for RCU stalled warning [3.10.69] Paul E. McKenney
2017-10-14 12:51   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox