From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Subject: Re: RCU stalls on 32-bit pmac SMP
Date: Tue, 19 Jun 2012 07:19:44 -0700 [thread overview]
Message-ID: <20120619141943.GA4127@linux.vnet.ibm.com> (raw)
In-Reply-To: <20120618215857.GG2400@linux.vnet.ibm.com>
On Mon, Jun 18, 2012 at 02:58:57PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 19, 2012 at 06:45:31AM +1000, Benjamin Herrenschmidt wrote:
> > Hi Paul !
> >
> > On Mon, 2012-06-18 at 10:05 -0700, Paul E. McKenney wrote:
> >
> > > > sd 0:0:0:0: Attached scsi generic sg0 type 0
> > > > sda: [mac] sda1 sda2 sda3 sda4 sda5 sda6 sda7 sda8 sda9 sda10 sda11 sda12 sda13 sda14 sda15
> > > > sd 0:0:0:0: [sda] Attached SCSI disk
> > > > scsi 0:0:1:0: Direct-Access ATA ST380021A 3.75 PQ: 0 ANSI: 5
> > > > sd 0:0:1:0: [sdb] 156301488 512-byte logical blocks: (80.0 GB/74.5 GiB)
> > > > sd 0:0:1:0: [sdb] Write Protect is off
> > > > sd 0:0:1:0: Attached scsi generic sg1 type 0
> > > > sd 0:0:1:0: [sdb] Mode Sense: 00 3a 00 00
> > > > sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> > > > INFO: rcu_sched self-detected stall on CPU { 0} (t=16163 jiffies)
> > > > Call Trace:
> > > > INFO: rcu_sched self-detected stall on CPU { 1} (t=16163 jiffies)
> > > > Call Trace:
> > > > [ef877d30] [c0008d04] show_stack+0x50/0x158 (unreliable)
> > > > [ef877d70] [c0097fe4] __rcu_pending+0x184/0x46c
> > > > [ef877da0] [c00991a0] rcu_check_callbacks+0x7c/0x168
> > > > [ef877dc0] [c0044a40] update_process_times+0x3c/0x70
> > > > [ef877de0] [c0083a3c] tick_sched_timer+0x88/0x100
> > > > [ef877e10] [c005b11c] __run_hrtimer.clone.29+0x54/0x104
> > > > [ef877e30] [c005bf44] hrtimer_interrupt+0x158/0x3f8
> > > > [ef877ea0] [c000b5c4] timer_interrupt+0x1cc/0x204
> > > > [ef877ed0] [c0011b88] ret_from_except+0x0/0x1c
> > > > --- Exception: 901 at cpu_idle+0xe4/0x188
> > >
> > > Am I reading this correctly? Did the system really take a scheduling-clock
> > > interrupt from within another scheduling-clock interrupt?
> >
> > Not that I can see. I see only one interrupt on that CPU. It was idle,
> > took a timer interrupt -> hrtimer ... etc... RCU boom !
> >
> > > If so, my first question is "Why did the scheduling-clock interrupt
> > > run for so long?" ;-)
> >
> > This is still that CPU:
> >
> > > LR = cpu_idle+0xc8/0x188
> > > > [ef877f90] [c00097e8] cpu_idle+0x60/0x188 (unreliable)
> > > > [ef877fc0] [c046531c] start_secondary+0x2c8/0x2cc
> > > > [ef877ff0] [00003278] 0x3278
> >
> > And now we have the other one:
> >
> > > > [ef873b60] [c0008d04] show_stack+0x50/0x158 (unreliable)
> > > > [ef873ba0] [c0097fe4] __rcu_pending+0x184/0x46c
> > > > [ef873bd0] [c0099240] rcu_check_callbacks+0x11c/0x168
> > > > [ef873bf0] [c0044a40] update_process_times+0x3c/0x70
> > > > [ef873c10] [c0083a3c] tick_sched_timer+0x88/0x100
> > > > [ef873c40] [c005b11c] __run_hrtimer.clone.29+0x54/0x104
> > > > [ef873c60] [c005bf44] hrtimer_interrupt+0x158/0x3f8
> > > > [ef873cd0] [c000b5c4] timer_interrupt+0x1cc/0x204
> > > > [ef873d00] [c0011b88] ret_from_except+0x0/0x1c
> > > > --- Exception: 901 at wake_up_new_task+0x134/0x16c
> > > > LR = wake_up_new_task+0x134/0x16c
> > > > [ef873dc0] [c0065f08] wake_up_new_task+0xfc/0x16c (unreliable)
> > > > [ef873df0] [c0035530] do_fork+0xe8/0x2bc
> > > > [ef873e30] [c0008a4c] sys_clone+0x50/0x90
> > > > [ef873e50] [c00114b8] ret_from_syscall+0x0/0x40
> > > > --- Exception: c00 at kernel_thread+0x28/0x68
> > > > LR = __call_usermodehelper+0x40/0xdc
> > > > [ef873f10] [c005f5c8] async_run_entry_fn+0x128/0x1e4 (unreliable)
> > > > [ef873f20] [ef878c00] 0xef878c00
> > > > [ef873f40] [c004e668] process_one_work+0x150/0x3f0
> > > > [ef873f70] [c00514f0] worker_thread+0x18c/0x37c
> > > > [ef873fb0] [c00567bc] kthread+0x84/0x88
> > > > [ef873ff0] [c000f514] kernel_thread+0x4c/0x68
> >
> > Here what we have is a kernel thread trying to call_usermodehelper
> > (probably the hotplug stuff from discovering the disk) taking a timer
> > interrupt from wake_up_new_task.
> >
> > Looks like the hang has to be with some RCU stuff update_process_times
> > does vs. taking the interrupt in wake_up_new_task ? Hard to tell...
>
> The RCU stuff from update_process_times() is usually the diagnostic.
>
> > This happens more/less reliably once at boot on this machine, then no
> > more (after the long pause the machine moves on, apparently fine).
>
> What happens if you reduce the stall-warning timeout? You can use either
> the CONFIG_RCU_CPU_STALL_TIMEOUT kernel config parameter or the
> rcutree.rcu_cpu_stall_timeout boot parameter.
>
> If you can get two stall-warning messages to appear during the hang,
> comparing the stacks is usually informative.
Also, would you be willing to send along your .config?
Thanx, Paul
prev parent reply other threads:[~2012-06-19 14:20 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-18 0:35 RCU stalls on 32-bit pmac SMP Benjamin Herrenschmidt
2012-06-18 17:05 ` Paul E. McKenney
2012-06-18 20:45 ` Benjamin Herrenschmidt
2012-06-18 21:58 ` Paul E. McKenney
2012-06-19 14:19 ` Paul E. McKenney [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120619141943.GA4127@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=benh@kernel.crashing.org \
--cc=linuxppc-dev@lists.ozlabs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.