Re: rcu stalls and soft lockups with recent kernels

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ion Badulescu <ionut@badula.org>, linux-kernel@vger.kernel.org
Subject: Re: rcu stalls and soft lockups with recent kernels
Date: Wed, 23 Mar 2016 08:36:09 -0700	[thread overview]
Message-ID: <20160323153609.GB4287@linux.vnet.ibm.com> (raw)
In-Reply-To: <1458708614.14170.11.camel@gmail.com>

On Wed, Mar 23, 2016 at 05:50:14AM +0100, Mike Galbraith wrote:
> (cc)
> 
> On Tue, 2016-03-22 at 16:22 -0400, Ion Badulescu wrote:
> > On 03/17/2016 10:28 PM, Mike Galbraith wrote:
> > > On Wed, 2016-03-16 at 12:15 -0400, Ion Badulescu wrote:
> > > > Just following up to my own email:
> > > > 
> > > > It turns out that we can eliminate the RCU stalls by changing from
> > > > CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu
> > > > handle its own RCU callbacks completely fixes the problems for us.
> > > > 
> > > > Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config
> > > > for fedora and rhel7. Ho-humm...
> > > 
> > > All RCU offloaded to CPU0 of a big box seems like a very bad idea.
> > 
> > 
> > It's not offloaded to CPU0, is it? Those rcuo* threads are not cpu-bound 
> > and can run on any cpu the scheduler will put them on. In any case, 
> > there was no indication that the rcuo* threads wanted to run but 
> > couldn't get cpu time.
> 
> Right, my thinker was screwed on cross-threaded.
> 
> > Anyway, looks like I spoke too soon. It's less often with 
> > RCU_NOCB_CPU_NONE than with RCU_NOCB_CPU_ALL, but the soft lockups and 
> > rcu stalls are still happening.
> > 
> > [44206.316711] clocksource: timekeeping watchdog: Marking clocksource 
> > 'tsc' as unstable because the skew is too large:
> > [44206.328463] clocksource:                       'hpet' wd_now: 
> > ffffffff wd_last: 5f03cdca mask: ffffffff
> > [44206.339037] clocksource:                       'tsc' cs_now: 
> > 64788b443c3a cs_last: 647840eea919 mask: ffffffffffffffff
> > [44206.351253] clocksource: Switched to clocksource hpet
> > [44922.301452] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [44922.307644]  0-...: (1 GPs behind) idle=53d/140000000000001/0 
> > softirq=8515474/8515477 fqs=6994
> > [44922.317435]  (detected by 1, t=21019 jiffies, g=2011397, c=2011396, 
> > q=3263)
> > [44922.325274] Task dump for CPU 0:
> > [44922.325276] python          R  running task        0 257113 257112 
> > 0x00080088   0 FAIR    1        0   152294    48373
> > [44922.325283]  ffffffff8152ca8e ffff881b76870000 ffff880e83669000 
> > 0000000000007d54
> > [44922.333671]  ffff881b1cdc7a48 ffff880a58a57e58 0000000000000086 
> > 0000000000000000
> > [44922.342060]  0000000000000000 0000000000003fa1 ffff880a58a54000 
> > ffff880a58a57e88
> > [44922.350446] Call Trace:
> > [44922.353215]  [] ? __schedule+0x38e/0xa90
> > [44922.359388]  [] ? rcu_eqs_enter_common+0x66/0x130
> > [44922.366437]  [] ? acct_account_cputime+0x1c/0x20
> > [44922.373388]  [] ? account_user_time+0x78/0x80
> > [44922.380045]  [] ? vtime_account_user+0x43/0x60
> > [44922.386801]  [] ? __context_tracking_exit+0x70/0xc0
> > [44922.394044]  [] ? enter_from_user_mode+0x1f/0x50
> > [44922.400994]  [] ? apic_timer_interrupt+0x69/0x90
> > [44923.210453] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! 
> > [python:257113]
> > [44923.218890] Modules linked in: nfsv3 nfs_acl nfs msr autofs4 lockd 
> > grace sunrpc cachefiles fscache binfmt_misc nls_iso8859_1 nls_cp437 vfat 
> > fat vhost_net vhost tun kvm irqbypass input_leds hid_generic iTCO_wdt 
> > iTCO_vendor_support pcspkr sfc mtd i2c_algo_bit sb_edac sg l
> > pc_ich mfd_core ehci_pci ehci_hcd xhci_pci xhci_hcd i2c_i801 i2c_core 
> > ixgbe ptp pps_core mdio ipmi_devintf ipmi_si ipmi_msghandler tpm_tis tpm 
> > acpi_power_meter hwmon ext4 jbd2 mbcache crc16 raid1 dm_mirror 
> > dm_region_hash dm_log dm_mod
> > [44923.269289] CPU: 0 PID: 257113 Comm: python Tainted: G          I 
> > 4.4.5-el6.ia32e.lime.0 #1
> > [44923.279089] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS 
> > SE5C610.86B.01.01.0011.081020151200 08/10/2015
> > [44923.290824] task: ffff880e83669000 ti: ffff880a58a54000 task.ti: 
> > ffff880a58a54000
> > [44923.299253] RIP: 0033:[<00002b5ee4e0502d>]  [<00002b5ee4e0502d>] 
> > 0x2b5ee4e0502d
> > [44923.307508] RSP: 002b:00007ffe4120b170  EFLAGS: 00000212
> > [44923.313494] RAX: 0000000000000008 RBX: 0000000003fe1480 RCX: 
> > 0000000003fc9b00
> > [44923.321513] RDX: 00002b5ee34b4260 RSI: 0000000000002248 RDI: 
> > 00000000000000d4
> > [44923.329530] RBP: 0000000003fe1800 R08: 0000000000000078 R09: 
> > 0000000000000800
> > [44923.337548] R10: 00007ffe4120b550 R11: 0000000000011240 R12: 
> > 00002b5ee34ba938
> > [44923.345566] R13: 00002b5ee34c1010 R14: 00007ffe4120b428 R15: 
> > 0000000000000400
> > [44923.353580] FS:  00002b5ec7fe98c0(0000) GS:ffff88103fc00000(0000) 
> > knlGS:0000000000000000
> > [44923.362689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [44923.369147] CR2: 00002b5ee4190000 CR3: 00000005ac81b000 CR4: 
> > 00000000001406f0
> > 
> > That rcu_eqs_enter_common function seems to be a fairly common 
> > occurrence in these stack traces. Not sure if it means anything, though.
> > 
> > Also, this seems to be a sock lockup with RIP in userspace. Does it mean 
> > timer interrupts are disabled? Somehow it fails to reschedule the NMI timer.
> > 
> > We're at our wits' end here...
> > 
> > This, btw, is a 2x12 core Haswell box.

This one would definitely make Salvador Dali proud!

All I can suggest is to reconfigure to get the highest error rate, then
bisect.

							Thanx, Paul

     prev parent reply	other threads:[~2016-03-23 15:36 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <56B39F0C.7070406@badula.org>
2016-02-04 19:12 ` rcu stalls and soft lockups with recent kernels Ion Badulescu
2016-03-16 16:15   ` Ion Badulescu
2016-03-18  2:28     ` Mike Galbraith
2016-03-22 20:22       ` Ion Badulescu
2016-03-23  4:50         ` Mike Galbraith
2016-03-23 15:36           ` Paul E. McKenney [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160323153609.GB4287@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=ionut@badula.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox