Re: Need your help on rcu_bh_state stall on CPU 0 (t=0 jiffies)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Sumit Gemini <sumit.gemini0@gmail.com>
Cc: ralf.hildebrandt@charite.de, linux-kernel@vger.kernel.org
Subject: Re: Need your help on rcu_bh_state stall on CPU 0 (t=0 jiffies)
Date: Mon, 20 Feb 2017 19:01:05 -0800	[thread overview]
Message-ID: <20170221030105.GA30506@linux.vnet.ibm.com> (raw)
In-Reply-To: <CAD6qDH7GY6xYDtZAF3LWFFYZZuusj7UuQg43WhEc44jXBfeNpA@mail.gmail.com>

On Tue, Feb 21, 2017 at 06:36:34AM +0530, Sumit Gemini wrote:
> Hi Pual,
> 
> Thanks for your time.
> 
> we are using HA pair with TIPC and opensf functionality. Now problem is on
> active machine TIPC link lost message was observed on 06:45:00 but standby
> machine detected this link lost at 06:45:06 because  at 06:45:00 standby
> machine having rcu_bh_state (CPU stall). Once standby machine detected link
> lost, it fired reboot command for active machine with the help of opensf at
> 06:45:07 to make itself active and standby to previously active machine.
> 
> I raised this matter before TIPC open source community, and they told
> rcu_bh_state was the reason for delaying to get link down event at standby
> end.
> 
> In normal case TIPC link broken detects in < 2 second. But in our case took
> 7 second because of that stall. It is first time, we observed such
> situation.
> 
> Now if we suppres this warning, will we observe this delay again?

That depends.  If the delay was due to the NMIs and stack-trace printing,
suppressing the warning will make the delay go away.  On the other hand,
if there really is a stall independent of the warning, then the delay
will still be there.  Why not try it and find out?

> Can you enlighten on this stall. Why does this stall happen at ideal
> machine? And if possible to debug it, what we need to do?
> 
> Any pointers will be highly appreciated.

I suggest reading Documentation/stallwarn.txt, especially the list
of possible causes near the end:

	https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt

If the delay persist when you suppress the warnings, then I suggest that
you use something like ftrace to gather information on the delay.

							Thanx, Paul

> Thanks & Regards
> ~Sumit Gemini
> 
> 
> On Feb 21, 2017 2:28 AM, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> wrote:
> 
> > On Mon, Feb 20, 2017 at 03:26:32PM +0530, Sumit Gemini wrote:
> > > Hi Paul,
> > >
> > >
> > > I got you while reading 'http://www.serverphorums.com/read.php?12,468996".
> > I'm
> > > facing same problem but not able to detect what exectly the cause of
> > > it.
> > >
> > >
> > > I'm observed weird problem, I got rcu_bh_state on cpu0 but my system
> > > was in ideal state when i got rcu_bh_state (Cpu stall) problem. After
> > > 7 second my system recovered itself from stall to active. But i did
> > > not get what happen at that time.
> >
> > One thing that you might look into below...
> >
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.520060] INFO: rcu_bh_state
> > detected stall on CPU 0 (t=0 jiffies)
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] sending NMI to all
> > CPUs:
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] NMI backtrace for cpu
> > 0
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] CPU 0
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Modules linked in:
> > nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit
> > ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port
> > xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle
> > ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter
> > ip_set_hash_ipip xt_set ip_set_hash_ip deflate zlib_deflate ctr
> > twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic
> > cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null
> > af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables
> > iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding
> > binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat
> > nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf
> > edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp
> > nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas
> > sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support button
> > container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor
> > thermal_sys [last unloaded: ipt_PORTMAP]
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042]
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Pid: 0, comm: swapper
> > Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RIP:
> > 0010:[<ffffffff81007f51>] [<ffffffff81007f51>] native_read_tsc+0x2/0xf
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RSP:
> > 0018:ffff88043ee03db0 EFLAGS: 00000803
> > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RAX: 0000000037185395
> > RBX: 00000000000003e9 RCX: 0000000000000001
> > > Jan 6 16:45:07 ffm-sbc-2a osafimmd[5035]: WA IMMND DOWN on active
> > controller f2 detected at standby immd!! f1. Possible failover
> > > Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Done Locking applications
> > node id:2020f ret val:0
> > > Jan 6 16:45:07 ffm-sbc-2a opensaf_recovery: Control interface status:0
> > Role:STANDBY
> > > Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Current role: STANDBY
> > > Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: Rebooting OpenSAF NodeId =
> > 131599 EE Name = , Reason: Received Node Down for peer controller, >
> > OwnNodeId = 131343, SupervisionTime = 60
> >
> > Is this "WA IMMND DOWN" a normal occurrence?  Perhaps it is pointing
> > at a root cause.
> >
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RDX: 0000000000bf0977
> > RSI: 0000000000000002 RDI: 0000000000032bd4
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RBP: 0000000000032bd4
> > R08: 0000000000000000 R09: ffffffff819232b0
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R10: 7fffffffffffffff
> > R11: 7fffffffffffffff R12: 0000000000000000
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R13: ffffffff819232b0
> > R14: 0000000000000001 R15: ffffffff81065c28
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] FS:
> > 0000000000000000(0000) GS:ffff88043ee00000(0000) knlGS:0000000000000000
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CS: 0010 DS: 0000 ES:
> > 0000 CR0: 000000008005003b
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CR2: 000000000069e034
> > CR3: 0000000001805000 CR4: 00000000000406f0
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR0: 0000000000000000
> > DR1: 0000000000000000 DR2: 0000000000000000
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR3: 0000000000000000
> > DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Process swapper (pid:
> > 0, threadinfo ffffffff81800000, task ffffffff8180d020)
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Stack:
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] ffffffff81200eb5
> > ffffffff81200f44 00000000000003e9 0000000000001000
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002
> > ffffffff819232b0 ffffffff81017698 7fffffffffffffff
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002
> > 0000000000000002 ffffffff81017fdf 0000000000000001
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Call Trace:
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200eb5>]
> > paravirt_read_tsc+0x5/0x8
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200f44>]
> > delay_tsc+0x1d/0x68
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017698>]
> > native_safe_apic_wait_icr_idle+0x27/0x32
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017fdf>]
> > default_send_IPI_dest_field.constprop.0+0x19/0x4d
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8101804b>]
> > default_send_IPI_mask_sequence_phys+0x38/0x6a
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8101815e>]
> > arch_trigger_all_cpu_backtrace+0x4d/0x7b
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8109801b>]
> > check_cpu_stall+0x66/0xdb
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff810980aa>]
> > rcu_pending+0x1a/0x10a
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8109852c>]
> > rcu_check_callbacks+0x9d/0xae
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8104c56c>]
> > update_process_times+0x31/0x63
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81065c92>]
> > tick_sched_timer+0x6a/0x90
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105b872>]
> > __run_hrtimer+0xa4/0x148
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105c08e>]
> > hrtimer_interrupt+0xdb/0x19a
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017768>]
> > smp_apic_timer_interrupt+0x6e/0x80
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff813efcde>]
> > apic_timer_interrupt+0x6e/0x80
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81239655>]
> > intel_idle+0xdd/0x104
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81304773>]
> > cpuidle_idle_call+0xda/0x169
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81001200>]
> > cpu_idle+0x51/0x95
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193db0f>]
> > start_kernel+0x388/0x393
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193d3af>]
> > x86_64_start_kernel+0xcf/0xdc
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Code: 74 03 e6 80 c3
> > e6 ed c3 bf 8e 21 00 00 e9 ba 8f 1f 00 c3 90 90 90 40 88 f8 e6 70 e4 71 c3
> > 40 88 f0 e6 70 40 88 f8 e6 71 c3 0f 31
> > > Jan 6 16:45:07 ffm-sbc-2a kernel[3167216.524042]: c1 48 89 d0 48 c1 e0
> > 20 48 09 c8 c3 41 56 41 55 41 54 55 53
> > >
> > > .
> > > .
> > > .
> > > .
> > > .
> > > .
> > >
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] NMI backtrace for cpu
> > 31
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CPU 31
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Modules linked in:
> > nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit
> > ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port
> > xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle
> > ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter
> > ip_set_hash_ipip xt_set ip_set_hash_ip deflate ctr twofish_x86_64
> > twofish_common camellia serpent > blowfish cast5 des_generic cbc xcbc
> > rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null af_key
> > iptable_mangle ip_set nfnetlink arptable_filter arp_tables iptable_raw
> > iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding binfmt_misc
> > iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat
> > nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf
> > edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp
> > nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas
> > sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support button
> > container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor
> > thermal_sys [last unloaded: ipt_PORTMAP]
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734]
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Pid: 0, comm:
> > kworker/0:1 Tainted: P 3.1.10-gb17-default #1 Intel Corporation
> > S2600CO/S2600CO
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RIP:
> > 0010:[<ffffffff81239624>] [<ffffffff81239624>] intel_idle+0xac/0x104
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RSP:
> > 0018:ffff880425e33eb8 EFLAGS: 00000046
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RAX: 0000000000000030
> > RBX: 0000000000000010 RCX: 0000000000000001
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RDX: 0000000000000000
> > RSI: ffff880425e33fd8 RDI: ffffffff81810500
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RBP: 0000000000000030
> > R08: 000000000000006d R09: 0000000000034b10
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R10: ffff88083eded830
> > R11: ffff88083eded830 R12: 149739342cb2ca49
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R13: 0000000000000004
> > R14: 000000000000001f R15: 0000000000000000
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] FS:
> > 0000000000000000(0000) GS:ffff88083ede0000(0000) knlGS:0000000000000000
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CS: 0010 DS: 0000 ES:
> > 0000 CR0: 000000008005003b
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CR2: 00007fe89df67120
> > CR3: 0000000001805000 CR4: 00000000000406e0
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR0: 0000000000000000
> > DR1: 0000000000000000 DR2: 0000000000000000
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR3: 0000000000000000
> > DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Process kworker/0:1
> > (pid: 0, threadinfo ffff880425e32000, task ffff880425e30580)
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Stack:
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685484] TIPC: Resetting link
> > <1.1.1:bond0-1.1.2:bond0>, requested by peer
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685487] TIPC: Lost link
> > <1.1.1:bond0-1.1.2:bond0> on network plane A
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685491] TIPC: Lost contact
> > with <1.1.2>
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000
> > 000000000cdd3a47 0000000000000000 000000000cdd3a47
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] ffff880425e33fd8
> > 0000001f3edf8970 ffff88083edf8970 ffff88083edf8b00
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000
> > ffffffff81304773 ffffffff819232b0 ffff880425e33fd8
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace:
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>]
> > cpuidle_idle_call+0xda/0x169
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>]
> > cpu_idle+0x51/0x95
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Code: 28 e0 ff ff 80
> > e2 08 75 22 31 d2 48 83 c0 10 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff
> > ff a8 08 75 08 b1 01 48 89 e8 0f 01 c9 <e8> 3f 6e e2 ff 4c 29 e0 48 89 c7
> > e8 10 ae e0 ff 48 69 e8 40 42
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace:
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>]
> > cpuidle_idle_call+0xda/0x169
> > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>]
> > cpu_idle+0x51/0x95
> >
> > Since NMIs were sent to all CPUs, it is not surprising that one was found
> > to be idle.  Or is there something that I am missing in CPU 31's
> > backtrace?
> >
> > > I'm using suse kernel 3.1.10
> >
> > That is going back a ways!
> >
> > There are quite a few patches to RCU CPU stall warnings that could
> > be applied.  It might be easier to move to a newer kernel.
> >
> > Alternatively, if you are convinced that these stall warnings are
> > false positives, you could boot with the rcu_cpu_stall_suppress
> > kernel boot parameter set, or rebuild the kernel with a much larger
> > value of the RCU_CPU_STALL_TIMEOUT Kconfig option.
> >
> >                                                         Thanx, Paul
> >
> >

next prev parent reply	other threads:[~2017-02-21  3:01 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAD6qDH7j1Fr6jFsuaeJm+xGX8Q41m7nDJb7C329jF65yrO5a7g@mail.gmail.com>
2017-02-20 20:58 ` Need your help on rcu_bh_state stall on CPU 0 (t=0 jiffies) Paul E. McKenney
     [not found]   ` <CAD6qDH7GY6xYDtZAF3LWFFYZZuusj7UuQg43WhEc44jXBfeNpA@mail.gmail.com>
2017-02-21  3:01     ` Paul E. McKenney [this message]
     [not found]       ` <CAD6qDH6iQzf29ZzgFHxUCwRtbdvRCW2C5a4nvuYzhhwHgnOnPQ@mail.gmail.com>
2017-02-21  8:05         ` Paul E. McKenney
     [not found]           ` <CAD6qDH6piF-tyTUjbGj679EBg_SneOUnC6jzxB24r_qy79dYvQ@mail.gmail.com>
     [not found]             ` <20170221195753.GF30506@linux.vnet.ibm.com>
     [not found]               ` <CAD6qDH5OrQdKJNKb1EENvWMyZOZK7kQ-pHJQL5q9DaKKJbrOpw@mail.gmail.com>
2017-02-22  0:39                 ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170221030105.GA30506@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ralf.hildebrandt@charite.de \
    --cc=sumit.gemini0@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.