From: Corey Minyard <minyard@acm.org>
To: "Janne Huttunen (Nokia)" <janne.huttunen@nokia.com>
Cc: "stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: IPMI related kernel panics since v4.19.286
Date: Mon, 19 Jun 2023 07:52:28 -0500 [thread overview]
Message-ID: <ZJBPjOL8chqtPck2@mail.minyard.net> (raw)
In-Reply-To: <7ae67dbec16b93f0e6356337e52bf21921b0897c.camel@nokia.com>
On Mon, Jun 19, 2023 at 11:11:16AM +0000, Janne Huttunen (Nokia) wrote:
>
> We recently updated an internal test server from kernel v4.19.273
> to v4.19.286 and since then it has already multiple times triggered
> a kernel panic due to a hard lockup. The lockups look e.g. like
> this:
It looks like
b4a34aa6d "ipmi: Fix how the lower layers are told to watch for messages"
was backported to fullfill a dependency for another backport, but there
was another change:
e1891cffd4c4 "ipmi: Make the smi watcher be disabled immediately when not needed"
That is needed to avoid calling a lower layer function with
xmit_msgs_lock held. It doesn't apply completely cleanly because of
other changes, but you just need to leave in the free_user_work()
function and delete the other function in the conflict. In addition to
that, you will also need:
383035211c79 "ipmi: move message error checking to avoid deadlock"
to fix a bug in that change.
Can you try this out?
Thanks,
-corey
>
> [29397.950589] RIP: 0010:native_queued_spin_lock_slowpath+0x57/0x190
> [29397.950590] Code: 74 38 81 e6 00 ff ff ff 75 60 f0 0f ba 2f 08 8b 07 72 57 89 c2 30 e6 a9 00 00 ff ff 75 48 85 d2 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 8b 37 81 fe 00 01
> [29397.950591] RSP: 0000:ffff93d07f703de8 EFLAGS: 00000002
> [29397.950591] RAX: 0000000000000101 RBX: ffff93cf90087220 RCX: 0000000000000006
> [29397.950592] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff93cf90087220
> [29397.950592] RBP: ffff93d07f703de8 R08: 0000000000000000 R09: ffff93d07f71fb70
> [29397.950593] R10: ffff93d07f71fb28 R11: ffff93d07f703ee8 R12: 0000000000000002
> [29397.950593] R13: ffff93cf852cde58 R14: ffff93cf852cc000 R15: 0000000000000003
> [29397.950594] ? native_queued_spin_lock_slowpath+0x57/0x190
> [29397.950594] ? native_queued_spin_lock_slowpath+0x57/0x190
> [29397.950594] </NMI>
> [29397.950595] <IRQ>
> [29397.950595] _raw_spin_lock_irqsave+0x46/0x50
> [29397.950595] set_need_watch+0x2d/0x70 [ipmi_si]
> [29397.950596] ? _raw_spin_lock_irqsave+0x25/0x50
> [29397.950596] ipmi_timeout+0x2b4/0x530 [ipmi_msghandler]
> [29397.950597] ? ipmi_set_gets_events+0x260/0x260 [ipmi_msghandler]
> [29397.950597] call_timer_fn+0x30/0x130
> [29397.950597] ? ipmi_set_gets_events+0x260/0x260 [ipmi_msghandler]
> [29397.950598] run_timer_softirq+0x1ce/0x3f0
> [29397.950598] ? ktime_get+0x40/0xa0
> [29397.950598] ? sched_clock+0x9/0x10
> [29397.950599] ? sched_clock_cpu+0x11/0xc0
> [29397.950599] __do_softirq+0x104/0x32d
> [29397.950600] ? sched_clock_cpu+0x11/0xc0
> [29397.950600] irq_exit+0x11b/0x120
> [29397.950600] smp_apic_timer_interrupt+0x79/0x140
> [29397.950601] apic_timer_interrupt+0xf/0x20
> [29397.950601] </IRQ>
>
>
> And like this:
>
> [16944.269585] RIP: 0010:native_queued_spin_lock_slowpath+0x57/0x190
> [16944.269586] Code: 74 38 81 e6 00 ff ff ff 75 60 f0 0f ba 2f 08 8b 07 72 57 89 c2 30 e6 a9 00 00 ff ff 75 48 85 d2 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 8b 37 81 fe 00 01
> [16944.269587] RSP: 0018:ffff9a4e40b03dd0 EFLAGS: 00000002
> [16944.269588] RAX: 0000000000580101 RBX: ffff9a4ac5089e98 RCX: ffff9a2b9574b538
> [16944.269588] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9a4ac5089e98
> [16944.269589] RBP: ffff9a4e40b03dd0 R08: ffff9a4ac5089e58 R09: ffff9a4e40b1fb70
> [16944.269589] R10: ffff9a4e40b1fb28 R11: ffff9a4e40b03ee8 R12: 0000000000000002
> [16944.269590] R13: ffff9a4ac5089e98 R14: ffff9a4ac5089e58 R15: ffff9a4ac5089e54
> [16944.269590] ? native_queued_spin_lock_slowpath+0x57/0x190
> [16944.269591] ? native_queued_spin_lock_slowpath+0x57/0x190
> [16944.269591] </NMI>
> [16944.269592] <IRQ>
> [16944.269592] _raw_spin_lock_irqsave+0x46/0x50
> [16944.269592] ipmi_smi_msg_received+0x1bc/0x300 [ipmi_msghandler]
> [16944.269593] smi_event_handler+0x26c/0x650 [ipmi_si]
> [16944.269593] smi_timeout+0x46/0xc0 [ipmi_si]
> [16944.269594] ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
> [16944.269594] call_timer_fn+0x30/0x130
> [16944.269595] ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
> [16944.269595] run_timer_softirq+0x1ce/0x3f0
> [16944.269595] ? ktime_get+0x40/0xa0
> [16944.269596] ? sched_clock+0x9/0x10
> [16944.269596] ? sched_clock_cpu+0x11/0xc0
> [16944.269597] __do_softirq+0x104/0x32d
> [16944.269597] ? sched_clock_cpu+0x11/0xc0
> [16944.269597] irq_exit+0x11b/0x120
> [16944.269598] smp_apic_timer_interrupt+0x79/0x140
> [16944.269598] apic_timer_interrupt+0xf/0x20
> [16944.269599] </IRQ>
>
>
> To me these would look like an ordering violation between
> "smi_info->si_lock" and "intf->xmit_msgs_lock", probably
> introduced by this commit:
>
> commit b4a34aa6dfbca67610e56ad84a3595f537c85af9
> Author: Corey Minyard <cminyard@mvista.com>
> Date: Tue Oct 23 11:29:02 2018 -0500
>
> ipmi: Fix how the lower layers are told to watch for messages
>
> [ Upstream commit c65ea996595005be470fbfa16711deba414fd33b ]
>
>
> In order to test the theory further I built and booted
> a kernel with lockdep and this happened:
>
> [ 215.679605] kipmi0/1465 is trying to acquire lock:
> [ 215.684490] 00000000fc1528d3 (&(&new_smi->si_lock)->rlock){..-.}, at: set_need_watch+0x2d/0x70 [ipmi_si]
> [ 215.694073]
> but task is already holding lock:
> [ 215.699995] 00000000e2eea01c (&(&intf->xmit_msgs_lock)->rlock){..-.}, at: smi_recv_tasklet+0x170/0x260 [ipmi_msghandler]
> [ 215.710966]
> which lock already depends on the new lock.
>
> [ 215.719243]
> the existing dependency chain (in reverse order) is:
> [ 215.726824]
> -> #1 (&(&intf->xmit_msgs_lock)->rlock){..-.}:
> [ 215.733890] lock_acquire+0xae/0x180
> [ 215.738111] _raw_spin_lock_irqsave+0x4d/0x8a
> [ 215.743113] ipmi_smi_msg_received+0x1bc/0x300 [ipmi_msghandler]
> [ 215.749766] smi_event_handler+0x26c/0x660 [ipmi_si]
> [ 215.755378] ipmi_thread+0x5d/0x200 [ipmi_si]
> [ 215.760377] kthread+0x13c/0x160
> [ 215.764247] ret_from_fork+0x24/0x50
> [ 215.768457]
> -> #0 (&(&new_smi->si_lock)->rlock){..-.}:
> [ 215.775220] __lock_acquire+0xa61/0xfb0
> [ 215.779700] lock_acquire+0xae/0x180
> [ 215.783912] _raw_spin_lock_irqsave+0x4d/0x8a
> [ 215.788915] set_need_watch+0x2d/0x70 [ipmi_si]
> [ 215.794091] smi_tell_to_watch.constprop.39+0x4a/0x50 [ipmi_msghandler]
> [ 215.801353] smi_recv_tasklet+0xe8/0x260 [ipmi_msghandler]
> [ 215.807484] tasklet_action_common.isra.14+0x83/0x1a0
> [ 215.813177] tasklet_action+0x22/0x30
> [ 215.817479] __do_softirq+0xd4/0x3ef
> [ 215.821698] irq_exit+0x120/0x130
> [ 215.825649] smp_irq_work_interrupt+0x8c/0x190
> [ 215.830732] irq_work_interrupt+0xf/0x20
> [ 215.835296] _raw_spin_unlock_irqrestore+0x56/0x60
> [ 215.840725] ipmi_thread+0x13e/0x200 [ipmi_si]
> [ 215.845811] kthread+0x13c/0x160
> [ 215.849683] ret_from_fork+0x24/0x50
> [ 215.853894]
> other info that might help us debug this:
>
> [ 215.862075] Possible unsafe locking scenario:
>
> [ 215.868139] CPU0 CPU1
> [ 215.872788] ---- ----
> [ 215.877433] lock(&(&intf->xmit_msgs_lock)->rlock);
> [ 215.882512] lock(&(&new_smi->si_lock)->rlock);
> [ 215.889776] lock(&(&intf->xmit_msgs_lock)->rlock);
> [ 215.897413] lock(&(&new_smi->si_lock)->rlock);
> [ 215.902151]
> *** DEADLOCK ***
>
> [ 215.908250] 2 locks held by kipmi0/1465:
>
>
> This seems to also indicate the same locks as culprits
> although the backtraces look different from the actual
> crashes.
>
next prev parent reply other threads:[~2023-06-19 12:53 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-19 11:11 IPMI related kernel panics since v4.19.286 Janne Huttunen (Nokia)
2023-06-19 12:23 ` Greg KH
2023-06-19 13:32 ` Corey Minyard
2023-06-19 12:52 ` Corey Minyard [this message]
2023-06-20 9:56 ` Janne Huttunen (Nokia)
2023-06-20 11:54 ` Corey Minyard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZJBPjOL8chqtPck2@mail.minyard.net \
--to=minyard@acm.org \
--cc=janne.huttunen@nokia.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).