Re: IPMI related kernel panics since v4.19.286

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Corey Minyard <minyard@acm.org>
To: "Janne Huttunen (Nokia)" <janne.huttunen@nokia.com>
Cc: "stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: IPMI related kernel panics since v4.19.286
Date: Mon, 19 Jun 2023 07:52:28 -0500	[thread overview]
Message-ID: <ZJBPjOL8chqtPck2@mail.minyard.net> (raw)
In-Reply-To: <7ae67dbec16b93f0e6356337e52bf21921b0897c.camel@nokia.com>

On Mon, Jun 19, 2023 at 11:11:16AM +0000, Janne Huttunen (Nokia) wrote:
> 
> We recently updated an internal test server from kernel v4.19.273
> to v4.19.286 and since then it has already multiple times triggered
> a kernel panic due to a hard lockup. The lockups look e.g. like
> this:

It looks like

  b4a34aa6d "ipmi: Fix how the lower layers are told to watch for messages"

was backported to fullfill a dependency for another backport, but there
was another change:

  e1891cffd4c4 "ipmi: Make the smi watcher be disabled immediately when not needed"

That is needed to avoid calling a lower layer function with
xmit_msgs_lock held.  It doesn't apply completely cleanly because of
other changes, but you just need to leave in the free_user_work()
function and delete the other function in the conflict.  In addition to
that, you will also need:

  383035211c79 "ipmi: move message error checking to avoid deadlock"

to fix a bug in that change.

Can you try this out?

Thanks,

-corey

> 
> [29397.950589] RIP: 0010:native_queued_spin_lock_slowpath+0x57/0x190
> [29397.950590] Code: 74 38 81 e6 00 ff ff ff 75 60 f0 0f ba 2f 08 8b 07 72 57 89 c2 30 e6 a9 00 00 ff ff 75 48 85 d2 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 8b 37 81 fe 00 01
> [29397.950591] RSP: 0000:ffff93d07f703de8 EFLAGS: 00000002
> [29397.950591] RAX: 0000000000000101 RBX: ffff93cf90087220 RCX: 0000000000000006
> [29397.950592] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff93cf90087220
> [29397.950592] RBP: ffff93d07f703de8 R08: 0000000000000000 R09: ffff93d07f71fb70
> [29397.950593] R10: ffff93d07f71fb28 R11: ffff93d07f703ee8 R12: 0000000000000002
> [29397.950593] R13: ffff93cf852cde58 R14: ffff93cf852cc000 R15: 0000000000000003
> [29397.950594]  ? native_queued_spin_lock_slowpath+0x57/0x190
> [29397.950594]  ? native_queued_spin_lock_slowpath+0x57/0x190
> [29397.950594]  </NMI>
> [29397.950595]  <IRQ>
> [29397.950595]  _raw_spin_lock_irqsave+0x46/0x50
> [29397.950595]  set_need_watch+0x2d/0x70 [ipmi_si]
> [29397.950596]  ? _raw_spin_lock_irqsave+0x25/0x50
> [29397.950596]  ipmi_timeout+0x2b4/0x530 [ipmi_msghandler]
> [29397.950597]  ? ipmi_set_gets_events+0x260/0x260 [ipmi_msghandler]
> [29397.950597]  call_timer_fn+0x30/0x130
> [29397.950597]  ? ipmi_set_gets_events+0x260/0x260 [ipmi_msghandler]
> [29397.950598]  run_timer_softirq+0x1ce/0x3f0
> [29397.950598]  ? ktime_get+0x40/0xa0
> [29397.950598]  ? sched_clock+0x9/0x10
> [29397.950599]  ? sched_clock_cpu+0x11/0xc0
> [29397.950599]  __do_softirq+0x104/0x32d
> [29397.950600]  ? sched_clock_cpu+0x11/0xc0
> [29397.950600]  irq_exit+0x11b/0x120
> [29397.950600]  smp_apic_timer_interrupt+0x79/0x140
> [29397.950601]  apic_timer_interrupt+0xf/0x20
> [29397.950601]  </IRQ>
> 
> 
> And like this:
> 
> [16944.269585] RIP: 0010:native_queued_spin_lock_slowpath+0x57/0x190
> [16944.269586] Code: 74 38 81 e6 00 ff ff ff 75 60 f0 0f ba 2f 08 8b 07 72 57 89 c2 30 e6 a9 00 00 ff ff 75 48 85 d2 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 8b 37 81 fe 00 01
> [16944.269587] RSP: 0018:ffff9a4e40b03dd0 EFLAGS: 00000002
> [16944.269588] RAX: 0000000000580101 RBX: ffff9a4ac5089e98 RCX: ffff9a2b9574b538
> [16944.269588] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9a4ac5089e98
> [16944.269589] RBP: ffff9a4e40b03dd0 R08: ffff9a4ac5089e58 R09: ffff9a4e40b1fb70
> [16944.269589] R10: ffff9a4e40b1fb28 R11: ffff9a4e40b03ee8 R12: 0000000000000002
> [16944.269590] R13: ffff9a4ac5089e98 R14: ffff9a4ac5089e58 R15: ffff9a4ac5089e54
> [16944.269590]  ? native_queued_spin_lock_slowpath+0x57/0x190
> [16944.269591]  ? native_queued_spin_lock_slowpath+0x57/0x190
> [16944.269591]  </NMI>
> [16944.269592]  <IRQ>
> [16944.269592]  _raw_spin_lock_irqsave+0x46/0x50
> [16944.269592]  ipmi_smi_msg_received+0x1bc/0x300 [ipmi_msghandler]
> [16944.269593]  smi_event_handler+0x26c/0x650 [ipmi_si]
> [16944.269593]  smi_timeout+0x46/0xc0 [ipmi_si]
> [16944.269594]  ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
> [16944.269594]  call_timer_fn+0x30/0x130
> [16944.269595]  ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
> [16944.269595]  run_timer_softirq+0x1ce/0x3f0
> [16944.269595]  ? ktime_get+0x40/0xa0
> [16944.269596]  ? sched_clock+0x9/0x10
> [16944.269596]  ? sched_clock_cpu+0x11/0xc0
> [16944.269597]  __do_softirq+0x104/0x32d
> [16944.269597]  ? sched_clock_cpu+0x11/0xc0
> [16944.269597]  irq_exit+0x11b/0x120
> [16944.269598]  smp_apic_timer_interrupt+0x79/0x140
> [16944.269598]  apic_timer_interrupt+0xf/0x20
> [16944.269599]  </IRQ>
> 
> 
> To me these would look like an ordering violation between
> "smi_info->si_lock" and "intf->xmit_msgs_lock", probably
> introduced by this commit:
> 
> commit b4a34aa6dfbca67610e56ad84a3595f537c85af9
> Author: Corey Minyard <cminyard@mvista.com>
> Date:   Tue Oct 23 11:29:02 2018 -0500
> 
>     ipmi: Fix how the lower layers are told to watch for messages
>     
>     [ Upstream commit c65ea996595005be470fbfa16711deba414fd33b ]
>     
> 
> In order to test the theory further I built and booted
> a kernel with lockdep and this happened:
> 
> [  215.679605] kipmi0/1465 is trying to acquire lock:
> [  215.684490] 00000000fc1528d3 (&(&new_smi->si_lock)->rlock){..-.}, at: set_need_watch+0x2d/0x70 [ipmi_si]
> [  215.694073] 
>                but task is already holding lock:
> [  215.699995] 00000000e2eea01c (&(&intf->xmit_msgs_lock)->rlock){..-.}, at: smi_recv_tasklet+0x170/0x260 [ipmi_msghandler]
> [  215.710966] 
>                which lock already depends on the new lock.
> 
> [  215.719243] 
>                the existing dependency chain (in reverse order) is:
> [  215.726824] 
>                -> #1 (&(&intf->xmit_msgs_lock)->rlock){..-.}:
> [  215.733890]        lock_acquire+0xae/0x180
> [  215.738111]        _raw_spin_lock_irqsave+0x4d/0x8a
> [  215.743113]        ipmi_smi_msg_received+0x1bc/0x300 [ipmi_msghandler]
> [  215.749766]        smi_event_handler+0x26c/0x660 [ipmi_si]
> [  215.755378]        ipmi_thread+0x5d/0x200 [ipmi_si]
> [  215.760377]        kthread+0x13c/0x160
> [  215.764247]        ret_from_fork+0x24/0x50
> [  215.768457] 
>                -> #0 (&(&new_smi->si_lock)->rlock){..-.}:
> [  215.775220]        __lock_acquire+0xa61/0xfb0
> [  215.779700]        lock_acquire+0xae/0x180
> [  215.783912]        _raw_spin_lock_irqsave+0x4d/0x8a
> [  215.788915]        set_need_watch+0x2d/0x70 [ipmi_si]
> [  215.794091]        smi_tell_to_watch.constprop.39+0x4a/0x50 [ipmi_msghandler]
> [  215.801353]        smi_recv_tasklet+0xe8/0x260 [ipmi_msghandler]
> [  215.807484]        tasklet_action_common.isra.14+0x83/0x1a0
> [  215.813177]        tasklet_action+0x22/0x30
> [  215.817479]        __do_softirq+0xd4/0x3ef
> [  215.821698]        irq_exit+0x120/0x130
> [  215.825649]        smp_irq_work_interrupt+0x8c/0x190
> [  215.830732]        irq_work_interrupt+0xf/0x20
> [  215.835296]        _raw_spin_unlock_irqrestore+0x56/0x60
> [  215.840725]        ipmi_thread+0x13e/0x200 [ipmi_si]
> [  215.845811]        kthread+0x13c/0x160
> [  215.849683]        ret_from_fork+0x24/0x50
> [  215.853894] 
>                other info that might help us debug this:
> 
> [  215.862075]  Possible unsafe locking scenario:
> 
> [  215.868139]        CPU0                    CPU1
> [  215.872788]        ----                    ----
> [  215.877433]   lock(&(&intf->xmit_msgs_lock)->rlock);
> [  215.882512]                                lock(&(&new_smi->si_lock)->rlock);
> [  215.889776]                                lock(&(&intf->xmit_msgs_lock)->rlock);
> [  215.897413]   lock(&(&new_smi->si_lock)->rlock);
> [  215.902151] 
>                 *** DEADLOCK ***
> 
> [  215.908250] 2 locks held by kipmi0/1465:
> 
> 
> This seems to also indicate the same locks as culprits
> although the backtraces look different from the actual
> crashes.
>

next prev parent reply	other threads:[~2023-06-19 12:53 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-19 11:11 IPMI related kernel panics since v4.19.286 Janne Huttunen (Nokia)
2023-06-19 12:23 ` Greg KH
2023-06-19 13:32   ` Corey Minyard
2023-06-19 12:52 ` Corey Minyard [this message]
2023-06-20  9:56   ` Janne Huttunen (Nokia)
2023-06-20 11:54     ` Corey Minyard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZJBPjOL8chqtPck2@mail.minyard.net \
    --to=minyard@acm.org \
    --cc=janne.huttunen@nokia.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.