Re: [PATCH] PCI/THP: Fix hang due to incorrect guard lock

Linux PCI subsystem development
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@linutronix.de>
To: himanshu.madhani@oracle.com, linux-kernel@vger.kernel.org
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>,
	Jorge Lopez <jorge.jo.lopez@oracle.com>,
	Bjorn Helgaas <helgaas@kernel.org>,
	linux-pci@vger.kernel.org
Subject: Re: [PATCH] PCI/THP: Fix hang due to incorrect guard lock
Date: Thu, 10 Jul 2025 23:31:11 +0200	[thread overview]
Message-ID: <87frf3wx5s.ffs@tglx> (raw)
In-Reply-To: <20250708222530.1041477-1-himanshu.madhani@oracle.com>

On Tue, Jul 08 2025 at 22:25, himanshu madhani wrote:

> From: Himanshu Madhani <himanshu.madhani@oracle.com>

The subject line is misleading because the problem is not in the THP
code. It's in the PCI/MSI code implementation of the function which is
used by THP.

> Fix system hang due to incorrect mutex lock placement.
>
> following stack trace will be seen on system boot

Please trim down stack traces to the bare minimum which is required to
illustrate your point.

> [  525.664681] task:systemd-shutdow state:D stack:0     pid:1     tgid:1     ppid:0      task_flags:0x400100 flags:0x00004002
> [  525.796878] Call Trace:
> [  525.826116]  <TASK>
> [  525.851195]  __schedule+0x2d1/0x730
> [  525.892917]  schedule+0x27/0x80
> [  525.930478]  schedule_preempt_disabled+0x15/0x30
> [  525.985718]  __mutex_lock.constprop.0+0x4be/0x8a0
> [  526.041993]  msi_domain_get_virq+0xcc/0x110
> [  526.092031]  pci_msix_write_tph_tag+0x3c/0x100
> [  526.145186]  pcie_tph_set_st_entry+0x125/0x1d0
> [  526.198346]  bnxt_irq_affinity_release+0x35/0x50 [bnxt_en]
> [  526.264015]  irq_set_affinity_notifier+0xe0/0x130
> [  526.320291]  bnxt_free_irq+0x6e/0x110 [bnxt_en]
> [  526.374507]  __bnxt_close_nic.isra.0+0x1eb/0x220 [bnxt_en]
> [  526.440175]  bnxt_close+0x3a/0x100 [bnxt_en]
> [  526.491264]  __dev_close_many+0xae/0x220
> [  526.538179]  dev_close_many+0xc2/0x1b0
> [  526.583014]  netif_close+0x9d/0xd0
> [  526.623693]  bnxt_shutdown+0xb1/0xe0 [bnxt_en]
> [  526.676874]  pci_device_shutdown+0x35/0x70
> [  526.725871]  device_shutdown+0x118/0x1a0

You trimmed the interesting information that this is a softlockup and
kept all the gunk below whihc is completely useless.

> [  526.772788]  kernel_restart+0x3a/0x70
> [  526.816588]  __do_sys_reboot+0x150/0x250
> [  526.863504]  do_syscall_64+0x84/0x940
> [  526.907300]  ? __put_user_8+0xd/0x20
> [  526.950059]  ? rseq_ip_fixup+0x90/0x1e0
> [  526.995937]  ? task_mm_cid_work+0x1ad/0x220
> [  527.045971]  ? __rseq_handle_notify_resume+0x35/0x90
> [  527.105367]  ? arch_exit_to_user_mode_prepare.isra.0+0x98/0xb0
> [  527.175166]  ? do_syscall_64+0xba/0x940
> [  527.221040]  ? do_filp_open+0xd7/0x1a0
> [  527.265882]  ? alloc_fd+0xba/0x110
> [  527.306556]  ? do_sys_openat2+0xa4/0xf0
> [  527.352434]  ? __x64_sys_openat+0x54/0xb0
> [  527.400389]  ? arch_exit_to_user_mode_prepare.isra.0+0x9/0xb0
> [  527.469150]  ? do_syscall_64+0xba/0x940
> [  527.515023]  ? do_user_addr_fault+0x221/0x690
> [  527.567141]  ? clear_bhb_loop+0x30/0x80
> [  527.613017]  ? clear_bhb_loop+0x30/0x80
> [  527.658895]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  527.719332] RIP: 0033:0x7fc3ec504777
> [  527.762091] RSP: 002b:00007ffecd62c4f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
> [  527.852685] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3ec504777
> [  527.938085] RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
> [  528.023485] RBP: 00007ffecd62c700 R08: 0000000000000000 R09: 00007ffecd62b8e0
> [  528.108878] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffecd62c568
> [  528.194273] R13: 00007ffecd62c548 R14: 00007ffecd62c568 R15: 0000000000000000
> [  528.279672]  </TASK>

See https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces

> Fixes: d5124a9957b2 ("PCI/MSI: Provide a sane mechanism for TPH")

This fixes tag is correct

> Fixes: 71296eae5887 ("PCI/TPH: Replace the broken MSI-X control word update")

This one not because it's a subsequent problem caused by the above.

> Reported-by: Jorge Lopez <jorge.jo.lopez@oracle.com>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Tested-by: Jorge Lopez <jorge.jo.lopez@oracle.com>
> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>

Other than that this looks good.

I pick it up tomorrow through the tip irq/urgent branch and fixup the
changelog, so no need to resend.

Thanks,

        tglx

     prev parent reply	other threads:[~2025-07-10 21:31 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250708222530.1041477-1-himanshu.madhani@oracle.com>
2025-07-10 16:47 ` [PATCH] PCI/THP: Fix hang due to incorrect guard lock Himanshu Madhani
2025-07-10 21:31 ` Thomas Gleixner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87frf3wx5s.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=helgaas@kernel.org \
    --cc=himanshu.madhani@oracle.com \
    --cc=jorge.jo.lopez@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox