* Re: [PATCH] PCI/THP: Fix hang due to incorrect guard lock
[not found] <20250708222530.1041477-1-himanshu.madhani@oracle.com>
@ 2025-07-10 16:47 ` Himanshu Madhani
2025-07-10 21:31 ` Thomas Gleixner
1 sibling, 0 replies; 2+ messages in thread
From: Himanshu Madhani @ 2025-07-10 16:47 UTC (permalink / raw)
To: linux-kernel@vger.kernel.org, Bjorn Helgaas
Cc: tglx@linutronix.de, Jorge Lopez, linux-pci@vger.kernel.org
Just realized this patch was missing maintainer email address and mailing list for reviews.
Bjorn,
Apologies. Can you please queue this up in the PCI tree for inclusion in 6.16-rc6.
> On Jul 8, 2025, at 15:25, himanshu.madhani@oracle.com wrote:
>
> From: Himanshu Madhani <himanshu.madhani@oracle.com>
>
> Fix system hang due to incorrect mutex lock placement.
>
> following stack trace will be seen on system boot
>
> [ 525.664681] task:systemd-shutdow state:D stack:0 pid:1 tgid:1 ppid:0 task_flags:0x400100 flags:0x00004002
> [ 525.796878] Call Trace:
> [ 525.826116] <TASK>
> [ 525.851195] __schedule+0x2d1/0x730
> [ 525.892917] schedule+0x27/0x80
> [ 525.930478] schedule_preempt_disabled+0x15/0x30
> [ 525.985718] __mutex_lock.constprop.0+0x4be/0x8a0
> [ 526.041993] msi_domain_get_virq+0xcc/0x110
> [ 526.092031] pci_msix_write_tph_tag+0x3c/0x100
> [ 526.145186] pcie_tph_set_st_entry+0x125/0x1d0
> [ 526.198346] bnxt_irq_affinity_release+0x35/0x50 [bnxt_en]
> [ 526.264015] irq_set_affinity_notifier+0xe0/0x130
> [ 526.320291] bnxt_free_irq+0x6e/0x110 [bnxt_en]
> [ 526.374507] __bnxt_close_nic.isra.0+0x1eb/0x220 [bnxt_en]
> [ 526.440175] bnxt_close+0x3a/0x100 [bnxt_en]
> [ 526.491264] __dev_close_many+0xae/0x220
> [ 526.538179] dev_close_many+0xc2/0x1b0
> [ 526.583014] netif_close+0x9d/0xd0
> [ 526.623693] bnxt_shutdown+0xb1/0xe0 [bnxt_en]
> [ 526.676874] pci_device_shutdown+0x35/0x70
> [ 526.725871] device_shutdown+0x118/0x1a0
> [ 526.772788] kernel_restart+0x3a/0x70
> [ 526.816588] __do_sys_reboot+0x150/0x250
> [ 526.863504] do_syscall_64+0x84/0x940
> [ 526.907300] ? __put_user_8+0xd/0x20
> [ 526.950059] ? rseq_ip_fixup+0x90/0x1e0
> [ 526.995937] ? task_mm_cid_work+0x1ad/0x220
> [ 527.045971] ? __rseq_handle_notify_resume+0x35/0x90
> [ 527.105367] ? arch_exit_to_user_mode_prepare.isra.0+0x98/0xb0
> [ 527.175166] ? do_syscall_64+0xba/0x940
> [ 527.221040] ? do_filp_open+0xd7/0x1a0
> [ 527.265882] ? alloc_fd+0xba/0x110
> [ 527.306556] ? do_sys_openat2+0xa4/0xf0
> [ 527.352434] ? __x64_sys_openat+0x54/0xb0
> [ 527.400389] ? arch_exit_to_user_mode_prepare.isra.0+0x9/0xb0
> [ 527.469150] ? do_syscall_64+0xba/0x940
> [ 527.515023] ? do_user_addr_fault+0x221/0x690
> [ 527.567141] ? clear_bhb_loop+0x30/0x80
> [ 527.613017] ? clear_bhb_loop+0x30/0x80
> [ 527.658895] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 527.719332] RIP: 0033:0x7fc3ec504777
> [ 527.762091] RSP: 002b:00007ffecd62c4f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
> [ 527.852685] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3ec504777
> [ 527.938085] RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
> [ 528.023485] RBP: 00007ffecd62c700 R08: 0000000000000000 R09: 00007ffecd62b8e0
> [ 528.108878] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffecd62c568
> [ 528.194273] R13: 00007ffecd62c548 R14: 00007ffecd62c568 R15: 0000000000000000
> [ 528.279672] </TASK>
>
> Fixes: d5124a9957b2 ("PCI/MSI: Provide a sane mechanism for TPH")
> Fixes: 71296eae5887 ("PCI/TPH: Replace the broken MSI-X control word update")
> Reported-by: Jorge Lopez <jorge.jo.lopez@oracle.com>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Tested-by: Jorge Lopez <jorge.jo.lopez@oracle.com>
> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> ---
> drivers/pci/msi/msi.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> index 6ede55a7c5e6..d686488f4111 100644
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -934,10 +934,12 @@ int pci_msix_write_tph_tag(struct pci_dev *pdev, unsigned int index, u16 tag)
> if (!pdev->msix_enabled)
> return -ENXIO;
>
> - guard(msi_descs_lock)(&pdev->dev);
> virq = msi_get_virq(&pdev->dev, index);
> if (!virq)
> return -ENXIO;
> +
> + guard(msi_descs_lock)(&pdev->dev);
> +
> /*
> * This is a horrible hack, but short of implementing a PCI
> * specific interrupt chip callback and a huge pile of
> --
> 2.41.0.rc2
>
--
Himanshu Madhani Oracle Linux Engineering
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH] PCI/THP: Fix hang due to incorrect guard lock
[not found] <20250708222530.1041477-1-himanshu.madhani@oracle.com>
2025-07-10 16:47 ` [PATCH] PCI/THP: Fix hang due to incorrect guard lock Himanshu Madhani
@ 2025-07-10 21:31 ` Thomas Gleixner
1 sibling, 0 replies; 2+ messages in thread
From: Thomas Gleixner @ 2025-07-10 21:31 UTC (permalink / raw)
To: himanshu.madhani, linux-kernel
Cc: Himanshu Madhani, Jorge Lopez, Bjorn Helgaas, linux-pci
On Tue, Jul 08 2025 at 22:25, himanshu madhani wrote:
> From: Himanshu Madhani <himanshu.madhani@oracle.com>
The subject line is misleading because the problem is not in the THP
code. It's in the PCI/MSI code implementation of the function which is
used by THP.
> Fix system hang due to incorrect mutex lock placement.
>
> following stack trace will be seen on system boot
Please trim down stack traces to the bare minimum which is required to
illustrate your point.
> [ 525.664681] task:systemd-shutdow state:D stack:0 pid:1 tgid:1 ppid:0 task_flags:0x400100 flags:0x00004002
> [ 525.796878] Call Trace:
> [ 525.826116] <TASK>
> [ 525.851195] __schedule+0x2d1/0x730
> [ 525.892917] schedule+0x27/0x80
> [ 525.930478] schedule_preempt_disabled+0x15/0x30
> [ 525.985718] __mutex_lock.constprop.0+0x4be/0x8a0
> [ 526.041993] msi_domain_get_virq+0xcc/0x110
> [ 526.092031] pci_msix_write_tph_tag+0x3c/0x100
> [ 526.145186] pcie_tph_set_st_entry+0x125/0x1d0
> [ 526.198346] bnxt_irq_affinity_release+0x35/0x50 [bnxt_en]
> [ 526.264015] irq_set_affinity_notifier+0xe0/0x130
> [ 526.320291] bnxt_free_irq+0x6e/0x110 [bnxt_en]
> [ 526.374507] __bnxt_close_nic.isra.0+0x1eb/0x220 [bnxt_en]
> [ 526.440175] bnxt_close+0x3a/0x100 [bnxt_en]
> [ 526.491264] __dev_close_many+0xae/0x220
> [ 526.538179] dev_close_many+0xc2/0x1b0
> [ 526.583014] netif_close+0x9d/0xd0
> [ 526.623693] bnxt_shutdown+0xb1/0xe0 [bnxt_en]
> [ 526.676874] pci_device_shutdown+0x35/0x70
> [ 526.725871] device_shutdown+0x118/0x1a0
You trimmed the interesting information that this is a softlockup and
kept all the gunk below whihc is completely useless.
> [ 526.772788] kernel_restart+0x3a/0x70
> [ 526.816588] __do_sys_reboot+0x150/0x250
> [ 526.863504] do_syscall_64+0x84/0x940
> [ 526.907300] ? __put_user_8+0xd/0x20
> [ 526.950059] ? rseq_ip_fixup+0x90/0x1e0
> [ 526.995937] ? task_mm_cid_work+0x1ad/0x220
> [ 527.045971] ? __rseq_handle_notify_resume+0x35/0x90
> [ 527.105367] ? arch_exit_to_user_mode_prepare.isra.0+0x98/0xb0
> [ 527.175166] ? do_syscall_64+0xba/0x940
> [ 527.221040] ? do_filp_open+0xd7/0x1a0
> [ 527.265882] ? alloc_fd+0xba/0x110
> [ 527.306556] ? do_sys_openat2+0xa4/0xf0
> [ 527.352434] ? __x64_sys_openat+0x54/0xb0
> [ 527.400389] ? arch_exit_to_user_mode_prepare.isra.0+0x9/0xb0
> [ 527.469150] ? do_syscall_64+0xba/0x940
> [ 527.515023] ? do_user_addr_fault+0x221/0x690
> [ 527.567141] ? clear_bhb_loop+0x30/0x80
> [ 527.613017] ? clear_bhb_loop+0x30/0x80
> [ 527.658895] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 527.719332] RIP: 0033:0x7fc3ec504777
> [ 527.762091] RSP: 002b:00007ffecd62c4f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
> [ 527.852685] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3ec504777
> [ 527.938085] RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
> [ 528.023485] RBP: 00007ffecd62c700 R08: 0000000000000000 R09: 00007ffecd62b8e0
> [ 528.108878] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffecd62c568
> [ 528.194273] R13: 00007ffecd62c548 R14: 00007ffecd62c568 R15: 0000000000000000
> [ 528.279672] </TASK>
See https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces
> Fixes: d5124a9957b2 ("PCI/MSI: Provide a sane mechanism for TPH")
This fixes tag is correct
> Fixes: 71296eae5887 ("PCI/TPH: Replace the broken MSI-X control word update")
This one not because it's a subsequent problem caused by the above.
> Reported-by: Jorge Lopez <jorge.jo.lopez@oracle.com>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Tested-by: Jorge Lopez <jorge.jo.lopez@oracle.com>
> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Other than that this looks good.
I pick it up tomorrow through the tip irq/urgent branch and fixup the
changelog, so no need to resend.
Thanks,
tglx
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-07-10 21:31 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250708222530.1041477-1-himanshu.madhani@oracle.com>
2025-07-10 16:47 ` [PATCH] PCI/THP: Fix hang due to incorrect guard lock Himanshu Madhani
2025-07-10 21:31 ` Thomas Gleixner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).