All of lore.kernel.org
 help / color / mirror / Atom feed
* System hang with latest kernel v6.16.0-rc1 (rc2 & rc3)
@ 2025-07-03 18:27 Himanshu Madhani
       [not found] ` <7279DC28-17BF-4A28-96ED-7AE9857BC2E3@oracle.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Himanshu Madhani @ 2025-07-03 18:27 UTC (permalink / raw)
  To: glx@linutronix.de, linux-kernel@vger.kernel.org

Hi Folks, 

We are seeing kernel hang while booting after new 6.16-rc1 kernel is installed.

Here’s stack track that shows up 

[  297.656683] systemd-shutdown[1]: Rebooting with kexec.
[  513.790993] INFO: task kexec:19038 blocked for more than 122 seconds.
[  513.868087]       Not tainted 6.16.0-rc1.master.20250611.ol9.x86_64 #1
[  513.946210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  514.039923] task:kexec           state:D stack:0     pid:19038 tgid:19038 ppid:1      task_flags:0x400100 flags:0x00004002
[  514.172122] Call Trace:
[  514.201356]  <TASK>
[  514.226438]  __schedule+0x2d1/0x730
[  514.268161]  schedule+0x27/0x80
[  514.305717]  schedule_preempt_disabled+0x15/0x30
[  514.360954]  __mutex_lock.constprop.0+0x4be/0x8a0
[  514.417232]  msi_domain_get_virq+0xcc/0x110
[  514.467279]  pci_msix_write_tph_tag+0x3c/0x100
[  514.520441]  pcie_tph_set_st_entry+0x125/0x1d0
[  514.573605]  bnxt_irq_affinity_release+0x35/0x50 [bnxt_en]
[  514.639258]  irq_set_affinity_notifier+0xdd/0x130
[  514.695534]  bnxt_free_irq+0x6e/0x110 [bnxt_en]
[  514.749746]  __bnxt_close_nic.isra.0+0x1eb/0x220 [bnxt_en]
[  514.815404]  bnxt_close+0x3a/0x100 [bnxt_en]
[  514.866498]  __dev_close_many+0xab/0x220
[  514.913423]  __dev_change_flags+0x102/0x240
[  514.963464]  netif_change_flags+0x26/0x70
[  515.011424]  dev_change_flags+0x40/0xc0
[  515.057304]  devinet_ioctl+0x3aa/0x7a0
[  515.102142]  inet_ioctl+0x1d3/0x1f0
[  515.143863]  sock_do_ioctl+0x7a/0x140
[  515.187667]  __x64_sys_ioctl+0x9b/0x100
[  515.233545]  ? syscall_trace_enter+0x10c/0x1d0
[  515.286704]  do_syscall_64+0x84/0x940
[  515.330502]  ? refill_obj_stock+0x143/0x240
[  515.380543]  ? __dentry_kill+0x12e/0x190
[  515.427459]  ? __memcg_slab_free_hook+0xf4/0x150
[  515.482698]  ? __x64_sys_close+0x3d/0x80
[  515.529616]  ? kmem_cache_free+0x3fe/0x460
[  515.578614]  ? syscall_exit_work+0x118/0x150
[  515.629695]  ? arch_exit_to_user_mode_prepare.isra.0+0x9/0xb0
[  515.698453]  ? do_syscall_64+0xba/0x940
[  515.744330]  ? mod_memcg_lruvec_state+0x1a2/0x1f0
[  515.800608]  ? __lruvec_stat_mod_folio+0x83/0xd0
[  515.855843]  ? __folio_mod_stat+0x26/0x80
[  515.903801]  ? set_ptes.isra.0+0x36/0x90
[  515.950723]  ? do_anonymous_page+0x103/0x4b0
[  516.001802]  ? __handle_mm_fault+0x394/0x6f0
[  516.052886]  ? count_memcg_events+0x15a/0x1a0
[  516.105008]  ? handle_mm_fault+0x24a/0x350
[  516.154003]  ? do_user_addr_fault+0x221/0x690
[  516.206122]  ? arch_exit_to_user_mode_prepare.isra.0+0x9/0xb0
[  516.274887]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  516.335330] RIP: 0033:0x7fc96e903bcb
[  516.378086] RSP: 002b:00007ffcc7f78518 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  516.468683] RAX: ffffffffffffffda RBX: 000055dc432d8f80 RCX: 00007fc96e903bcb
[  516.554080] RDX: 00007ffcc7f78680 RSI: 0000000000008914 RDI: 0000000000000003
[  516.639482] RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000007
[  516.724882] R10: 000000000000005e R11: 0000000000000202 R12: 000055dc095468dd
[  516.810278] R13: 000055dc095468e4 R14: 00007ffcc7f78680 R15: 000055dc432d9020
[  516.895676]  </TASK>
[  516.921808] INFO: task kexec:19038 is blocked on a mutex likely owned by task kexec:19038.
[  517.020728] task:kexec           state:D stack:0     pid:19038 tgid:19038 ppid:1      task_flags:0x400100 flags:0x00004002


Git-bisect point to this merge commit 

commit 6376c0770656f3bdf7f411faf068371b6932aeca
Merge: 5e8bbb2caa4e 29857e6f4e30
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue May 27 09:01:26 2025 -0700

    Merge tag 'timers-clocksource-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
    
    Pull clocksource updates from Thomas Gleixner:
     "Updates for clocksource/clockevent drivers:
    
       - The final conversion of text formatted device tree binding to
         schemas
    
       - A new driver fot the System Timer Module on S32G NXP SoCs
    
       - A new driver fot the Econet HPT timer
    
       - The usual improvements and device tree binding updates"
    
    * tag 'timers-clocksource-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
      clocksource/drivers/renesas-ostm: Unconditionally enable reprobe support
      dt-bindings: timer: renesas,ostm: Document RZ/V2N (R9A09G056) support
      dt-bindings: timer: Convert marvell,armada-370-timer to DT schema
      dt-bindings: timer: Convert ti,keystone-timer to DT schema
      dt-bindings: timer: Convert st,spear-timer to DT schema
      dt-bindings: timer: Convert socionext,milbeaut-timer to DT schema
      dt-bindings: timer: Convert snps,arc-timer to DT schema
      dt-bindings: timer: Convert snps,archs-rtc to DT schema
      dt-bindings: timer: Convert snps,archs-gfrc to DT schema
      dt-bindings: timer: Convert lsi,zevio-timer to DT schema
      dt-bindings: timer: Convert jcore,pit to DT schema
      dt-bindings: timer: Convert img,pistachio-gptimer to DT schema
      dt-bindings: timer: Convert ezchip,nps400-timer to DT schema
      dt-bindings: timer: Convert cirrus,clps711x-timer to DT schema
      dt-bindings: timer: Convert altr,timer-1.0 to DT schema
      dt-bindings: timer: Add ESWIN EIC7700 CLINT
      clocksource/drivers: Add EcoNet Timer HPT driver
      dt-bindings: timer: Add EcoNet EN751221 "HPT" CPU Timer
      dt-bindings: timer: Convert arm,mps2-timer to DT schema
      dt-bindings: timer: Add Sophgo SG2044 ACLINT timer
      …

Following further in this commit, I only see this following series that had changes which may or may not be related to hang. 

https://lore.kernel.org/all/20250429065337.117370076@linutronix.de/

I am not very familiar with this subsystem and was hoping if somebody can spot the offending commit and possibly provide fix for this hang. 

Note that we tried with rc3 as well to see if there was fix applied in later RC and still see same issue. 

[  525.390801] INFO: task systemd-shutdow:1 blocked for more than 122 seconds.
[  525.474133]       Tainted: G S                  6.16.0-rc3.master.20250625.ol9.x86_64 #1
[  525.570969] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  525.664681] task:systemd-shutdow state:D stack:0     pid:1     tgid:1     ppid:0      task_flags:0x400100 flags:0x00004002
[  525.796878] Call Trace:
[  525.826116]  <TASK>
[  525.851195]  __schedule+0x2d1/0x730
[  525.892917]  schedule+0x27/0x80
[  525.930478]  schedule_preempt_disabled+0x15/0x30
[  525.985718]  __mutex_lock.constprop.0+0x4be/0x8a0
[  526.041993]  msi_domain_get_virq+0xcc/0x110
[  526.092031]  pci_msix_write_tph_tag+0x3c/0x100
[  526.145186]  pcie_tph_set_st_entry+0x125/0x1d0
[  526.198346]  bnxt_irq_affinity_release+0x35/0x50 [bnxt_en]
[  526.264015]  irq_set_affinity_notifier+0xe0/0x130
[  526.320291]  bnxt_free_irq+0x6e/0x110 [bnxt_en]
[  526.374507]  __bnxt_close_nic.isra.0+0x1eb/0x220 [bnxt_en]
[  526.440175]  bnxt_close+0x3a/0x100 [bnxt_en]
[  526.491264]  __dev_close_many+0xae/0x220
[  526.538179]  dev_close_many+0xc2/0x1b0
[  526.583014]  netif_close+0x9d/0xd0
[  526.623693]  bnxt_shutdown+0xb1/0xe0 [bnxt_en]
[  526.676874]  pci_device_shutdown+0x35/0x70
[  526.725871]  device_shutdown+0x118/0x1a0
[  526.772788]  kernel_restart+0x3a/0x70
[  526.816588]  __do_sys_reboot+0x150/0x250
[  526.863504]  do_syscall_64+0x84/0x940
[  526.907300]  ? __put_user_8+0xd/0x20
[  526.950059]  ? rseq_ip_fixup+0x90/0x1e0
[  526.995937]  ? task_mm_cid_work+0x1ad/0x220
[  527.045971]  ? __rseq_handle_notify_resume+0x35/0x90
[  527.105367]  ? arch_exit_to_user_mode_prepare.isra.0+0x98/0xb0
[  527.175166]  ? do_syscall_64+0xba/0x940
[  527.221040]  ? do_filp_open+0xd7/0x1a0
[  527.265882]  ? alloc_fd+0xba/0x110
[  527.306556]  ? do_sys_openat2+0xa4/0xf0
[  527.352434]  ? __x64_sys_openat+0x54/0xb0
[  527.400389]  ? arch_exit_to_user_mode_prepare.isra.0+0x9/0xb0
[  527.469150]  ? do_syscall_64+0xba/0x940
[  527.515023]  ? do_user_addr_fault+0x221/0x690
[  527.567141]  ? clear_bhb_loop+0x30/0x80
[  527.613017]  ? clear_bhb_loop+0x30/0x80
[  527.658895]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  527.719332] RIP: 0033:0x7fc3ec504777
[  527.762091] RSP: 002b:00007ffecd62c4f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
[  527.852685] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3ec504777
[  527.938085] RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
[  528.023485] RBP: 00007ffecd62c700 R08: 0000000000000000 R09: 00007ffecd62b8e0
[  528.108878] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffecd62c568
[  528.194273] R13: 00007ffecd62c548 R14: 00007ffecd62c568 R15: 0000000000000000
[  528.279672]  </TASK>

-- 
Himanshu Madhani	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 7+ messages in thread

* System hang with latest kernel v6.16.0-rc1 (rc2 & rc3)
@ 2025-07-03 19:31 Himanshu Madhani
  2025-07-03 20:24 ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Himanshu Madhani @ 2025-07-03 19:31 UTC (permalink / raw)
  To: tglx, linux-kernel

Hi Folks,

We are seeing kernel hang while booting after new 6.16-rc1 kernel is 
installed.

Here’s stack track that shows up

[  297.656683] systemd-shutdown[1]: Rebooting with kexec.
[  513.790993] INFO: task kexec:19038 blocked for more than 122 seconds.
[  513.868087]       Not tainted 6.16.0-rc1.master.20250611.ol9.x86_64 #1
[  513.946210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  514.039923] task:kexec           state:D stack:0     pid:19038 
tgid:19038 ppid:1      task_flags:0x400100 flags:0x00004002
[  514.172122] Call Trace:
[  514.201356]  <TASK>
[  514.226438]  __schedule+0x2d1/0x730
[  514.268161]  schedule+0x27/0x80
[  514.305717]  schedule_preempt_disabled+0x15/0x30
[  514.360954]  __mutex_lock.constprop.0+0x4be/0x8a0
[  514.417232]  msi_domain_get_virq+0xcc/0x110
[  514.467279]  pci_msix_write_tph_tag+0x3c/0x100
[  514.520441]  pcie_tph_set_st_entry+0x125/0x1d0
[  514.573605]  bnxt_irq_affinity_release+0x35/0x50 [bnxt_en]
[  514.639258]  irq_set_affinity_notifier+0xdd/0x130
[  514.695534]  bnxt_free_irq+0x6e/0x110 [bnxt_en]
[  514.749746]  __bnxt_close_nic.isra.0+0x1eb/0x220 [bnxt_en]
[  514.815404]  bnxt_close+0x3a/0x100 [bnxt_en]
[  514.866498]  __dev_close_many+0xab/0x220
[  514.913423]  __dev_change_flags+0x102/0x240
[  514.963464]  netif_change_flags+0x26/0x70
[  515.011424]  dev_change_flags+0x40/0xc0
[  515.057304]  devinet_ioctl+0x3aa/0x7a0
[  515.102142]  inet_ioctl+0x1d3/0x1f0
[  515.143863]  sock_do_ioctl+0x7a/0x140
[  515.187667]  __x64_sys_ioctl+0x9b/0x100
[  515.233545]  ? syscall_trace_enter+0x10c/0x1d0
[  515.286704]  do_syscall_64+0x84/0x940
[  515.330502]  ? refill_obj_stock+0x143/0x240
[  515.380543]  ? __dentry_kill+0x12e/0x190
[  515.427459]  ? __memcg_slab_free_hook+0xf4/0x150
[  515.482698]  ? __x64_sys_close+0x3d/0x80
[  515.529616]  ? kmem_cache_free+0x3fe/0x460
[  515.578614]  ? syscall_exit_work+0x118/0x150
[  515.629695]  ? arch_exit_to_user_mode_prepare.isra.0+0x9/0xb0
[  515.698453]  ? do_syscall_64+0xba/0x940
[  515.744330]  ? mod_memcg_lruvec_state+0x1a2/0x1f0
[  515.800608]  ? __lruvec_stat_mod_folio+0x83/0xd0
[  515.855843]  ? __folio_mod_stat+0x26/0x80
[  515.903801]  ? set_ptes.isra.0+0x36/0x90
[  515.950723]  ? do_anonymous_page+0x103/0x4b0
[  516.001802]  ? __handle_mm_fault+0x394/0x6f0
[  516.052886]  ? count_memcg_events+0x15a/0x1a0
[  516.105008]  ? handle_mm_fault+0x24a/0x350
[  516.154003]  ? do_user_addr_fault+0x221/0x690
[  516.206122]  ? arch_exit_to_user_mode_prepare.isra.0+0x9/0xb0
[  516.274887]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  516.335330] RIP: 0033:0x7fc96e903bcb
[  516.378086] RSP: 002b:00007ffcc7f78518 EFLAGS: 00000202 ORIG_RAX: 
0000000000000010
[  516.468683] RAX: ffffffffffffffda RBX: 000055dc432d8f80 RCX: 
00007fc96e903bcb
[  516.554080] RDX: 00007ffcc7f78680 RSI: 0000000000008914 RDI: 
0000000000000003
[  516.639482] RBP: 0000000000000000 R08: 0000000000000007 R09: 
0000000000000007
[  516.724882] R10: 000000000000005e R11: 0000000000000202 R12: 
000055dc095468dd
[  516.810278] R13: 000055dc095468e4 R14: 00007ffcc7f78680 R15: 
000055dc432d9020
[  516.895676]  </TASK>
[  516.921808] INFO: task kexec:19038 is blocked on a mutex likely owned 
by task kexec:19038.
[  517.020728] task:kexec           state:D stack:0     pid:19038 
tgid:19038 ppid:1      task_flags:0x400100 flags:0x00004002


Git-bisect point to this merge commit

commit 6376c0770656f3bdf7f411faf068371b6932aeca
Merge: 5e8bbb2caa4e 29857e6f4e30
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue May 27 09:01:26 2025 -0700

    Merge tag 'timers-clocksource-2025-05-25' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull clocksource updates from Thomas Gleixner:
     "Updates for clocksource/clockevent drivers:

       - The final conversion of text formatted device tree binding to
         schemas

       - A new driver fot the System Timer Module on S32G NXP SoCs

       - A new driver fot the Econet HPT timer

       - The usual improvements and device tree binding updates"

    * tag 'timers-clocksource-2025-05-25' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
      clocksource/drivers/renesas-ostm: Unconditionally enable reprobe 
support
      dt-bindings: timer: renesas,ostm: Document RZ/V2N (R9A09G056) support
      dt-bindings: timer: Convert marvell,armada-370-timer to DT schema
      dt-bindings: timer: Convert ti,keystone-timer to DT schema
      dt-bindings: timer: Convert st,spear-timer to DT schema
      dt-bindings: timer: Convert socionext,milbeaut-timer to DT schema
      dt-bindings: timer: Convert snps,arc-timer to DT schema
      dt-bindings: timer: Convert snps,archs-rtc to DT schema
      dt-bindings: timer: Convert snps,archs-gfrc to DT schema
      dt-bindings: timer: Convert lsi,zevio-timer to DT schema
      dt-bindings: timer: Convert jcore,pit to DT schema
      dt-bindings: timer: Convert img,pistachio-gptimer to DT schema
      dt-bindings: timer: Convert ezchip,nps400-timer to DT schema
      dt-bindings: timer: Convert cirrus,clps711x-timer to DT schema
      dt-bindings: timer: Convert altr,timer-1.0 to DT schema
      dt-bindings: timer: Add ESWIN EIC7700 CLINT
      clocksource/drivers: Add EcoNet Timer HPT driver
      dt-bindings: timer: Add EcoNet EN751221 "HPT" CPU Timer
      dt-bindings: timer: Convert arm,mps2-timer to DT schema
      dt-bindings: timer: Add Sophgo SG2044 ACLINT timer
      …

Following further in this commit, I only see this following series that 
had changes which may or may not be related to hang.

https://lore.kernel.org/all/20250429065337.117370076@linutronix.de/

I am not very familiar with this subsystem and was hoping if somebody 
can spot the offending commit and possibly provide fix for this hang.

Note that we tried with rc3 as well to see if there was fix applied in 
later RC and still see same issue.

[  525.390801] INFO: task systemd-shutdow:1 blocked for more than 122 
seconds.
[  525.474133]       Tainted: G S 
6.16.0-rc3.master.20250625.ol9.x86_64 #1
[  525.570969] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  525.664681] task:systemd-shutdow state:D stack:0     pid:1     tgid:1 
     ppid:0      task_flags:0x400100 flags:0x00004002
[  525.796878] Call Trace:
[  525.826116]  <TASK>
[  525.851195]  __schedule+0x2d1/0x730
[  525.892917]  schedule+0x27/0x80
[  525.930478]  schedule_preempt_disabled+0x15/0x30
[  525.985718]  __mutex_lock.constprop.0+0x4be/0x8a0
[  526.041993]  msi_domain_get_virq+0xcc/0x110
[  526.092031]  pci_msix_write_tph_tag+0x3c/0x100
[  526.145186]  pcie_tph_set_st_entry+0x125/0x1d0
[  526.198346]  bnxt_irq_affinity_release+0x35/0x50 [bnxt_en]
[  526.264015]  irq_set_affinity_notifier+0xe0/0x130
[  526.320291]  bnxt_free_irq+0x6e/0x110 [bnxt_en]
[  526.374507]  __bnxt_close_nic.isra.0+0x1eb/0x220 [bnxt_en]
[  526.440175]  bnxt_close+0x3a/0x100 [bnxt_en]
[  526.491264]  __dev_close_many+0xae/0x220
[  526.538179]  dev_close_many+0xc2/0x1b0
[  526.583014]  netif_close+0x9d/0xd0
[  526.623693]  bnxt_shutdown+0xb1/0xe0 [bnxt_en]
[  526.676874]  pci_device_shutdown+0x35/0x70
[  526.725871]  device_shutdown+0x118/0x1a0
[  526.772788]  kernel_restart+0x3a/0x70
[  526.816588]  __do_sys_reboot+0x150/0x250
[  526.863504]  do_syscall_64+0x84/0x940
[  526.907300]  ? __put_user_8+0xd/0x20
[  526.950059]  ? rseq_ip_fixup+0x90/0x1e0
[  526.995937]  ? task_mm_cid_work+0x1ad/0x220
[  527.045971]  ? __rseq_handle_notify_resume+0x35/0x90
[  527.105367]  ? arch_exit_to_user_mode_prepare.isra.0+0x98/0xb0
[  527.175166]  ? do_syscall_64+0xba/0x940
[  527.221040]  ? do_filp_open+0xd7/0x1a0
[  527.265882]  ? alloc_fd+0xba/0x110
[  527.306556]  ? do_sys_openat2+0xa4/0xf0
[  527.352434]  ? __x64_sys_openat+0x54/0xb0
[  527.400389]  ? arch_exit_to_user_mode_prepare.isra.0+0x9/0xb0
[  527.469150]  ? do_syscall_64+0xba/0x940
[  527.515023]  ? do_user_addr_fault+0x221/0x690
[  527.567141]  ? clear_bhb_loop+0x30/0x80
[  527.613017]  ? clear_bhb_loop+0x30/0x80
[  527.658895]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  527.719332] RIP: 0033:0x7fc3ec504777
[  527.762091] RSP: 002b:00007ffecd62c4f8 EFLAGS: 00000202 ORIG_RAX: 
00000000000000a9
[  527.852685] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 
00007fc3ec504777
[  527.938085] RDX: 0000000001234567 RSI: 0000000028121969 RDI: 
00000000fee1dead
[  528.023485] RBP: 00007ffecd62c700 R08: 0000000000000000 R09: 
00007ffecd62b8e0
[  528.108878] R10: 0000000000000001 R11: 0000000000000202 R12: 
00007ffecd62c568
[  528.194273] R13: 00007ffecd62c548 R14: 00007ffecd62c568 R15: 
0000000000000000
[  528.279672]  </TASK>


-- 
Himanshu Madhani                                Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: System hang with latest kernel v6.16.0-rc1 (rc2 & rc3)
       [not found] ` <7279DC28-17BF-4A28-96ED-7AE9857BC2E3@oracle.com>
@ 2025-07-03 20:21   ` Thomas Gleixner
  2025-07-03 20:34     ` Himanshu Madhani
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2025-07-03 20:21 UTC (permalink / raw)
  To: Himanshu Madhani, linux-kernel@vger.kernel.org

On Thu, Jul 03 2025 at 18:32, Himanshu Madhani wrote:
> On Jul 3, 2025, at 11:27, Himanshu Madhani <himanshu.madhani@oracle.com> wrote:
> Git-bisect point to this merge commit
>
> commit 6376c0770656f3bdf7f411faf068371b6932aeca
> Merge: 5e8bbb2caa4e 29857e6f4e30
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Tue May 27 09:01:26 2025 -0700
>
>    Merge tag 'timers-clocksource-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>
>    Pull clocksource updates from Thomas Gleixner:
>     "Updates for clocksource/clockevent drivers:
>
>       - The final conversion of text formatted device tree binding to
>         schemas
>
>       - A new driver fot the System Timer Module on S32G NXP SoCs
>
>       - A new driver fot the Econet HPT timer
>
>       - The usual improvements and device tree binding updates"

That obviously does not make sense, so your bisect got side ways.

> Following further in this commit, I only see this following series
> that had changes which may or may not be related to hang.
>
> https://lore.kernel.org/all/20250429065337.117370076@linutronix.de/

They are not. There is a hint in both backtraces:

> [  514.305717]  schedule_preempt_disabled+0x15/0x30
> [  514.360954]  __mutex_lock.constprop.0+0x4be/0x8a0
> [  514.417232]  msi_domain_get_virq+0xcc/0x110
> [  514.467279]  pci_msix_write_tph_tag+0x3c/0x100

and

> [  525.930478]  schedule_preempt_disabled+0x15/0x30
> [  525.985718]  __mutex_lock.constprop.0+0x4be/0x8a0
> [  526.041993]  msi_domain_get_virq+0xcc/0x110
> [  526.092031]  pci_msix_write_tph_tag+0x3c/0x100

pci_msix_write_tph_tag() is the function which ends up trying to lock
the mutex and gets stuck. This function was introduced with commit

  d5124a9957b2 ("PCI/MSI: Provide a sane mechanism for TPH")

and the subsequent commit

  71296eae5887 ("PCI/TPH: Replace the broken MSI-X control word update")

flipped the TPH code over to use that.

The problem is obvious and if you would have enabled
CONFIG_PROVE_LOCKING then you would have got the reason presented on a
silver tablet in dmesg. I encourage you to do so nevertheless.

I definitely screwed that one up in the most stupid way.

As I had no idea how to exercise that code path I did not test it. It
seems this code is not really tested by any of the CI stuff either
before it hits Linus tree and as some folks start testing only post rc1
it takes some time to surface :( 

The fix is as obvious as the problem. See uncompiled and untested patch
below. If it solves the problem, which it should, feel free to take it
and create a proper patch with changelog and Fixes tag yourself (Adding
Suggested-by: Thomas ... is good enough). Otherwise let me know, and I
take care of it in my copious spare time :)

Thanks,

        tglx
---
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 6ede55a7c5e6..eb26f3816922 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -934,10 +934,11 @@ int pci_msix_write_tph_tag(struct pci_dev *pdev, unsigned int index, u16 tag)
 	if (!pdev->msix_enabled)
 		return -ENXIO;
 
-	guard(msi_descs_lock)(&pdev->dev);
 	virq = msi_get_virq(&pdev->dev, index);
 	if (!virq)
 		return -ENXIO;
+
+	guard(msi_descs_lock)(&pdev->dev);
 	/*
 	 * This is a horrible hack, but short of implementing a PCI
 	 * specific interrupt chip callback and a huge pile of

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: System hang with latest kernel v6.16.0-rc1 (rc2 & rc3)
  2025-07-03 19:31 Himanshu Madhani
@ 2025-07-03 20:24 ` Thomas Gleixner
  2025-07-03 20:31   ` Himanshu Madhani
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2025-07-03 20:24 UTC (permalink / raw)
  To: Himanshu Madhani, linux-kernel

On Thu, Jul 03 2025 at 12:31, Himanshu Madhani wrote:
> We are seeing kernel hang while booting after new 6.16-rc1 kernel is 
> installed.

Please don't resend stuff within a few hours just because nobody
replies. People live in different time zones and are not waiting in
front of their computer for your important messages.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: System hang with latest kernel v6.16.0-rc1 (rc2 & rc3)
  2025-07-03 20:24 ` Thomas Gleixner
@ 2025-07-03 20:31   ` Himanshu Madhani
  0 siblings, 0 replies; 7+ messages in thread
From: Himanshu Madhani @ 2025-07-03 20:31 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel@vger.kernel.org



> On Jul 3, 2025, at 13:24, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, Jul 03 2025 at 12:31, Himanshu Madhani wrote:
>> We are seeing kernel hang while booting after new 6.16-rc1 kernel is 
>> installed.
> 
> Please don't resend stuff within a few hours just because nobody
> replies. People live in different time zones and are not waiting in
> front of their computer for your important messages.

I got mail rejection notice from kernel mailing list complaining about HTML format. 
So I fixed it as plain text and resent. Did not realize that both mails made it to
Mailing list.  Apologies for noise. 

Thanks,
Himanshu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: System hang with latest kernel v6.16.0-rc1 (rc2 & rc3)
  2025-07-03 20:21   ` Thomas Gleixner
@ 2025-07-03 20:34     ` Himanshu Madhani
  2025-07-03 21:51       ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Himanshu Madhani @ 2025-07-03 20:34 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel@vger.kernel.org



> On Jul 3, 2025, at 13:21, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, Jul 03 2025 at 18:32, Himanshu Madhani wrote:
>> On Jul 3, 2025, at 11:27, Himanshu Madhani <himanshu.madhani@oracle.com> wrote:
>> Git-bisect point to this merge commit
>> 
>> commit 6376c0770656f3bdf7f411faf068371b6932aeca
>> Merge: 5e8bbb2caa4e 29857e6f4e30
>> Author: Linus Torvalds <torvalds@linux-foundation.org>
>> Date:   Tue May 27 09:01:26 2025 -0700
>> 
>>   Merge tag 'timers-clocksource-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> 
>>   Pull clocksource updates from Thomas Gleixner:
>>    "Updates for clocksource/clockevent drivers:
>> 
>>      - The final conversion of text formatted device tree binding to
>>        schemas
>> 
>>      - A new driver fot the System Timer Module on S32G NXP SoCs
>> 
>>      - A new driver fot the Econet HPT timer
>> 
>>      - The usual improvements and device tree binding updates"
> 
> That obviously does not make sense, so your bisect got side ways.
> 
>> Following further in this commit, I only see this following series
>> that had changes which may or may not be related to hang.
>> 
>> https://lore.kernel.org/all/20250429065337.117370076@linutronix.de/
> 
> They are not. There is a hint in both backtraces:
> 
>> [  514.305717]  schedule_preempt_disabled+0x15/0x30
>> [  514.360954]  __mutex_lock.constprop.0+0x4be/0x8a0
>> [  514.417232]  msi_domain_get_virq+0xcc/0x110
>> [  514.467279]  pci_msix_write_tph_tag+0x3c/0x100
> 
> and
> 
>> [  525.930478]  schedule_preempt_disabled+0x15/0x30
>> [  525.985718]  __mutex_lock.constprop.0+0x4be/0x8a0
>> [  526.041993]  msi_domain_get_virq+0xcc/0x110
>> [  526.092031]  pci_msix_write_tph_tag+0x3c/0x100
> 
> pci_msix_write_tph_tag() is the function which ends up trying to lock
> the mutex and gets stuck. This function was introduced with commit
> 
>  d5124a9957b2 ("PCI/MSI: Provide a sane mechanism for TPH")
> 
> and the subsequent commit
> 
>  71296eae5887 ("PCI/TPH: Replace the broken MSI-X control word update")
> 
> flipped the TPH code over to use that.
> 
> The problem is obvious and if you would have enabled
> CONFIG_PROVE_LOCKING then you would have got the reason presented on a
> silver tablet in dmesg. I encourage you to do so nevertheless.
> 
Great tip on this. I’ll keep that in mind for future debugging efforts. 

> I definitely screwed that one up in the most stupid way.
> 
> As I had no idea how to exercise that code path I did not test it. It
> seems this code is not really tested by any of the CI stuff either
> before it hits Linus tree and as some folks start testing only post rc1
> it takes some time to surface :( 
> 
> The fix is as obvious as the problem. See uncompiled and untested patch
> below. If it solves the problem, which it should, feel free to take it
> and create a proper patch with changelog and Fixes tag yourself (Adding
> Suggested-by: Thomas ... is good enough). Otherwise let me know, and I
> take care of it in my copious spare time :)
> 

Sure. I’ll get this testing in our test bed and report back in couple days.

> Thanks,
> 
>        tglx
> ---
> diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> index 6ede55a7c5e6..eb26f3816922 100644
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -934,10 +934,11 @@ int pci_msix_write_tph_tag(struct pci_dev *pdev, unsigned int index, u16 tag)
> if (!pdev->msix_enabled)
> return -ENXIO;
> 
> - guard(msi_descs_lock)(&pdev->dev);
> virq = msi_get_virq(&pdev->dev, index);
> if (!virq)
> return -ENXIO;
> +
> + guard(msi_descs_lock)(&pdev->dev);
> /*
> * This is a horrible hack, but short of implementing a PCI
> * specific interrupt chip callback and a huge pile of



-- 
Himanshu Madhani	Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: System hang with latest kernel v6.16.0-rc1 (rc2 & rc3)
  2025-07-03 20:34     ` Himanshu Madhani
@ 2025-07-03 21:51       ` Thomas Gleixner
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Gleixner @ 2025-07-03 21:51 UTC (permalink / raw)
  To: Himanshu Madhani; +Cc: linux-kernel@vger.kernel.org

On Thu, Jul 03 2025 at 20:34, Himanshu Madhani wrote:
>> On Jul 3, 2025, at 13:21, Thomas Gleixner <tglx@linutronix.de> wrote:
>> The problem is obvious and if you would have enabled
>> CONFIG_PROVE_LOCKING then you would have got the reason presented on a
>> silver tablet in dmesg. I encourage you to do so nevertheless.
>> 
> Great tip on this. I’ll keep that in mind for future debugging efforts. 

Actually the very first thing in testing of a new kernel should be to
run it with a copious amount of debug options. That avoids all the
headaches of chasing fallout caught by them, in painful ways later.

I'm truly surprised that this is not done already and testing blindly
assumes that rc1 has already been objected to such tests completely.

It's bloody obvious that with a code base of the complexity of the
kernel and the gazillion of drivers, the CI coverage is far from
complete and only best effort based.

Obviously the companies, who have access to and care about their
specialized hardware, should run CI against linux-next to begin
with. Then such problems would be caught way before they hit Linus tree.

I know there is no budget for this kind of effort. Companies rather
waste their budget on chasing problems, which could have been avoided
upfront. That's a huge cost saving, which is proven by applying magic to
the relevant Excel-sh*ts.

Not your decision, I know.

Shrug,

        tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-03 21:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 18:27 System hang with latest kernel v6.16.0-rc1 (rc2 & rc3) Himanshu Madhani
     [not found] ` <7279DC28-17BF-4A28-96ED-7AE9857BC2E3@oracle.com>
2025-07-03 20:21   ` Thomas Gleixner
2025-07-03 20:34     ` Himanshu Madhani
2025-07-03 21:51       ` Thomas Gleixner
  -- strict thread matches above, loose matches on Subject: below --
2025-07-03 19:31 Himanshu Madhani
2025-07-03 20:24 ` Thomas Gleixner
2025-07-03 20:31   ` Himanshu Madhani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.