Re: [patch V5 00/20] sched: Rewrite MM CID management

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [patch V5 00/20] sched: Rewrite MM CID management
       [not found] <20251119171016.815482037@linutronix.de>
@ 2026-01-28  0:01 ` Ihor Solodrai
  2026-01-28  8:46   ` Peter Zijlstra
  2026-01-28 11:57   ` Thomas Gleixner
  0 siblings, 2 replies; 9+ messages in thread
From: Ihor Solodrai @ 2026-01-28  0:01 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Gabriele Monaco, Mathieu Desnoyers,
	Michael Jeanson, Jens Axboe, Paul E. McKenney, Gautham R. Shenoy,
	Florian Weimer, Tim Chen, Yury Norov, Shrikanth Hegde, bpf,
	sched-ext, Kernel Team, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Puranjay Mohan, Tejun Heo

On 11/19/25 9:26 AM, Thomas Gleixner wrote:
> This is a follow up on the V4 series which can be found here:
> 
>     https://lore.kernel.org/20251104075053.700034556@linutronix.de
> 
> The V1 cover letter contains a detailed analyisis of the issues:
> 
>     https://lore.kernel.org/20251015164952.694882104@linutronix.de
> 
> TLDR: The CID management is way to complex and adds significant overhead
> into scheduler hotpaths.
> 
> The series rewrites MM CID management in a more simplistic way which
> focusses on low overhead in the scheduler while maintaining per task CIDs
> as long as the number of threads is not exceeding the number of possible
> CPUs.

Hello Thomas, everyone.

BPF CI caught a deadlock on current bpf-next tip (35538dba51b4).
Job: https://github.com/kernel-patches/bpf/actions/runs/21417415035/job/61670254640

It appears to be related to this series. Pasting a splat below.

Any ideas what might be going on?

Thanks!

[   45.009755] watchdog: CPU2: Watchdog detected hard LOCKUP on cpu 2
[   45.009763] Modules linked in: bpf_testmod(OE)
[   45.009769] irq event stamp: 685710
[   45.009771] hardirqs last  enabled at (685709): [<ffffffffb5bfa8b8>] _raw_spin_unlock_irq+0x28/0x50
[   45.009786] hardirqs last disabled at (685710): [<ffffffffb5bfa651>] _raw_spin_lock_irqsave+0x51/0x60
[   45.009789] softirqs last  enabled at (685650): [<ffffffffb3345e2a>] fpu_clone+0xda/0x4f0
[   45.009795] softirqs last disabled at (685648): [<ffffffffb3345dd2>] fpu_clone+0x82/0x4f0
[   45.009803] CPU: 2 UID: 0 PID: 126 Comm: test_progs Tainted: G           OE       6.19.0-rc5-g748c6d52700a-dirty #1 PREEMPT(full)
[   45.009808] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   45.009810] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   45.009813] RIP: 0010:queued_spin_lock_slowpath+0x6cc/0xac0
[   45.009820] Code: 0c 24 8b 03 66 85 c0 74 38 48 b8 00 00 00 00 00 fc ff df 48 89 da 49 89 de 48 c1 ea 03 41 83 e6 07 48 01 c2 41 83 c6 03 f3 90 <0f> b6 02 41 38 c6 7c 08 84 c0 0f 85 90 02 00 00 8b 03 66 85 c0 75
[   45.009823] RSP: 0018:ffffc9000128f750 EFLAGS: 00000002
[   45.009828] RAX: 0000000000100101 RBX: ffff8881520ba000 RCX: 0000000000000000
[   45.009830] RDX: ffffed102a417400 RSI: 0000000000000002 RDI: ffff8881520ba002
[   45.009832] RBP: 1ffff92000251eec R08: ffffffffb5bfb6c9 R09: ffffed102a417400
[   45.009834] R10: ffffed102a417401 R11: 0000000000000004 R12: ffff88815213b100
[   45.009836] R13: 00000000000c0000 R14: 0000000000000003 R15: 0000000000000002
[   45.009838] FS:  00007f6ab3e0de00(0000) GS:ffff8881998dd000(0000) knlGS:0000000000000000
[   45.009841] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   45.009843] CR2: 00007f6ab2873d58 CR3: 0000000103357005 CR4: 0000000000770ef0
[   45.009845] PKRU: 55555554
[   45.009846] Call Trace:
[   45.009850]  <TASK>
[   45.009855]  ? __pfx_queued_spin_lock_slowpath+0x10/0x10
[   45.009862]  do_raw_spin_lock+0x1d9/0x270
[   45.009868]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   45.009871]  ? __pfx___might_resched+0x10/0x10
[   45.009878]  task_rq_lock+0xcf/0x3c0
[   45.009884]  mm_cid_fixup_task_to_cpu+0xb0/0x460
[   45.009888]  ? __pfx_mm_cid_fixup_task_to_cpu+0x10/0x10
[   45.009892]  ? lock_acquire+0x14e/0x2b0
[   45.009896]  ? mark_held_locks+0x40/0x70
[   45.009901]  sched_mm_cid_fork+0x6da/0xc20
[   45.009905]  ? __pfx_sched_mm_cid_fork+0x10/0x10
[   45.009908]  ? copy_process+0x217b/0x6950
[   45.009913]  copy_process+0x2bce/0x6950
[   45.009919]  ? __pfx_copy_process+0x10/0x10
[   45.009921]  ? find_held_lock+0x2b/0x80
[   45.009926]  ? _copy_from_user+0x53/0xa0
[   45.009933]  kernel_clone+0xce/0x600
[   45.009937]  ? __pfx_kernel_clone+0x10/0x10
[   45.009942]  ? __lock_acquire+0x481/0x2590
[   45.009947]  __do_sys_clone3+0x16e/0x1b0
[   45.009950]  ? __pfx___do_sys_clone3+0x10/0x10
[   45.009952]  ? lock_acquire+0x14e/0x2b0
[   45.009955]  ? __might_fault+0x9b/0x140
[   45.009963]  ? _copy_to_user+0x5c/0x70
[   45.009967]  ? __x64_sys_rt_sigprocmask+0x258/0x400
[   45.009974]  ? do_user_addr_fault+0x4c2/0xa40
[   45.009978]  ? lockdep_hardirqs_on_prepare+0xd7/0x180
[   45.009982]  do_syscall_64+0x6b/0x3a0
[   45.009988]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   45.009992] RIP: 0033:0x7f6ab430fc5d
[   45.009996] Code: 79 14 0e 00 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 ea ff ff ff 48 85 ff 74 28 48 85 d2 74 23 49 89 c8 b8 b3 01 00 00 0f 05 <48> 85 c0 7c 14 74 01 c3 31 ed 4c 89 c7 ff d2 48 89 c7 b8 3c 00 00
[   45.009998] RSP: 002b:00007fffb282a148 EFLAGS: 00000202 ORIG_RAX: 00000000000001b3
[   45.010002] RAX: ffffffffffffffda RBX: 00007f6ab4282720 RCX: 00007f6ab430fc5d
[   45.010004] RDX: 00007f6ab4282720 RSI: 0000000000000058 RDI: 00007fffb282a1a0
[   45.010005] RBP: 00007fffb282a180 R08: 00007f6ab28736c0 R09: 00007fffb282a2a7
[   45.010007] R10: 0000000000000008 R11: 0000000000000202 R12: 00007f6ab28736c0
[   45.010009] R13: ffffffffffffff08 R14: 0000000000000000 R15: 00007fffb282a1a0
[   45.010015]  </TASK>
[   45.010018] Kernel panic - not syncing: Hard LOCKUP
[   45.010020] CPU: 2 UID: 0 PID: 126 Comm: test_progs Tainted: G           OE       6.19.0-rc5-g748c6d52700a-dirty #1 PREEMPT(full)
[   45.010025] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   45.010026] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   45.010027] Call Trace:
[   45.010029]  <NMI>
[   45.010031]  dump_stack_lvl+0x5d/0x80
[   45.010036]  vpanic+0x133/0x3f0
[   45.010042]  panic+0xce/0xce
[   45.010045]  ? __pfx_panic+0x10/0x10
[   45.010050]  ? __show_trace_log_lvl+0x2ee/0x323
[   45.010053]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   45.010057]  ? nmi_panic+0x91/0x130
[   45.010061]  nmi_panic.cold+0x14/0x14
[   45.010065]  ? __pfx_nmi_panic+0x10/0x10
[   45.010070]  watchdog_hardlockup_check.cold+0x12a/0x1c5
[   45.010076]  __perf_event_overflow+0x2fe/0xeb0
[   45.010082]  ? __pfx___perf_event_overflow+0x10/0x10
[   45.010085]  ? __pfx_x86_perf_event_set_period+0x10/0x10
[   45.010091]  handle_pmi_common+0x405/0x920
[   45.010096]  ? __pfx_handle_pmi_common+0x10/0x10
[   45.010109]  ? __pfx_intel_bts_interrupt+0x10/0x10
[   45.010115]  intel_pmu_handle_irq+0x1c5/0x5d0
[   45.010119]  ? lock_acquire+0x1e9/0x2b0
[   45.010122]  ? nmi_handle.part.0+0x2f/0x370
[   45.010127]  perf_event_nmi_handler+0x3e/0x70
[   45.010130]  nmi_handle.part.0+0x13f/0x370
[   45.010134]  ? trace_rcu_watching+0x105/0x150
[   45.010140]  default_do_nmi+0x3b/0x110
[   45.010144]  ? irqentry_nmi_enter+0x6f/0x80
[   45.010147]  exc_nmi+0xe3/0x110
[   45.010151]  end_repeat_nmi+0xf/0x53
[   45.010154] RIP: 0010:queued_spin_lock_slowpath+0x6cc/0xac0
[   45.010157] Code: 0c 24 8b 03 66 85 c0 74 38 48 b8 00 00 00 00 00 fc ff df 48 89 da 49 89 de 48 c1 ea 03 41 83 e6 07 48 01 c2 41 83 c6 03 f3 90 <0f> b6 02 41 38 c6 7c 08 84 c0 0f 85 90 02 00 00 8b 03 66 85 c0 75
[   45.010159] RSP: 0018:ffffc9000128f750 EFLAGS: 00000002
[   45.010162] RAX: 0000000000100101 RBX: ffff8881520ba000 RCX: 0000000000000000
[   45.010164] RDX: ffffed102a417400 RSI: 0000000000000002 RDI: ffff8881520ba002
[   45.010165] RBP: 1ffff92000251eec R08: ffffffffb5bfb6c9 R09: ffffed102a417400
[   45.010167] R10: ffffed102a417401 R11: 0000000000000004 R12: ffff88815213b100
[   45.010169] R13: 00000000000c0000 R14: 0000000000000003 R15: 0000000000000002
[   45.010172]  ? queued_spin_lock_slowpath+0x559/0xac0
[   45.010177]  ? queued_spin_lock_slowpath+0x6cc/0xac0
[   45.010181]  ? queued_spin_lock_slowpath+0x6cc/0xac0
[   45.010185]  </NMI>
[   45.010186]  <TASK>
[   45.010187]  ? __pfx_queued_spin_lock_slowpath+0x10/0x10
[   45.010194]  do_raw_spin_lock+0x1d9/0x270
[   45.010198]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   45.010201]  ? __pfx___might_resched+0x10/0x10
[   45.010206]  task_rq_lock+0xcf/0x3c0
[   45.010211]  mm_cid_fixup_task_to_cpu+0xb0/0x460
[   45.010215]  ? __pfx_mm_cid_fixup_task_to_cpu+0x10/0x10
[   45.010219]  ? lock_acquire+0x14e/0x2b0
[   45.010223]  ? mark_held_locks+0x40/0x70
[   45.010228]  sched_mm_cid_fork+0x6da/0xc20
[   45.010232]  ? __pfx_sched_mm_cid_fork+0x10/0x10
[   45.010234]  ? copy_process+0x217b/0x6950
[   45.010238]  copy_process+0x2bce/0x6950
[   45.010245]  ? __pfx_copy_process+0x10/0x10
[   45.010247]  ? find_held_lock+0x2b/0x80
[   45.010251]  ? _copy_from_user+0x53/0xa0
[   45.010256]  kernel_clone+0xce/0x600
[   45.010259]  ? __pfx_kernel_clone+0x10/0x10
[   45.010264]  ? __lock_acquire+0x481/0x2590
[   45.010269]  __do_sys_clone3+0x16e/0x1b0
[   45.010272]  ? __pfx___do_sys_clone3+0x10/0x10
[   45.010274]  ? lock_acquire+0x14e/0x2b0
[   45.010277]  ? __might_fault+0x9b/0x140
[   45.010284]  ? _copy_to_user+0x5c/0x70
[   45.010288]  ? __x64_sys_rt_sigprocmask+0x258/0x400
[   45.010293]  ? do_user_addr_fault+0x4c2/0xa40
[   45.010296]  ? lockdep_hardirqs_on_prepare+0xd7/0x180
[   45.010300]  do_syscall_64+0x6b/0x3a0
[   45.010305]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   45.010307] RIP: 0033:0x7f6ab430fc5d
[   45.010309] Code: 79 14 0e 00 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 ea ff ff ff 48 85 ff 74 28 48 85 d2 74 23 49 89 c8 b8 b3 01 00 00 0f 05 <48> 85 c0 7c 14 74 01 c3 31 ed 4c 89 c7 ff d2 48 89 c7 b8 3c 00 00
[   45.010311] RSP: 002b:00007fffb282a148 EFLAGS: 00000202 ORIG_RAX: 00000000000001b3
[   45.010314] RAX: ffffffffffffffda RBX: 00007f6ab4282720 RCX: 00007f6ab430fc5d
[   45.010316] RDX: 00007f6ab4282720 RSI: 0000000000000058 RDI: 00007fffb282a1a0
[   45.010317] RBP: 00007fffb282a180 R08: 00007f6ab28736c0 R09: 00007fffb282a2a7
[   45.010319] R10: 0000000000000008 R11: 0000000000000202 R12: 00007f6ab28736c0
[   45.010320] R13: ffffffffffffff08 R14: 0000000000000000 R15: 00007fffb282a1a0
[   45.010326]  </TASK>
[   46.053092]
[   46.053095] ================================
[   46.053096] WARNING: inconsistent lock state
[   46.053098] 6.19.0-rc5-g748c6d52700a-dirty #1 Tainted: G           OE
[   46.053101] --------------------------------
[   46.053102] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[   46.053103] test_progs/126 [HC1[1]:SC0[0]:HE0:SE1] takes:
[   46.053107] ffffffffb6eace78 (&nmi_desc[NMI_LOCAL].lock){....}-{2:2}, at: __register_nmi_handler+0x83/0x350
[   46.053119] {INITIAL USE} state was registered at:
[   46.053120]   lock_acquire+0x14e/0x2b0
[   46.053123]   _raw_spin_lock_irqsave+0x39/0x60
[   46.053127]   __register_nmi_handler+0x83/0x350
[   46.053130]   init_hw_perf_events+0x1d0/0x850
[   46.053135]   do_one_initcall+0xd0/0x3a0
[   46.053138]   kernel_init_freeable+0x34c/0x580
[   46.053141]   kernel_init+0x1c/0x150
[   46.053145]   ret_from_fork+0x48c/0x590
[   46.053149]   ret_from_fork_asm+0x1a/0x30
[   46.053151] irq event stamp: 685710
[   46.053153] hardirqs last  enabled at (685709): [<ffffffffb5bfa8b8>] _raw_spin_unlock_irq+0x28/0x50
[   46.053156] hardirqs last disabled at (685710): [<ffffffffb5bfa651>] _raw_spin_lock_irqsave+0x51/0x60
[   46.053159] softirqs last  enabled at (685650): [<ffffffffb3345e2a>] fpu_clone+0xda/0x4f0
[   46.053163] softirqs last disabled at (685648): [<ffffffffb3345dd2>] fpu_clone+0x82/0x4f0
[   46.053166]
[   46.053166] other info that might help us debug this:
[   46.053168]  Possible unsafe locking scenario:
[   46.053168]
[   46.053168]        CPU0
[   46.053169]        ----
[   46.053170]   lock(&nmi_desc[NMI_LOCAL].lock);
[   46.053172]   <Interrupt>
[   46.053173]     lock(&nmi_desc[NMI_LOCAL].lock);
[   46.053174]
[   46.053174]  *** DEADLOCK ***
[   46.053174]
[   46.053175] 5 locks held by test_progs/126:
[   46.053177]  #0: ffffffffb6f49790 (scx_fork_rwsem){.+.+}-{0:0}, at: sched_fork+0xf9/0x6b0
[   46.053184]  #1: ffff88810c4930e8 (&mm->mm_cid.mutex){+.+.}-{4:4}, at: sched_mm_cid_fork+0xdf/0xc20
[   46.053190]  #2: ffffffffb7671a80 (rcu_read_lock){....}-{1:3}, at: sched_mm_cid_fork+0x692/0xc20
[   46.053195]  #3: ffff888110548a90 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x3c0
[   46.053201]  #4: ffff8881520ba018 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0xcf/0x3c0
[   46.053207]
[   46.053207] stack backtrace:
[   46.053209] CPU: 2 UID: 0 PID: 126 Comm: test_progs Tainted: G           OE       6.19.0-rc5-g748c6d52700a-dirty #1 PREEMPT(full)
[   46.053214] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   46.053215] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   46.053217] Call Trace:
[   46.053220]  <NMI>
[   46.053223]  dump_stack_lvl+0x5d/0x80
[   46.053227]  print_usage_bug.part.0+0x22b/0x2c0
[   46.053231]  lock_acquire+0x272/0x2b0
[   46.053235]  ? __register_nmi_handler+0x83/0x350
[   46.053240]  _raw_spin_lock_irqsave+0x39/0x60
[   46.053242]  ? __register_nmi_handler+0x83/0x350
[   46.053246]  __register_nmi_handler+0x83/0x350
[   46.053250]  native_stop_other_cpus+0x31c/0x460
[   46.053255]  ? __pfx_native_stop_other_cpus+0x10/0x10
[   46.053260]  vpanic+0x1c5/0x3f0
[   46.053265]  panic+0xce/0xce
[   46.053268]  ? __pfx_panic+0x10/0x10
[   46.053272]  ? __show_trace_log_lvl+0x2ee/0x323
[   46.053276]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   46.053279]  ? nmi_panic+0x91/0x130
[   46.053283]  nmi_panic.cold+0x14/0x14
[   46.053287]  ? __pfx_nmi_panic+0x10/0x10
[   46.053291]  watchdog_hardlockup_check.cold+0x12a/0x1c5
[   46.053296]  __perf_event_overflow+0x2fe/0xeb0
[   46.053300]  ? __pfx___perf_event_overflow+0x10/0x10
[   46.053303]  ? __pfx_x86_perf_event_set_period+0x10/0x10
[   46.053308]  handle_pmi_common+0x405/0x920
[   46.053312]  ? __pfx_handle_pmi_common+0x10/0x10
[   46.053322]  ? __pfx_intel_bts_interrupt+0x10/0x10
[   46.053327]  intel_pmu_handle_irq+0x1c5/0x5d0
[   46.053330]  ? lock_acquire+0x1e9/0x2b0
[   46.053334]  ? nmi_handle.part.0+0x2f/0x370
[   46.053337]  perf_event_nmi_handler+0x3e/0x70
[   46.053340]  nmi_handle.part.0+0x13f/0x370
[   46.053343]  ? trace_rcu_watching+0x105/0x150
[   46.053348]  default_do_nmi+0x3b/0x110
[   46.053351]  ? irqentry_nmi_enter+0x6f/0x80
[   46.053355]  exc_nmi+0xe3/0x110
[   46.053358]  end_repeat_nmi+0xf/0x53
[   46.053361] RIP: 0010:queued_spin_lock_slowpath+0x6cc/0xac0
[   46.053365] Code: 0c 24 8b 03 66 85 c0 74 38 48 b8 00 00 00 00 00 fc ff df 48 89 da 49 89 de 48 c1 ea 03 41 83 e6 07 48 01 c2 41 83 c6 03 f3 90 <0f> b6 02 41 38 c6 7c 08 84 c0 0f 85 90 02 00 00 8b 03 66 85 c0 75
[   46.053367] RSP: 0018:ffffc9000128f750 EFLAGS: 00000002
[   46.053370] RAX: 0000000000100101 RBX: ffff8881520ba000 RCX: 0000000000000000
[   46.053372] RDX: ffffed102a417400 RSI: 0000000000000002 RDI: ffff8881520ba002
[   46.053374] RBP: 1ffff92000251eec R08: ffffffffb5bfb6c9 R09: ffffed102a417400
[   46.053376] R10: ffffed102a417401 R11: 0000000000000004 R12: ffff88815213b100
[   46.053378] R13: 00000000000c0000 R14: 0000000000000003 R15: 0000000000000002
[   46.053380]  ? queued_spin_lock_slowpath+0x559/0xac0
[   46.053385]  ? queued_spin_lock_slowpath+0x6cc/0xac0
[   46.053389]  ? queued_spin_lock_slowpath+0x6cc/0xac0
[   46.053392]  </NMI>
[   46.053393]  <TASK>
[   46.053394]  ? __pfx_queued_spin_lock_slowpath+0x10/0x10
[   46.053400]  do_raw_spin_lock+0x1d9/0x270
[   46.053404]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   46.053407]  ? __pfx___might_resched+0x10/0x10
[   46.053411]  task_rq_lock+0xcf/0x3c0
[   46.053416]  mm_cid_fixup_task_to_cpu+0xb0/0x460
[   46.053420]  ? __pfx_mm_cid_fixup_task_to_cpu+0x10/0x10
[   46.053423]  ? lock_acquire+0x14e/0x2b0
[   46.053427]  ? mark_held_locks+0x40/0x70
[   46.053431]  sched_mm_cid_fork+0x6da/0xc20
[   46.053435]  ? __pfx_sched_mm_cid_fork+0x10/0x10
[   46.053437]  ? copy_process+0x217b/0x6950
[   46.053441]  copy_process+0x2bce/0x6950
[   46.053446]  ? __pfx_copy_process+0x10/0x10
[   46.053448]  ? find_held_lock+0x2b/0x80
[   46.053452]  ? _copy_from_user+0x53/0xa0
[   46.053457]  kernel_clone+0xce/0x600
[   46.053460]  ? __pfx_kernel_clone+0x10/0x10
[   46.053465]  ? __lock_acquire+0x481/0x2590
[   46.053469]  __do_sys_clone3+0x16e/0x1b0
[   46.053472]  ? __pfx___do_sys_clone3+0x10/0x10
[   46.053474]  ? lock_acquire+0x14e/0x2b0
[   46.053477]  ? __might_fault+0x9b/0x140
[   46.053483]  ? _copy_to_user+0x5c/0x70
[   46.053486]  ? __x64_sys_rt_sigprocmask+0x258/0x400
[   46.053491]  ? do_user_addr_fault+0x4c2/0xa40
[   46.053495]  ? lockdep_hardirqs_on_prepare+0xd7/0x180
[   46.053498]  do_syscall_64+0x6b/0x3a0
[   46.053503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   46.053506] RIP: 0033:0x7f6ab430fc5d
[   46.053509] Code: 79 14 0e 00 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 ea ff ff ff 48 85 ff 74 28 48 85 d2 74 23 49 89 c8 b8 b3 01 00 00 0f 05 <48> 85 c0 7c 14 74 01 c3 31 ed 4c 89 c7 ff d2 48 89 c7 b8 3c 00 00
[   46.053511] RSP: 002b:00007fffb282a148 EFLAGS: 00000202 ORIG_RAX: 00000000000001b3
[   46.053514] RAX: ffffffffffffffda RBX: 00007f6ab4282720 RCX: 00007f6ab430fc5d
[   46.053516] RDX: 00007f6ab4282720 RSI: 0000000000000058 RDI: 00007fffb282a1a0
[   46.053517] RBP: 00007fffb282a180 R08: 00007f6ab28736c0 R09: 00007fffb282a2a7
[   46.053519] R10: 0000000000000008 R11: 0000000000000202 R12: 00007f6ab28736c0
[   46.053521] R13: ffffffffffffff08 R14: 0000000000000000 R15: 00007fffb282a1a0
[   46.053525]  </TASK>
[   46.053527] Shutting down cpus with NMI
[   46.053722] Kernel Offset: 0x32000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)


> 
> [...]


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V5 00/20] sched: Rewrite MM CID management
  2026-01-28  0:01 ` [patch V5 00/20] sched: Rewrite MM CID management Ihor Solodrai
@ 2026-01-28  8:46   ` Peter Zijlstra
  2026-01-28 11:57   ` Thomas Gleixner
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2026-01-28  8:46 UTC (permalink / raw)
  To: Ihor Solodrai
  Cc: Thomas Gleixner, LKML, Gabriele Monaco, Mathieu Desnoyers,
	Michael Jeanson, Jens Axboe, Paul E. McKenney, Gautham R. Shenoy,
	Florian Weimer, Tim Chen, Yury Norov, Shrikanth Hegde, bpf,
	sched-ext, Kernel Team, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Puranjay Mohan, Tejun Heo

On Tue, Jan 27, 2026 at 04:01:11PM -0800, Ihor Solodrai wrote:
> On 11/19/25 9:26 AM, Thomas Gleixner wrote:
> > This is a follow up on the V4 series which can be found here:
> > 
> >     https://lore.kernel.org/20251104075053.700034556@linutronix.de
> > 
> > The V1 cover letter contains a detailed analyisis of the issues:
> > 
> >     https://lore.kernel.org/20251015164952.694882104@linutronix.de
> > 
> > TLDR: The CID management is way to complex and adds significant overhead
> > into scheduler hotpaths.
> > 
> > The series rewrites MM CID management in a more simplistic way which
> > focusses on low overhead in the scheduler while maintaining per task CIDs
> > as long as the number of threads is not exceeding the number of possible
> > CPUs.
> 
> Hello Thomas, everyone.
> 
> BPF CI caught a deadlock on current bpf-next tip (35538dba51b4).
> Job: https://github.com/kernel-patches/bpf/actions/runs/21417415035/job/61670254640
> 
> It appears to be related to this series. Pasting a splat below.
> 
> Any ideas what might be going on?

That splat is only CPU2, that's not typically very useful in a lockup
scenario.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V5 00/20] sched: Rewrite MM CID management
  2026-01-28  0:01 ` [patch V5 00/20] sched: Rewrite MM CID management Ihor Solodrai
  2026-01-28  8:46   ` Peter Zijlstra
@ 2026-01-28 11:57   ` Thomas Gleixner
  2026-01-28 12:58     ` Shrikanth Hegde
  1 sibling, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2026-01-28 11:57 UTC (permalink / raw)
  To: Ihor Solodrai, LKML
  Cc: Peter Zijlstra, Gabriele Monaco, Mathieu Desnoyers,
	Michael Jeanson, Jens Axboe, Paul E. McKenney, Gautham R. Shenoy,
	Florian Weimer, Tim Chen, Yury Norov, Shrikanth Hegde, bpf,
	sched-ext, Kernel Team, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Puranjay Mohan, Tejun Heo

On Tue, Jan 27 2026 at 16:01, Ihor Solodrai wrote:
> BPF CI caught a deadlock on current bpf-next tip (35538dba51b4).
> Job: https://github.com/kernel-patches/bpf/actions/runs/21417415035/job/61670254640
>
> It appears to be related to this series. Pasting a splat below.

The deadlock splat is completely unrelated as it is a consequence of the
panic which is triggered by the watchdog:

> [   45.009755] watchdog: CPU2: Watchdog detected hard LOCKUP on cpu 2

...

> [   46.053170]   lock(&nmi_desc[NMI_LOCAL].lock);
> [   46.053172]   <Interrupt>
> [   46.053173]     lock(&nmi_desc[NMI_LOCAL].lock);

...

> Any ideas what might be going on?

Without a full backtrace of all CPUs it's hard to tell because it's
unclear what is holding the runqueue lock of CPU2 long enough to trigger
the hard lockup watchdog.

I'm pretty sure the CID changes are unrelated, that new code just happen
to show up as the messenger which gets stuck on the lock forever.

> [   46.053209] CPU: 2 UID: 0 PID: 126 Comm: test_progs Tainted: G           OE       6.19.0-rc5-g748c6d52700a-dirty #1 PREEMPT(full)
> [   46.053214] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> [   46.053215] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   46.053217] Call Trace:
> [   46.053220]  <NMI>
> [   46.053223]  dump_stack_lvl+0x5d/0x80
> [   46.053227]  print_usage_bug.part.0+0x22b/0x2c0
> [   46.053231]  lock_acquire+0x272/0x2b0
> [   46.053235]  ? __register_nmi_handler+0x83/0x350
> [   46.053240]  _raw_spin_lock_irqsave+0x39/0x60
> [   46.053242]  ? __register_nmi_handler+0x83/0x350
> [   46.053246]  __register_nmi_handler+0x83/0x350
> [   46.053250]  native_stop_other_cpus+0x31c/0x460
> [   46.053255]  ? __pfx_native_stop_other_cpus+0x10/0x10
> [   46.053260]  vpanic+0x1c5/0x3f0

vpanic() really should disable lockdep here before taking that lock in
NMI context. The resulting lockdep splat is not really useful.

Thanks.

        tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V5 00/20] sched: Rewrite MM CID management
  2026-01-28 11:57   ` Thomas Gleixner
@ 2026-01-28 12:58     ` Shrikanth Hegde
  2026-01-28 13:56       ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Shrikanth Hegde @ 2026-01-28 12:58 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Ihor Solodrai, LKML
  Cc: Gabriele Monaco, Mathieu Desnoyers, Michael Jeanson, Jens Axboe,
	Paul E. McKenney, Gautham R. Shenoy, Florian Weimer, Tim Chen,
	Yury Norov, bpf, sched-ext, Kernel Team, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Puranjay Mohan, Tejun Heo



On 1/28/26 5:27 PM, Thomas Gleixner wrote:
> On Tue, Jan 27 2026 at 16:01, Ihor Solodrai wrote:
>> BPF CI caught a deadlock on current bpf-next tip (35538dba51b4).
>> Job: https://github.com/kernel-patches/bpf/actions/runs/21417415035/job/61670254640
>>
>> It appears to be related to this series. Pasting a splat below.
> 
> The deadlock splat is completely unrelated as it is a consequence of the
> panic which is triggered by the watchdog:
> 
>> [   45.009755] watchdog: CPU2: Watchdog detected hard LOCKUP on cpu 2
> 
> ...
> 
>> [   46.053170]   lock(&nmi_desc[NMI_LOCAL].lock);
>> [   46.053172]   <Interrupt>
>> [   46.053173]     lock(&nmi_desc[NMI_LOCAL].lock);
> 
> ...
> 
>> Any ideas what might be going on?
> 
> Without a full backtrace of all CPUs it's hard to tell because it's
> unclear what is holding the runqueue lock of CPU2 long enough to trigger
> the hard lockup watchdog.
> 
> I'm pretty sure the CID changes are unrelated, that new code just happen
> to show up as the messenger which gets stuck on the lock forever.
> 
>> [   46.053209] CPU: 2 UID: 0 PID: 126 Comm: test_progs Tainted: G           OE       6.19.0-rc5-g748c6d52700a-dirty #1 PREEMPT(full)
>> [   46.053214] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>> [   46.053215] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> [   46.053217] Call Trace:
>> [   46.053220]  <NMI>
>> [   46.053223]  dump_stack_lvl+0x5d/0x80
>> [   46.053227]  print_usage_bug.part.0+0x22b/0x2c0
>> [   46.053231]  lock_acquire+0x272/0x2b0
>> [   46.053235]  ? __register_nmi_handler+0x83/0x350
>> [   46.053240]  _raw_spin_lock_irqsave+0x39/0x60
>> [   46.053242]  ? __register_nmi_handler+0x83/0x350
>> [   46.053246]  __register_nmi_handler+0x83/0x350
>> [   46.053250]  native_stop_other_cpus+0x31c/0x460
>> [   46.053255]  ? __pfx_native_stop_other_cpus+0x10/0x10
>> [   46.053260]  vpanic+0x1c5/0x3f0
> 
> vpanic() really should disable lockdep here before taking that lock in
> NMI context. The resulting lockdep splat is not really useful.
> 
> Thanks.
> 
>          tglx

Hi Thomas, Peter.


I remember running into this panic, once. But it wasn't consistent and i
couldn't hit it again. And it had vcpu overcommit, and fair bit of steal time.


The trace was like below from different CPUs.
------------------------

  watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
  watchdog: CPU 23 TB:1434903268401795, last heartbeat TB:1434897252302837 (11750ms ago)
  NIP [c0000000001b7134] mm_get_cid+0xe8/0x188
  LR [c0000000001b7154] mm_get_cid+0x108/0x188
  Call Trace:
  [c000000004c37db0] [c000000001145d84] cpuidle_enter_state+0xf8/0x6a4 (unreliable)
  [c000000004c37e00] [c0000000001b95ac] mm_cid_switch_to+0x3c4/0x52c
  [c000000004c37e60] [c000000001147264] __schedule+0x47c/0x700
  [c000000004c37ee0] [c000000001147a70] schedule_idle+0x3c/0x64
  [c000000004c37f10] [c0000000001f6d70] do_idle+0x160/0x1b0
  [c000000004c37f60] [c0000000001f7084] cpu_startup_entry+0x48/0x50
  [c000000004c37f90] [c00000000005f570] start_secondary+0x284/0x288
  [c000000004c37fe0] [c00000000000e158] start_secondary_prolog+0x10/0x14


  watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
  watchdog: CPU 11 TB:1434903340004919, last heartbeat TB:1434897249749892 (11895ms ago)
  NIP [c0000000000f84fc] plpar_hcall_norets_notrace+0x18/0x2c
  LR [c000000001152588] queued_spin_lock_slowpath+0xd88/0x15d0
  Call Trace:
  [c00000056b69fb10] [c00000056b69fba0] 0xc00000056b69fba0 (unreliable)
  [c00000056b69fc30] [c000000001153ce0] _raw_spin_lock+0x80/0xa0
  [c00000056b69fc50] [c0000000001b9a34] raw_spin_rq_lock_nested+0x3c/0xf8
  [c00000056b69fc80] [c0000000001b9bb8] mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
  [c00000056b69fd00] [c0000000001bff34] sched_mm_cid_exit+0x108/0x22c
  [c00000056b69fd40] [c000000000167b08] do_exit+0xf4/0x5d0
  [c00000056b69fdf0] [c00000000016800c] make_task_dead+0x0/0x178
  [c00000056b69fe10] [c0000000000316c8] system_call_exception+0x128/0x390
  [c00000056b69fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec


  watchdog: CPU 65 self-detected hard LOCKUP @ queued_spin_lock_slowpath+0x10ec/0x15d0
  watchdog: CPU 65 TB:1434905824977447, last heartbeat TB:1434899309522065 (12725ms ago)
  NIP [c0000000011528ec] queued_spin_lock_slowpath+0x10ec/0x15d0
  LR [c000000001152d0c] queued_spin_lock_slowpath+0x150c/0x15d0
  Call Trace:
  [c000000777e27a60] [0000000000000009] 0x9 (unreliable)
  [c000000777e27b80] [c000000001153ce0] _raw_spin_lock+0x80/0xa0
  [c000000777e27ba0] [c0000000001b9a34] raw_spin_rq_lock_nested+0x3c/0xf8
  [c000000777e27bd0] [c0000000001babb8] ___task_rq_lock+0x64/0x140
  [c000000777e27c20] [c0000000001c8294] wake_up_new_task+0x180/0x484
  [c000000777e27ca0] [c00000000015bea4] kernel_clone+0x120/0x5bc
  [c000000777e27d30] [c00000000015c4c0] __do_sys_clone+0x88/0xc8
  [c000000777e27e10] [c0000000000316c8] system_call_exception+0x128/0x390
  [c000000777e27e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec




I am wondering if it this loop in mm_get_cid, which may not be getting a cid
for a long time? Is that possible?

static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
         unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));

         while (cid == MM_CID_UNSET) {
                 cpu_relax();
                 cid = __mm_get_cid(mm, num_possible_cpus());
         }
         return cid;
}


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V5 00/20] sched: Rewrite MM CID management
  2026-01-28 12:58     ` Shrikanth Hegde
@ 2026-01-28 13:56       ` Thomas Gleixner
  2026-01-28 22:24         ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2026-01-28 13:56 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra, Ihor Solodrai, LKML
  Cc: Gabriele Monaco, Mathieu Desnoyers, Michael Jeanson, Jens Axboe,
	Paul E. McKenney, Gautham R. Shenoy, Florian Weimer, Tim Chen,
	Yury Norov, bpf, sched-ext, Kernel Team, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Puranjay Mohan, Tejun Heo

On Wed, Jan 28 2026 at 18:28, Shrikanth Hegde wrote:
> On 1/28/26 5:27 PM, Thomas Gleixner wrote:
>   watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
>   watchdog: CPU 23 TB:1434903268401795, last heartbeat TB:1434897252302837 (11750ms ago)
>   NIP [c0000000001b7134] mm_get_cid+0xe8/0x188
>   LR [c0000000001b7154] mm_get_cid+0x108/0x188
>   Call Trace:
>   [c000000004c37db0] [c000000001145d84] cpuidle_enter_state+0xf8/0x6a4 (unreliable)
>   [c000000004c37e00] [c0000000001b95ac] mm_cid_switch_to+0x3c4/0x52c
>   [c000000004c37e60] [c000000001147264] __schedule+0x47c/0x700

So if the above spins in mm_get_cid() then the below is just a consequence.

>   watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
>   watchdog: CPU 11 TB:1434903340004919, last heartbeat TB:1434897249749892 (11895ms ago)
>   NIP [c0000000000f84fc] plpar_hcall_norets_notrace+0x18/0x2c
>   LR [c000000001152588] queued_spin_lock_slowpath+0xd88/0x15d0
>   Call Trace:
>   [c00000056b69fb10] [c00000056b69fba0] 0xc00000056b69fba0 (unreliable)
>   [c00000056b69fc30] [c000000001153ce0] _raw_spin_lock+0x80/0xa0
>   [c00000056b69fc50] [c0000000001b9a34] raw_spin_rq_lock_nested+0x3c/0xf8
>   [c00000056b69fc80] [c0000000001b9bb8] mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
>   [c00000056b69fd00] [c0000000001bff34] sched_mm_cid_exit+0x108/0x22c
>   [c00000056b69fd40] [c000000000167b08] do_exit+0xf4/0x5d0
>   [c00000056b69fdf0] [c00000000016800c] make_task_dead+0x0/0x178
>   [c00000056b69fe10] [c0000000000316c8] system_call_exception+0x128/0x390
>   [c00000056b69fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

> I am wondering if it this loop in mm_get_cid, which may not be getting a cid
> for a long time? Is that possible?

It shouldn't be possible by design, but it seems there is a corner case
lurking somewhere which hasn't been covered. Let me stare at the logic
in the transition functions once more. That's where CPU11 comes from:

>   [c00000056b69fc80] [c0000000001b9bb8] mm_cid_fixup_cpus_to_tasks+0xc8/0x28c

The exiting it initiated a transition back from per CPU to per task mode
and that seems to make things unhappy for mysterious reasons.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V5 00/20] sched: Rewrite MM CID management
  2026-01-28 13:56       ` Thomas Gleixner
@ 2026-01-28 22:24         ` Thomas Gleixner
  2026-01-28 22:33           ` Ihor Solodrai
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2026-01-28 22:24 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra, Ihor Solodrai, LKML
  Cc: Gabriele Monaco, Mathieu Desnoyers, Michael Jeanson, Jens Axboe,
	Paul E. McKenney, Gautham R. Shenoy, Florian Weimer, Tim Chen,
	Yury Norov, bpf, sched-ext, Kernel Team, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Puranjay Mohan, Tejun Heo

On Wed, Jan 28 2026 at 14:56, Thomas Gleixner wrote:
> On Wed, Jan 28 2026 at 18:28, Shrikanth Hegde wrote:
>> On 1/28/26 5:27 PM, Thomas Gleixner wrote:
>>   watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
>>   watchdog: CPU 23 TB:1434903268401795, last heartbeat TB:1434897252302837 (11750ms ago)
>>   NIP [c0000000001b7134] mm_get_cid+0xe8/0x188
>>   LR [c0000000001b7154] mm_get_cid+0x108/0x188
>>   Call Trace:
>>   [c000000004c37db0] [c000000001145d84] cpuidle_enter_state+0xf8/0x6a4 (unreliable)
>>   [c000000004c37e00] [c0000000001b95ac] mm_cid_switch_to+0x3c4/0x52c
>>   [c000000004c37e60] [c000000001147264] __schedule+0x47c/0x700
>
> So if the above spins in mm_get_cid() then the below is just a consequence.
>
>>   watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
>>   watchdog: CPU 11 TB:1434903340004919, last heartbeat TB:1434897249749892 (11895ms ago)
>>   NIP [c0000000000f84fc] plpar_hcall_norets_notrace+0x18/0x2c
>>   LR [c000000001152588] queued_spin_lock_slowpath+0xd88/0x15d0
>>   Call Trace:
>>   [c00000056b69fb10] [c00000056b69fba0] 0xc00000056b69fba0 (unreliable)
>>   [c00000056b69fc30] [c000000001153ce0] _raw_spin_lock+0x80/0xa0
>>   [c00000056b69fc50] [c0000000001b9a34] raw_spin_rq_lock_nested+0x3c/0xf8
>>   [c00000056b69fc80] [c0000000001b9bb8] mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
>>   [c00000056b69fd00] [c0000000001bff34] sched_mm_cid_exit+0x108/0x22c
>>   [c00000056b69fd40] [c000000000167b08] do_exit+0xf4/0x5d0
>>   [c00000056b69fdf0] [c00000000016800c] make_task_dead+0x0/0x178
>>   [c00000056b69fe10] [c0000000000316c8] system_call_exception+0x128/0x390
>>   [c00000056b69fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>
>> I am wondering if it this loop in mm_get_cid, which may not be getting a cid
>> for a long time? Is that possible?
>
> It shouldn't be possible by design, but it seems there is a corner case
> lurking somewhere which hasn't been covered. Let me stare at the logic
> in the transition functions once more. That's where CPU11 comes from:
>
>>   [c00000056b69fc80] [c0000000001b9bb8] mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
>
> The exiting it initiated a transition back from per CPU to per task mode
> and that seems to make things unhappy for mysterious reasons.

I stared at it for a while and found the below stupidity. But when I
actually sat down after a while away from the keyboard and tried to
write a concise changelog explaining the root cause I failed to come up
with a coherent explanation why this would prevent the above scenario,
which hints at a situation of MMCID exhaustion.

@Ihor: Is the BPF CI fallout reproducible? If so, can you please provide
       it?

Thanks,

        tglx
---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10664,8 +10664,14 @@ void sched_mm_cid_exit(struct task_struc
 			scoped_guard(raw_spinlock_irq, &mm->mm_cid.lock) {
 				if (!__sched_mm_cid_exit(t))
 					return;
-				/* Mode change required. Transfer currents CID */
-				mm_cid_transit_to_task(current, this_cpu_ptr(mm->mm_cid.pcpu));
+				/*
+				 * Mode change. The task has the CID unset
+				 * already. The CPU CID is still valid and
+				 * does not have MM_CID_TRANSIT set as the
+				 * mode change has just taken effect under
+				 * mm::mm_cid::lock. Drop it.
+				 */
+				mm_drop_cid_on_cpu(mm, this_cpu_ptr(mm->mm_cid.pcpu));
 			}
 			mm_cid_fixup_cpus_to_tasks(mm);
 			return;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V5 00/20] sched: Rewrite MM CID management
  2026-01-28 22:24         ` Thomas Gleixner
@ 2026-01-28 22:33           ` Ihor Solodrai
  2026-01-28 23:08             ` Ihor Solodrai
  0 siblings, 1 reply; 9+ messages in thread
From: Ihor Solodrai @ 2026-01-28 22:33 UTC (permalink / raw)
  To: Thomas Gleixner, Shrikanth Hegde, Peter Zijlstra, LKML
  Cc: Gabriele Monaco, Mathieu Desnoyers, Michael Jeanson, Jens Axboe,
	Paul E. McKenney, Gautham R. Shenoy, Florian Weimer, Tim Chen,
	Yury Norov, bpf, sched-ext, Kernel Team, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Puranjay Mohan, Tejun Heo

On 1/28/26 2:24 PM, Thomas Gleixner wrote:
> On Wed, Jan 28 2026 at 14:56, Thomas Gleixner wrote:
>> On Wed, Jan 28 2026 at 18:28, Shrikanth Hegde wrote:
>>> On 1/28/26 5:27 PM, Thomas Gleixner wrote:
>>>   watchdog: CPU 23 self-detected hard LOCKUP @ mm_get_cid+0xe8/0x188
>>>   watchdog: CPU 23 TB:1434903268401795, last heartbeat TB:1434897252302837 (11750ms ago)
>>>   NIP [c0000000001b7134] mm_get_cid+0xe8/0x188
>>>   LR [c0000000001b7154] mm_get_cid+0x108/0x188
>>>   Call Trace:
>>>   [c000000004c37db0] [c000000001145d84] cpuidle_enter_state+0xf8/0x6a4 (unreliable)
>>>   [c000000004c37e00] [c0000000001b95ac] mm_cid_switch_to+0x3c4/0x52c
>>>   [c000000004c37e60] [c000000001147264] __schedule+0x47c/0x700
>>
>> So if the above spins in mm_get_cid() then the below is just a consequence.
>>
>>>   watchdog: CPU 11 self-detected hard LOCKUP @ plpar_hcall_norets_notrace+0x18/0x2c
>>>   watchdog: CPU 11 TB:1434903340004919, last heartbeat TB:1434897249749892 (11895ms ago)
>>>   NIP [c0000000000f84fc] plpar_hcall_norets_notrace+0x18/0x2c
>>>   LR [c000000001152588] queued_spin_lock_slowpath+0xd88/0x15d0
>>>   Call Trace:
>>>   [c00000056b69fb10] [c00000056b69fba0] 0xc00000056b69fba0 (unreliable)
>>>   [c00000056b69fc30] [c000000001153ce0] _raw_spin_lock+0x80/0xa0
>>>   [c00000056b69fc50] [c0000000001b9a34] raw_spin_rq_lock_nested+0x3c/0xf8
>>>   [c00000056b69fc80] [c0000000001b9bb8] mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
>>>   [c00000056b69fd00] [c0000000001bff34] sched_mm_cid_exit+0x108/0x22c
>>>   [c00000056b69fd40] [c000000000167b08] do_exit+0xf4/0x5d0
>>>   [c00000056b69fdf0] [c00000000016800c] make_task_dead+0x0/0x178
>>>   [c00000056b69fe10] [c0000000000316c8] system_call_exception+0x128/0x390
>>>   [c00000056b69fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
>>
>>> I am wondering if it this loop in mm_get_cid, which may not be getting a cid
>>> for a long time? Is that possible?
>>
>> It shouldn't be possible by design, but it seems there is a corner case
>> lurking somewhere which hasn't been covered. Let me stare at the logic
>> in the transition functions once more. That's where CPU11 comes from:
>>
>>>   [c00000056b69fc80] [c0000000001b9bb8] mm_cid_fixup_cpus_to_tasks+0xc8/0x28c
>>
>> The exiting it initiated a transition back from per CPU to per task mode
>> and that seems to make things unhappy for mysterious reasons.
> 
> I stared at it for a while and found the below stupidity. But when I
> actually sat down after a while away from the keyboard and tried to
> write a concise changelog explaining the root cause I failed to come up
> with a coherent explanation why this would prevent the above scenario,
> which hints at a situation of MMCID exhaustion.
> 
> @Ihor: Is the BPF CI fallout reproducible? If so, can you please provide
>        it?

Not reliably, unfortunately. I saw it at least twice (out of 100+
runs) this week.

I added `hardlockup_all_cpu_backtrace=1` to get more logs.  If there
is anything else I could set up (kconfigs, debug switches) that may be
helpful, let me know.

We have a steady stream of jobs running, so if it's not a one-off it's
likely to happen again. I'll share if we get anything.

Thank you for investigating!


> 
> Thanks,
> 
>         tglx
> ---
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10664,8 +10664,14 @@ void sched_mm_cid_exit(struct task_struc
>  			scoped_guard(raw_spinlock_irq, &mm->mm_cid.lock) {
>  				if (!__sched_mm_cid_exit(t))
>  					return;
> -				/* Mode change required. Transfer currents CID */
> -				mm_cid_transit_to_task(current, this_cpu_ptr(mm->mm_cid.pcpu));
> +				/*
> +				 * Mode change. The task has the CID unset
> +				 * already. The CPU CID is still valid and
> +				 * does not have MM_CID_TRANSIT set as the
> +				 * mode change has just taken effect under
> +				 * mm::mm_cid::lock. Drop it.
> +				 */
> +				mm_drop_cid_on_cpu(mm, this_cpu_ptr(mm->mm_cid.pcpu));
>  			}
>  			mm_cid_fixup_cpus_to_tasks(mm);
>  			return;


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V5 00/20] sched: Rewrite MM CID management
  2026-01-28 22:33           ` Ihor Solodrai
@ 2026-01-28 23:08             ` Ihor Solodrai
  2026-01-29 17:06               ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Ihor Solodrai @ 2026-01-28 23:08 UTC (permalink / raw)
  To: Thomas Gleixner, Shrikanth Hegde, Peter Zijlstra, LKML
  Cc: Gabriele Monaco, Mathieu Desnoyers, Michael Jeanson, Jens Axboe,
	Paul E. McKenney, Gautham R. Shenoy, Florian Weimer, Tim Chen,
	Yury Norov, bpf, sched-ext, Kernel Team, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Puranjay Mohan, Tejun Heo

On 1/28/26 2:33 PM, Ihor Solodrai wrote:
> [...]
> 
> We have a steady stream of jobs running, so if it's not a one-off it's
> likely to happen again. I'll share if we get anything.

Here is another one, with backtraces of other CPUs:

[   59.133878] watchdog: CPU2: Watchdog detected hard LOCKUP on cpu 2
[   59.133886] Modules linked in: bpf_testmod(OE)
[   59.133892] irq event stamp: 687092
[   59.133893] hardirqs last  enabled at (687091): [<ffffffff8fbfbf78>] _raw_spin_unlock_irq+0x28/0x50
[   59.133908] hardirqs last disabled at (687092): [<ffffffff8fbfbd11>] _raw_spin_lock_irqsave+0x51/0x60
[   59.133912] softirqs last  enabled at (687006): [<ffffffff8d345e2a>] fpu_clone+0xda/0x4f0
[   59.133918] softirqs last disabled at (687004): [<ffffffff8d345dd2>] fpu_clone+0x82/0x4f0
[   59.133925] CPU: 2 UID: 0 PID: 127 Comm: test_progs Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
[   59.133930] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   59.133932] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   59.133935] RIP: 0010:queued_spin_lock_slowpath+0x3a9/0xac0
[   59.133943] Code: 00 00 85 c0 74 3d 0f b6 03 84 c0 74 36 48 b8 00 00 00 00 00 fc ff df 49 89 dc 49 89 dd 49 c1 ec 03 41 83 e5 07 49 01 c4 f3 90 <41> 0f b6 04 24 44 38 e8 7f 08 84 c0 0f 85 9f 05 00 00 0f b6 03 84
[   59.133945] RSP: 0018:ffffc900012df750 EFLAGS: 00000002
[   59.133950] RAX: 0000000000000001 RBX: ffff8881520ba000 RCX: 0000000000000001
[   59.133952] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8881520ba000
[   59.133954] RBP: 1ffff9200025beec R08: ffffffff8fbfcb69 R09: ffffed102a417400
[   59.133956] R10: ffffed102a417401 R11: 0000000000000004 R12: ffffed102a417400
[   59.133958] R13: 0000000000000000 R14: dffffc0000000000 R15: ffff8881520ba000
[   59.133960] FS:  00007f7230740e00(0000) GS:ffff8881bf8db000(0000) knlGS:0000000000000000
[   59.133964] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   59.133966] CR2: 00007f722f1a6d58 CR3: 000000010ed2f001 CR4: 0000000000770ef0
[   59.133968] PKRU: 55555554
[   59.133969] Call Trace:
[   59.133973]  <TASK>
[   59.133977]  ? __pfx_queued_spin_lock_slowpath+0x10/0x10
[   59.133985]  do_raw_spin_lock+0x1d9/0x270
[   59.133991]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   59.133994]  ? __pfx___might_resched+0x10/0x10
[   59.134001]  task_rq_lock+0xcf/0x3c0
[   59.134007]  mm_cid_fixup_task_to_cpu+0xb0/0x460
[   59.134011]  ? __pfx_mm_cid_fixup_task_to_cpu+0x10/0x10
[   59.134015]  ? lock_acquire+0x14e/0x2b0
[   59.134020]  ? mark_held_locks+0x40/0x70
[   59.134025]  sched_mm_cid_fork+0x6da/0xc20
[   59.134030]  ? __pfx_sched_mm_cid_fork+0x10/0x10
[   59.134032]  ? copy_process+0x217b/0x6950
[   59.134037]  copy_process+0x2bce/0x6950
[   59.134044]  ? __pfx_copy_process+0x10/0x10
[   59.134046]  ? find_held_lock+0x2b/0x80
[   59.134051]  ? _copy_from_user+0x53/0xa0
[   59.134058]  kernel_clone+0xce/0x600
[   59.134061]  ? __pfx_kernel_clone+0x10/0x10
[   59.134066]  ? __lock_acquire+0x481/0x2590
[   59.134071]  __do_sys_clone3+0x16e/0x1b0
[   59.134074]  ? __pfx___do_sys_clone3+0x10/0x10
[   59.134077]  ? lock_acquire+0x14e/0x2b0
[   59.134080]  ? __might_fault+0x9b/0x140
[   59.134089]  ? _copy_to_user+0x5c/0x70
[   59.134092]  ? __x64_sys_rt_sigprocmask+0x258/0x400
[   59.134099]  ? do_user_addr_fault+0x4c2/0xa40
[   59.134103]  ? lockdep_hardirqs_on_prepare+0xd7/0x180
[   59.134107]  do_syscall_64+0x6b/0x3a0
[   59.134111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   59.134116] RIP: 0033:0x7f7230c42c5d
[   59.134120] Code: 79 14 0e 00 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 ea ff ff ff 48 85 ff 74 28 48 85 d2 74 23 49 89 c8 b8 b3 01 00 00 0f 05 <48> 85 c0 7c 14 74 01 c3 31 ed 4c 89 c7 ff d2 48 89 c7 b8 3c 00 00
[   59.134122] RSP: 002b:00007ffe90d4e1f8 EFLAGS: 00000202 ORIG_RAX: 00000000000001b3
[   59.134126] RAX: ffffffffffffffda RBX: 00007f7230bb5720 RCX: 00007f7230c42c5d
[   59.134128] RDX: 00007f7230bb5720 RSI: 0000000000000058 RDI: 00007ffe90d4e250
[   59.134129] RBP: 00007ffe90d4e230 R08: 00007f722f1a66c0 R09: 00007ffe90d4e357
[   59.134131] R10: 0000000000000008 R11: 0000000000000202 R12: 00007f722f1a66c0
[   59.134133] R13: ffffffffffffff08 R14: 0000000000000000 R15: 00007ffe90d4e250
[   59.134139]  </TASK>
[   59.134141] Sending NMI from CPU 2 to CPUs 0-1,3:
[   59.134168] NMI backtrace for cpu 3
[   59.134176] CPU: 3 UID: 0 PID: 67 Comm: kworker/3:1 Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
[   59.134181] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   59.134183] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   59.134186] Workqueue: events drain_vmap_area_work
[   59.134194] RIP: 0010:smp_call_function_many_cond+0x772/0xe60
[   59.134200] Code: 38 c8 7c 08 84 c9 0f 85 92 05 00 00 8b 43 08 a8 01 74 2e 48 89 f1 49 89 f5 48 c1 e9 03 41 83 e5 07 4c 01 f1 41 83 c5 03 f3 90 <0f> b6 01 41 38 c5 7c 08 84 c0 0f 85 c1 04 00 00 8b 43 08 a8 01 75
[   59.134203] RSP: 0018:ffffc90000587948 EFLAGS: 00000202
[   59.134206] RAX: 0000000000000011 RBX: ffff8881520c1ac0 RCX: ffffed102a418359
[   59.134208] RDX: 0000000000000001 RSI: ffff8881520c1ac8 RDI: ffffffff90713be8
[   59.134210] RBP: ffffed102a437680 R08: ffff8881521bb408 R09: 0000000000000000
[   59.134212] R10: 1ffff1102a437681 R11: ffff888103aa8bb0 R12: ffff8881521bb408
[   59.134213] R13: 0000000000000003 R14: dffffc0000000000 R15: ffff8881521bb400
[   59.134215] FS:  0000000000000000(0000) GS:ffff8881bf95b000(0000) knlGS:0000000000000000
[   59.134219] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   59.134221] CR2: 00007fc9ae7762a0 CR3: 000000010c435001 CR4: 0000000000770ef0
[   59.134223] PKRU: 55555554
[   59.134224] Call Trace:
[   59.134226]  <TASK>
[   59.134230]  ? __pfx_do_flush_tlb_all+0x10/0x10
[   59.134238]  ? __pfx_smp_call_function_many_cond+0x10/0x10
[   59.134242]  ? __pfx___apply_to_page_range+0x10/0x10
[   59.134245]  ? mark_held_locks+0x40/0x70
[   59.134250]  on_each_cpu_cond_mask+0x24/0x40
[   59.134254]  flush_tlb_kernel_range+0x402/0x6b0
[   59.134259]  ? __kasan_release_vmalloc+0xd6/0x110
[   59.134265]  purge_vmap_node+0x1db/0x9c0
[   59.134270]  ? __pfx_smp_call_function_many_cond+0x10/0x10
[   59.134275]  ? __pfx_purge_vmap_node+0x10/0x10
[   59.134280]  __purge_vmap_area_lazy+0x6ea/0xac0
[   59.134286]  drain_vmap_area_work+0x27/0x40
[   59.134289]  process_one_work+0x800/0x13e0
[   59.134296]  ? __pfx_process_one_work+0x10/0x10
[   59.134298]  ? lock_acquire+0x14e/0x2b0
[   59.134302]  ? lock_is_held_type+0x87/0xf0
[   59.134307]  ? assign_work+0x156/0x390
[   59.134313]  worker_thread+0x5c8/0xfa0
[   59.134319]  ? __pfx_worker_thread+0x10/0x10
[   59.134322]  kthread+0x3bd/0x780
[   59.134327]  ? do_raw_spin_lock+0x128/0x270
[   59.134332]  ? __pfx_kthread+0x10/0x10
[   59.134335]  ? __pfx_kthread+0x10/0x10
[   59.134340]  ? ret_from_fork+0x6e/0x590
[   59.134344]  ? lock_release+0xd4/0x2c0
[   59.134348]  ? __pfx_kthread+0x10/0x10
[   59.134351]  ret_from_fork+0x48c/0x590
[   59.134355]  ? __pfx_ret_from_fork+0x10/0x10
[   59.134359]  ? __pfx_kthread+0x10/0x10
[   59.134363]  ret_from_fork_asm+0x1a/0x30
[   59.134371]  </TASK>
[   59.134374] NMI backtrace for cpu 1
[   59.134380] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
[   59.134385] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   59.134386] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   59.134388] RIP: 0010:_find_first_zero_bit+0x50/0x90
[   59.134394] Code: 48 39 c1 73 25 48 89 fa 48 c1 ea 03 80 3c 32 00 75 26 48 8b 17 48 83 f2 ff 74 dd f3 48 0f bc d2 48 01 d1 48 39 c8 48 0f 47 c1 <48> 83 c4 18 c3 cc cc cc cc c3 cc cc cc cc 48 89 44 24 10 48 89 4c
[   59.134396] RSP: 0018:ffffc9000014fd58 EFLAGS: 00000046
[   59.134400] RAX: 0000000000000004 RBX: ffff888100d3a440 RCX: 0000000000000004
[   59.134402] RDX: 0000000000000004 RSI: dffffc0000000000 RDI: ffff88810e9d22a0
[   59.134403] RBP: ffffc9000014fe60 R08: ffff88810e9d1840 R09: ffff8881396e0000
[   59.134405] R10: 0000000080000000 R11: 0000000000000004 R12: ffff88810e9d1840
[   59.134407] R13: ffff8881520ba000 R14: ffff88810e9d22a0 R15: ffff8881396e0000
[   59.134409] FS:  0000000000000000(0000) GS:ffff8881bf85b000(0000) knlGS:0000000000000000
[   59.134413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   59.134414] CR2: 00007f72301a8d58 CR3: 000000010ed2f005 CR4: 0000000000770ef0
[   59.134416] PKRU: 55555554
[   59.134417] Call Trace:
[   59.134420]  <TASK>
[   59.134423]  __schedule+0x3312/0x4390
[   59.134430]  ? __pfx___schedule+0x10/0x10
[   59.134434]  ? trace_rcu_watching+0x105/0x150
[   59.134440]  schedule_idle+0x59/0x90
[   59.134443]  do_idle+0x26b/0x4d0
[   59.134449]  ? __pfx_do_idle+0x10/0x10
[   59.134452]  ? do_idle+0x278/0x4d0
[   59.134456]  cpu_startup_entry+0x53/0x70
[   59.134459]  start_secondary+0x1b9/0x230
[   59.134463]  common_startup_64+0x12c/0x138
[   59.134472]  </TASK>
[   59.134474] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20
[   59.135160] Kernel panic - not syncing: Hard LOCKUP
[   59.135163] CPU: 2 UID: 0 PID: 127 Comm: test_progs Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
[   59.135167] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   59.135169] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   59.135170] Call Trace:
[   59.135173]  <NMI>
[   59.135174]  dump_stack_lvl+0x5d/0x80
[   59.135179]  vpanic+0x133/0x3f0
[   59.135185]  panic+0xce/0xce
[   59.135188]  ? __pfx_panic+0x10/0x10
[   59.135193]  ? _printk+0xc7/0x100
[   59.135198]  ? nmi_panic+0x91/0x130
[   59.135202]  nmi_panic.cold+0x14/0x14
[   59.135206]  ? __pfx_nmi_panic+0x10/0x10
[   59.135209]  ? __pfx_nmi_raise_cpu_backtrace+0x10/0x10
[   59.135214]  watchdog_hardlockup_check.cold+0x12a/0x1c5
[   59.135220]  __perf_event_overflow+0x2fe/0xeb0
[   59.135226]  ? __pfx___perf_event_overflow+0x10/0x10
[   59.135229]  ? __pfx_x86_perf_event_set_period+0x10/0x10
[   59.135235]  handle_pmi_common+0x405/0x920
[   59.135240]  ? __pfx_handle_pmi_common+0x10/0x10
[   59.135253]  ? __pfx_intel_bts_interrupt+0x10/0x10
[   59.135259]  intel_pmu_handle_irq+0x1c5/0x5d0
[   59.135263]  ? lock_acquire+0x1e9/0x2b0
[   59.135266]  ? nmi_handle.part.0+0x2f/0x370
[   59.135271]  perf_event_nmi_handler+0x3e/0x70
[   59.135275]  nmi_handle.part.0+0x13f/0x370
[   59.135278]  ? trace_rcu_watching+0x105/0x150
[   59.135283]  default_do_nmi+0x3b/0x110
[   59.135287]  ? irqentry_nmi_enter+0x6f/0x80
[   59.135291]  exc_nmi+0xe3/0x110
[   59.135294]  end_repeat_nmi+0xf/0x53
[   59.135297] RIP: 0010:queued_spin_lock_slowpath+0x3a9/0xac0
[   59.135301] Code: 00 00 85 c0 74 3d 0f b6 03 84 c0 74 36 48 b8 00 00 00 00 00 fc ff df 49 89 dc 49 89 dd 49 c1 ec 03 41 83 e5 07 49 01 c4 f3 90 <41> 0f b6 04 24 44 38 e8 7f 08 84 c0 0f 85 9f 05 00 00 0f b6 03 84
[   59.135303] RSP: 0018:ffffc900012df750 EFLAGS: 00000002
[   59.135305] RAX: 0000000000000001 RBX: ffff8881520ba000 RCX: 0000000000000001
[   59.135307] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8881520ba000
[   59.135309] RBP: 1ffff9200025beec R08: ffffffff8fbfcb69 R09: ffffed102a417400
[   59.135311] R10: ffffed102a417401 R11: 0000000000000004 R12: ffffed102a417400
[   59.135313] R13: 0000000000000000 R14: dffffc0000000000 R15: ffff8881520ba000
[   59.135316]  ? queued_spin_lock_slowpath+0x339/0xac0
[   59.135321]  ? queued_spin_lock_slowpath+0x3a9/0xac0
[   59.135325]  ? queued_spin_lock_slowpath+0x3a9/0xac0
[   59.135329]  </NMI>
[   59.135330]  <TASK>
[   59.135332]  ? __pfx_queued_spin_lock_slowpath+0x10/0x10
[   59.135338]  do_raw_spin_lock+0x1d9/0x270
[   59.135342]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   59.135346]  ? __pfx___might_resched+0x10/0x10
[   59.135350]  task_rq_lock+0xcf/0x3c0
[   59.135355]  mm_cid_fixup_task_to_cpu+0xb0/0x460
[   59.135359]  ? __pfx_mm_cid_fixup_task_to_cpu+0x10/0x10
[   59.135364]  ? lock_acquire+0x14e/0x2b0
[   59.135368]  ? mark_held_locks+0x40/0x70
[   59.135372]  sched_mm_cid_fork+0x6da/0xc20
[   59.135376]  ? __pfx_sched_mm_cid_fork+0x10/0x10
[   59.135379]  ? copy_process+0x217b/0x6950
[   59.135383]  copy_process+0x2bce/0x6950
[   59.135389]  ? __pfx_copy_process+0x10/0x10
[   59.135391]  ? find_held_lock+0x2b/0x80
[   59.135396]  ? _copy_from_user+0x53/0xa0
[   59.135401]  kernel_clone+0xce/0x600
[   59.135404]  ? __pfx_kernel_clone+0x10/0x10
[   59.135409]  ? __lock_acquire+0x481/0x2590
[   59.135414]  __do_sys_clone3+0x16e/0x1b0
[   59.135417]  ? __pfx___do_sys_clone3+0x10/0x10
[   59.135419]  ? lock_acquire+0x14e/0x2b0
[   59.135422]  ? __might_fault+0x9b/0x140
[   59.135429]  ? _copy_to_user+0x5c/0x70
[   59.135432]  ? __x64_sys_rt_sigprocmask+0x258/0x400
[   59.135438]  ? do_user_addr_fault+0x4c2/0xa40
[   59.135441]  ? lockdep_hardirqs_on_prepare+0xd7/0x180
[   59.135445]  do_syscall_64+0x6b/0x3a0
[   59.135448]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   59.135451] RIP: 0033:0x7f7230c42c5d
[   59.135453] Code: 79 14 0e 00 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 ea ff ff ff 48 85 ff 74 28 48 85 d2 74 23 49 89 c8 b8 b3 01 00 00 0f 05 <48> 85 c0 7c 14 74 01 c3 31 ed 4c 89 c7 ff d2 48 89 c7 b8 3c 00 00
[   59.135455] RSP: 002b:00007ffe90d4e1f8 EFLAGS: 00000202 ORIG_RAX: 00000000000001b3
[   59.135458] RAX: ffffffffffffffda RBX: 00007f7230bb5720 RCX: 00007f7230c42c5d
[   59.135459] RDX: 00007f7230bb5720 RSI: 0000000000000058 RDI: 00007ffe90d4e250
[   59.135461] RBP: 00007ffe90d4e230 R08: 00007f722f1a66c0 R09: 00007ffe90d4e357
[   59.135462] R10: 0000000000000008 R11: 0000000000000202 R12: 00007f722f1a66c0
[   59.135464] R13: ffffffffffffff08 R14: 0000000000000000 R15: 00007ffe90d4e250
[   59.135470]  </TASK>
[   60.170882]
[   60.170886] ================================
[   60.170888] WARNING: inconsistent lock state
[   60.170890] 6.19.0-rc5-gbe9790cb9e63-dirty #1 Tainted: G           OE
[   60.170893] --------------------------------
[   60.170894] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[   60.170895] test_progs/127 [HC1[1]:SC0[0]:HE0:SE1] takes:
[   60.170899] ffffffff90eace78 (&nmi_desc[NMI_LOCAL].lock){....}-{2:2}, at: __register_nmi_handler+0x83/0x350
[   60.170912] {INITIAL USE} state was registered at:
[   60.170913]   lock_acquire+0x14e/0x2b0
[   60.170918]   _raw_spin_lock_irqsave+0x39/0x60
[   60.170921]   __register_nmi_handler+0x83/0x350
[   60.170924]   init_hw_perf_events+0x1d0/0x850
[   60.170929]   do_one_initcall+0xd0/0x3a0
[   60.170934]   kernel_init_freeable+0x34c/0x580
[   60.170937]   kernel_init+0x1c/0x150
[   60.170939]   ret_from_fork+0x48c/0x590
[   60.170942]   ret_from_fork_asm+0x1a/0x30
[   60.170945] irq event stamp: 687092
[   60.170946] hardirqs last  enabled at (687091): [<ffffffff8fbfbf78>] _raw_spin_unlock_irq+0x28/0x50
[   60.170950] hardirqs last disabled at (687092): [<ffffffff8fbfbd11>] _raw_spin_lock_irqsave+0x51/0x60
[   60.170952] softirqs last  enabled at (687006): [<ffffffff8d345e2a>] fpu_clone+0xda/0x4f0
[   60.170956] softirqs last disabled at (687004): [<ffffffff8d345dd2>] fpu_clone+0x82/0x4f0
[   60.170959]
[   60.170959] other info that might help us debug this:
[   60.170961]  Possible unsafe locking scenario:
[   60.170961]
[   60.170962]        CPU0
[   60.170963]        ----
[   60.170963]   lock(&nmi_desc[NMI_LOCAL].lock);
[   60.170965]   <Interrupt>
[   60.170966]     lock(&nmi_desc[NMI_LOCAL].lock);
[   60.170968]
[   60.170968]  *** DEADLOCK ***
[   60.170968]
[   60.170969] 5 locks held by test_progs/127:
[   60.170970]  #0: ffffffff90f49790 (scx_fork_rwsem){.+.+}-{0:0}, at: sched_fork+0xf9/0x6b0
[   60.170978]  #1: ffff88810e9d1968 (&mm->mm_cid.mutex){+.+.}-{4:4}, at: sched_mm_cid_fork+0xdf/0xc20
[   60.170983]  #2: ffffffff91671a80 (rcu_read_lock){....}-{1:3}, at: sched_mm_cid_fork+0x692/0xc20
[   60.170989]  #3: ffff88810cfbaed0 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x3c0
[   60.170995]  #4: ffff8881520ba018 (&rq->__lock){-.-.}-{2:2}, at: task_rq_lock+0xcf/0x3c0
[   60.171001]
[   60.171001] stack backtrace:
[   60.171004] CPU: 2 UID: 0 PID: 127 Comm: test_progs Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
[   60.171009] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[   60.171011] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   60.171013] Call Trace:
[   60.171016]  <NMI>
[   60.171020]  dump_stack_lvl+0x5d/0x80
[   60.171024]  print_usage_bug.part.0+0x22b/0x2c0
[   60.171029]  lock_acquire+0x272/0x2b0
[   60.171032]  ? __register_nmi_handler+0x83/0x350
[   60.171037]  _raw_spin_lock_irqsave+0x39/0x60
[   60.171040]  ? __register_nmi_handler+0x83/0x350
[   60.171043]  __register_nmi_handler+0x83/0x350
[   60.171048]  native_stop_other_cpus+0x31c/0x460
[   60.171052]  ? __pfx_native_stop_other_cpus+0x10/0x10
[   60.171057]  vpanic+0x1c5/0x3f0
[   60.171060]  panic+0xce/0xce
[   60.171064]  ? __pfx_panic+0x10/0x10
[   60.171068]  ? _printk+0xc7/0x100
[   60.171072]  ? nmi_panic+0x91/0x130
[   60.171075]  nmi_panic.cold+0x14/0x14
[   60.171078]  ? __pfx_nmi_panic+0x10/0x10
[   60.171081]  ? __pfx_nmi_raise_cpu_backtrace+0x10/0x10
[   60.171085]  watchdog_hardlockup_check.cold+0x12a/0x1c5
[   60.171090]  __perf_event_overflow+0x2fe/0xeb0
[   60.171094]  ? __pfx___perf_event_overflow+0x10/0x10
[   60.171097]  ? __pfx_x86_perf_event_set_period+0x10/0x10
[   60.171102]  handle_pmi_common+0x405/0x920
[   60.171105]  ? __pfx_handle_pmi_common+0x10/0x10
[   60.171115]  ? __pfx_intel_bts_interrupt+0x10/0x10
[   60.171120]  intel_pmu_handle_irq+0x1c5/0x5d0
[   60.171123]  ? lock_acquire+0x1e9/0x2b0
[   60.171127]  ? nmi_handle.part.0+0x2f/0x370
[   60.171130]  perf_event_nmi_handler+0x3e/0x70
[   60.171133]  nmi_handle.part.0+0x13f/0x370
[   60.171135]  ? trace_rcu_watching+0x105/0x150
[   60.171141]  default_do_nmi+0x3b/0x110
[   60.171144]  ? irqentry_nmi_enter+0x6f/0x80
[   60.171147]  exc_nmi+0xe3/0x110
[   60.171150]  end_repeat_nmi+0xf/0x53
[   60.171154] RIP: 0010:queued_spin_lock_slowpath+0x3a9/0xac0
[   60.171158] Code: 00 00 85 c0 74 3d 0f b6 03 84 c0 74 36 48 b8 00 00 00 00 00 fc ff df 49 89 dc 49 89 dd 49 c1 ec 03 41 83 e5 07 49 01 c4 f3 90 <41> 0f b6 04 24 44 38 e8 7f 08 84 c0 0f 85 9f 05 00 00 0f b6 03 84
[   60.171160] RSP: 0018:ffffc900012df750 EFLAGS: 00000002
[   60.171163] RAX: 0000000000000001 RBX: ffff8881520ba000 RCX: 0000000000000001
[   60.171165] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8881520ba000
[   60.171167] RBP: 1ffff9200025beec R08: ffffffff8fbfcb69 R09: ffffed102a417400
[   60.171168] R10: ffffed102a417401 R11: 0000000000000004 R12: ffffed102a417400
[   60.171170] R13: 0000000000000000 R14: dffffc0000000000 R15: ffff8881520ba000
[   60.171173]  ? queued_spin_lock_slowpath+0x339/0xac0
[   60.171178]  ? queued_spin_lock_slowpath+0x3a9/0xac0
[   60.171181]  ? queued_spin_lock_slowpath+0x3a9/0xac0
[   60.171184]  </NMI>
[   60.171185]  <TASK>
[   60.171187]  ? __pfx_queued_spin_lock_slowpath+0x10/0x10
[   60.171192]  do_raw_spin_lock+0x1d9/0x270
[   60.171197]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   60.171200]  ? __pfx___might_resched+0x10/0x10
[   60.171204]  task_rq_lock+0xcf/0x3c0
[   60.171209]  mm_cid_fixup_task_to_cpu+0xb0/0x460
[   60.171212]  ? __pfx_mm_cid_fixup_task_to_cpu+0x10/0x10
[   60.171216]  ? lock_acquire+0x14e/0x2b0
[   60.171220]  ? mark_held_locks+0x40/0x70
[   60.171224]  sched_mm_cid_fork+0x6da/0xc20
[   60.171227]  ? __pfx_sched_mm_cid_fork+0x10/0x10
[   60.171230]  ? copy_process+0x217b/0x6950
[   60.171233]  copy_process+0x2bce/0x6950
[   60.171238]  ? __pfx_copy_process+0x10/0x10
[   60.171241]  ? find_held_lock+0x2b/0x80
[   60.171245]  ? _copy_from_user+0x53/0xa0
[   60.171251]  kernel_clone+0xce/0x600
[   60.171254]  ? __pfx_kernel_clone+0x10/0x10
[   60.171258]  ? __lock_acquire+0x481/0x2590
[   60.171262]  __do_sys_clone3+0x16e/0x1b0
[   60.171265]  ? __pfx___do_sys_clone3+0x10/0x10
[   60.171267]  ? lock_acquire+0x14e/0x2b0
[   60.171270]  ? __might_fault+0x9b/0x140
[   60.171276]  ? _copy_to_user+0x5c/0x70
[   60.171280]  ? __x64_sys_rt_sigprocmask+0x258/0x400
[   60.171285]  ? do_user_addr_fault+0x4c2/0xa40
[   60.171289]  ? lockdep_hardirqs_on_prepare+0xd7/0x180
[   60.171292]  do_syscall_64+0x6b/0x3a0
[   60.171295]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   60.171298] RIP: 0033:0x7f7230c42c5d
[   60.171300] Code: 79 14 0e 00 c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 ea ff ff ff 48 85 ff 74 28 48 85 d2 74 23 49 89 c8 b8 b3 01 00 00 0f 05 <48> 85 c0 7c 14 74 01 c3 31 ed 4c 89 c7 ff d2 48 89 c7 b8 3c 00 00
[   60.171302] RSP: 002b:00007ffe90d4e1f8 EFLAGS: 00000202 ORIG_RAX: 00000000000001b3
[   60.171305] RAX: ffffffffffffffda RBX: 00007f7230bb5720 RCX: 00007f7230c42c5d
[   60.171307] RDX: 00007f7230bb5720 RSI: 0000000000000058 RDI: 00007ffe90d4e250
[   60.171309] RBP: 00007ffe90d4e230 R08: 00007f722f1a66c0 R09: 00007ffe90d4e357
[   60.171310] R10: 0000000000000008 R11: 0000000000000202 R12: 00007f722f1a66c0
[   60.171312] R13: ffffffffffffff08 R14: 0000000000000000 R15: 00007ffe90d4e250
[   60.171316]  </TASK>
[   60.171319] Shutting down cpus with NMI
[   60.171381] Kernel Offset: 0xc000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)



> 
> Thank you for investigating!
> 
> 
>>
>> Thanks,
>>
>>         tglx
>> ---
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -10664,8 +10664,14 @@ void sched_mm_cid_exit(struct task_struc
>>  			scoped_guard(raw_spinlock_irq, &mm->mm_cid.lock) {
>>  				if (!__sched_mm_cid_exit(t))
>>  					return;
>> -				/* Mode change required. Transfer currents CID */
>> -				mm_cid_transit_to_task(current, this_cpu_ptr(mm->mm_cid.pcpu));
>> +				/*
>> +				 * Mode change. The task has the CID unset
>> +				 * already. The CPU CID is still valid and
>> +				 * does not have MM_CID_TRANSIT set as the
>> +				 * mode change has just taken effect under
>> +				 * mm::mm_cid::lock. Drop it.
>> +				 */
>> +				mm_drop_cid_on_cpu(mm, this_cpu_ptr(mm->mm_cid.pcpu));
>>  			}
>>  			mm_cid_fixup_cpus_to_tasks(mm);
>>  			return;
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch V5 00/20] sched: Rewrite MM CID management
  2026-01-28 23:08             ` Ihor Solodrai
@ 2026-01-29 17:06               ` Thomas Gleixner
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2026-01-29 17:06 UTC (permalink / raw)
  To: Ihor Solodrai, Shrikanth Hegde, Peter Zijlstra, LKML
  Cc: Gabriele Monaco, Mathieu Desnoyers, Michael Jeanson, Jens Axboe,
	Paul E. McKenney, Gautham R. Shenoy, Florian Weimer, Tim Chen,
	Yury Norov, bpf, sched-ext, Kernel Team, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Puranjay Mohan, Tejun Heo

On Wed, Jan 28 2026 at 15:08, Ihor Solodrai wrote:
> On 1/28/26 2:33 PM, Ihor Solodrai wrote:
>> [...]
>> 
>> We have a steady stream of jobs running, so if it's not a one-off it's
>> likely to happen again. I'll share if we get anything.
>
> Here is another one, with backtraces of other CPUs:
>
> [   59.133925] CPU: 2 UID: 0 PID: 127 Comm: test_progs Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
> [   59.133935] RIP: 0010:queued_spin_lock_slowpath+0x3a9/0xac0
> [   59.133985]  do_raw_spin_lock+0x1d9/0x270
> [   59.134001]  task_rq_lock+0xcf/0x3c0
> [   59.134007]  mm_cid_fixup_task_to_cpu+0xb0/0x460
> [   59.134025]  sched_mm_cid_fork+0x6da/0xc20

Compared to Shrikanth's splat this is the reverse situation, i.e. fork()
reached the point where it needs to switch to per CPU mode and the fixup
function is stuck on a runqueue lock.

> [   59.134176] CPU: 3 UID: 0 PID: 67 Comm: kworker/3:1 Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
> [   59.134186] Workqueue: events drain_vmap_area_work
> [   59.134194] RIP: 0010:smp_call_function_many_cond+0x772/0xe60
> [   59.134250]  on_each_cpu_cond_mask+0x24/0x40
> [   59.134254]  flush_tlb_kernel_range+0x402/0x6b0

CPU3 is unrelated as it does not hold runqueue lock.

> [   59.134374] NMI backtrace for cpu 1
> [   59.134388] RIP: 0010:_find_first_zero_bit+0x50/0x90
> [   59.134423]  __schedule+0x3312/0x4390
> [   59.134430]  ? __pfx___schedule+0x10/0x10
> [   59.134434]  ? trace_rcu_watching+0x105/0x150
> [   59.134440]  schedule_idle+0x59/0x90

CPU1 holds runqueue lock and find_first_zero_bit() suggests that this
comes from mm_get_cid(), but w/o decoding the return address it's hard
to tell for sure.

> [   59.134474] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20

CPU0 is idle and not involved at all.

So the situation is:

test_prog creates the 4th child, which exceeds the number of CPUs, so
it switches to per CPU mode.

At this point each task of test_prog has a CID associated. Let's
assume thread creation order assignment for simplicity.

   T0 (main thread)       CID0  runs fork()
   T1 (1st child)	  CID1
   T2 (2nd child)	  CID2
   T3 (3rd child)	  CID3
   T4 (4th child)         ---   is about to be forked and causes the
                                mode switch

T0 sets mm_cid::percpu = true
   transfers the CID from T0 to CPU2

   Starts the fixup which walks through the threads

During that T1 - T3 are free to schedule in and out before the fixup
caught up with them. Now I played through all possible permutations with
a python script and came up with the following snafu:

   T1 schedules in on CPU3 and observes percpu == true, so it transfers
      it's CID to CPU3

   T1 is migrated to CPU1 and schedule in observes percpu == true, but
      CPU1 does not have a CID associated and T1 transferred it's own to
      CPU3

      So it has to allocate one with CPU1 runqueue lock held, but the
      pool is empty, so it keeps looping.

Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held. ---> Livelock

So this side needs the same MM_CID_TRANSIT treatment as the other side,
which brings me back to the splat Shrikanth observed.

I used the same script to run through all possible permutations on that
side too, but nothing showed up there and the yesterday finding is
harmless because that only creates slightly inconsistent state as the
task is already marked CID inactive. But the CID has the MM_CID_TRANSIT
bit set, so the CID is dropped back into the pool when the exiting task
schedules out via preemption or the final schedule().

So I scratched my head some more and stared at the code with two things
in mind:

   1) It seems to be hard to reproduce
   2) It happened on a weakly ordered architecture

and indeed there is a opportunity to get this wrong:

The mode switch does:

    WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
    WRITE_ONCE(mm->mm_cid.percpu, ....);

sched_in() does:

    if (!READ_ONCE(mm->mm_cid.percpu))
       ...
       cid |= READ_ONCE(mm->mm_cid.transit);

so it can observe percpu == false and transit == 0 even if the fixup
function has not yet completed. As a consequence the task will not drop
the CID when scheduling out before the fixup is completed, which means
the CID space can be exhausted and the next task scheduling in will loop
in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.

I'll send out a series to address all of that later this evening when
tests have completed and changelogs are polished.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-29 17:06 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251119171016.815482037@linutronix.de>
2026-01-28  0:01 ` [patch V5 00/20] sched: Rewrite MM CID management Ihor Solodrai
2026-01-28  8:46   ` Peter Zijlstra
2026-01-28 11:57   ` Thomas Gleixner
2026-01-28 12:58     ` Shrikanth Hegde
2026-01-28 13:56       ` Thomas Gleixner
2026-01-28 22:24         ` Thomas Gleixner
2026-01-28 22:33           ` Ihor Solodrai
2026-01-28 23:08             ` Ihor Solodrai
2026-01-29 17:06               ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox