* DEBUG_ATOMIC_SLEEP spew in cgroup_task_dead() on next-20251104
@ 2025-11-04 18:11 Calvin Owens
2025-11-04 19:30 ` Tejun Heo
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Calvin Owens @ 2025-11-04 18:11 UTC (permalink / raw)
To: linux-kernel, Tejun Heo; +Cc: Dan Schatzberg, Peter Zijlstra
Hi Tejun,
The following spews constantly for me on next-20251104 (w/ PREEMPT_RT):
[ 1.246079] [ T0] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[ 1.246079] [ T0] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/1
[ 1.246079] [ T0] preempt_count: 1, expected: 0
[ 1.246079] [ T0] RCU nest depth: 0, expected: 0
[ 1.246079] [ T0] 1 lock held by swapper/1/0:
[ 1.246079] [ T0] #0: ffffffff827d0060 (css_set_lock){+.+.}-{3:3}, at: cgroup_task_dead+0x18/0x23b
[ 1.246079] [ T0] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.18.0-rc4-next-20251104-x86-hardened #1 PREEMPT_{RT,LAZY}
[ 1.246079] [ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-20240910_120124-localhost 04/01/2014
[ 1.246079] [ T0] Call Trace:
[ 1.246079] [ T0] <TASK>
[ 1.246079] [ T0] dump_stack_lvl+0x94/0xdb
[ 1.246079] [ T0] __might_resched+0x21b/0x240
[ 1.246079] [ T0] rt_spin_lock+0x62/0x1d0
[ 1.246079] [ T0] cgroup_task_dead+0x18/0x23b
[ 1.246079] [ T0] finish_task_switch+0x1a6/0x295
[ 1.246079] [ T0] __schedule+0x7ee/0xbc5
[ 1.246079] [ T0] schedule_idle+0x1a/0x30
[ 1.246079] [ T0] do_idle+0x1aa/0x1e5
[ 1.246079] [ T0] cpu_startup_entry+0x21/0x30
[ 1.246079] [ T0] start_secondary+0xc4/0xdb
[ 1.246079] [ T0] common_startup_64+0x13b/0x157
[ 1.246079] [ T0] </TASK>
Full dmesg is here: https://gist.githubusercontent.com/jcalvinowens/e1ec1153ddbff10cee5a96ad58f65205/raw/445966a3bfefd94cfca0cd17bcb76c07d31cb33e/gistfile1.txt
Kconfig is here: https://gist.githubusercontent.com/jcalvinowens/8dd543985e25e3d0329339c0d041f1f2/raw/197555de8482a1ffd0482c88dede0c13a42b9a69/gistfile1.txt
I'm guessing this is related to d245698d727a ("cgroup: Defer task cgroup
unlink until after the task is done switching out")? Is there any other
useful info I can provide?
Thanks,
Calvin
^ permalink raw reply [flat|nested] 41+ messages in thread* Re: DEBUG_ATOMIC_SLEEP spew in cgroup_task_dead() on next-20251104 2025-11-04 18:11 DEBUG_ATOMIC_SLEEP spew in cgroup_task_dead() on next-20251104 Calvin Owens @ 2025-11-04 19:30 ` Tejun Heo 2025-11-05 15:16 ` Calvin Owens 2025-11-04 19:32 ` [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t Tejun Heo 2025-11-04 19:32 ` [PATCH cgroup/for-6.19 2/2] cgroup: Convert css_set_lock locking to use cleanup guards Tejun Heo 2 siblings, 1 reply; 41+ messages in thread From: Tejun Heo @ 2025-11-04 19:30 UTC (permalink / raw) To: Calvin Owens; +Cc: linux-kernel, Dan Schatzberg, Peter Zijlstra Hello, Calvin! How are you? On Tue, Nov 04, 2025 at 10:11:14AM -0800, Calvin Owens wrote: > I'm guessing this is related to d245698d727a ("cgroup: Defer task cgroup > unlink until after the task is done switching out")? Is there any other > useful info I can provide? Ah, I need to make css_set_lock a raw one. I'll reply with patches. Thanks. -- tejun ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: DEBUG_ATOMIC_SLEEP spew in cgroup_task_dead() on next-20251104 2025-11-04 19:30 ` Tejun Heo @ 2025-11-05 15:16 ` Calvin Owens 2025-11-05 19:03 ` [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT Tejun Heo 0 siblings, 1 reply; 41+ messages in thread From: Calvin Owens @ 2025-11-05 15:16 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-kernel, Dan Schatzberg, Peter Zijlstra On Tuesday 11/04 at 09:30 -1000, Tejun Heo wrote: > Hello, Calvin! How are you? Still here. Sorry to appear only with problems not solutions :) I wonder... how much work would it be to make the debugging instrumentation throw a splat for this class of problem without actually having to compile the kernel with PREEMPT_RT? > On Tue, Nov 04, 2025 at 10:11:14AM -0800, Calvin Owens wrote: > > I'm guessing this is related to d245698d727a ("cgroup: Defer task cgroup > > unlink until after the task is done switching out")? Is there any other > > useful info I can provide? > > Ah, I need to make css_set_lock a raw one. I'll reply with patches. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT 2025-11-05 15:16 ` Calvin Owens @ 2025-11-05 19:03 ` Tejun Heo 2025-11-06 1:15 ` Calvin Owens ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Tejun Heo @ 2025-11-05 19:03 UTC (permalink / raw) To: Calvin Owens Cc: linux-kernel, Dan Schatzberg, Peter Zijlstra, Sebastian Andrzej Siewior cgroup_task_dead() is called from finish_task_switch() which runs with preemption disabled and doesn't allow scheduling even on PREEMPT_RT. The function needs to acquire css_set_lock which is a regular spinlock that can sleep on RT kernels, leading to "sleeping function called from invalid context" warnings. css_set_lock is too large in scope to convert to a raw_spinlock. However, the unlinking operations don't need to run synchronously - they just need to complete after the task is done running. On PREEMPT_RT, defer the work through irq_work. Fixes: d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") Reported-by: Calvin Owens <calvin@wbinvd.org> Link: https://lore.kernel.org/r/20251104181114.489391-1-calvin@wbinvd.org Signed-off-by: Tejun Heo <tj@kernel.org> --- Hello, Calvin, this seems to work fine here but can you please try it out? Sebastian, Peter, does this look okay to you guys? Thanks. include/linux/sched.h | 5 +++- kernel/cgroup/cgroup.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 56 insertions(+), 2 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1324,7 +1324,10 @@ struct task_struct { struct css_set __rcu *cgroups; /* cg_list protected by css_set_lock and tsk->alloc_lock: */ struct list_head cg_list; -#endif +#ifdef CONFIG_PREEMPT_RT + struct llist_node cg_dead_lnode; +#endif /* CONFIG_PREEMPT_RT */ +#endif /* CONFIG_CGROUPS */ #ifdef CONFIG_X86_CPU_RESCTRL u32 closid; u32 rmid; --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -290,6 +290,7 @@ static void kill_css(struct cgroup_subsy static int cgroup_addrm_files(struct cgroup_subsys_state *css, struct cgroup *cgrp, struct cftype cfts[], bool is_add); +static void cgroup_rt_init(void); #ifdef CONFIG_DEBUG_CGROUP_REF #define CGROUP_REF_FN_ATTRS noinline @@ -6360,6 +6361,7 @@ int __init cgroup_init(void) BUG_ON(ss_rstat_init(NULL)); get_user_ns(init_cgroup_ns.user_ns); + cgroup_rt_init(); cgroup_lock(); @@ -6990,7 +6992,7 @@ void cgroup_task_exit(struct task_struct } while_each_subsys_mask(); } -void cgroup_task_dead(struct task_struct *tsk) +static void do_cgroup_task_dead(struct task_struct *tsk) { struct css_set *cset; unsigned long flags; @@ -7016,6 +7018,55 @@ void cgroup_task_dead(struct task_struct spin_unlock_irqrestore(&css_set_lock, flags); } +#ifdef CONFIG_PREEMPT_RT +/* + * cgroup_task_dead() is called from finish_task_switch() which doesn't allow + * scheduling even in RT. As the task_dead path requires grabbing css_set_lock, + * this lead to sleeping in the invalid context warning bug. css_set_lock is too + * big to become a raw_spinlock. The task_dead path doesn't need to run + * synchronously. Bounce through irq_work instead. + */ +static DEFINE_PER_CPU(struct llist_head, cgrp_dead_tasks); +static DEFINE_PER_CPU(struct irq_work, cgrp_dead_tasks_iwork); + +static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork) +{ + struct llist_node *lnode; + struct task_struct *task, *next; + + lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks)); + llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) { + do_cgroup_task_dead(task); + put_task_struct(task); + } +} + +static void __init cgroup_rt_init(void) +{ + int cpu; + + for_each_possible_cpu(cpu) { + init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu)); + init_irq_work(per_cpu_ptr(&cgrp_dead_tasks_iwork, cpu), + cgrp_dead_tasks_iwork_fn); + } +} + +void cgroup_task_dead(struct task_struct *task) +{ + get_task_struct(task); + llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks)); + irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork)); +} +#else /* CONFIG_PREEMPT_RT */ +static void __init cgroup_rt_init(void) {} + +void cgroup_task_dead(struct task_struct *task) +{ + do_cgroup_task_dead(task); +} +#endif /* CONFIG_PREEMPT_RT */ + void cgroup_task_release(struct task_struct *task) { struct cgroup_subsys *ss; ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT 2025-11-05 19:03 ` [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT Tejun Heo @ 2025-11-06 1:15 ` Calvin Owens 2025-11-06 17:36 ` Tejun Heo 2025-11-06 15:07 ` Sebastian Andrzej Siewior 2026-02-19 16:46 ` ~90s reboot delay with v6.19 and PREEMPT_RT Bert Karwatzki 2 siblings, 1 reply; 41+ messages in thread From: Calvin Owens @ 2025-11-06 1:15 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel, Dan Schatzberg, Peter Zijlstra, Sebastian Andrzej Siewior On Wednesday 11/05 at 09:03 -1000, Tejun Heo wrote: > cgroup_task_dead() is called from finish_task_switch() which runs with > preemption disabled and doesn't allow scheduling even on PREEMPT_RT. The > function needs to acquire css_set_lock which is a regular spinlock that can > sleep on RT kernels, leading to "sleeping function called from invalid > context" warnings. > > css_set_lock is too large in scope to convert to a raw_spinlock. However, > the unlinking operations don't need to run synchronously - they just need > to complete after the task is done running. > > On PREEMPT_RT, defer the work through irq_work. > > Fixes: d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") > Reported-by: Calvin Owens <calvin@wbinvd.org> > Link: https://lore.kernel.org/r/20251104181114.489391-1-calvin@wbinvd.org > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > Hello, > > Calvin, this seems to work fine here but can you please try it out? Works for me, no splats with that on top of next-20251104. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT 2025-11-06 1:15 ` Calvin Owens @ 2025-11-06 17:36 ` Tejun Heo 0 siblings, 0 replies; 41+ messages in thread From: Tejun Heo @ 2025-11-06 17:36 UTC (permalink / raw) To: Calvin Owens Cc: linux-kernel, Dan Schatzberg, Peter Zijlstra, Sebastian Andrzej Siewior On Wed, Nov 05, 2025 at 05:15:50PM -0800, Calvin Owens wrote: > Works for me, no splats with that on top of next-20251104. Thanks for testing! -- tejun ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT 2025-11-05 19:03 ` [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT Tejun Heo 2025-11-06 1:15 ` Calvin Owens @ 2025-11-06 15:07 ` Sebastian Andrzej Siewior 2025-11-06 17:37 ` Tejun Heo 2026-02-19 16:46 ` ~90s reboot delay with v6.19 and PREEMPT_RT Bert Karwatzki 2 siblings, 1 reply; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2025-11-06 15:07 UTC (permalink / raw) To: Tejun Heo; +Cc: Calvin Owens, linux-kernel, Dan Schatzberg, Peter Zijlstra On 2025-11-05 09:03:55 [-1000], Tejun Heo wrote: > +#ifdef CONFIG_PREEMPT_RT > +/* > + * cgroup_task_dead() is called from finish_task_switch() which doesn't allow > + * scheduling even in RT. As the task_dead path requires grabbing css_set_lock, > + * this lead to sleeping in the invalid context warning bug. css_set_lock is too > + * big to become a raw_spinlock. The task_dead path doesn't need to run > + * synchronously. Bounce through irq_work instead. > + */ > +static DEFINE_PER_CPU(struct llist_head, cgrp_dead_tasks); > +static DEFINE_PER_CPU(struct irq_work, cgrp_dead_tasks_iwork); > + > +static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork) > +{ > + struct llist_node *lnode; > + struct task_struct *task, *next; > + > + lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks)); > + llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) { > + do_cgroup_task_dead(task); > + put_task_struct(task); > + } > +} > + > +static void __init cgroup_rt_init(void) > +{ > + int cpu; > + > + for_each_possible_cpu(cpu) { > + init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu)); > + init_irq_work(per_cpu_ptr(&cgrp_dead_tasks_iwork, cpu), > + cgrp_dead_tasks_iwork_fn); How important is it, that it happens right away? Written as-is, this leads to an interrupt then wakes irq_work/$cpu thread which then runs this callback. That thread runs as SCHED_FIFO-1. This means the termination of a SCHED_OTHER tasks on a single CPU will run as follows: - TASK_DEAD schedule() - queue IRQ_WORK -> INTERRUPT -> WAKE irq_work -> preempt to irq_work/ -> handle one callback schedule() back to next TASK_DEAD So cgrp_dead_tasks_iwork_fn() will never have to opportunity to batch. Unless the exiting task's priority is > 1. Then it will be delayed until all RT tasks are done. My proposal would be to init the irq_work item with *per_cpu_ptr(&cgrp_dead_tasks_iwork, cpu) = IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn); instead which won't raise an IRQ immediately and delay the callback until the next timer tick. So it could batch multiple tasks. [ queue_work() should work, too but the overhead to schedule is greater imho so this makes sense ] Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT 2025-11-06 15:07 ` Sebastian Andrzej Siewior @ 2025-11-06 17:37 ` Tejun Heo 2025-11-06 17:46 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 41+ messages in thread From: Tejun Heo @ 2025-11-06 17:37 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Calvin Owens, linux-kernel, Dan Schatzberg, Peter Zijlstra Hello, On Thu, Nov 06, 2025 at 04:07:17PM +0100, Sebastian Andrzej Siewior wrote: > How important is it, that it happens right away? Written as-is, this Not important at all. > leads to an interrupt then wakes irq_work/$cpu thread which then runs > this callback. That thread runs as SCHED_FIFO-1. This means the > termination of a SCHED_OTHER tasks on a single CPU will run as follows: > - TASK_DEAD > schedule() > - queue IRQ_WORK > -> INTERRUPT > -> WAKE irq_work > -> preempt to irq_work/ > -> handle one callback > schedule() > back to next TASK_DEAD > > So cgrp_dead_tasks_iwork_fn() will never have to opportunity to batch. > Unless the exiting task's priority is > 1. Then it will be delayed > until all RT tasks are done. > > My proposal would be to init the irq_work item with > *per_cpu_ptr(&cgrp_dead_tasks_iwork, cpu) = IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn); > > instead which won't raise an IRQ immediately and delay the callback > until the next timer tick. So it could batch multiple tasks. > > [ queue_work() should work, too but the overhead to schedule is greater > imho so this makes sense ] Will switch to IRQ_WORK_LAZY_INIT. Thanks for the review. -- tejun ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT 2025-11-06 17:37 ` Tejun Heo @ 2025-11-06 17:46 ` Sebastian Andrzej Siewior 2025-11-06 17:55 ` Tejun Heo 0 siblings, 1 reply; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2025-11-06 17:46 UTC (permalink / raw) To: Tejun Heo; +Cc: Calvin Owens, linux-kernel, Dan Schatzberg, Peter Zijlstra On 2025-11-06 07:37:16 [-1000], Tejun Heo wrote: > Hello, Hi, > On Thu, Nov 06, 2025 at 04:07:17PM +0100, Sebastian Andrzej Siewior wrote: > > How important is it, that it happens right away? Written as-is, this > > Not important at all. > … > Will switch to IRQ_WORK_LAZY_INIT. Quick question: Since it is not important at all, would it work to have it in task's RCU callback, __put_task_struct()? > Thanks for the review. > Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT 2025-11-06 17:46 ` Sebastian Andrzej Siewior @ 2025-11-06 17:55 ` Tejun Heo 2025-11-06 18:06 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 41+ messages in thread From: Tejun Heo @ 2025-11-06 17:55 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Calvin Owens, linux-kernel, Dan Schatzberg, Peter Zijlstra Hello, On Thu, Nov 06, 2025 at 06:46:14PM +0100, Sebastian Andrzej Siewior wrote: > On 2025-11-06 07:37:16 [-1000], Tejun Heo wrote: > > Will switch to IRQ_WORK_LAZY_INIT. > > Quick question: Since it is not important at all, would it work to have > it in task's RCU callback, __put_task_struct()? It doesn't have to run right away but it better run in some definite time frame because at this point the task is not visible from userspace otherwise (it doesn't have a pid) but are still pinning the cgroup, so we're in this limbo state where reading cgroup.procs should return empty (there may be a bug here right now. I think the code will try to deref NULL pid pointer) but the cgroup is not empty. This window is not really broken in itself because cgroup empty state is tracked and notified separately. However, task_struct can be pinned and can linger for indefinite amount of time after being dead, and that would become an actual problem. So, to add a bit of qualitifier, while it's okay to run it with some amount of delay that's not very significant to human perception, we definitely don't want to allow delaying it indefinitely. Thanks. -- tejun ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT 2025-11-06 17:55 ` Tejun Heo @ 2025-11-06 18:06 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2025-11-06 18:06 UTC (permalink / raw) To: Tejun Heo; +Cc: Calvin Owens, linux-kernel, Dan Schatzberg, Peter Zijlstra On 2025-11-06 07:55:02 [-1000], Tejun Heo wrote: > Hello, Hi, > On Thu, Nov 06, 2025 at 06:46:14PM +0100, Sebastian Andrzej Siewior wrote: > > On 2025-11-06 07:37:16 [-1000], Tejun Heo wrote: > > > Will switch to IRQ_WORK_LAZY_INIT. > > > > Quick question: Since it is not important at all, would it work to have > > it in task's RCU callback, __put_task_struct()? > > It doesn't have to run right away but it better run in some definite time > frame because at this point the task is not visible from userspace otherwise > (it doesn't have a pid) but are still pinning the cgroup, so we're in this > limbo state where reading cgroup.procs should return empty (there may be a > bug here right now. I think the code will try to deref NULL pid pointer) but > the cgroup is not empty. This window is not really broken in itself because > cgroup empty state is tracked and notified separately. However, task_struct > can be pinned and can linger for indefinite amount of time after being dead, > and that would become an actual problem. > > So, to add a bit of qualitifier, while it's okay to run it with some amount > of delay that's not very significant to human perception, we definitely > don't want to allow delaying it indefinitely. Okay. This is some arguing that would justify the additional extension of task_struct :) Understood. > Thanks. Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* ~90s reboot delay with v6.19 and PREEMPT_RT 2025-11-05 19:03 ` [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT Tejun Heo 2025-11-06 1:15 ` Calvin Owens 2025-11-06 15:07 ` Sebastian Andrzej Siewior @ 2026-02-19 16:46 ` Bert Karwatzki 2026-02-19 20:53 ` Calvin Owens 2026-02-25 15:43 ` Sebastian Andrzej Siewior 2 siblings, 2 replies; 41+ messages in thread From: Bert Karwatzki @ 2026-02-19 16:46 UTC (permalink / raw) To: Tejun Heo Cc: Bert Karwatzki, Sebastian Andrzej Siewior, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel Since linux v6.19 I noticed that rebooting my MSI Alpha 15 Laptop would hang for about ~90s before rebooting. I bisected this (from v6.18 to v6.19) and got this as the first bad commit: 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") Reverting this commit fixes the reboot delay but introduces these error (which also seem to make the system crash sometimes): [ T0] BUG: scheduling while atomic: swapper/0/0/0x00000002 Compiling the kernel without PREEMPT_RT also fixes the issue. I've test these commits from "git log --oneline v6.19 kernel/cgroup" This commit shows the reboot delay: 9311e6c29b34 cgroup: Fix sleeping from invalid context warning on PREEMPT_RT These commits show no reboot delay, but "scheduling while atomic" warnings and instability: be04e96ba911 cgroup/cpuset: Globally track isolated_cpus update b1034a690129 cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition 6cfeddbf4ade cgroup/cpuset: Move up prstate_housekeeping_conflict() helper 103b08709e8a cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping 55939cf28a48 cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() b66c7af4d86d cgroup: use credential guards in cgroup_attach_permissions() d245698d727a cgroup: Defer task cgroup unlink until after the task is done switching out This commit shows neither the reboot delay nor the "scheduling while atomic" problem: 260fbcb92bbe cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Bert Karwatzki ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-19 16:46 ` ~90s reboot delay with v6.19 and PREEMPT_RT Bert Karwatzki @ 2026-02-19 20:53 ` Calvin Owens 2026-02-19 23:10 ` Bert Karwatzki 2026-02-25 15:43 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 41+ messages in thread From: Calvin Owens @ 2026-02-19 20:53 UTC (permalink / raw) To: Bert Karwatzki Cc: Tejun Heo, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel On Thursday 02/19 at 17:46 +0100, Bert Karwatzki wrote: > Since linux v6.19 I noticed that rebooting my MSI Alpha 15 Laptop > would hang for about ~90s before rebooting. I bisected this (from > v6.18 to v6.19) and got this as the first bad commit: > 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") > > Reverting this commit fixes the reboot delay but introduces these error (which > also seem to make the system crash sometimes): > [ T0] BUG: scheduling while atomic: swapper/0/0/0x00000002 > > Compiling the kernel without PREEMPT_RT also fixes the issue. Hi Bert, I'm not seeing this behavior, there must be more to it than just this. Could you share your kconfig? The dmesg with initcall_debug will be more verbose during shutdown too, that may also help. Thanks, Calvin > I've test these commits from "git log --oneline v6.19 kernel/cgroup" > > This commit shows the reboot delay: > 9311e6c29b34 cgroup: Fix sleeping from invalid context warning on PREEMPT_RT > These commits show no reboot delay, but "scheduling while atomic" warnings > and instability: > be04e96ba911 cgroup/cpuset: Globally track isolated_cpus update > b1034a690129 cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition > 6cfeddbf4ade cgroup/cpuset: Move up prstate_housekeeping_conflict() helper > 103b08709e8a cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping > 55939cf28a48 cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() > b66c7af4d86d cgroup: use credential guards in cgroup_attach_permissions() > d245698d727a cgroup: Defer task cgroup unlink until after the task is done switching out > This commit shows neither the reboot delay nor the "scheduling while atomic" problem: > 260fbcb92bbe cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() > > Bert Karwatzki ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-19 20:53 ` Calvin Owens @ 2026-02-19 23:10 ` Bert Karwatzki 2026-02-20 0:58 ` Steven Rostedt 2026-02-24 15:45 ` ~90s reboot " Sebastian Andrzej Siewior 0 siblings, 2 replies; 41+ messages in thread From: Bert Karwatzki @ 2026-02-19 23:10 UTC (permalink / raw) To: Calvin Owens Cc: Tejun Heo, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel, spasswolf Am Donnerstag, dem 19.02.2026 um 12:53 -0800 schrieb Calvin Owens: > On Thursday 02/19 at 17:46 +0100, Bert Karwatzki wrote: > > Since linux v6.19 I noticed that rebooting my MSI Alpha 15 Laptop > > would hang for about ~90s before rebooting. I bisected this (from > > v6.18 to v6.19) and got this as the first bad commit: > > 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") > > > > Reverting this commit fixes the reboot delay but introduces these error (which > > also seem to make the system crash sometimes): > > [ T0] BUG: scheduling while atomic: swapper/0/0/0x00000002 > > > > Compiling the kernel without PREEMPT_RT also fixes the issue. > > Hi Bert, > > I'm not seeing this behavior, there must be more to it than just this. > > Could you share your kconfig? The dmesg with initcall_debug will be more > verbose during shutdown too, that may also help. I tested v6.19 with initcall_debug and I'm seeing these messages (on startup, there no extra message during the 90s reboot delay). I put the .config as an attachement in a gitlab issue (Or should I put it in the mail? It's ~6500 lines): https://gitlab.freedesktop.org/-/project/26509/uploads/8d1c04bbe0ab121945be7c898d08e1b6/config-6.19.0-stable 2026-02-19T23:37:50.140437+01:00 lisa kernel: [ T16] BUG: sleeping function called from invalid context at kernel/printk/printk.c:3377 2026-02-19T23:37:50.140439+01:00 lisa kernel: [ T16] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 16, name: pr/legacy 2026-02-19T23:37:50.140441+01:00 lisa kernel: [ T16] preempt_count: 0, expected: 0 2026-02-19T23:37:50.140443+01:00 lisa kernel: [ T16] RCU nest depth: 1, expected: 0 2026-02-19T23:37:50.140445+01:00 lisa kernel: [ T16] CPU: 1 UID: 0 PID: 16 Comm: pr/legacy Tainted: G W 6.19.0-stable #1159 PREEMPT_{RT,(full)} 2026-02-19T23:37:50.140446+01:00 lisa kernel: [ T16] Tainted: [W]=WARN 2026-02-19T23:37:50.140448+01:00 lisa kernel: [ T16] Hardware name: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.10F 11/11/2024 2026-02-19T23:37:50.140456+01:00 lisa kernel: [ T16] Call Trace: 2026-02-19T23:37:50.140458+01:00 lisa kernel: [ T16] <TASK> 2026-02-19T23:37:50.140460+01:00 lisa kernel: [ T16] dump_stack_lvl+0x4b/0x70 2026-02-19T23:37:50.140461+01:00 lisa kernel: [ T16] __might_resched.cold+0xaf/0xbd 2026-02-19T23:37:50.140463+01:00 lisa kernel: [ T16] console_conditional_schedule+0x26/0x30 2026-02-19T23:37:50.140464+01:00 lisa kernel: [ T16] fbcon_redraw+0x9b/0x240 2026-02-19T23:37:50.140466+01:00 lisa kernel: [ T16] ? get_color+0x21/0x130 2026-02-19T23:37:50.140468+01:00 lisa kernel: [ T16] fbcon_scroll+0x165/0x1c0 2026-02-19T23:37:50.140470+01:00 lisa kernel: [ T16] con_scroll+0xf6/0x200 2026-02-19T23:37:50.140472+01:00 lisa kernel: [ T16] ? srso_alias_return_thunk+0x5/0xfbef5 2026-02-19T23:37:50.140474+01:00 lisa kernel: [ T16] lf+0x9f/0xb0 2026-02-19T23:37:50.140475+01:00 lisa kernel: [ T16] vt_console_print+0x2ff/0x460 2026-02-19T23:37:50.140477+01:00 lisa kernel: [ T16] console_flush_one_record+0x21c/0x3e0 2026-02-19T23:37:50.140479+01:00 lisa kernel: [ T16] ? console_flush_one_record+0x3e0/0x3e0 2026-02-19T23:37:50.140481+01:00 lisa kernel: [ T16] legacy_kthread_func+0xc7/0x1a0 2026-02-19T23:37:50.140482+01:00 lisa kernel: [ T16] ? housekeeping_affine+0x30/0x30 2026-02-19T23:37:50.140484+01:00 lisa kernel: [ T16] kthread+0xf7/0x1e0 2026-02-19T23:37:50.140486+01:00 lisa kernel: [ T16] ? kthreads_online_cpu+0x100/0x100 2026-02-19T23:37:50.140487+01:00 lisa kernel: [ T16] ? kthreads_online_cpu+0x100/0x100 2026-02-19T23:37:50.140489+01:00 lisa kernel: [ T16] ret_from_fork+0x20e/0x240 2026-02-19T23:37:50.140491+01:00 lisa kernel: [ T16] ? kthreads_online_cpu+0x100/0x100 2026-02-19T23:37:50.140492+01:00 lisa kernel: [ T16] ret_from_fork_asm+0x11/0x20 2026-02-19T23:37:50.140494+01:00 lisa kernel: [ T16] </TASK> and 2026-02-19T23:37:50.338878+01:00 lisa kernel: [ T16] BUG: sleeping function called from invalid context at kernel/printk/printk.c:3377 2026-02-19T23:37:50.338880+01:00 lisa kernel: [ T16] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 16, name: pr/legacy 2026-02-19T23:37:50.338882+01:00 lisa kernel: [ T16] preempt_count: 0, expected: 0 2026-02-19T23:37:50.338884+01:00 lisa kernel: [ T16] RCU nest depth: 1, expected: 0 2026-02-19T23:37:50.338893+01:00 lisa kernel: [ T16] CPU: 7 UID: 0 PID: 16 Comm: pr/legacy Tainted: G W 6.19.0-stable #1159 PREEMPT_{RT,(full)} 2026-02-19T23:37:50.338895+01:00 lisa kernel: [ T16] Tainted: [W]=WARN 2026-02-19T23:37:50.338896+01:00 lisa kernel: [ T16] Hardware name: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.10F 11/11/2024 2026-02-19T23:37:50.338897+01:00 lisa kernel: [ T16] Call Trace: 2026-02-19T23:37:50.338899+01:00 lisa kernel: [ T16] <TASK> 2026-02-19T23:37:50.338901+01:00 lisa kernel: [ T16] dump_stack_lvl+0x4b/0x70 2026-02-19T23:37:50.338906+01:00 lisa kernel: [ T16] __might_resched.cold+0xaf/0xbd 2026-02-19T23:37:50.338908+01:00 lisa kernel: [ T16] console_conditional_schedule+0x26/0x30 2026-02-19T23:37:50.338910+01:00 lisa kernel: [ T16] fbcon_redraw+0x9b/0x240 2026-02-19T23:37:50.338916+01:00 lisa kernel: [ T16] ? get_color+0x21/0x130 2026-02-19T23:37:50.338917+01:00 lisa kernel: [ T16] fbcon_scroll+0x165/0x1c0 2026-02-19T23:37:50.338919+01:00 lisa kernel: [ T16] con_scroll+0xf6/0x200 2026-02-19T23:37:50.338924+01:00 lisa kernel: [ T16] ? srso_alias_return_thunk+0x5/0xfbef5 2026-02-19T23:37:50.338935+01:00 lisa kernel: [ T16] lf+0x9f/0xb0 2026-02-19T23:37:50.338937+01:00 lisa kernel: [ T16] vt_console_print+0x2ff/0x460 2026-02-19T23:37:50.338957+01:00 lisa kernel: [ T16] console_flush_one_record+0x21c/0x3e0 2026-02-19T23:37:50.338960+01:00 lisa kernel: [ T16] ? console_flush_one_record+0x3e0/0x3e0 2026-02-19T23:37:50.338962+01:00 lisa kernel: [ T16] legacy_kthread_func+0xc7/0x1a0 2026-02-19T23:37:50.338964+01:00 lisa kernel: [ T16] ? housekeeping_affine+0x30/0x30 2026-02-19T23:37:50.338965+01:00 lisa kernel: [ T16] kthread+0xf7/0x1e0 2026-02-19T23:37:50.338966+01:00 lisa kernel: [ T16] ? kthreads_online_cpu+0x100/0x100 2026-02-19T23:37:50.338968+01:00 lisa kernel: [ T16] ? kthreads_online_cpu+0x100/0x100 2026-02-19T23:37:50.338972+01:00 lisa kernel: [ T16] ret_from_fork+0x20e/0x240 2026-02-19T23:37:50.338976+01:00 lisa kernel: [ T16] ? kthreads_online_cpu+0x100/0x100 2026-02-19T23:37:50.338978+01:00 lisa kernel: [ T16] ret_from_fork_asm+0x11/0x20 2026-02-19T23:37:50.338980+01:00 lisa kernel: [ T16] </TASK> These seem to be the same type of messages that commit 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") is supposed to fix. When I compile a kernel with 9311e6c29b34 as HEAD I do not get these messages, so I guess I have to do another bisection here. Bert Karwatzki ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-19 23:10 ` Bert Karwatzki @ 2026-02-20 0:58 ` Steven Rostedt 2026-02-20 9:15 ` ~90s shutdown " Bert Karwatzki 2026-02-24 15:45 ` ~90s reboot " Sebastian Andrzej Siewior 1 sibling, 1 reply; 41+ messages in thread From: Steven Rostedt @ 2026-02-20 0:58 UTC (permalink / raw) To: Bert Karwatzki Cc: Calvin Owens, Tejun Heo, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel On Fri, 20 Feb 2026 00:10:41 +0100 Bert Karwatzki <spasswolf@web.de> wrote: > These seem to be the same type of messages that commit > 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") > is supposed to fix. > When I compile a kernel with 9311e6c29b34 as HEAD I do not get these messages, so > I guess I have to do another bisection here. Can you add to the kernel command line: trace_event=sched_switch traceoff_after_boot And then look at /sys/kernel/tracing/trace It will enable sched_switch event and the traceoff_after_boot will disable tracing right before running init. That way you can see what is running during boot that is taking the 90 seconds. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s shutdown delay with v6.19 and PREEMPT_RT 2026-02-20 0:58 ` Steven Rostedt @ 2026-02-20 9:15 ` Bert Karwatzki 2026-02-20 15:44 ` Steven Rostedt 0 siblings, 1 reply; 41+ messages in thread From: Bert Karwatzki @ 2026-02-20 9:15 UTC (permalink / raw) To: Steven Rostedt Cc: Calvin Owens, Tejun Heo, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel, spasswolf Am Donnerstag, dem 19.02.2026 um 19:58 -0500 schrieb Steven Rostedt: > On Fri, 20 Feb 2026 00:10:41 +0100 > Bert Karwatzki <spasswolf@web.de> wrote: > > > These seem to be the same type of messages that commit > > 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") > > is supposed to fix. > > When I compile a kernel with 9311e6c29b34 as HEAD I do not get these messages, so > > I guess I have to do another bisection here. > > Can you add to the kernel command line: > > trace_event=sched_switch traceoff_after_boot > > And then look at /sys/kernel/tracing/trace > > It will enable sched_switch event and the traceoff_after_boot will > disable tracing right before running init. That way you can see what is > running during boot that is taking the 90 seconds. > > -- Steve I think there's a misunderstanding here, the 90s delay happens on shutdown, i.e. when using either reboot or shutdown. I've changed the subject accordingly. Bert Karwatzki ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s shutdown delay with v6.19 and PREEMPT_RT 2026-02-20 9:15 ` ~90s shutdown " Bert Karwatzki @ 2026-02-20 15:44 ` Steven Rostedt 2026-02-23 0:35 ` Bert Karwatzki 0 siblings, 1 reply; 41+ messages in thread From: Steven Rostedt @ 2026-02-20 15:44 UTC (permalink / raw) To: Bert Karwatzki Cc: Calvin Owens, Tejun Heo, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel On Fri, 20 Feb 2026 10:15:43 +0100 Bert Karwatzki <spasswolf@web.de> wrote: > > It will enable sched_switch event and the traceoff_after_boot will > > disable tracing right before running init. That way you can see what is > > running during boot that is taking the 90 seconds. > > > > -- Steve > > I think there's a misunderstanding here, the 90s delay happens on shutdown, i.e. > when using either reboot or shutdown. I've changed the subject accordingly. In that case you should be using the persistent ring buffer ;-) https://docs.kernel.org/trace/debugging.html Add to the kernel command line: reserve_mem=20M:2M:trace trace_instance=boot_map@trace And then before rebooting: echo 1 > /sys/kernel/tracing/instances/boot_map/events/sched/sched_switch/enable echo 1 > /sys/kernel/tracing/instances/boot_map/tracing_on Then look at the trace after the reboot: cat /sys/kernel/tracing/instances/boot_map/trace If your laptop doesn't clear memory during a reboot, you should have the trace. If that's not enough to debug the situation, you can enable other events, or enable function or function graph tracing. That should all work with the persistent buffer. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s shutdown delay with v6.19 and PREEMPT_RT 2026-02-20 15:44 ` Steven Rostedt @ 2026-02-23 0:35 ` Bert Karwatzki 2026-02-23 8:22 ` Steven Rostedt 0 siblings, 1 reply; 41+ messages in thread From: Bert Karwatzki @ 2026-02-23 0:35 UTC (permalink / raw) To: Steven Rostedt Cc: Calvin Owens, Tejun Heo, spasswolf, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel Am Freitag, dem 20.02.2026 um 10:44 -0500 schrieb Steven Rostedt: > On Fri, 20 Feb 2026 10:15:43 +0100 > Bert Karwatzki <spasswolf@web.de> wrote: > > > > It will enable sched_switch event and the traceoff_after_boot will > > > disable tracing right before running init. That way you can see what is > > > running during boot that is taking the 90 seconds. > > > > > > -- Steve > > > > I think there's a misunderstanding here, the 90s delay happens on shutdown, i.e. > > when using either reboot or shutdown. I've changed the subject accordingly. > > In that case you should be using the persistent ring buffer ;-) > > https://docs.kernel.org/trace/debugging.html > > Add to the kernel command line: > > reserve_mem=20M:2M:trace trace_instance=boot_map@trace > > And then before rebooting: > > echo 1 > /sys/kernel/tracing/instances/boot_map/events/sched/sched_switch/enable > echo 1 > /sys/kernel/tracing/instances/boot_map/tracing_on > > Then look at the trace after the reboot: > > cat /sys/kernel/tracing/instances/boot_map/trace > > If your laptop doesn't clear memory during a reboot, you should have the trace. > > If that's not enough to debug the situation, you can enable other events, > or enable function or function graph tracing. That should all work with the > persistent buffer. > > -- Steve Thank you, I tried this with v6.19 and PREEMPT_RT and got this during a reboot: # tracer: nop # # entries-in-buffer/entries-written: 265126/265126 #P:16 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <...>-2657 [013] d..2. 62.492813: sched_switch: prev_comm=bash prev_pid=0xa61 (2657) prev_prio=0x78 (120) prev_state=0x1 (1) next_comm=ksoftirqd/13 next_pid=0x86 (134) next_prio=0x78 (120) <...>-134 [013] d..2. 62.492817: sched_switch: prev_comm=ksoftirqd/13 prev_pid=0x86 (134) prev_prio=0x78 (120) prev_state=0x1 (1) next_comm=swapper/13 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [007] d..2. 62.492824: sched_switch: prev_comm=swapper/7 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=xfce4- terminal next_pid=0x7ca (1994) next_prio=0x78 (120) <...>-1994 [007] d..2. 62.493103: sched_switch: prev_comm=xfce4-terminal prev_pid=0x7ca (1994) prev_prio=0x78 (120) prev_state=0x1 (1) next_comm=swapper/7 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [007] d..2. 62.493135: sched_switch: prev_comm=swapper/7 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=xfce4- terminal next_pid=0x7ca (1994) next_prio=0x78 (120) <idle>-0 [009] d..2. 62.493181: sched_switch: prev_comm=swapper/9 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=gdbus next_pid=0x7dc (2012) next_prio=0x78 (120) [...] <...>-22 [000] d..2. 183.258021: sched_switch: prev_comm=migration/0 prev_pid=0x16 (22) prev_prio=0x0 (0) prev_state=0x1 (1) next_comm=swapper/0 next_pid=0x0 (0) next_prio=0x78 (120) <...>-86 [001] d..2. 183.258037: sched_switch: prev_comm=ksoftirqd/1 prev_pid=0x56 (86) prev_prio=0x78 (120) prev_state=0x1 (1) next_comm=swapper/1 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [004] d..2. 183.258160: sched_switch: prev_comm=swapper/4 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=systemd- shutdow next_pid=0x1 (1) next_prio=0x78 (120) <...>-1 [004] d..2. 183.258236: sched_switch: prev_comm=systemd-shutdow prev_pid=0x1 (1) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=migration/4 next_pid=0x23 (35) next_prio=0x0 (0) <...>-35 [004] d..2. 183.258254: sched_switch: prev_comm=migration/4 prev_pid=0x23 (35) prev_prio=0x0 (0) prev_state=0x1 (1) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [000] d..2. 183.258281: sched_switch: prev_comm=swapper/0 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=systemd- shutdow next_pid=0x1 (1) next_prio=0x78 (120) while on v6.19 without PREEMPT_RT I get this # tracer: nop # # entries-in-buffer/entries-written: 32550/32550 #P:16 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <idle>-0 [010] d..2. 22.559681: sched_switch: prev_comm=swapper/10 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=kworker/u64:15 next_pid=0x6e (110) next_prio=0x78 (120) <...>-2767 [004] d..2. 22.559687: sched_switch: prev_comm=bash prev_pid=0xacf (2767) prev_prio=0x78 (120) prev_state=0x1 (1) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-110 [010] d..2. 22.559691: sched_switch: prev_comm=kworker/u64:15 prev_pid=0x6e (110) prev_prio=0x78 (120) prev_state=0x80 (128) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [002] d..2. 22.559702: sched_switch: prev_comm=swapper/2 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=xfce4- terminal next_pid=0x724 (1828) next_prio=0x78 (120) <...>-1828 [002] d..2. 22.559951: sched_switch: prev_comm=xfce4-terminal prev_pid=0x724 (1828) prev_prio=0x78 (120) prev_state=0x1 (1) next_comm=swapper/2 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [008] d..2. 22.560343: sched_switch: prev_comm=swapper/8 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=rcu_preempt next_pid=0xf (15) next_prio=0x78 (120) <...>-15 [008] d..2. 22.560347: sched_switch: prev_comm=rcu_preempt prev_pid=0xf (15) prev_prio=0x78 (120) prev_state=0x80 (128) next_comm=swapper/8 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [002] d..2. 22.561724: sched_switch: prev_comm=swapper/2 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=xfce4- terminal next_pid=0x724 (1828) next_prio=0x78 (120) <...>-1828 [002] d..2. 22.561822: sched_switch: prev_comm=xfce4-terminal prev_pid=0x724 (1828) prev_prio=0x78 (120) prev_state=0x1 (1) next_comm=swapper/2 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [008] d..2. 22.561826: sched_switch: prev_comm=swapper/8 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=Xorg next_pid=0x5cb (1483) next_prio=0x78 (120) [...] <...>-252 [006] d..2. 29.512070: sched_switch: prev_comm=kworker/6:2 prev_pid=0xfc (252) prev_prio=0x78 (120) prev_state=0x80 (128) next_comm=swapper/6 next_pid=0x0 (0) next_prio=0x78 (120) <...>-153 [012] d..2. 29.512074: sched_switch: prev_comm=kworker/12:1 prev_pid=0x99 (153) prev_prio=0x78 (120) prev_state=0x80 (128) next_comm=swapper/12 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [011] d..2. 29.541390: sched_switch: prev_comm=swapper/11 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=systemd-shutdow next_pid=0x1 (1) next_prio=0x78 (120) <...>-1 [011] d..2. 29.541912: sched_switch: prev_comm=systemd-shutdow prev_pid=0x1 (1) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=migration/11 next_pid=0x51 (81) next_prio=0x0 (0) <...>-81 [011] d..2. 29.541933: sched_switch: prev_comm=migration/11 prev_pid=0x51 (81) prev_prio=0x0 (0) prev_state=0x1 (1) next_comm=swapper/11 next_pid=0x0 (0) next_prio=0x78 (120) <idle>-0 [000] d..2. 29.541966: sched_switch: prev_comm=swapper/0 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=systemd- shutdow next_pid=0x1 (1) next_prio=0x78 (120) <idle>-0 [012] d..2. 29.542130: sched_switch: prev_comm=swapper/12 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=rcu_preempt next_pid=0xf (15) next_prio=0x78 (120) <...>-15 [012] d..2. 29.542142: sched_switch: prev_comm=rcu_preempt prev_pid=0xf (15) prev_prio=0x78 (120) prev_state=0x80 (128) next_comm=swapper/12 next_pid=0x0 (0) next_prio=0x78 (120) So the time to was is ~120s with PREEMPT_RT and 7s without. The interesting difference between these two traces is that the second one only contains messages with "status" d..2. while the first also contains some with different status (191 of 265126). Could these be the reason for the delay. $ grep -v d..2. trace.txt # tracer: nop # # entries-in-buffer/entries-written: 265126/265126 #P:16 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <...>-1584 [011] D..22 62.779670: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-1584 [011] D..22 62.779702: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-1584 [011] D..22 64.779027: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-1584 [011] D..22 64.779052: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-1653 [003] D..22 64.810070: sched_switch: prev_comm=Xorg prev_pid=0x675 (1653) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/3 next_pid=0x5c (92) next_prio=0x62 (98) <...>-1584 [011] D..22 66.778764: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-1584 [011] D..22 66.778786: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-150 [009] D..22 67.793707: sched_switch: prev_comm=kworker/u64:5 prev_pid=0x96 (150) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/9 next_pid=0x74 (116) next_prio=0x62 (98) <...>-148 [003] D..22 68.793596: sched_switch: prev_comm=kworker/u64:3 prev_pid=0x94 (148) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/3 next_pid=0x5c (92) next_prio=0x62 (98) <...>-148 [003] D..22 70.794233: sched_switch: prev_comm=kworker/u64:3 prev_pid=0x94 (148) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/3 next_pid=0x5c (92) next_prio=0x62 (98) <...>-669 [009] D..21 70.794361: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29d (669) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/9 next_pid=0x0 (0) next_prio=0x78 (120) <...>-669 [009] D..21 71.214115: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29d (669) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/9 next_pid=0x74 (116) next_prio=0x62 (98) <...>-148 [003] D..21 71.794951: sched_switch: prev_comm=kworker/u64:3 prev_pid=0x94 (148) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=ktimers/3 next_pid=0x5d (93) next_prio=0x62 (98) <...>-148 [003] D..23 71.795947: sched_switch: prev_comm=kworker/u64:3 prev_pid=0x94 (148) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/3 next_pid=0x5c (92) next_prio=0x62 (98) <...>-148 [012] D..22 75.489983: sched_switch: prev_comm=kworker/u64:3 prev_pid=0x94 (148) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/12 next_pid=0x44 (68) next_prio=0x62 (98) <...>-1584 [012] D..22 76.776598: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-1584 [012] D..22 76.776629: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-1584 [012] D..22 78.776540: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-1584 [012] D..22 78.776588: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-671 [010] D..21 78.907165: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29f (671) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=ktimers/10 next_pid=0x3d (61) next_prio=0x62 (98) <...>-1372 [012] D..22 80.920002: sched_switch: prev_comm=avahi-daemon prev_pid=0x55c (1372) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-669 [009] D..21 80.935518: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29d (669) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/9 next_pid=0x0 (0) next_prio=0x78 (120) <...>-669 [009] D..21 80.935533: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29d (669) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/9 next_pid=0x0 (0) next_prio=0x78 (120) <...>-100 [005] D..21 81.549584: sched_switch: prev_comm=rcuc/5 prev_pid=0x64 (100) prev_prio=0x62 (98) prev_state=0x100 (256) next_comm=irq/108-amdgpu next_pid=0x1aa (426) next_prio=0x31 (49) <...>-148 [012] D..23 81.791530: sched_switch: prev_comm=kworker/u64:3 prev_pid=0x94 (148) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/12 next_pid=0x44 (68) next_prio=0x62 (98) <...>-671 [004] D..21 82.179446: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29f (671) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/4 next_pid=0x24 (36) next_prio=0x62 (98) <...>-150 [002] D..22 83.169248: sched_switch: prev_comm=kworker/u64:5 prev_pid=0x96 (150) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/2 next_pid=0x1c (28) next_prio=0x62 (98) <...>-669 [012] D..23 83.247178: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29d (669) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-152 [010] D..21 84.177041: sched_switch: prev_comm=kworker/u64:7 prev_pid=0x98 (152) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/10 next_pid=0x3c (60) next_prio=0x62 (98) <...>-159 [006] D..22 84.577962: sched_switch: prev_comm=kworker/u64:14 prev_pid=0x9f (159) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/6 next_pid=0x2c (44) next_prio=0x62 (98) <...>-159 [006] D..21 84.579215: sched_switch: prev_comm=kworker/u64:14 prev_pid=0x9f (159) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=irq/110-mt7921e next_pid=0x29c (668) next_prio=0x31 (49) <...>-152 [004] D..22 84.789922: sched_switch: prev_comm=kworker/u64:7 prev_pid=0x98 (152) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/4 next_pid=0x24 (36) next_prio=0x62 (98) <...>-10 [000] D..21 86.208639: sched_switch: prev_comm=kworker/0:1 prev_pid=0xa (10) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/0 next_pid=0x14 (20) next_prio=0x62 (98) <...>-116 [009] D..21 86.254673: sched_switch: prev_comm=rcuc/9 prev_pid=0x74 (116) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=firefox- esr next_pid=0x7c5 (1989) next_prio=0x62 (98) <...>-76 [014] D..23 86.270830: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=irq_work/14 next_pid=0x4a (74) next_prio=0x62 (98) <...>-44 [006] D..23 86.270839: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/6:7 next_pid=0xc83 (3203) next_prio=0x78 (120) <...>-76 [014] D..23 86.270857: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=xdg- desktop-por next_pid=0x754 (1876) next_prio=0x78 (120) <...>-44 [006] D..23 86.270867: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.270880: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/14:4 next_pid=0x89e (2206) next_prio=0x78 (120) <...>-44 [006] D..23 86.270891: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.270903: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=xdg- desktop-por next_pid=0x754 (1876) next_prio=0x78 (120) <...>-44 [006] D..23 86.270911: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.270925: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Xorg next_pid=0x675 (1653) next_prio=0x78 (120) <...>-44 [006] D..23 86.270935: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.270949: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=xdg- desktop-por next_pid=0x754 (1876) next_prio=0x78 (120) <...>-44 [006] D..23 86.270961: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.270971: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Xorg next_pid=0x675 (1653) next_prio=0x78 (120) <...>-44 [006] D..23 86.270981: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.270997: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=xdg- desktop-por next_pid=0x754 (1876) next_prio=0x78 (120) <...>-44 [006] D..23 86.271009: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.271021: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Xorg next_pid=0x675 (1653) next_prio=0x78 (120) <...>-44 [006] D..23 86.271032: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.271043: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=xdg- desktop-por next_pid=0x754 (1876) next_prio=0x78 (120) <...>-44 [006] D..23 86.271062: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.271081: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Xorg next_pid=0x675 (1653) next_prio=0x78 (120) <...>-44 [006] D..23 86.271098: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.271110: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=xdg- desktop-por next_pid=0x754 (1876) next_prio=0x78 (120) <...>-44 [006] D..23 86.271124: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=dbus- daemon next_pid=0x65d (1629) next_prio=0x78 (120) <...>-76 [014] D..23 86.271145: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Xorg next_pid=0x675 (1653) next_prio=0x78 (120) <...>-76 [014] D..21 86.274667: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/14:4 next_pid=0x89e (2206) next_prio=0x78 (120) <...>-108 [007] D..21 86.274668: sched_switch: prev_comm=rcuc/7 prev_pid=0x6c (108) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/7:4 next_pid=0xd1e (3358) next_prio=0x78 (120) <...>-28 [002] D..21 86.274678: sched_switch: prev_comm=rcuc/2 prev_pid=0x1c (28) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/2:2 next_pid=0x161 (353) next_prio=0x78 (120) <...>-76 [014] D..21 86.274682: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/14:4 next_pid=0x89e (2206) next_prio=0x78 (120) <...>-92 [003] D..21 86.274687: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/3:1 next_pid=0xe9 (233) next_prio=0x78 (120) <...>-28 [002] D..21 86.274691: sched_switch: prev_comm=rcuc/2 prev_pid=0x1c (28) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/2:2 next_pid=0x161 (353) next_prio=0x78 (120) <...>-76 [014] D..21 86.274695: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/u64:4 next_pid=0x95 (149) next_prio=0x78 (120) <...>-92 [003] D..21 86.274698: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/3:1 next_pid=0xe9 (233) next_prio=0x78 (120) <...>-28 [002] D..21 86.274702: sched_switch: prev_comm=rcuc/2 prev_pid=0x1c (28) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=ksoftirqd/2 next_pid=0x1e (30) next_prio=0x78 (120) <...>-76 [014] D..21 86.274705: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/u64:4 next_pid=0x95 (149) next_prio=0x78 (120) <...>-92 [003] D..21 86.274709: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=ksoftirqd/3 next_pid=0x5e (94) next_prio=0x78 (120) <...>-28 [002] D..21 86.274714: sched_switch: prev_comm=rcuc/2 prev_pid=0x1c (28) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Socket Thread next_pid=0x96f (2415) next_prio=0x78 (120) <...>-76 [014] D..21 86.274717: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/u64:4 next_pid=0x95 (149) next_prio=0x78 (120) <...>-92 [003] D..21 86.274721: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/3 next_pid=0x0 (0) next_prio=0x78 (120) <...>-28 [002] D..21 86.274726: sched_switch: prev_comm=rcuc/2 prev_pid=0x1c (28) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Socket Thread next_pid=0x96f (2415) next_prio=0x78 (120) <...>-76 [014] D..21 86.274730: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/u64:4 next_pid=0x95 (149) next_prio=0x78 (120) <...>-140 [015] D..21 86.274732: sched_switch: prev_comm=rcuc/15 prev_pid=0x8c (140) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=systemd next_pid=0x1 (1) next_prio=0x78 (120) <...>-92 [003] D..21 86.274735: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/3 next_pid=0x0 (0) next_prio=0x78 (120) <...>-76 [014] D..21 86.274744: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/14 next_pid=0x0 (0) next_prio=0x78 (120) <...>-140 [015] D..21 86.274747: sched_switch: prev_comm=rcuc/15 prev_pid=0x8c (140) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/15:1 next_pid=0xce (206) next_prio=0x78 (120) <...>-92 [003] D..21 86.274754: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/3 next_pid=0x0 (0) next_prio=0x78 (120) <...>-76 [014] D..21 86.274761: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/14 next_pid=0x0 (0) next_prio=0x78 (120) <...>-140 [015] D..21 86.274767: sched_switch: prev_comm=rcuc/15 prev_pid=0x8c (140) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=systemd next_pid=0x1 (1) next_prio=0x78 (120) <...>-92 [003] D..21 86.274772: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/3 next_pid=0x0 (0) next_prio=0x78 (120) <...>-28 [002] D..21 86.274775: sched_switch: prev_comm=rcuc/2 prev_pid=0x1c (28) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Socket Thread next_pid=0x96f (2415) next_prio=0x78 (120) <...>-76 [014] D..21 86.274779: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/14 next_pid=0x0 (0) next_prio=0x78 (120) <...>-140 [015] D..21 86.274787: sched_switch: prev_comm=rcuc/15 prev_pid=0x8c (140) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/15:1 next_pid=0xce (206) next_prio=0x78 (120) <...>-92 [003] D..21 86.274790: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/3 next_pid=0x0 (0) next_prio=0x78 (120) <...>-28 [002] D..21 86.274795: sched_switch: prev_comm=rcuc/2 prev_pid=0x1c (28) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Socket Thread next_pid=0x96f (2415) next_prio=0x78 (120) <...>-76 [014] D..21 86.274801: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/14 next_pid=0x0 (0) next_prio=0x78 (120) <...>-140 [015] D..21 86.274804: sched_switch: prev_comm=rcuc/15 prev_pid=0x8c (140) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=systemd next_pid=0x1 (1) next_prio=0x78 (120) <...>-92 [003] D..21 86.274812: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/3 next_pid=0x0 (0) next_prio=0x78 (120) <...>-28 [002] D..21 86.274816: sched_switch: prev_comm=rcuc/2 prev_pid=0x1c (28) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Socket Thread next_pid=0x96f (2415) next_prio=0x78 (120) <...>-76 [014] D..21 86.274822: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/14 next_pid=0x0 (0) next_prio=0x78 (120) <...>-92 [003] D..21 86.274834: sched_switch: prev_comm=rcuc/3 prev_pid=0x5c (92) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/3 next_pid=0x0 (0) next_prio=0x78 (120) <...>-28 [002] D..21 86.274837: sched_switch: prev_comm=rcuc/2 prev_pid=0x1c (28) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=Socket Thread next_pid=0x96f (2415) next_prio=0x78 (120) <...>-76 [014] D..21 86.274843: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/u64:4 next_pid=0x95 (149) next_prio=0x78 (120) <...>-44 [006] D..23 86.276634: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=irq_work/6 next_pid=0x2a (42) next_prio=0x62 (98) <...>-60 [010] D..23 86.276634: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=ksoftirqd/10 next_pid=0x3e (62) next_prio=0x78 (120) <...>-44 [006] D..23 86.276657: sched_switch: prev_comm=rcuc/6 prev_pid=0x2c (44) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/6 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.276672: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276762: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/4:3 next_pid=0xceb (3307) next_prio=0x78 (120) <...>-60 [010] D..23 86.276773: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276784: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/4:3 next_pid=0xceb (3307) next_prio=0x78 (120) <...>-60 [010] D..23 86.276797: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276807: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=WebKitNetworkPr next_pid=0xc68 (3176) next_prio=0x78 (120) <...>-60 [010] D..23 86.276818: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276833: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.276842: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276857: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.276867: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276880: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.276892: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276903: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=pool- spawner next_pid=0xbc8 (3016) next_prio=0x78 (120) <...>-60 [010] D..23 86.276921: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276933: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.276948: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276958: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.276973: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.276982: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.276996: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277007: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277021: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277031: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=gmain next_pid=0x8f6 (2294) next_prio=0x78 (120) <...>-60 [010] D..23 86.277042: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277055: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277066: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277077: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277091: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277101: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277111: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277122: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277130: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277139: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277149: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277158: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277166: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277175: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277184: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277192: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277201: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277210: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277219: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277228: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-60 [010] D..23 86.277239: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/10 next_pid=0x0 (0) next_prio=0x78 (120) <...>-36 [004] D..23 86.277248: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/4 next_pid=0x0 (0) next_prio=0x78 (120) <...>-52 [008] D..21 86.308651: sched_switch: prev_comm=rcuc/8 prev_pid=0x34 (52) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=ksoftirqd/8 next_pid=0x36 (54) next_prio=0x78 (120) <...>-68 [012] D..21 86.308652: sched_switch: prev_comm=rcuc/12 prev_pid=0x44 (68) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=systemd- logind next_pid=0x56c (1388) next_prio=0x78 (120) <...>-52 [008] D..21 86.308658: sched_switch: prev_comm=rcuc/8 prev_pid=0x34 (52) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/8:1 next_pid=0xaf (175) next_prio=0x78 (120) <...>-20 [000] D..21 86.311699: sched_switch: prev_comm=rcuc/0 prev_pid=0x14 (20) prev_prio=0x62 (98) prev_state=0x100 (256) next_comm=irq/6- AMDI0010: next_pid=0x159 (345) next_prio=0x31 (49) <...>-1535 [000] D..21 86.333765: sched_switch: prev_comm=isc-loop-0001 prev_pid=0x5ff (1535) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/0 next_pid=0x0 (0) next_prio=0x78 (120) <...>-1538 [001] D..21 86.333767: sched_switch: prev_comm=isc-loop-0004 prev_pid=0x602 (1538) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/1 next_pid=0x0 (0) next_prio=0x78 (120) <...>-1541 [007] D..21 86.333768: sched_switch: prev_comm=isc-loop-0005 prev_pid=0x605 (1541) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/7 next_pid=0x0 (0) next_prio=0x78 (120) <...>-1547 [013] D..21 86.333780: sched_switch: prev_comm=isc-loop-0009 prev_pid=0x60b (1547) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/13 next_pid=0x0 (0) next_prio=0x78 (120) <...>-1550 [005] D..21 86.333789: sched_switch: prev_comm=isc-loop-0011 prev_pid=0x60e (1550) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/5 next_pid=0x0 (0) next_prio=0x78 (120) <...>-1517 [009] D..21 86.333796: sched_switch: prev_comm=named prev_pid=0x5ed (1517) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/9 next_pid=0x0 (0) next_prio=0x78 (120) <...>-1549 [003] D..21 86.334055: sched_switch: prev_comm=isc-loop-0010 prev_pid=0x60d (1549) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/3 next_pid=0x0 (0) next_prio=0x78 (120) <...>-1552 [012] D..21 86.334064: sched_switch: prev_comm=isc-loop-0012 prev_pid=0x610 (1552) prev_prio=0x78 (120) prev_state=0x2 (2) next_comm=swapper/12 next_pid=0x0 (0) next_prio=0x78 (120) <...>-124 [011] D..21 86.339629: sched_switch: prev_comm=rcuc/11 prev_pid=0x7c (124) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=named next_pid=0x5ed (1517) next_prio=0x62 (98) <...>-669 [009] D..21 86.384605: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29d (669) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/9 next_pid=0x74 (116) next_prio=0x62 (98) <...>-669 [009] D..21 86.422599: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29d (669) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/9 next_pid=0x74 (116) next_prio=0x62 (98) <...>-669 [009] D..22 86.435597: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29d (669) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/9 next_pid=0x74 (116) next_prio=0x62 (98) <...>-669 [009] D..24 86.599564: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29d (669) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/9 next_pid=0x74 (116) next_prio=0x62 (98) <...>-671 [010] D..23 86.642556: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29f (671) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/10 next_pid=0x3c (60) next_prio=0x62 (98) <...>-100 [005] D..22 87.291482: sched_switch: prev_comm=rcuc/5 prev_pid=0x64 (100) prev_prio=0x62 (98) prev_state=0x100 (256) next_comm=irq/108-amdgpu next_pid=0x1aa (426) next_prio=0x31 (49) <...>-108 [007] D..21 87.291485: sched_switch: prev_comm=rcuc/7 prev_pid=0x6c (108) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/7 next_pid=0x0 (0) next_prio=0x78 (120) <...>-100 [005] D..21 87.291494: sched_switch: prev_comm=rcuc/5 prev_pid=0x64 (100) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/5 next_pid=0x0 (0) next_prio=0x78 (120) <...>-100 [005] D..21 87.291513: sched_switch: prev_comm=rcuc/5 prev_pid=0x64 (100) prev_prio=0x62 (98) prev_state=0x100 (256) next_comm=irq/108-amdgpu next_pid=0x1aa (426) next_prio=0x31 (49) <...>-671 [010] D..23 87.871456: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29f (671) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=ktimers/10 next_pid=0x3d (61) next_prio=0x62 (98) <...>-68 [012] D..21 88.253434: sched_switch: prev_comm=rcuc/12 prev_pid=0x44 (68) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=irq_work/12 next_pid=0x42 (66) next_prio=0x62 (98) <...>-76 [014] D..21 88.253437: sched_switch: prev_comm=rcuc/14 prev_pid=0x4c (76) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=swapper/14 next_pid=0x0 (0) next_prio=0x78 (120) <...>-68 [012] D..21 88.253469: sched_switch: prev_comm=rcuc/12 prev_pid=0x44 (68) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/12:4 next_pid=0xa58 (2648) next_prio=0x78 (120) <...>-100 [005] D..21 88.253472: sched_switch: prev_comm=rcuc/5 prev_pid=0x64 (100) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=ksoftirqd/5 next_pid=0x66 (102) next_prio=0x78 (120) <...>-116 [009] D..21 88.256459: sched_switch: prev_comm=rcuc/9 prev_pid=0x74 (116) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=kworker/9:1 next_pid=0xb0 (176) next_prio=0x78 (120) <...>-100 [005] D..21 88.256462: sched_switch: prev_comm=rcuc/5 prev_pid=0x64 (100) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=ksoftirqd/5 next_pid=0x66 (102) next_prio=0x78 (120) <...>-36 [004] D..21 88.277433: sched_switch: prev_comm=rcuc/4 prev_pid=0x24 (36) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=systemd next_pid=0x63f (1599) next_prio=0x78 (120) <...>-671 [012] D..23 88.383431: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29f (671) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/12 next_pid=0x44 (68) next_prio=0x62 (98) <...>-3500 [014] D..22 134.060263: sched_switch: prev_comm=kworker/u64:29 prev_pid=0xdac (3500) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/14 next_pid=0x4c (76) next_prio=0x62 (98) <...>-148 [012] D..23 156.035218: sched_switch: prev_comm=kworker/u64:3 prev_pid=0x94 (148) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/12 next_pid=0x44 (68) next_prio=0x62 (98) <...>-671 [011] D..23 156.374210: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29f (671) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/11 next_pid=0x7c (124) next_prio=0x62 (98) <...>-671 [012] D..22 156.886187: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29f (671) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/12 next_pid=0x44 (68) next_prio=0x62 (98) <...>-578 [008] D..21 178.359189: sched_switch: prev_comm=kworker/8:2 prev_pid=0x242 (578) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=kworker/8:1 next_pid=0xaf (175) next_prio=0x78 (120) <...>-143 [015] D..21 178.470232: sched_switch: prev_comm=kworker/15:0 prev_pid=0x8f (143) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=kworker/15:1 next_pid=0xce (206) next_prio=0x78 (120) <...>-671 [012] D..21 178.523343: sched_switch: prev_comm=napi/phy0-0 prev_pid=0x29f (671) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) <...>-20 [000] D..21 178.891146: sched_switch: prev_comm=rcuc/0 prev_pid=0x14 (20) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=ksoftirqd/0 next_pid=0xe (14) next_prio=0x78 (120) <...>-116 [009] D..21 178.938182: sched_switch: prev_comm=rcuc/9 prev_pid=0x74 (116) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=umount next_pid=0xe27 (3623) next_prio=0x62 (98) <...>-116 [009] D..21 178.947159: sched_switch: prev_comm=rcuc/9 prev_pid=0x74 (116) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=umount next_pid=0xe27 (3623) next_prio=0x62 (98) <...>-68 [012] D..21 179.004152: sched_switch: prev_comm=rcuc/12 prev_pid=0x44 (68) prev_prio=0x62 (98) prev_state=0x2 (2) next_comm=umount next_pid=0xe27 (3623) next_prio=0x62 (98) <...>-132 [013] D..21 179.070272: sched_switch: prev_comm=rcuc/13 prev_pid=0x84 (132) prev_prio=0x62 (98) prev_state=0x100 (256) next_comm=irq/87-nvme0q14 next_pid=0x181 (385) next_prio=0x31 (49) <...>-60 [010] D..21 181.208158: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x100 (256) next_comm=irq/83-nvme0q11 next_pid=0x17b (379) next_prio=0x31 (49) <...>-60 [010] D..21 181.321060: sched_switch: prev_comm=rcuc/10 prev_pid=0x3c (60) prev_prio=0x62 (98) prev_state=0x100 (256) next_comm=irq/83-nvme0q11 next_pid=0x17b (379) next_prio=0x31 (49) <...>-1 [002] D..22 182.672954: sched_switch: prev_comm=systemd-shutdow prev_pid=0x1 (1) prev_prio=0x78 (120) prev_state=0x100 (256) next_comm=rcuc/2 next_pid=0x1c (28) next_prio=0x62 (98) Bert Karwatzki ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s shutdown delay with v6.19 and PREEMPT_RT 2026-02-23 0:35 ` Bert Karwatzki @ 2026-02-23 8:22 ` Steven Rostedt 2026-02-23 13:36 ` Bert Karwatzki 0 siblings, 1 reply; 41+ messages in thread From: Steven Rostedt @ 2026-02-23 8:22 UTC (permalink / raw) To: Bert Karwatzki Cc: Calvin Owens, Tejun Heo, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel On Mon, 23 Feb 2026 01:35:36 +0100 Bert Karwatzki <spasswolf@web.de> wrote: > So the time to was is ~120s with PREEMPT_RT and 7s without. > > The interesting difference between these two traces is that the second one only > contains messages with "status" d..2. while the first also contains some with different status > (191 of 265126). Could these be the reason for the delay. > > $ grep -v d..2. trace.txt > > # tracer: nop > # > # entries-in-buffer/entries-written: 265126/265126 #P:16 > # > # _-----=> irqs-off/BH-disabled > # / _----=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / _-=> migrate-disable > # |||| / delay > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION > # | | | ||||| | | > <...>-1584 [011] D..22 62.779670: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) > next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) The 'D' means both interrupts 'd' and softirqs 'b' are disabled. The last number is migrate disable which means the task is pinned to a CPU. That may be an issue if the system is trying to take down a CPU and there's a task pinned to it. Now that we know that the persistent ring buffer works, we can add even more debugging. We could see where things are stuck... cd /sys/kernel/tracing/instances/boot_map echo 'stacktrace if prev_state & 3' > events/sched/sched_switch/trigger That will do a stacktrace at every location that schedules out in a non-running state. That way we can see what is waiting for something to finish. Then in a separate boot, we may want to see where things are pinned. echo 'stacktrace if common_flags & 0xf00' > events/sched/sched_switch/trigger That will do a stacktrace every time a task schedules out with migration disabled. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s shutdown delay with v6.19 and PREEMPT_RT 2026-02-23 8:22 ` Steven Rostedt @ 2026-02-23 13:36 ` Bert Karwatzki 2026-02-23 23:36 ` Bert Karwatzki 0 siblings, 1 reply; 41+ messages in thread From: Bert Karwatzki @ 2026-02-23 13:36 UTC (permalink / raw) To: Steven Rostedt Cc: Calvin Owens, Tejun Heo, spasswolf, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel Am Montag, dem 23.02.2026 um 03:22 -0500 schrieb Steven Rostedt: > On Mon, 23 Feb 2026 01:35:36 +0100 > Bert Karwatzki <spasswolf@web.de> wrote: > > > So the time to was is ~120s with PREEMPT_RT and 7s without. > > > > The interesting difference between these two traces is that the second one only > > contains messages with "status" d..2. while the first also contains some with different status > > (191 of 265126). Could these be the reason for the delay. > > > > $ grep -v d..2. trace.txt > > > > # tracer: nop > > # > > # entries-in-buffer/entries-written: 265126/265126 #P:16 > > # > > # _-----=> irqs-off/BH-disabled > > # / _----=> need-resched > > # | / _---=> hardirq/softirq > > # || / _--=> preempt-depth > > # ||| / _-=> migrate-disable > > # |||| / delay > > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION > > # | | | ||||| | | > > <...>-1584 [011] D..22 62.779670: sched_switch: prev_comm=ntpd prev_pid=0x630 (1584) prev_prio=0x78 (120) prev_state=0x100 (256) > > next_comm=mt76-tx phy0 next_pid=0x5fb (1531) next_prio=0x62 (98) > > The 'D' means both interrupts 'd' and softirqs 'b' are disabled. > > The last number is migrate disable which means the task is pinned to a > CPU. That may be an issue if the system is trying to take down a CPU > and there's a task pinned to it. > > Now that we know that the persistent ring buffer works, we can add even > more debugging. We could see where things are stuck... > > cd /sys/kernel/tracing/instances/boot_map > echo 'stacktrace if prev_state & 3' > events/sched/sched_switch/trigger > > That will do a stacktrace at every location that schedules out in a > non-running state. That way we can see what is waiting for something to > finish. I tried that twice and got these results: Commands: echo 1 > /sys/kernel/tracing/instances/boot_map/events/sched/sched_switch/enable echo 'stacktrace if prev_state & 3' > /sys/kernel/tracing/instances/boot_map/events/sched/sched_switch/trigger echo 1 > /sys/kernel/tracing/instances/boot_map/tracing_on reboot The first time (despite the 265000 entries) the reboot happened at normals speed, even thought the timestamps suggest something different: # entries-in-buffer/entries-written: 265170/265170 #P:16 [...] <...>-158 [009] d..2. 98.680157: sched_switch: prev_comm=kworker/u64:13 prev_pid=0x9e (158) prev_prio=0x78 (120) prev_state=0x80 (128) next_comm=swapper/9 next_pid=0x0 (0) next_prio=0x78 (120) [...] <...>-70 [012] d..2. 166.155040: sched_switch: prev_comm=ksoftirqd/12 prev_pid=0x46 (70) prev_prio=0x78 (120) prev_state=0x1 (1) next_comm=swapper/12 next_pid=0x0 (0) next_prio=0x78 (120) Here are the messages from /var/log/kern.log that show the last messages of the old kernel and the first message of the new: 2026-02-23T13:57:38.441240+01:00 lisa kernel: [ T156] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes 2026-02-23T13:58:04.876308+01:00 lisa kernel: [ T0] Linux version 6.19.0-trace (bert@lisa) (gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44) #1160 SMP PREEMPT_RT Sat Feb 21 00:22:46 CET 2026 So while the time stamps in the trace suggest that the reboot took at least 68 seconds, /var/log/kern.log shows that it took only 26s at most. As I observed the whole thing I'm sure kern.log is correct here. Then I repeated the process above and this time the shutdown got stuck again, but the trace was actually shorter: # entries-in-buffer/entries-written: 195123/195123 #P:16 [...] <idle>-0 [002] d..2. 135.554543: sched_switch: prev_comm=swapper/2 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=xfce4- terminal next_pid=0x7b7 (1975) next_prio=0x78 (120) [...] <idle>-0 [000] d..2. 242.301841: sched_switch: prev_comm=swapper/0 prev_pid=0x0 (0) prev_prio=0x78 (120) prev_state=0x0 (0) next_comm=systemd- shutdow next_pid=0x1 (1) next_prio=0x78 (120) These are the line from kern.log that show the longer delay: 2026-02-23T13:59:19.348451+01:00 lisa kernel: [ T146] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes 2026-02-23T14:02:13.817354+01:00 lisa kernel: [ T0] Linux version 6.19.0-trace (bert@lisa) (gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44) #1160 SMP PREEMPT_RT Sat Feb 21 00:22:46 CET 2026 So this time we have a trace that took about 107s and also a longer delay in kern.log. The kern.log delay here is not the time it took for the reboot though, as the there might have been a delay between the last log message of the old kernel and the reboot command. From now on I'll reboot like this to get a proper timestamp in /var/log/kern.log: # echo reboot > /dev/kmsg && reboot Bert Karwatzki ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s shutdown delay with v6.19 and PREEMPT_RT 2026-02-23 13:36 ` Bert Karwatzki @ 2026-02-23 23:36 ` Bert Karwatzki 2026-02-24 12:44 ` Bert Karwatzki 2026-02-24 14:20 ` Steven Rostedt 0 siblings, 2 replies; 41+ messages in thread From: Bert Karwatzki @ 2026-02-23 23:36 UTC (permalink / raw) To: Steven Rostedt Cc: Calvin Owens, Tejun Heo, spasswolf, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel As the bisection in suggested that commit9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") is somehow causing the problem I put some printk()s in the code changed by this commit and captured the output via netconsole (I tried using trace_printk() to use the persistent ringbuffer but got no output). This is the debug patch (for v6.19): commit 655ad0d7ce2d03b1b8bfbc2a3e6c36b46a4604c5 Author: Bert Karwatzki <spasswolf@web.de> Date: Mon Feb 23 21:49:47 2026 +0100 cgroup: add printk() cgroup_task_dead for PREEMPT_RT Signed-off-by: Bert Karwatzki <spasswolf@web.de> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 5f0d33b04910..16bb40df5dd0 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -7029,6 +7029,7 @@ static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork) { struct llist_node *lnode; struct task_struct *task, *next; + printk(KERN_INFO "%s:\n", __func__); lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks)); llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) { @@ -7050,6 +7051,7 @@ static void __init cgroup_rt_init(void) void cgroup_task_dead(struct task_struct *task) { + printk(KERN_INFO "%s: task = %s\n", __func__, task->comm); get_task_struct(task); llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks)); irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork)); This is the output captured via netconsole after rebooting via # echo reboot > /dev/kmsg && reboot 2026-02-24T00:17:38.836655+01:00 localhost 12,5335,211298089,-,caller=T3211;reboot 2026-02-24T00:17:38.848658+01:00 localhost 6,5336,211309667,-,caller=T0;cgroup_task_dead: task = reboot 2026-02-24T00:17:38.848658+01:00 localhost 6,5337,211310223,-,caller=T50;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:17:38.852694+01:00 localhost 6,5338,211313770,-,caller=T0;cgroup_task_dead: task = gmain 2026-02-24T00:17:38.852694+01:00 localhost 6,5339,211313776,-,caller=T0;cgroup_task_dead: task = xfce4-sess:cs0 2026-02-24T00:17:38.852694+01:00 localhost 6,5340,211313798,-,caller=T0;cgroup_task_dead: task = xfce4-sess:sh0 2026-02-24T00:17:38.852694+01:00 localhost 6,5341,211313810,-,caller=T0;cgroup_task_dead: task = xfce4-:sh_opt0 2026-02-24T00:17:38.852694+01:00 localhost 6,5342,211313830,-,caller=T0;cgroup_task_dead: task = xfce4-:traceq0 2026-02-24T00:17:38.852694+01:00 localhost 6,5343,211313840,-,caller=T0;cgroup_task_dead: task = pool-spawner 2026-02-24T00:17:38.852694+01:00 localhost 6,5344,211313839,-,caller=T0;cgroup_task_dead: task = xfce4-:traceq0 2026-02-24T00:17:38.852694+01:00 localhost 6,5345,211313868,-,caller=T0;cgroup_task_dead: task = xfce4-session 2026-02-24T00:17:38.852694+01:00 localhost 6,5346,211313871,-,caller=T0;cgroup_task_dead: task = xfce4-:traceq0 2026-02-24T00:17:38.852694+01:00 localhost 6,5347,211313874,-,caller=T0;cgroup_task_dead: task = xfce4-s:disk$0 [...] Here comes the part which shows the delay (I did not remove any message here): 2026-02-24T00:17:40.964180+01:00 localhost 6,6838,213423568,-,caller=T0;cgroup_task_dead: task = (sd-close) 2026-02-24T00:17:40.964244+01:00 localhost 6,6839,213423756,-,caller=T0;cgroup_task_dead: task = systemctl 2026-02-24T00:17:40.964576+01:00 localhost 6,6840,213424221,-,caller=T34;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:17:40.964649+01:00 localhost 6,6841,213424221,-,caller=T23;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:17:40.964969+01:00 localhost 6,6842,213424435,-,caller=T0;cgroup_task_dead: task = (sd-close) 2026-02-24T00:17:40.965057+01:00 localhost 6,6843,213425219,-,caller=T50;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:17:44.025094+01:00 localhost 6,6844,216485806,-,caller=T0;cgroup_task_dead: task = (udev-worker) 2026-02-24T00:17:44.025094+01:00 localhost 6,6845,216485810,-,caller=T0;cgroup_task_dead: task = (udev-worker) 2026-02-24T00:17:44.025387+01:00 localhost 6,6846,216485816,-,caller=T0;cgroup_task_dead: task = (udev-worker) 2026-02-24T00:17:44.025642+01:00 localhost 6,6847,216486228,-,caller=T66;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:17:44.025718+01:00 localhost 6,6848,216486239,-,caller=T122;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:17:44.025923+01:00 localhost 6,6849,216486239,-,caller=T82;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:19:11.070444+01:00 localhost 6,6850,303512419,-,caller=T0;cgroup_task_dead: task = systemd 2026-02-24T00:19:11.070444+01:00 localhost 6,6851,303513238,-,caller=T34;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:19:11.070725+01:00 localhost 6,6852,303513728,-,caller=T0;cgroup_task_dead: task = (sd-pam) 2026-02-24T00:19:11.070888+01:00 localhost 6,6853,303514010,-,caller=T0;cgroup_task_dead: task = (sd-close) 2026-02-24T00:19:11.071192+01:00 localhost 6,6854,303514221,-,caller=T90;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:19:11.071291+01:00 localhost 6,6855,303514238,-,caller=T50;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:19:11.071626+01:00 localhost 6,6856,303514629,-,caller=T0;cgroup_task_dead: task = (sd-close) 2026-02-24T00:19:11.072139+01:00 localhost 6,6857,303515237,-,caller=T50;cgrp_dead_tasks_iwork_fn: 2026-02-24T00:19:11.107473+01:00 localhost 6,6858,303549400,-,caller=T0;cgroup_task_dead: task = psimon 2026-02-24T00:19:11.107473+01:00 localhost 6,6859,303550244,-,caller=T66;cgrp_dead_tasks_iwork_fn: [...] I've tried this several times and the function calls just before the delay are always the same. Bert Karwatzki ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: ~90s shutdown delay with v6.19 and PREEMPT_RT 2026-02-23 23:36 ` Bert Karwatzki @ 2026-02-24 12:44 ` Bert Karwatzki 2026-02-24 12:58 ` Bert Karwatzki 2026-02-24 14:20 ` Steven Rostedt 1 sibling, 1 reply; 41+ messages in thread From: Bert Karwatzki @ 2026-02-24 12:44 UTC (permalink / raw) To: Steven Rostedt Cc: Calvin Owens, spasswolf, Tejun Heo, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel I've done some more monitoring with this debug patch which monitors cgroup_task_dead() and the function which calls it finish task switch. To avoid too many messages some printk()s are filtered by command name (a previous patch showed systemd to be the problematic process): diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 5f0d33b04910..7bb6931a4d86 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6990,25 +6990,32 @@ static void do_cgroup_task_dead(struct task_struct *tsk) { struct css_set *cset; unsigned long flags; + printk(KERN_INFO "%s 0: task = %px\n", __func__, tsk); spin_lock_irqsave(&css_set_lock, flags); + printk(KERN_INFO "%s 1: task = %px\n", __func__, tsk); WARN_ON_ONCE(list_empty(&tsk->cg_list)); cset = task_css_set(tsk); + printk(KERN_INFO "%s 2: task = %px\n", __func__, tsk); css_set_move_task(tsk, cset, NULL, false); + printk(KERN_INFO "%s 3: task = %px\n", __func__, tsk); cset->nr_tasks--; /* matches the signal->live check in css_task_iter_advance() */ if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live)) list_add_tail(&tsk->cg_list, &cset->dying_tasks); + printk(KERN_INFO "%s 4: task = %px\n", __func__, tsk); if (dl_task(tsk)) dec_dl_tasks_cs(tsk); + printk(KERN_INFO "%s 5: task = %px\n", __func__, tsk); WARN_ON_ONCE(cgroup_task_frozen(tsk)); if (unlikely(!(tsk->flags & PF_KTHREAD) && test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags))) cgroup_update_frozen(task_dfl_cgroup(tsk)); + printk(KERN_INFO "%s 6: task = %px\n", __func__, tsk); spin_unlock_irqrestore(&css_set_lock, flags); } @@ -7029,9 +7036,11 @@ static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork) { struct llist_node *lnode; struct task_struct *task, *next; + printk(KERN_INFO "%s:\n", __func__); lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks)); llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) { + printk(KERN_INFO "%s: %px %s", __func__, task, task->comm); do_cgroup_task_dead(task); put_task_struct(task); } @@ -7050,6 +7059,7 @@ static void __init cgroup_rt_init(void) void cgroup_task_dead(struct task_struct *task) { + printk(KERN_INFO "%s: task = %px (%s)\n", __func__, task, task->comm); get_task_struct(task); llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks)); irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork)); @@ -7059,6 +7069,7 @@ static void __init cgroup_rt_init(void) {} void cgroup_task_dead(struct task_struct *task) { + printk(KERN_INFO "%s: task = %px (%s)\n", __func__, task, task->comm); do_cgroup_task_dead(task); } #endif /* CONFIG_PREEMPT_RT */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 854984967fe2..73e477d8cf1a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5078,6 +5078,8 @@ static struct rq *finish_task_switch(struct task_struct *prev) struct rq *rq = this_rq(); struct mm_struct *mm = rq->prev_mm; unsigned int prev_state; + if (!strcmp(prev->comm, "systemd")) + printk(KERN_INFO "%s 0: %px\n", __func__, prev); /* * The previous task will have left us with a preempt_count of 2 @@ -5144,15 +5146,18 @@ static struct rq *finish_task_switch(struct task_struct *prev) } if (unlikely(prev_state == TASK_DEAD)) { + printk(KERN_INFO "%s 1: %px (%s)\n", __func__, prev, prev->comm); if (prev->sched_class->task_dead) prev->sched_class->task_dead(prev); + printk(KERN_INFO "%s 2: %px (%s)\n", __func__, prev, prev->comm); /* * sched_ext_dead() must come before cgroup_task_dead() to * prevent cgroups from being removed while its member tasks are * visible to SCX schedulers. */ sched_ext_dead(prev); + printk(KERN_INFO "%s 3: %px (%s)\n", __func__, prev, prev->comm); cgroup_task_dead(prev); /* Task is done with its stack. */ @@ -5202,6 +5207,8 @@ static __always_inline struct rq * context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next, struct rq_flags *rf) { + if (!strcmp(prev->comm, "systemd")) + printk(KERN_INFO "%s 0: %px\n", __func__, prev); prepare_task_switch(rq, prev, next); /* I also tried monitoring the schedule_tail(), which is one of the two function calling finish_task_switch(), but that did get rid of the delay. The result from this is: 2026-02-24T13:13:21.739185+01:00 localhost 12,32039,34364889,-,caller=T2955;reboot [...] Here the delay section begins (all message here have comm == "systemd"): 2026-02-24T13:13:38.124013+01:00 localhost 6,45456,50748843,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:13:38.124234+01:00 localhost 6,45457,50748851,-,caller=T0;finish_task_switch 0: ffff97fd40884300 2026-02-24T13:13:38.271984+01:00 localhost 6,45458,50896237,-,caller=T1573;context_switch 0: ffff97fd46050000 2026-02-24T13:13:38.272255+01:00 localhost 6,45459,50896244,-,caller=T52;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:13:38.272556+01:00 localhost 6,45460,50896287,-,caller=T1573;context_switch 0: ffff97fd46050000 2026-02-24T13:13:38.272759+01:00 localhost 6,45461,50896289,-,caller=T51;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:13:38.273225+01:00 localhost 6,45462,50896303,-,caller=T1573;context_switch 0: ffff97fd46050000 2026-02-24T13:13:38.273406+01:00 localhost 6,45463,50896305,-,caller=T51;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:13:38.273819+01:00 localhost 6,45464,50896481,-,caller=T1573;context_switch 0: ffff97fd46050000 2026-02-24T13:13:38.274025+01:00 localhost 6,45465,50896484,-,caller=T0;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:13:38.274465+01:00 localhost 6,45466,50896801,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:13:38.274635+01:00 localhost 6,45467,50896804,-,caller=T0;finish_task_switch 0: ffff97fd40884300 2026-02-24T13:13:38.275034+01:00 localhost 6,45468,50897328,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:13:38.275254+01:00 localhost 6,45469,50897331,-,caller=T2077;finish_task_switch 0: ffff97fd40884300 2026-02-24T13:13:38.275680+01:00 localhost 6,45470,50897600,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:13:38.275886+01:00 localhost 6,45471,50897602,-,caller=T0;finish_task_switch 0: ffff97fd40884300 2026-02-24T13:13:38.276300+01:00 localhost 6,45472,50897936,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:13:38.276487+01:00 localhost 6,45473,50897939,-,caller=T0;finish_task_switch 0: ffff97fd40884300 2026-02-24T13:13:38.276907+01:00 localhost 6,45474,50898225,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:13:38.279144+01:00 localhost 6,45475,50898228,-,caller=T68;finish_task_switch 0: ffff97fd40884300 [...] 1505 similar lines removed 2026-02-24T13:14:54.000427+01:00 localhost 6,46981,126614226,-,caller=T44;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:14:54.000967+01:00 localhost 6,46982,126614420,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:14:54.001060+01:00 localhost 6,46983,126614422,-,caller=T0;finish_task_switch 0: ffff97fd40884300 2026-02-24T13:14:54.001562+01:00 localhost 6,46984,126614462,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:14:54.001666+01:00 localhost 6,46985,126614463,-,caller=T0;finish_task_switch 0: ffff97fd40884300 2026-02-24T13:14:54.002182+01:00 localhost 6,46986,126614533,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:14:54.002296+01:00 localhost 6,46987,126614534,-,caller=T0;finish_task_switch 0: ffff97fd40884300 2026-02-24T13:14:54.002811+01:00 localhost 6,46988,126615219,-,caller=T1573;context_switch 0: ffff97fd46050000 2026-02-24T13:14:54.002910+01:00 localhost 6,46989,126615221,-,caller=T42;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:14:54.003402+01:00 localhost 6,46990,126616376,-,caller=T1573;context_switch 0: ffff97fd46050000 2026-02-24T13:14:54.003510+01:00 localhost 6,46991,126616378,-,caller=T0;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:14:54.004034+01:00 localhost 6,46992,126621884,-,caller=T1573;context_switch 0: ffff97fd46050000 2026-02-24T13:14:54.004125+01:00 localhost 6,46993,126621887,-,caller=T0;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:14:54.004645+01:00 localhost 6,46994,126625311,-,caller=T1573;context_switch 0: ffff97fd46050000 2026-02-24T13:14:54.004734+01:00 localhost 6,46995,126625314,-,caller=T235;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:14:54.013405+01:00 localhost 6,46996,126637601,-,caller=T1;context_switch 0: ffff97fd40884300 2026-02-24T13:14:54.013614+01:00 localhost 6,46997,126637603,-,caller=T0;finish_task_switch 0: ffff97fd40884300 2026-02-24T13:14:54.014165+01:00 localhost 6,46998,126638229,-,caller=T1573;context_switch 0: ffff97fd46050000 2026-02-24T13:14:54.014266+01:00 localhost 6,46999,126638232,-,caller=T44;finish_task_switch 0: ffff97fd46050000 2026-02-24T13:14:54.014774+01:00 localhost 6,47000,126638469,-,caller=T1573;context_switch 0: ffff97fd46050000 END So there's something strange going on here with the scheduler. Bert Karwatzki ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: ~90s shutdown delay with v6.19 and PREEMPT_RT 2026-02-24 12:44 ` Bert Karwatzki @ 2026-02-24 12:58 ` Bert Karwatzki 0 siblings, 0 replies; 41+ messages in thread From: Bert Karwatzki @ 2026-02-24 12:58 UTC (permalink / raw) To: Steven Rostedt Cc: Calvin Owens, spasswolf@web.de Tejun Heo, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel Am Dienstag, dem 24.02.2026 um 13:44 +0100 schrieb Bert Karwatzki: > I've done some more monitoring with this debug patch which monitors > cgroup_task_dead() and the function which calls it finish task switch. > To avoid too many messages some printk()s are filtered by command name > (a previous patch showed systemd to be the problematic process): > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 5f0d33b04910..7bb6931a4d86 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -6990,25 +6990,32 @@ static void do_cgroup_task_dead(struct task_struct *tsk) > { > struct css_set *cset; > unsigned long flags; > + printk(KERN_INFO "%s 0: task = %px\n", __func__, tsk); > > spin_lock_irqsave(&css_set_lock, flags); > > + printk(KERN_INFO "%s 1: task = %px\n", __func__, tsk); > WARN_ON_ONCE(list_empty(&tsk->cg_list)); > cset = task_css_set(tsk); > + printk(KERN_INFO "%s 2: task = %px\n", __func__, tsk); > css_set_move_task(tsk, cset, NULL, false); > + printk(KERN_INFO "%s 3: task = %px\n", __func__, tsk); > cset->nr_tasks--; > /* matches the signal->live check in css_task_iter_advance() */ > if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live)) > list_add_tail(&tsk->cg_list, &cset->dying_tasks); > > + printk(KERN_INFO "%s 4: task = %px\n", __func__, tsk); > if (dl_task(tsk)) > dec_dl_tasks_cs(tsk); > > + printk(KERN_INFO "%s 5: task = %px\n", __func__, tsk); > WARN_ON_ONCE(cgroup_task_frozen(tsk)); > if (unlikely(!(tsk->flags & PF_KTHREAD) && > test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags))) > cgroup_update_frozen(task_dfl_cgroup(tsk)); > > + printk(KERN_INFO "%s 6: task = %px\n", __func__, tsk); > spin_unlock_irqrestore(&css_set_lock, flags); > } > > @@ -7029,9 +7036,11 @@ static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork) > { > struct llist_node *lnode; > struct task_struct *task, *next; > + printk(KERN_INFO "%s:\n", __func__); > > lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks)); > llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) { > + printk(KERN_INFO "%s: %px %s", __func__, task, task->comm); > do_cgroup_task_dead(task); > put_task_struct(task); > } > @@ -7050,6 +7059,7 @@ static void __init cgroup_rt_init(void) > > void cgroup_task_dead(struct task_struct *task) > { > + printk(KERN_INFO "%s: task = %px (%s)\n", __func__, task, task->comm); > get_task_struct(task); > llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks)); > irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork)); > @@ -7059,6 +7069,7 @@ static void __init cgroup_rt_init(void) {} > > void cgroup_task_dead(struct task_struct *task) > { > + printk(KERN_INFO "%s: task = %px (%s)\n", __func__, task, task->comm); > do_cgroup_task_dead(task); > } > #endif /* CONFIG_PREEMPT_RT */ > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 854984967fe2..73e477d8cf1a 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5078,6 +5078,8 @@ static struct rq *finish_task_switch(struct task_struct *prev) > struct rq *rq = this_rq(); > struct mm_struct *mm = rq->prev_mm; > unsigned int prev_state; > + if (!strcmp(prev->comm, "systemd")) > + printk(KERN_INFO "%s 0: %px\n", __func__, prev); > > /* > * The previous task will have left us with a preempt_count of 2 > @@ -5144,15 +5146,18 @@ static struct rq *finish_task_switch(struct task_struct *prev) > } > > if (unlikely(prev_state == TASK_DEAD)) { > + printk(KERN_INFO "%s 1: %px (%s)\n", __func__, prev, prev->comm); > if (prev->sched_class->task_dead) > prev->sched_class->task_dead(prev); > > + printk(KERN_INFO "%s 2: %px (%s)\n", __func__, prev, prev->comm); > /* > * sched_ext_dead() must come before cgroup_task_dead() to > * prevent cgroups from being removed while its member tasks are > * visible to SCX schedulers. > */ > sched_ext_dead(prev); > + printk(KERN_INFO "%s 3: %px (%s)\n", __func__, prev, prev->comm); > cgroup_task_dead(prev); > > /* Task is done with its stack. */ > @@ -5202,6 +5207,8 @@ static __always_inline struct rq * > context_switch(struct rq *rq, struct task_struct *prev, > struct task_struct *next, struct rq_flags *rf) > { > + if (!strcmp(prev->comm, "systemd")) > + printk(KERN_INFO "%s 0: %px\n", __func__, prev); > prepare_task_switch(rq, prev, next); > > /* > > > I also tried monitoring the schedule_tail(), which is one of the two function calling > finish_task_switch(), but that did get rid of the delay. This is incorrect. Monitoring schedule_tail() does not undo the delayed shutdown, but the delay probalbly has only 95% reproducability. > > The result from this is: > 2026-02-24T13:13:21.739185+01:00 localhost 12,32039,34364889,-,caller=T2955;reboot > [...] > Here the delay section begins (all message here have comm == "systemd"): > 2026-02-24T13:13:38.124013+01:00 localhost 6,45456,50748843,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.124234+01:00 localhost 6,45457,50748851,-,caller=T0;finish_task_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.271984+01:00 localhost 6,45458,50896237,-,caller=T1573;context_switch 0: ffff97fd46050000 > 2026-02-24T13:13:38.272255+01:00 localhost 6,45459,50896244,-,caller=T52;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:13:38.272556+01:00 localhost 6,45460,50896287,-,caller=T1573;context_switch 0: ffff97fd46050000 > 2026-02-24T13:13:38.272759+01:00 localhost 6,45461,50896289,-,caller=T51;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:13:38.273225+01:00 localhost 6,45462,50896303,-,caller=T1573;context_switch 0: ffff97fd46050000 > 2026-02-24T13:13:38.273406+01:00 localhost 6,45463,50896305,-,caller=T51;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:13:38.273819+01:00 localhost 6,45464,50896481,-,caller=T1573;context_switch 0: ffff97fd46050000 > 2026-02-24T13:13:38.274025+01:00 localhost 6,45465,50896484,-,caller=T0;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:13:38.274465+01:00 localhost 6,45466,50896801,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.274635+01:00 localhost 6,45467,50896804,-,caller=T0;finish_task_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.275034+01:00 localhost 6,45468,50897328,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.275254+01:00 localhost 6,45469,50897331,-,caller=T2077;finish_task_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.275680+01:00 localhost 6,45470,50897600,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.275886+01:00 localhost 6,45471,50897602,-,caller=T0;finish_task_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.276300+01:00 localhost 6,45472,50897936,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.276487+01:00 localhost 6,45473,50897939,-,caller=T0;finish_task_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.276907+01:00 localhost 6,45474,50898225,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:13:38.279144+01:00 localhost 6,45475,50898228,-,caller=T68;finish_task_switch 0: ffff97fd40884300 > > [...] 1505 similar lines removed > > 2026-02-24T13:14:54.000427+01:00 localhost 6,46981,126614226,-,caller=T44;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.000967+01:00 localhost 6,46982,126614420,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:14:54.001060+01:00 localhost 6,46983,126614422,-,caller=T0;finish_task_switch 0: ffff97fd40884300 > 2026-02-24T13:14:54.001562+01:00 localhost 6,46984,126614462,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:14:54.001666+01:00 localhost 6,46985,126614463,-,caller=T0;finish_task_switch 0: ffff97fd40884300 > 2026-02-24T13:14:54.002182+01:00 localhost 6,46986,126614533,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:14:54.002296+01:00 localhost 6,46987,126614534,-,caller=T0;finish_task_switch 0: ffff97fd40884300 > 2026-02-24T13:14:54.002811+01:00 localhost 6,46988,126615219,-,caller=T1573;context_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.002910+01:00 localhost 6,46989,126615221,-,caller=T42;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.003402+01:00 localhost 6,46990,126616376,-,caller=T1573;context_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.003510+01:00 localhost 6,46991,126616378,-,caller=T0;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.004034+01:00 localhost 6,46992,126621884,-,caller=T1573;context_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.004125+01:00 localhost 6,46993,126621887,-,caller=T0;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.004645+01:00 localhost 6,46994,126625311,-,caller=T1573;context_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.004734+01:00 localhost 6,46995,126625314,-,caller=T235;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.013405+01:00 localhost 6,46996,126637601,-,caller=T1;context_switch 0: ffff97fd40884300 > 2026-02-24T13:14:54.013614+01:00 localhost 6,46997,126637603,-,caller=T0;finish_task_switch 0: ffff97fd40884300 > 2026-02-24T13:14:54.014165+01:00 localhost 6,46998,126638229,-,caller=T1573;context_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.014266+01:00 localhost 6,46999,126638232,-,caller=T44;finish_task_switch 0: ffff97fd46050000 > 2026-02-24T13:14:54.014774+01:00 localhost 6,47000,126638469,-,caller=T1573;context_switch 0: ffff97fd46050000 END > > So there's something strange going on here with the scheduler. > > > Bert Karwatzki ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s shutdown delay with v6.19 and PREEMPT_RT 2026-02-23 23:36 ` Bert Karwatzki 2026-02-24 12:44 ` Bert Karwatzki @ 2026-02-24 14:20 ` Steven Rostedt 1 sibling, 0 replies; 41+ messages in thread From: Steven Rostedt @ 2026-02-24 14:20 UTC (permalink / raw) To: Bert Karwatzki Cc: Calvin Owens, Tejun Heo, Sebastian Andrzej Siewior, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel On Tue, 24 Feb 2026 00:36:27 +0100 Bert Karwatzki <spasswolf@web.de> wrote: > As the bisection in suggested that commit9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") > is somehow causing the problem I put some printk()s in the code changed by this > commit and captured the output via netconsole (I tried using trace_printk() to > use the persistent ringbuffer but got no output). It's described in the document I posted: https://docs.kernel.org/trace/debugging.html Using trace_printk() in the boot instance By default, the content of trace_printk() goes into the top level tracing instance. But this instance is never preserved across boots. To have the trace_printk() content, and some other internal tracing go to the preserved buffer (like dump stacks), either set the instance to be the trace_printk() destination from the kernel command line, or set it after boot up via the trace_printk_dest option. After boot up: echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest From the kernel command line: reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace If setting it from the kernel command line, it is recommended to also disable tracing with the “traceoff” flag, and enable tracing after boot up. Otherwise the trace from the most recent boot will be mixed with the trace from the previous boot, and may make it confusing to read. -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-19 23:10 ` Bert Karwatzki 2026-02-20 0:58 ` Steven Rostedt @ 2026-02-24 15:45 ` Sebastian Andrzej Siewior 1 sibling, 0 replies; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2026-02-24 15:45 UTC (permalink / raw) To: Bert Karwatzki Cc: Calvin Owens, Tejun Heo, Thomas Gleixner, dschatzberg, peterz, linux-kernel, linux-rt-devel On 2026-02-20 00:10:41 [+0100], Bert Karwatzki wrote: > BUG: sleeping function called from invalid context at kernel/printk/printk.c:3377 > in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 16, name: pr/legacy … > <TASK> > dump_stack_lvl+0x4b/0x70 > __might_resched.cold+0xaf/0xbd > console_conditional_schedule+0x26/0x30 … This is addressed by commit 8e9bf8b9e8c0a ("printk, vt, fbcon: Remove console_conditional_schedule()") Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-19 16:46 ` ~90s reboot delay with v6.19 and PREEMPT_RT Bert Karwatzki 2026-02-19 20:53 ` Calvin Owens @ 2026-02-25 15:43 ` Sebastian Andrzej Siewior 2026-02-25 16:37 ` Bert Karwatzki 1 sibling, 1 reply; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2026-02-25 15:43 UTC (permalink / raw) To: Bert Karwatzki Cc: Tejun Heo, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel On 2026-02-19 17:46:47 [+0100], Bert Karwatzki wrote: > Since linux v6.19 I noticed that rebooting my MSI Alpha 15 Laptop > would hang for about ~90s before rebooting. I bisected this (from > v6.18 to v6.19) and got this as the first bad commit: > 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") … I'm on it. I looks like we free the task after sched_process_wait() but before it is entirely gone there is a wait() on its pid. Some of them do come back but one seems to be stuck and I need to figure out which one. If we get rid of the LAZY then it happens "quick" enough so it works. > Bert Karwatzki Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-25 15:43 ` Sebastian Andrzej Siewior @ 2026-02-25 16:37 ` Bert Karwatzki 2026-02-25 16:59 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 41+ messages in thread From: Bert Karwatzki @ 2026-02-25 16:37 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Tejun Heo, Thomas Gleixner, spasswolf, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel, Steven Rostedt Am Mittwoch, dem 25.02.2026 um 16:43 +0100 schrieb Sebastian Andrzej Siewior: > On 2026-02-19 17:46:47 [+0100], Bert Karwatzki wrote: > > Since linux v6.19 I noticed that rebooting my MSI Alpha 15 Laptop > > would hang for about ~90s before rebooting. I bisected this (from > > v6.18 to v6.19) and got this as the first bad commit: > > 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") > … > > I'm on it. I looks like we free the task after sched_process_wait() but > before it is entirely gone there is a wait() on its pid. Some of them do > come back but one seems to be stuck and I need to figure out which one. > If we get rid of the LAZY then it happens "quick" enough so it works. > > > Bert Karwatzki > > Sebastian I've done two testruns with this debug patch (The persistant log buffer works now, thanks again to Steven Rostedt): diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 5f0d33b04910..b750aa284b89 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6990,6 +6990,7 @@ static void do_cgroup_task_dead(struct task_struct *tsk) { struct css_set *cset; unsigned long flags; + trace_printk(KERN_INFO "%s 0: task = %px\n", __func__, tsk); spin_lock_irqsave(&css_set_lock, flags); @@ -7029,9 +7030,11 @@ static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork) { struct llist_node *lnode; struct task_struct *task, *next; + trace_printk(KERN_INFO "%s:\n", __func__); lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks)); llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) { + trace_printk(KERN_INFO "%s: %px %s", __func__, task, task->comm); do_cgroup_task_dead(task); put_task_struct(task); } @@ -7050,6 +7053,7 @@ static void __init cgroup_rt_init(void) void cgroup_task_dead(struct task_struct *task) { + trace_printk(KERN_INFO "%s: task = %px (%s)\n", __func__, task, task->comm); get_task_struct(task); llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks)); irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork)); @@ -7059,6 +7063,7 @@ static void __init cgroup_rt_init(void) {} void cgroup_task_dead(struct task_struct *task) { + trace_printk(KERN_INFO "%s: task = %px (%s)\n", __func__, task, task->comm); do_cgroup_task_dead(task); } #endif /* CONFIG_PREEMPT_RT */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 854984967fe2..19b130b831bc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5078,6 +5078,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) struct rq *rq = this_rq(); struct mm_struct *mm = rq->prev_mm; unsigned int prev_state; + trace_printk(KERN_INFO "%s 0: %px (%s)\n", __func__, prev, prev->comm); /* * The previous task will have left us with a preempt_count of 2 @@ -5153,6 +5154,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) * visible to SCX schedulers. */ sched_ext_dead(prev); + trace_printk(KERN_INFO "%s 1: %px (%s)\n", __func__, prev, prev->comm); cgroup_task_dead(prev); /* Task is done with its stack. */ @@ -5202,6 +5204,7 @@ static __always_inline struct rq * context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next, struct rq_flags *rf) { + trace_printk(KERN_INFO "%s 0: %px (%s)\n", __func__, prev, prev->comm); prepare_task_switch(rq, prev, next); /* This if from PREEMPT_RT log, there*s a long pause in which cgroup_task_dead() is not called 59366: <...>-3209 [001] d..2. 33.110392: 0xffffffffa36c309b: 6context_switch 0: ffff933eb264a180 (reboot) [...] 112455: <idle>-0 [006] ...1. 40.503766: 0xffffffffa2da570c: 6cgroup_task_dead: task = ffff933f1885c300 ((udev-worker)) [...] no call to cgroup_task_dead() here, just finish_task_switch() and context_switch() 217571: <idle>-0 [010] ...1. 125.282118: 0xffffffffa2da570c: 6cgroup_task_dead: task = ffff933e94fae480 (systemd) [...] 274103 <idle>-0 [014] d..2. 130.157472: 0xffffffffa2cef125: 6finish_task_switch 0: ffff933e815e10c0 (ksoftirqd/14) This is other log (no pause here, just the first messages after reboot is initiated and the last message, to show duration of shutdown): 58029: <...>-2975 [003] d..2. 33.564291: 0xffffffff934b3e89: 6context_switch 0: ffff88e1e0302180 (reboot) [...] 107700 <...>-1 [000] d..2. 37.352191: 0xffffffff92aee9c5: 6finish_task_switch 0: ffffffff93c12980 (swapper/0) The complete logs are here: https://gitlab.freedesktop.org/spasswolf/pastebin/-/issues/2 Bert Karwatzki ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-25 16:37 ` Bert Karwatzki @ 2026-02-25 16:59 ` Sebastian Andrzej Siewior 2026-02-25 22:31 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2026-02-25 16:59 UTC (permalink / raw) To: Bert Karwatzki Cc: Tejun Heo, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel, Steven Rostedt On 2026-02-25 17:37:56 [+0100], Bert Karwatzki wrote: > I've done two testruns with this debug patch (The persistant log buffer works now, thanks > again to Steven Rostedt): … > This if from PREEMPT_RT log, there*s a long pause in which cgroup_task_dead() is not called Yeah, I don't know why. The irq-work is invoked slightly delayed so that part is working. | irq_work-24 0..... 1501734us : cgrp_dead_tasks_iwork_fn: Kill ffff888101588000 (sd-close)-1964 | systemd--1 1..... 1502194us : sched_process_wait: comm=systemd pid=1963 prio=120 | systemd--1 1..... 1502547us : sched_process_wait: comm=systemd pid=0 prio=120 | systemd--1 1..... 1502771us : sched_process_wait: comm=systemd pid=1964 prio=120 | systemd--1 1..... 1502819us : sched_process_wait: comm=systemd pid=0 prio=120 | systemd--1 1..... 1502835us : sched_process_wait: comm=systemd pid=0 prio=120 | rcuc/1-31 1b...1 1524116us : sched_process_free: comm=(sd-close) pid=1964 prio=120 the gap | systemd--1 3..... 90610397us : sched_process_wait: comm=systemd pid=0 prio=120 | systemd--1 3..... 90611469us : sched_process_wait: comm=systemd pid=1829 prio=120 So for some reason systemd stops killing tasks. It complains about still running gpg-agent and ssh-agent but both are long gone. Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-25 16:59 ` Sebastian Andrzej Siewior @ 2026-02-25 22:31 ` Sebastian Andrzej Siewior 2026-02-26 13:24 ` Bert Karwatzki 2026-02-27 14:13 ` Sebastian Andrzej Siewior 0 siblings, 2 replies; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2026-02-25 22:31 UTC (permalink / raw) To: Bert Karwatzki Cc: Tejun Heo, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel, Steven Rostedt On 2026-02-25 17:59:55 [+0100], To Bert Karwatzki wrote: > On 2026-02-25 17:37:56 [+0100], Bert Karwatzki wrote: > > I've done two testruns with this debug patch (The persistant log buffer works now, thanks > > again to Steven Rostedt): > > … > > This if from PREEMPT_RT log, there*s a long pause in which cgroup_task_dead() is not called > > Yeah, I don't know why. The irq-work is invoked slightly delayed so that > part is working. … In the good case I have | systemd-1818 3....2 605751us : cgroup_notify_populated: root=0 id=2382 level=5 path=/user.slice/user-0.slice/user@0.service/app.slice/ssh-agent.socket val=1 |systemct-1911 1....1 620046us : cgroup_attach_task: dst_root=0 dst_id=2382 dst_level=5 dst_path=/user.slice/user-0.slice/user@0.service/app.slice/ssh-agent.socket pid=1911 comm=(ystemctl) |systemct-1911 1..... 642680us : sched_process_exit: comm=systemctl pid=1911 prio=120 group_dead=true |systemct-1911 1....2 643423us : signal_generate: sig=17 errno=0 code=1 comm=systemd pid=1818 grp=1 res=0 |systemct-1911 1d..2. 643432us : sched_switch: prev_comm=systemctl prev_pid=1911 prev_prio=120 prev_state=Z ==> next_comm=systemd next_pid=1818 next_prio=120 |irq_work-29 1....2 643450us : cgroup_notify_populated: root=0 id=2382 level=5 path=/user.slice/user-0.slice/user@0.service/app.slice/ssh-agent.socket val=0 |irq_work-29 1....2 643457us : cgroup_notify_populated: root=0 id=2229 level=4 path=/user.slice/user-0.slice/user@0.service/app.slice val=0 | systemd-1818 1....1 644548us : cgroup_rmdir: root=0 id=2382 level=5 path=/user.slice/user-0.slice/user@0.service/app.slice/ssh-agent.socket | systemd-1818 1..... 644784us : sched_process_wait: comm=systemd pid=1911 prio=120 and in the bad case | systemd-1828 3....2 312877us : cgroup_notify_populated: root=0 id=2419 level=5 path=/user.slice/user-0.slice/user@0.service/app.slice/ssh-agent.socket val=1 |systemct-1929 2....1 321916us : cgroup_attach_task: dst_root=0 dst_id=2419 dst_level=5 dst_path=/user.slice/user-0.slice/user@0.service/app.slice/ssh-agent.socket pid=1929 comm =(ystemctl) |systemct-1929 1..... 341432us : sched_process_exit: comm=systemctl pid=1929 prio=120 group_dead=true |systemct-1929 1.l..2 342623us : signal_generate: sig=17 errno=0 code=1 comm=systemd pid=1828 grp=1 res=0 |systemct-1929 1d..2. 342637us : sched_switch: prev_comm=systemctl prev_pid=1929 prev_prio=120 prev_state=Z ==> next_comm=systemd next_pid=1828 next_prio=120 | systemd-1828 1....1 343099us : signal_generate: sig=15 errno=0 code=0 comm=systemctl pid=1929 grp=1 res=1 | systemd-1828 1....1 343102us : signal_generate: sig=18 errno=0 code=0 comm=systemctl pid=1929 grp=1 res=1 | systemd-1828 1..... 343292us : sched_process_wait: comm=systemd pid=1929 prio=120 | systemd-1828 1..... 343442us : sched_process_wait: comm=systemd pid=0 prio=120 |irq_work-29 1....2 343725us : cgroup_notify_populated: root=0 id=2419 level=5 path=/user.slice/user-0.slice/user@0.service/app.slice/ssh-agent.socket val=0 | systemd-1828 1....1 90457054us : cgroup_rmdir: root=0 id=2419 level=5 path=/user.slice/user-0.slice/user@0.service/app.slice/ssh-agent.socket Until the sched_switch, everything is the same. But then systemd-1828 (the one with the cgroup_notify_populated event) seems to get impatient and sends a SIGTERM+SIGCONT. It gets the exit code, the cgroup_notify_populated event is there later and just once. The app.slice notify is missing. And the rmdir gets in much later. Did systemd kill the app.slice at level=4? Is this relevant? I don't see any immediate wake up or signal from within irq_work or kernfs_notify_workfn() later on. Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-25 22:31 ` Sebastian Andrzej Siewior @ 2026-02-26 13:24 ` Bert Karwatzki 2026-02-26 13:46 ` Sebastian Andrzej Siewior 2026-02-26 16:37 ` Steven Rostedt 2026-02-27 14:13 ` Sebastian Andrzej Siewior 1 sibling, 2 replies; 41+ messages in thread From: Bert Karwatzki @ 2026-02-26 13:24 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Tejun Heo, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel, Steven Rostedt, spasswolf I think I've found the reason for the stall. I was looking at commit 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT"), and noticed this: +static void __init cgroup_rt_init(void) +{ + int cpu; + + for_each_possible_cpu(cpu) { + init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu)); + per_cpu(cgrp_dead_tasks_iwork, cpu) = + IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn); + } IRQ_WORK_INIT_LAZY() expands to __IRQ_WORK_INIT(_func, IRQ_WORK_LAZY): (in include/linux/irq_work.h) IRQ_WORK_LAZY is declare in include/linux/smp-types.h: IRQ_WORK_LAZY = 0x04, /* No IPI, wait for tick */ The "wait for tick" gave me an idea as I'm also using CONFIG_NO_HZ_FULL=y so I tried commit b7453da1c7de288235234eb9265c3a2d661c1a2d (HEAD -> reboot_delay_debug_0) Author: Bert Karwatzki <spasswolf@web.de> Date: Thu Feb 26 14:01:41 2026 +0100 cgroup: use IRQ_WORK_INIT in cgroup_rt_init() Signed-off-by: Bert Karwatzki <spasswolf@web.de> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index b750aa284b89..ed22842eb5d3 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -7047,7 +7047,7 @@ static void __init cgroup_rt_init(void) for_each_possible_cpu(cpu) { init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu)); per_cpu(cgrp_dead_tasks_iwork, cpu) = - IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn); + IRQ_WORK_INIT(cgrp_dead_tasks_iwork_fn); } } and this seems to have fixes the issue for me. (Five successful reboots in a row, needs more testing). Bert Karwatzki ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-26 13:24 ` Bert Karwatzki @ 2026-02-26 13:46 ` Sebastian Andrzej Siewior 2026-02-26 16:37 ` Steven Rostedt 1 sibling, 0 replies; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2026-02-26 13:46 UTC (permalink / raw) To: Bert Karwatzki Cc: Tejun Heo, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel, Steven Rostedt On 2026-02-26 14:24:34 [+0100], Bert Karwatzki wrote: > I think I've found the reason for the stall. I was looking at commit 9311e6c29b34 > ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT"), > and noticed this: … I did say that removing LAZY "fixes" it in https://lore.kernel.org/all/20260225154341.32AjXoVi@linutronix.de/ I wouldn't describe this as a reason. The delay leads to different behaviour and I am not sure if systemd behaves wrongly or if there is a missing wake up. Tejun did explain in https://lore.kernel.org/all/aQzg9kcnCsdRQiB4@slm.duckdns.org/ that a slight delay is okay. Something is odd here. > and this seems to have fixes the issue for me. (Five successful reboots > in a row, needs more testing). > Bert Karwatzki Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-26 13:24 ` Bert Karwatzki 2026-02-26 13:46 ` Sebastian Andrzej Siewior @ 2026-02-26 16:37 ` Steven Rostedt 1 sibling, 0 replies; 41+ messages in thread From: Steven Rostedt @ 2026-02-26 16:37 UTC (permalink / raw) To: Bert Karwatzki Cc: Sebastian Andrzej Siewior, Tejun Heo, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel On Thu, 26 Feb 2026 14:24:34 +0100 Bert Karwatzki <spasswolf@web.de> wrote: > I think I've found the reason for the stall. I was looking at commit 9311e6c29b34 > ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT"), > and noticed this: > > +static void __init cgroup_rt_init(void) > +{ > + int cpu; > + > + for_each_possible_cpu(cpu) { > + init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu)); > + per_cpu(cgrp_dead_tasks_iwork, cpu) = > + IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn); > + } > > IRQ_WORK_INIT_LAZY() expands to __IRQ_WORK_INIT(_func, IRQ_WORK_LAZY): > (in include/linux/irq_work.h) > IRQ_WORK_LAZY is declare in include/linux/smp-types.h: > IRQ_WORK_LAZY = 0x04, /* No IPI, wait for tick */ > > The "wait for tick" gave me an idea as I'm also using > CONFIG_NO_HZ_FULL=y If you disable NO_HZ_FULL, does the problem also go away? I wonder if you add irq events and trace a good and bad boot to see if it definitely shows the delayed tick being an issue. echo 1 > /sys/kernel/tracing/instances/boot_map/events/irq/enable -- Steve ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-25 22:31 ` Sebastian Andrzej Siewior 2026-02-26 13:24 ` Bert Karwatzki @ 2026-02-27 14:13 ` Sebastian Andrzej Siewior 2026-02-27 22:57 ` Bert Karwatzki 1 sibling, 1 reply; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2026-02-27 14:13 UTC (permalink / raw) To: Bert Karwatzki Cc: Tejun Heo, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel, Steven Rostedt On 2026-02-25 23:31:36 [+0100], To Bert Karwatzki wrote: … > Until the sched_switch, everything is the same. But then systemd-1828 > (the one with the cgroup_notify_populated event) seems to get impatient > and sends a SIGTERM+SIGCONT. It gets the exit code, the > cgroup_notify_populated event is there later and just once. The > app.slice notify is missing. And the rmdir gets in much later. … I so proud of myself. Bert, can you confirm that this works? --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5283,6 +5283,11 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) static int cgroup_procs_show(struct seq_file *s, void *v) { + struct task_struct *tsk = v; + + if (READ_ONCE(tsk->__state) & TASK_DEAD) + return 0; + seq_printf(s, "%d\n", task_pid_vnr(v)); return 0; } -- 2.51.0 Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-27 14:13 ` Sebastian Andrzej Siewior @ 2026-02-27 22:57 ` Bert Karwatzki 2026-03-02 11:15 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 41+ messages in thread From: Bert Karwatzki @ 2026-02-27 22:57 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Tejun Heo, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel, Steven Rostedt, spasswolf Am Freitag, dem 27.02.2026 um 15:13 +0100 schrieb Sebastian Andrzej Siewior: > On 2026-02-25 23:31:36 [+0100], To Bert Karwatzki wrote: > … > > Until the sched_switch, everything is the same. But then systemd-1828 > > (the one with the cgroup_notify_populated event) seems to get impatient > > and sends a SIGTERM+SIGCONT. It gets the exit code, the > > cgroup_notify_populated event is there later and just once. The > > app.slice notify is missing. And the rmdir gets in much later. > … > > I so proud of myself. Bert, can you confirm that this works? > > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -5283,6 +5283,11 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos) > > static int cgroup_procs_show(struct seq_file *s, void *v) > { > + struct task_struct *tsk = v; > + > + if (READ_ONCE(tsk->__state) & TASK_DEAD) > + return 0; > + > seq_printf(s, "%d\n", task_pid_vnr(v)); > return 0; > } I tested this with 10 reboots, it worked 9 times, one reboot was delayed (perhaps for a different reason, sometime these delays occur on no-RT or older kernels, too) Bert Karwatzki ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: ~90s reboot delay with v6.19 and PREEMPT_RT 2026-02-27 22:57 ` Bert Karwatzki @ 2026-03-02 11:15 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-02 11:15 UTC (permalink / raw) To: Bert Karwatzki Cc: Tejun Heo, Thomas Gleixner, calvin, dschatzberg, peterz, linux-kernel, linux-rt-devel, Steven Rostedt On 2026-02-27 23:57:09 [+0100], Bert Karwatzki wrote: > I tested this with 10 reboots, it worked 9 times, one reboot was delayed (perhaps > for a different reason, sometime these delays occur on no-RT or older kernels, too) Okay, thank you. I read this as Tested-by. > Bert Karwatzki Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t 2025-11-04 18:11 DEBUG_ATOMIC_SLEEP spew in cgroup_task_dead() on next-20251104 Calvin Owens 2025-11-04 19:30 ` Tejun Heo @ 2025-11-04 19:32 ` Tejun Heo 2025-11-05 7:30 ` Sebastian Andrzej Siewior 2025-11-05 8:50 ` Peter Zijlstra 2025-11-04 19:32 ` [PATCH cgroup/for-6.19 2/2] cgroup: Convert css_set_lock locking to use cleanup guards Tejun Heo 2 siblings, 2 replies; 41+ messages in thread From: Tejun Heo @ 2025-11-04 19:32 UTC (permalink / raw) To: Calvin Owens Cc: linux-kernel, Dan Schatzberg, Peter Zijlstra, Johannes Weiner, Michal Koutný, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, cgroups, linux-rt-devel Convert css_set_lock from spinlock_t to raw_spinlock_t to address RT-related scheduling constraints. cgroup_task_dead() is called from finish_task_switch() which cannot schedule even in PREEMPT_RT kernels, requiring css_set_lock to be a raw spinlock to avoid sleeping in a non-preemptible context. Fixes: d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") Reported-by: Calvin Owens <calvin@wbinvd.org> Link: https://lore.kernel.org/r/20251104181114.489391-1-calvin@wbinvd.org Signed-off-by: Tejun Heo <tj@kernel.org> kernel/cgroup/cgroup.c | 130 ++++++++++++++++++++-------------------- kernel/cgroup/debug.c | 12 +-- kernel/cgroup/freezer.c | 16 ++-- kernel/cgroup/namespace.c | 4 - 7 files changed, 91 insertions(+), 89 deletions(-) --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -76,7 +76,7 @@ enum cgroup_lifetime_events { extern struct file_system_type cgroup_fs_type; extern struct cgroup_root cgrp_dfl_root; extern struct css_set init_css_set; -extern spinlock_t css_set_lock; +extern raw_spinlock_t css_set_lock; extern struct blocking_notifier_head cgroup_lifetime_notifier; #define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys; --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -208,9 +208,9 @@ static inline void put_css_set(struct cs if (refcount_dec_not_one(&cset->refcount)) return; - spin_lock_irqsave(&css_set_lock, flags); + raw_spin_lock_irqsave(&css_set_lock, flags); put_css_set_locked(cset); - spin_unlock_irqrestore(&css_set_lock, flags); + raw_spin_unlock_irqrestore(&css_set_lock, flags); } /* --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -73,9 +73,9 @@ int cgroup_attach_task_all(struct task_s for_each_root(root) { struct cgroup *from_cgrp; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); from_cgrp = task_cgroup_from_root(from, root); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); retval = cgroup_attach_task(from_cgrp, tsk, false); if (retval) @@ -121,10 +121,10 @@ int cgroup_transfer_tasks(struct cgroup cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); /* all tasks in @from are being moved, all csets are source */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); list_for_each_entry(link, &from->cset_links, cset_link) cgroup_migrate_add_src(link->cset, to, &mgctx); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); ret = cgroup_migrate_prepare_dst(&mgctx); if (ret) @@ -1308,11 +1308,11 @@ struct cgroup *task_get_cgroup1(struct t continue; if (root->hierarchy_id != hierarchy_id) continue; - spin_lock_irqsave(&css_set_lock, flags); + raw_spin_lock_irqsave(&css_set_lock, flags); cgrp = task_cgroup_from_root(tsk, root); if (!cgrp || !cgroup_tryget(cgrp)) cgrp = ERR_PTR(-ENOENT); - spin_unlock_irqrestore(&css_set_lock, flags); + raw_spin_unlock_irqrestore(&css_set_lock, flags); break; } rcu_read_unlock(); --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -83,13 +83,15 @@ * hierarchy must be performed while holding it. * * css_set_lock protects task->cgroups pointer, the list of css_set - * objects, and the chain of tasks off each css_set. + * objects, and the chain of tasks off each css_set. This needs to be + * a raw spinlock as cgroup_task_dead() which grabs the lock is called + * from finish_task_switch() which can't schedule even in RT. * * These locks are exported if CONFIG_PROVE_RCU so that accessors in * cgroup.h can use them for lockdep annotations. */ DEFINE_MUTEX(cgroup_mutex); -DEFINE_SPINLOCK(css_set_lock); +DEFINE_RAW_SPINLOCK(css_set_lock); #if (defined CONFIG_PROVE_RCU || defined CONFIG_LOCKDEP) EXPORT_SYMBOL_GPL(cgroup_mutex); @@ -666,9 +668,9 @@ int cgroup_task_count(const struct cgrou { int count; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); count = __cgroup_task_count(cgrp); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); return count; } @@ -1236,11 +1238,11 @@ static struct css_set *find_css_set(stru /* First see if we already have a cgroup group that matches * the desired set */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cset = find_existing_css_set(old_cset, cgrp, template); if (cset) get_css_set(cset); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); if (cset) return cset; @@ -1272,7 +1274,7 @@ static struct css_set *find_css_set(stru * find_existing_css_set() */ memcpy(cset->subsys, template, sizeof(cset->subsys)); - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); /* Add reference counts and links from the new css_set. */ list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) { struct cgroup *c = link->cgrp; @@ -1298,7 +1300,7 @@ static struct css_set *find_css_set(stru css_get(css); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* * If @cset should be threaded, look up the matching dom_cset and @@ -1315,11 +1317,11 @@ static struct css_set *find_css_set(stru return NULL; } - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cset->dom_cset = dcset; list_add_tail(&cset->threaded_csets_node, &dcset->threaded_csets); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); } return cset; @@ -1412,7 +1414,7 @@ static void cgroup_destroy_root(struct c * Release all the links from cset_links to this hierarchy's * root cgroup */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); list_for_each_entry_safe(link, tmp_link, &cgrp->cset_links, cset_link) { list_del(&link->cset_link); @@ -1420,7 +1422,7 @@ static void cgroup_destroy_root(struct c kfree(link); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); WARN_ON_ONCE(list_empty(&root->root_list)); list_del_rcu(&root->root_list); @@ -1917,7 +1919,7 @@ int rebind_subsystems(struct cgroup_root rcu_assign_pointer(dcgrp->subsys[ssid], css); ss->root = dst_root; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); css->cgroup = dcgrp; WARN_ON(!list_empty(&dcgrp->e_csets[ss->id])); list_for_each_entry_safe(cset, cset_pos, &scgrp->e_csets[ss->id], @@ -1935,7 +1937,7 @@ int rebind_subsystems(struct cgroup_root if (it->cset_head == &scgrp->e_csets[ss->id]) it->cset_head = &dcgrp->e_csets[ss->id]; } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* default hierarchy doesn't enable controllers by default */ dst_root->subsys_mask |= 1 << ssid; @@ -1971,10 +1973,10 @@ int cgroup_show_path(struct seq_file *sf if (!buf) return -ENOMEM; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot); len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); if (len == -E2BIG) len = -ERANGE; @@ -2230,13 +2232,13 @@ int cgroup_setup_root(struct cgroup_root * Link the root cgroup in this hierarchy into all the css_set * objects. */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); hash_for_each(css_set_table, i, cset, hlist) { link_css_set(&tmp_links, cset, root_cgrp); if (css_set_populated(cset)) cgroup_update_populated(root_cgrp, true); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); BUG_ON(!list_empty(&root_cgrp->self.children)); BUG_ON(atomic_read(&root->nr_cgrps) != 1); @@ -2280,11 +2282,11 @@ int cgroup_do_get_tree(struct fs_context struct cgroup *cgrp; cgroup_lock(); - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); cgroup_unlock(); nsdentry = kernfs_node_dentry(cgrp->kn, sb); @@ -2496,11 +2498,11 @@ int cgroup_path_ns(struct cgroup *cgrp, int ret; cgroup_lock(); - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); ret = cgroup_path_ns_locked(cgrp, buf, buflen, ns); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); cgroup_unlock(); return ret; @@ -2719,7 +2721,7 @@ static int cgroup_migrate_execute(struct * the new cgroup. There are no failure cases after here, so this * is the commit point. */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); list_for_each_entry(cset, &tset->src_csets, mg_node) { list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) { struct css_set *from_cset = task_css_set(task); @@ -2739,7 +2741,7 @@ static int cgroup_migrate_execute(struct } } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* * Migration is committed, all target tasks are now on dst_csets. @@ -2772,13 +2774,13 @@ out_cancel_attach: } while_each_subsys_mask(); } out_release_tset: - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); list_splice_init(&tset->dst_csets, &tset->src_csets); list_for_each_entry_safe(cset, tmp_cset, &tset->src_csets, mg_node) { list_splice_tail_init(&cset->mg_tasks, &cset->tasks); list_del_init(&cset->mg_node); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* * Re-initialize the cgroup_taskset structure in case it is reused @@ -2836,7 +2838,7 @@ void cgroup_migrate_finish(struct cgroup lockdep_assert_held(&cgroup_mutex); - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); list_for_each_entry_safe(cset, tmp_cset, &mgctx->preloaded_src_csets, mg_src_preload_node) { @@ -2856,7 +2858,7 @@ void cgroup_migrate_finish(struct cgroup put_css_set_locked(cset); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); } /** @@ -2999,14 +3001,14 @@ int cgroup_migrate(struct task_struct *l * section to prevent tasks from being freed while taking the snapshot. * spin_lock_irq() implies RCU critical section here. */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); task = leader; do { cgroup_migrate_add_task(task, mgctx); if (!threadgroup) break; } while_each_thread(leader, task); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); return cgroup_migrate_execute(mgctx); } @@ -3027,14 +3029,14 @@ int cgroup_attach_task(struct cgroup *ds int ret = 0; /* look up all src csets */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); task = leader; do { cgroup_migrate_add_src(task_css_set(task), dst_cgrp, &mgctx); if (!threadgroup) break; } while_each_thread(leader, task); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* prepare dst csets and commit */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3191,7 +3193,7 @@ static int cgroup_update_dfl_csses(struc lockdep_assert_held(&cgroup_mutex); /* look up all csses currently attached to @cgrp's subtree */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) { struct cgrp_cset_link *link; @@ -3207,7 +3209,7 @@ static int cgroup_update_dfl_csses(struc list_for_each_entry(link, &dsct->cset_links, cset_link) cgroup_migrate_add_src(link->cset, dsct, &mgctx); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* * We need to write-lock threadgroup_rwsem while migrating tasks. @@ -3229,7 +3231,7 @@ static int cgroup_update_dfl_csses(struc if (ret) goto out_finish; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); list_for_each_entry(src_cset, &mgctx.preloaded_src_csets, mg_src_preload_node) { struct task_struct *task, *ntask; @@ -3238,7 +3240,7 @@ static int cgroup_update_dfl_csses(struc list_for_each_entry_safe(task, ntask, &src_cset->tasks, cg_list) cgroup_migrate_add_task(task, &mgctx); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); ret = cgroup_migrate_execute(&mgctx); out_finish: @@ -4186,9 +4188,9 @@ static void __cgroup_kill(struct cgroup lockdep_assert_held(&cgroup_mutex); - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cgrp->kill_seq++; - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it); while ((task = css_task_iter_next(&it))) { @@ -5146,7 +5148,7 @@ void css_task_iter_start(struct cgroup_s memset(it, 0, sizeof(*it)); - spin_lock_irqsave(&css_set_lock, irqflags); + raw_spin_lock_irqsave(&css_set_lock, irqflags); it->ss = css->ss; it->flags = flags; @@ -5160,7 +5162,7 @@ void css_task_iter_start(struct cgroup_s css_task_iter_advance(it); - spin_unlock_irqrestore(&css_set_lock, irqflags); + raw_spin_unlock_irqrestore(&css_set_lock, irqflags); } /** @@ -5180,7 +5182,7 @@ struct task_struct *css_task_iter_next(s it->cur_task = NULL; } - spin_lock_irqsave(&css_set_lock, irqflags); + raw_spin_lock_irqsave(&css_set_lock, irqflags); /* @it may be half-advanced by skips, finish advancing */ if (it->flags & CSS_TASK_ITER_SKIPPED) @@ -5193,7 +5195,7 @@ struct task_struct *css_task_iter_next(s css_task_iter_advance(it); } - spin_unlock_irqrestore(&css_set_lock, irqflags); + raw_spin_unlock_irqrestore(&css_set_lock, irqflags); return it->cur_task; } @@ -5209,10 +5211,10 @@ void css_task_iter_end(struct css_task_i unsigned long irqflags; if (it->cur_cset) { - spin_lock_irqsave(&css_set_lock, irqflags); + raw_spin_lock_irqsave(&css_set_lock, irqflags); list_del(&it->iters_node); put_css_set_locked(it->cur_cset); - spin_unlock_irqrestore(&css_set_lock, irqflags); + raw_spin_unlock_irqrestore(&css_set_lock, irqflags); } if (it->cur_dcset) @@ -5378,9 +5380,9 @@ static ssize_t __cgroup_procs_write(stru goto out_unlock; /* find the source cgroup */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* * Process and thread migrations follow same delegation rule. Check @@ -5667,11 +5669,11 @@ static void css_release_work_fn(struct w css_rstat_flush(&cgrp->self); - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); for (tcgrp = cgroup_parent(cgrp); tcgrp; tcgrp = cgroup_parent(tcgrp)) tcgrp->nr_dying_descendants--; - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* * There are two control paths which try to determine @@ -5922,7 +5924,7 @@ static struct cgroup *cgroup_create(stru goto out_psi_free; /* allocation complete, commit to creation */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); for (i = 0; i < level; i++) { tcgrp = cgrp->ancestors[i]; tcgrp->nr_descendants++; @@ -5935,7 +5937,7 @@ static struct cgroup *cgroup_create(stru if (cgrp->freezer.e_freeze) tcgrp->freezer.nr_frozen_descendants++; } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children); atomic_inc(&root->nr_cgrps); @@ -6181,10 +6183,10 @@ static int cgroup_destroy_locked(struct */ cgrp->self.flags &= ~CSS_ONLINE; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); list_for_each_entry(link, &cgrp->cset_links, cset_link) link->cset->dead = true; - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* initiate massacre of all css's */ for_each_css(css, ssid, cgrp) @@ -6197,7 +6199,7 @@ static int cgroup_destroy_locked(struct if (cgroup_is_threaded(cgrp)) parent->nr_threaded_children--; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); for (tcgrp = parent; tcgrp; tcgrp = cgroup_parent(tcgrp)) { tcgrp->nr_descendants--; tcgrp->nr_dying_descendants++; @@ -6208,7 +6210,7 @@ static int cgroup_destroy_locked(struct if (test_bit(CGRP_FROZEN, &cgrp->flags)) tcgrp->freezer.nr_frozen_descendants--; } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); cgroup1_check_for_release(parent); @@ -6557,7 +6559,7 @@ int proc_cgroup_show(struct seq_file *m, goto out; rcu_read_lock(); - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); for_each_root(root) { struct cgroup_subsys *ss; @@ -6612,7 +6614,7 @@ int proc_cgroup_show(struct seq_file *m, retval = 0; out_unlock: - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); rcu_read_unlock(); kfree(buf); out: @@ -6700,14 +6702,14 @@ static int cgroup_css_set_fork(struct ke cgroup_threadgroup_change_begin(current); - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cset = task_css_set(current); get_css_set(cset); if (kargs->cgrp) kargs->kill_seq = kargs->cgrp->kill_seq; else kargs->kill_seq = cset->dfl_cgrp->kill_seq; - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); if (!(kargs->flags & CLONE_INTO_CGROUP)) { kargs->cset = cset; @@ -6897,7 +6899,7 @@ void cgroup_post_fork(struct task_struct cset = kargs->cset; kargs->cset = NULL; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); /* init tasks are special, only link regular threads */ if (likely(child->pid)) { @@ -6945,7 +6947,7 @@ void cgroup_post_fork(struct task_struct kill = kargs->kill_seq != cgrp_kill_seq; } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); /* * Call ss->fork(). This must happen after @child is linked on @@ -6995,7 +6997,7 @@ void cgroup_task_dead(struct task_struct struct css_set *cset; unsigned long flags; - spin_lock_irqsave(&css_set_lock, flags); + raw_spin_lock_irqsave(&css_set_lock, flags); WARN_ON_ONCE(list_empty(&tsk->cg_list)); cset = task_css_set(tsk); @@ -7013,7 +7015,7 @@ void cgroup_task_dead(struct task_struct test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags))) cgroup_update_frozen(task_dfl_cgroup(tsk)); - spin_unlock_irqrestore(&css_set_lock, flags); + raw_spin_unlock_irqrestore(&css_set_lock, flags); } void cgroup_task_release(struct task_struct *task) @@ -7031,10 +7033,10 @@ void cgroup_task_free(struct task_struct struct css_set *cset = task_css_set(task); if (!list_empty(&task->cg_list)) { - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); css_set_skip_task_iters(task_css_set(task), task); list_del_init(&task->cg_list); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); } put_css_set(cset); --- a/kernel/cgroup/debug.c +++ b/kernel/cgroup/debug.c @@ -48,7 +48,7 @@ static int current_css_set_read(struct s if (!cgroup_kn_lock_live(of->kn, false)) return -ENODEV; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cset = task_css_set(current); refcnt = refcount_read(&cset->refcount); seq_printf(seq, "css_set %pK %d", cset, refcnt); @@ -66,7 +66,7 @@ static int current_css_set_read(struct s seq_printf(seq, "%2d: %-4s\t- %p[%d]\n", ss->id, ss->name, css, css->id); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); cgroup_kn_unlock(of->kn); return 0; } @@ -92,7 +92,7 @@ static int current_css_set_cg_links_read if (!name_buf) return -ENOMEM; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cset = task_css_set(current); list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { struct cgroup *c = link->cgrp; @@ -101,7 +101,7 @@ static int current_css_set_cg_links_read seq_printf(seq, "Root %d group %s\n", c->root->hierarchy_id, name_buf); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); kfree(name_buf); return 0; } @@ -113,7 +113,7 @@ static int cgroup_css_links_read(struct struct cgrp_cset_link *link; int dead_cnt = 0, extra_refs = 0, threaded_csets = 0; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); list_for_each_entry(link, &css->cgroup->cset_links, cset_link) { struct css_set *cset = link->cset; @@ -180,7 +180,7 @@ static int cgroup_css_links_read(struct WARN_ON(count != cset->nr_tasks); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); if (!dead_cnt && !extra_refs && !threaded_csets) return 0; --- a/kernel/cgroup/freezer.c +++ b/kernel/cgroup/freezer.c @@ -108,12 +108,12 @@ void cgroup_enter_frozen(void) if (current->frozen) return; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); current->frozen = true; cgrp = task_dfl_cgroup(current); cgroup_inc_frozen_cnt(cgrp); cgroup_update_frozen(cgrp); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); } /* @@ -129,7 +129,7 @@ void cgroup_leave_frozen(bool always_lea { struct cgroup *cgrp; - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cgrp = task_dfl_cgroup(current); if (always_leave || !test_bit(CGRP_FREEZE, &cgrp->flags)) { cgroup_dec_frozen_cnt(cgrp); @@ -142,7 +142,7 @@ void cgroup_leave_frozen(bool always_lea set_thread_flag(TIF_SIGPENDING); spin_unlock(¤t->sighand->siglock); } - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); } /* @@ -178,7 +178,7 @@ static void cgroup_do_freeze(struct cgro lockdep_assert_held(&cgroup_mutex); - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); write_seqcount_begin(&cgrp->freezer.freeze_seq); if (freeze) { set_bit(CGRP_FREEZE, &cgrp->flags); @@ -189,7 +189,7 @@ static void cgroup_do_freeze(struct cgro cgrp->freezer.freeze_start_nsec); } write_seqcount_end(&cgrp->freezer.freeze_seq); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); if (freeze) TRACE_CGROUP_PATH(freeze, cgrp); @@ -212,10 +212,10 @@ static void cgroup_do_freeze(struct cgro * Cgroup state should be revisited here to cover empty leaf cgroups * and cgroups which descendants are already in the desired state. */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); if (cgrp->nr_descendants == cgrp->freezer.nr_frozen_descendants) cgroup_update_frozen(cgrp); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); } /* --- a/kernel/cgroup/namespace.c +++ b/kernel/cgroup/namespace.c @@ -70,10 +70,10 @@ struct cgroup_namespace *copy_cgroup_ns( return ERR_PTR(-ENOSPC); /* It is not safe to take cgroup_mutex here */ - spin_lock_irq(&css_set_lock); + raw_spin_lock_irq(&css_set_lock); cset = task_css_set(current); get_css_set(cset); - spin_unlock_irq(&css_set_lock); + raw_spin_unlock_irq(&css_set_lock); new_ns = alloc_cgroup_ns(); if (IS_ERR(new_ns)) { ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t 2025-11-04 19:32 ` [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t Tejun Heo @ 2025-11-05 7:30 ` Sebastian Andrzej Siewior 2025-11-05 16:19 ` Tejun Heo 2025-11-05 8:50 ` Peter Zijlstra 1 sibling, 1 reply; 41+ messages in thread From: Sebastian Andrzej Siewior @ 2025-11-05 7:30 UTC (permalink / raw) To: Tejun Heo Cc: Calvin Owens, linux-kernel, Dan Schatzberg, Peter Zijlstra, Johannes Weiner, Michal Koutný, Clark Williams, Steven Rostedt, cgroups, linux-rt-devel On 2025-11-04 09:32:24 [-1000], Tejun Heo wrote: > +++ b/kernel/cgroup/cgroup-internal.h > @@ -208,9 +208,9 @@ static inline void put_css_set(struct cs > if (refcount_dec_not_one(&cset->refcount)) > return; > > - spin_lock_irqsave(&css_set_lock, flags); > + raw_spin_lock_irqsave(&css_set_lock, flags); > put_css_set_locked(cset); This one has a kfree(link) and kfree takes spinlock_t so not working. > - spin_unlock_irqrestore(&css_set_lock, flags); > + raw_spin_unlock_irqrestore(&css_set_lock, flags); > } > /* > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -1272,7 +1274,7 @@ static struct css_set *find_css_set(stru > * find_existing_css_set() */ > memcpy(cset->subsys, template, sizeof(cset->subsys)); > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > /* Add reference counts and links from the new css_set. */ > list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) { > struct cgroup *c = link->cgrp; I am also a bit worried about all these list iterations which happen under the lock. There is no upper limit meaning the list can grow with limits affecting the time the lock is held. Sebastian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t 2025-11-05 7:30 ` Sebastian Andrzej Siewior @ 2025-11-05 16:19 ` Tejun Heo 0 siblings, 0 replies; 41+ messages in thread From: Tejun Heo @ 2025-11-05 16:19 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Calvin Owens, linux-kernel, Dan Schatzberg, Peter Zijlstra, Johannes Weiner, Michal Koutný, Clark Williams, Steven Rostedt, cgroups, linux-rt-devel On Wed, Nov 05, 2025 at 08:30:09AM +0100, Sebastian Andrzej Siewior wrote: > This one has a kfree(link) and kfree takes spinlock_t so not working. ... > > I am also a bit worried about all these list iterations which happen > under the lock. There is no upper limit meaning the list can grow with > limits affecting the time the lock is held. Good points. Let me think of something else. Thanks. -- tejun ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t 2025-11-04 19:32 ` [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t Tejun Heo 2025-11-05 7:30 ` Sebastian Andrzej Siewior @ 2025-11-05 8:50 ` Peter Zijlstra 2025-11-05 16:20 ` Tejun Heo 1 sibling, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2025-11-05 8:50 UTC (permalink / raw) To: Tejun Heo Cc: Calvin Owens, linux-kernel, Dan Schatzberg, Johannes Weiner, Michal Koutný, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, cgroups, linux-rt-devel On Tue, Nov 04, 2025 at 09:32:24AM -1000, Tejun Heo wrote: > Convert css_set_lock from spinlock_t to raw_spinlock_t to address RT-related > scheduling constraints. cgroup_task_dead() is called from finish_task_switch() > which cannot schedule even in PREEMPT_RT kernels, requiring css_set_lock to be > a raw spinlock to avoid sleeping in a non-preemptible context. The constraint for doing so, is that each critical section is actually bounded in time. The below seem to contain list iteration. I'm thinking it is unbounded since userspace is on control of the cgroup hierarchy. > Fixes: d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") > Reported-by: Calvin Owens <calvin@wbinvd.org> > Link: https://lore.kernel.org/r/20251104181114.489391-1-calvin@wbinvd.org > Signed-off-by: Tejun Heo <tj@kernel.org> > kernel/cgroup/cgroup.c | 130 ++++++++++++++++++++-------------------- > kernel/cgroup/debug.c | 12 +-- > kernel/cgroup/freezer.c | 16 ++-- > kernel/cgroup/namespace.c | 4 - > 7 files changed, 91 insertions(+), 89 deletions(-) > > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -76,7 +76,7 @@ enum cgroup_lifetime_events { > extern struct file_system_type cgroup_fs_type; > extern struct cgroup_root cgrp_dfl_root; > extern struct css_set init_css_set; > -extern spinlock_t css_set_lock; > +extern raw_spinlock_t css_set_lock; > extern struct blocking_notifier_head cgroup_lifetime_notifier; > > #define SUBSYS(_x) extern struct cgroup_subsys _x ## _cgrp_subsys; > --- a/kernel/cgroup/cgroup-internal.h > +++ b/kernel/cgroup/cgroup-internal.h > @@ -208,9 +208,9 @@ static inline void put_css_set(struct cs > if (refcount_dec_not_one(&cset->refcount)) > return; > > - spin_lock_irqsave(&css_set_lock, flags); > + raw_spin_lock_irqsave(&css_set_lock, flags); > put_css_set_locked(cset); > - spin_unlock_irqrestore(&css_set_lock, flags); > + raw_spin_unlock_irqrestore(&css_set_lock, flags); > } > > /* > --- a/kernel/cgroup/cgroup-v1.c > +++ b/kernel/cgroup/cgroup-v1.c > @@ -73,9 +73,9 @@ int cgroup_attach_task_all(struct task_s > for_each_root(root) { > struct cgroup *from_cgrp; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > from_cgrp = task_cgroup_from_root(from, root); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > retval = cgroup_attach_task(from_cgrp, tsk, false); > if (retval) > @@ -121,10 +121,10 @@ int cgroup_transfer_tasks(struct cgroup > cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); > > /* all tasks in @from are being moved, all csets are source */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > list_for_each_entry(link, &from->cset_links, cset_link) > cgroup_migrate_add_src(link->cset, to, &mgctx); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > ret = cgroup_migrate_prepare_dst(&mgctx); > if (ret) > @@ -1308,11 +1308,11 @@ struct cgroup *task_get_cgroup1(struct t > continue; > if (root->hierarchy_id != hierarchy_id) > continue; > - spin_lock_irqsave(&css_set_lock, flags); > + raw_spin_lock_irqsave(&css_set_lock, flags); > cgrp = task_cgroup_from_root(tsk, root); > if (!cgrp || !cgroup_tryget(cgrp)) > cgrp = ERR_PTR(-ENOENT); > - spin_unlock_irqrestore(&css_set_lock, flags); > + raw_spin_unlock_irqrestore(&css_set_lock, flags); > break; > } > rcu_read_unlock(); > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -83,13 +83,15 @@ > * hierarchy must be performed while holding it. > * > * css_set_lock protects task->cgroups pointer, the list of css_set > - * objects, and the chain of tasks off each css_set. > + * objects, and the chain of tasks off each css_set. This needs to be > + * a raw spinlock as cgroup_task_dead() which grabs the lock is called > + * from finish_task_switch() which can't schedule even in RT. > * > * These locks are exported if CONFIG_PROVE_RCU so that accessors in > * cgroup.h can use them for lockdep annotations. > */ > DEFINE_MUTEX(cgroup_mutex); > -DEFINE_SPINLOCK(css_set_lock); > +DEFINE_RAW_SPINLOCK(css_set_lock); > > #if (defined CONFIG_PROVE_RCU || defined CONFIG_LOCKDEP) > EXPORT_SYMBOL_GPL(cgroup_mutex); > @@ -666,9 +668,9 @@ int cgroup_task_count(const struct cgrou > { > int count; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > count = __cgroup_task_count(cgrp); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > return count; > } > @@ -1236,11 +1238,11 @@ static struct css_set *find_css_set(stru > > /* First see if we already have a cgroup group that matches > * the desired set */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > cset = find_existing_css_set(old_cset, cgrp, template); > if (cset) > get_css_set(cset); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > if (cset) > return cset; > @@ -1272,7 +1274,7 @@ static struct css_set *find_css_set(stru > * find_existing_css_set() */ > memcpy(cset->subsys, template, sizeof(cset->subsys)); > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > /* Add reference counts and links from the new css_set. */ > list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) { > struct cgroup *c = link->cgrp; > @@ -1298,7 +1300,7 @@ static struct css_set *find_css_set(stru > css_get(css); > } > > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* > * If @cset should be threaded, look up the matching dom_cset and > @@ -1315,11 +1317,11 @@ static struct css_set *find_css_set(stru > return NULL; > } > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > cset->dom_cset = dcset; > list_add_tail(&cset->threaded_csets_node, > &dcset->threaded_csets); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > } > > return cset; > @@ -1412,7 +1414,7 @@ static void cgroup_destroy_root(struct c > * Release all the links from cset_links to this hierarchy's > * root cgroup > */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > > list_for_each_entry_safe(link, tmp_link, &cgrp->cset_links, cset_link) { > list_del(&link->cset_link); > @@ -1420,7 +1422,7 @@ static void cgroup_destroy_root(struct c > kfree(link); > } > > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > WARN_ON_ONCE(list_empty(&root->root_list)); > list_del_rcu(&root->root_list); > @@ -1917,7 +1919,7 @@ int rebind_subsystems(struct cgroup_root > rcu_assign_pointer(dcgrp->subsys[ssid], css); > ss->root = dst_root; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > css->cgroup = dcgrp; > WARN_ON(!list_empty(&dcgrp->e_csets[ss->id])); > list_for_each_entry_safe(cset, cset_pos, &scgrp->e_csets[ss->id], > @@ -1935,7 +1937,7 @@ int rebind_subsystems(struct cgroup_root > if (it->cset_head == &scgrp->e_csets[ss->id]) > it->cset_head = &dcgrp->e_csets[ss->id]; > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* default hierarchy doesn't enable controllers by default */ > dst_root->subsys_mask |= 1 << ssid; > @@ -1971,10 +1973,10 @@ int cgroup_show_path(struct seq_file *sf > if (!buf) > return -ENOMEM; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot); > len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > if (len == -E2BIG) > len = -ERANGE; > @@ -2230,13 +2232,13 @@ int cgroup_setup_root(struct cgroup_root > * Link the root cgroup in this hierarchy into all the css_set > * objects. > */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > hash_for_each(css_set_table, i, cset, hlist) { > link_css_set(&tmp_links, cset, root_cgrp); > if (css_set_populated(cset)) > cgroup_update_populated(root_cgrp, true); > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > BUG_ON(!list_empty(&root_cgrp->self.children)); > BUG_ON(atomic_read(&root->nr_cgrps) != 1); > @@ -2280,11 +2282,11 @@ int cgroup_do_get_tree(struct fs_context > struct cgroup *cgrp; > > cgroup_lock(); > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > > cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root); > > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > cgroup_unlock(); > > nsdentry = kernfs_node_dentry(cgrp->kn, sb); > @@ -2496,11 +2498,11 @@ int cgroup_path_ns(struct cgroup *cgrp, > int ret; > > cgroup_lock(); > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > > ret = cgroup_path_ns_locked(cgrp, buf, buflen, ns); > > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > cgroup_unlock(); > > return ret; > @@ -2719,7 +2721,7 @@ static int cgroup_migrate_execute(struct > * the new cgroup. There are no failure cases after here, so this > * is the commit point. > */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > list_for_each_entry(cset, &tset->src_csets, mg_node) { > list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) { > struct css_set *from_cset = task_css_set(task); > @@ -2739,7 +2741,7 @@ static int cgroup_migrate_execute(struct > > } > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* > * Migration is committed, all target tasks are now on dst_csets. > @@ -2772,13 +2774,13 @@ out_cancel_attach: > } while_each_subsys_mask(); > } > out_release_tset: > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > list_splice_init(&tset->dst_csets, &tset->src_csets); > list_for_each_entry_safe(cset, tmp_cset, &tset->src_csets, mg_node) { > list_splice_tail_init(&cset->mg_tasks, &cset->tasks); > list_del_init(&cset->mg_node); > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* > * Re-initialize the cgroup_taskset structure in case it is reused > @@ -2836,7 +2838,7 @@ void cgroup_migrate_finish(struct cgroup > > lockdep_assert_held(&cgroup_mutex); > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > > list_for_each_entry_safe(cset, tmp_cset, &mgctx->preloaded_src_csets, > mg_src_preload_node) { > @@ -2856,7 +2858,7 @@ void cgroup_migrate_finish(struct cgroup > put_css_set_locked(cset); > } > > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > } > > /** > @@ -2999,14 +3001,14 @@ int cgroup_migrate(struct task_struct *l > * section to prevent tasks from being freed while taking the snapshot. > * spin_lock_irq() implies RCU critical section here. > */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > task = leader; > do { > cgroup_migrate_add_task(task, mgctx); > if (!threadgroup) > break; > } while_each_thread(leader, task); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > return cgroup_migrate_execute(mgctx); > } > @@ -3027,14 +3029,14 @@ int cgroup_attach_task(struct cgroup *ds > int ret = 0; > > /* look up all src csets */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > task = leader; > do { > cgroup_migrate_add_src(task_css_set(task), dst_cgrp, &mgctx); > if (!threadgroup) > break; > } while_each_thread(leader, task); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* prepare dst csets and commit */ > ret = cgroup_migrate_prepare_dst(&mgctx); > @@ -3191,7 +3193,7 @@ static int cgroup_update_dfl_csses(struc > lockdep_assert_held(&cgroup_mutex); > > /* look up all csses currently attached to @cgrp's subtree */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) { > struct cgrp_cset_link *link; > > @@ -3207,7 +3209,7 @@ static int cgroup_update_dfl_csses(struc > list_for_each_entry(link, &dsct->cset_links, cset_link) > cgroup_migrate_add_src(link->cset, dsct, &mgctx); > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* > * We need to write-lock threadgroup_rwsem while migrating tasks. > @@ -3229,7 +3231,7 @@ static int cgroup_update_dfl_csses(struc > if (ret) > goto out_finish; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > list_for_each_entry(src_cset, &mgctx.preloaded_src_csets, > mg_src_preload_node) { > struct task_struct *task, *ntask; > @@ -3238,7 +3240,7 @@ static int cgroup_update_dfl_csses(struc > list_for_each_entry_safe(task, ntask, &src_cset->tasks, cg_list) > cgroup_migrate_add_task(task, &mgctx); > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > ret = cgroup_migrate_execute(&mgctx); > out_finish: > @@ -4186,9 +4188,9 @@ static void __cgroup_kill(struct cgroup > > lockdep_assert_held(&cgroup_mutex); > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > cgrp->kill_seq++; > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it); > while ((task = css_task_iter_next(&it))) { > @@ -5146,7 +5148,7 @@ void css_task_iter_start(struct cgroup_s > > memset(it, 0, sizeof(*it)); > > - spin_lock_irqsave(&css_set_lock, irqflags); > + raw_spin_lock_irqsave(&css_set_lock, irqflags); > > it->ss = css->ss; > it->flags = flags; > @@ -5160,7 +5162,7 @@ void css_task_iter_start(struct cgroup_s > > css_task_iter_advance(it); > > - spin_unlock_irqrestore(&css_set_lock, irqflags); > + raw_spin_unlock_irqrestore(&css_set_lock, irqflags); > } > > /** > @@ -5180,7 +5182,7 @@ struct task_struct *css_task_iter_next(s > it->cur_task = NULL; > } > > - spin_lock_irqsave(&css_set_lock, irqflags); > + raw_spin_lock_irqsave(&css_set_lock, irqflags); > > /* @it may be half-advanced by skips, finish advancing */ > if (it->flags & CSS_TASK_ITER_SKIPPED) > @@ -5193,7 +5195,7 @@ struct task_struct *css_task_iter_next(s > css_task_iter_advance(it); > } > > - spin_unlock_irqrestore(&css_set_lock, irqflags); > + raw_spin_unlock_irqrestore(&css_set_lock, irqflags); > > return it->cur_task; > } > @@ -5209,10 +5211,10 @@ void css_task_iter_end(struct css_task_i > unsigned long irqflags; > > if (it->cur_cset) { > - spin_lock_irqsave(&css_set_lock, irqflags); > + raw_spin_lock_irqsave(&css_set_lock, irqflags); > list_del(&it->iters_node); > put_css_set_locked(it->cur_cset); > - spin_unlock_irqrestore(&css_set_lock, irqflags); > + raw_spin_unlock_irqrestore(&css_set_lock, irqflags); > } > > if (it->cur_dcset) > @@ -5378,9 +5380,9 @@ static ssize_t __cgroup_procs_write(stru > goto out_unlock; > > /* find the source cgroup */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* > * Process and thread migrations follow same delegation rule. Check > @@ -5667,11 +5669,11 @@ static void css_release_work_fn(struct w > > css_rstat_flush(&cgrp->self); > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > for (tcgrp = cgroup_parent(cgrp); tcgrp; > tcgrp = cgroup_parent(tcgrp)) > tcgrp->nr_dying_descendants--; > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* > * There are two control paths which try to determine > @@ -5922,7 +5924,7 @@ static struct cgroup *cgroup_create(stru > goto out_psi_free; > > /* allocation complete, commit to creation */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > for (i = 0; i < level; i++) { > tcgrp = cgrp->ancestors[i]; > tcgrp->nr_descendants++; > @@ -5935,7 +5937,7 @@ static struct cgroup *cgroup_create(stru > if (cgrp->freezer.e_freeze) > tcgrp->freezer.nr_frozen_descendants++; > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children); > atomic_inc(&root->nr_cgrps); > @@ -6181,10 +6183,10 @@ static int cgroup_destroy_locked(struct > */ > cgrp->self.flags &= ~CSS_ONLINE; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > list_for_each_entry(link, &cgrp->cset_links, cset_link) > link->cset->dead = true; > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* initiate massacre of all css's */ > for_each_css(css, ssid, cgrp) > @@ -6197,7 +6199,7 @@ static int cgroup_destroy_locked(struct > if (cgroup_is_threaded(cgrp)) > parent->nr_threaded_children--; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > for (tcgrp = parent; tcgrp; tcgrp = cgroup_parent(tcgrp)) { > tcgrp->nr_descendants--; > tcgrp->nr_dying_descendants++; > @@ -6208,7 +6210,7 @@ static int cgroup_destroy_locked(struct > if (test_bit(CGRP_FROZEN, &cgrp->flags)) > tcgrp->freezer.nr_frozen_descendants--; > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > cgroup1_check_for_release(parent); > > @@ -6557,7 +6559,7 @@ int proc_cgroup_show(struct seq_file *m, > goto out; > > rcu_read_lock(); > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > > for_each_root(root) { > struct cgroup_subsys *ss; > @@ -6612,7 +6614,7 @@ int proc_cgroup_show(struct seq_file *m, > > retval = 0; > out_unlock: > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > rcu_read_unlock(); > kfree(buf); > out: > @@ -6700,14 +6702,14 @@ static int cgroup_css_set_fork(struct ke > > cgroup_threadgroup_change_begin(current); > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > cset = task_css_set(current); > get_css_set(cset); > if (kargs->cgrp) > kargs->kill_seq = kargs->cgrp->kill_seq; > else > kargs->kill_seq = cset->dfl_cgrp->kill_seq; > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > if (!(kargs->flags & CLONE_INTO_CGROUP)) { > kargs->cset = cset; > @@ -6897,7 +6899,7 @@ void cgroup_post_fork(struct task_struct > cset = kargs->cset; > kargs->cset = NULL; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > > /* init tasks are special, only link regular threads */ > if (likely(child->pid)) { > @@ -6945,7 +6947,7 @@ void cgroup_post_fork(struct task_struct > kill = kargs->kill_seq != cgrp_kill_seq; > } > > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > /* > * Call ss->fork(). This must happen after @child is linked on > @@ -6995,7 +6997,7 @@ void cgroup_task_dead(struct task_struct > struct css_set *cset; > unsigned long flags; > > - spin_lock_irqsave(&css_set_lock, flags); > + raw_spin_lock_irqsave(&css_set_lock, flags); > > WARN_ON_ONCE(list_empty(&tsk->cg_list)); > cset = task_css_set(tsk); > @@ -7013,7 +7015,7 @@ void cgroup_task_dead(struct task_struct > test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags))) > cgroup_update_frozen(task_dfl_cgroup(tsk)); > > - spin_unlock_irqrestore(&css_set_lock, flags); > + raw_spin_unlock_irqrestore(&css_set_lock, flags); > } > > void cgroup_task_release(struct task_struct *task) > @@ -7031,10 +7033,10 @@ void cgroup_task_free(struct task_struct > struct css_set *cset = task_css_set(task); > > if (!list_empty(&task->cg_list)) { > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > css_set_skip_task_iters(task_css_set(task), task); > list_del_init(&task->cg_list); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > } > > put_css_set(cset); > --- a/kernel/cgroup/debug.c > +++ b/kernel/cgroup/debug.c > @@ -48,7 +48,7 @@ static int current_css_set_read(struct s > if (!cgroup_kn_lock_live(of->kn, false)) > return -ENODEV; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > cset = task_css_set(current); > refcnt = refcount_read(&cset->refcount); > seq_printf(seq, "css_set %pK %d", cset, refcnt); > @@ -66,7 +66,7 @@ static int current_css_set_read(struct s > seq_printf(seq, "%2d: %-4s\t- %p[%d]\n", ss->id, ss->name, > css, css->id); > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > cgroup_kn_unlock(of->kn); > return 0; > } > @@ -92,7 +92,7 @@ static int current_css_set_cg_links_read > if (!name_buf) > return -ENOMEM; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > cset = task_css_set(current); > list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { > struct cgroup *c = link->cgrp; > @@ -101,7 +101,7 @@ static int current_css_set_cg_links_read > seq_printf(seq, "Root %d group %s\n", > c->root->hierarchy_id, name_buf); > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > kfree(name_buf); > return 0; > } > @@ -113,7 +113,7 @@ static int cgroup_css_links_read(struct > struct cgrp_cset_link *link; > int dead_cnt = 0, extra_refs = 0, threaded_csets = 0; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > > list_for_each_entry(link, &css->cgroup->cset_links, cset_link) { > struct css_set *cset = link->cset; > @@ -180,7 +180,7 @@ static int cgroup_css_links_read(struct > > WARN_ON(count != cset->nr_tasks); > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > if (!dead_cnt && !extra_refs && !threaded_csets) > return 0; > --- a/kernel/cgroup/freezer.c > +++ b/kernel/cgroup/freezer.c > @@ -108,12 +108,12 @@ void cgroup_enter_frozen(void) > if (current->frozen) > return; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > current->frozen = true; > cgrp = task_dfl_cgroup(current); > cgroup_inc_frozen_cnt(cgrp); > cgroup_update_frozen(cgrp); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > } > > /* > @@ -129,7 +129,7 @@ void cgroup_leave_frozen(bool always_lea > { > struct cgroup *cgrp; > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > cgrp = task_dfl_cgroup(current); > if (always_leave || !test_bit(CGRP_FREEZE, &cgrp->flags)) { > cgroup_dec_frozen_cnt(cgrp); > @@ -142,7 +142,7 @@ void cgroup_leave_frozen(bool always_lea > set_thread_flag(TIF_SIGPENDING); > spin_unlock(¤t->sighand->siglock); > } > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > } > > /* > @@ -178,7 +178,7 @@ static void cgroup_do_freeze(struct cgro > > lockdep_assert_held(&cgroup_mutex); > > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > write_seqcount_begin(&cgrp->freezer.freeze_seq); > if (freeze) { > set_bit(CGRP_FREEZE, &cgrp->flags); > @@ -189,7 +189,7 @@ static void cgroup_do_freeze(struct cgro > cgrp->freezer.freeze_start_nsec); > } > write_seqcount_end(&cgrp->freezer.freeze_seq); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > if (freeze) > TRACE_CGROUP_PATH(freeze, cgrp); > @@ -212,10 +212,10 @@ static void cgroup_do_freeze(struct cgro > * Cgroup state should be revisited here to cover empty leaf cgroups > * and cgroups which descendants are already in the desired state. > */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > if (cgrp->nr_descendants == cgrp->freezer.nr_frozen_descendants) > cgroup_update_frozen(cgrp); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > } > > /* > --- a/kernel/cgroup/namespace.c > +++ b/kernel/cgroup/namespace.c > @@ -70,10 +70,10 @@ struct cgroup_namespace *copy_cgroup_ns( > return ERR_PTR(-ENOSPC); > > /* It is not safe to take cgroup_mutex here */ > - spin_lock_irq(&css_set_lock); > + raw_spin_lock_irq(&css_set_lock); > cset = task_css_set(current); > get_css_set(cset); > - spin_unlock_irq(&css_set_lock); > + raw_spin_unlock_irq(&css_set_lock); > > new_ns = alloc_cgroup_ns(); > if (IS_ERR(new_ns)) { ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t 2025-11-05 8:50 ` Peter Zijlstra @ 2025-11-05 16:20 ` Tejun Heo 0 siblings, 0 replies; 41+ messages in thread From: Tejun Heo @ 2025-11-05 16:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Calvin Owens, linux-kernel, Dan Schatzberg, Johannes Weiner, Michal Koutný, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, cgroups, linux-rt-devel Hello, On Wed, Nov 05, 2025 at 09:50:36AM +0100, Peter Zijlstra wrote: > On Tue, Nov 04, 2025 at 09:32:24AM -1000, Tejun Heo wrote: > > Convert css_set_lock from spinlock_t to raw_spinlock_t to address RT-related > > scheduling constraints. cgroup_task_dead() is called from finish_task_switch() > > which cannot schedule even in PREEMPT_RT kernels, requiring css_set_lock to be > > a raw spinlock to avoid sleeping in a non-preemptible context. > > The constraint for doing so, is that each critical section is actually > bounded in time. The below seem to contain list iteration. I'm thinking > it is unbounded since userspace is on control of the cgroup hierarchy. Right, along with the problems Sebastian pointed out, doesn't look like this is the way to go. This doesn't need to happen in line. It just needs to happen after the last task switch. I'll bounce it out and run it asynchronously after the rq lock is dropped. Thanks. -- tejun ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH cgroup/for-6.19 2/2] cgroup: Convert css_set_lock locking to use cleanup guards 2025-11-04 18:11 DEBUG_ATOMIC_SLEEP spew in cgroup_task_dead() on next-20251104 Calvin Owens 2025-11-04 19:30 ` Tejun Heo 2025-11-04 19:32 ` [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t Tejun Heo @ 2025-11-04 19:32 ` Tejun Heo 2 siblings, 0 replies; 41+ messages in thread From: Tejun Heo @ 2025-11-04 19:32 UTC (permalink / raw) To: Calvin Owens Cc: linux-kernel, Dan Schatzberg, Peter Zijlstra, Johannes Weiner, Michal Koutný, Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt, cgroups, linux-rt-devel Convert most css_set_lock critical sections to use cleanup guards (guard(), scoped_guard()) for automatic lock management. This reduces the amount of manual lock/unlock pairing and eliminates several 'unsigned long flags' variables. cgroup_css_links_read() in debug.c is left unconverted as it would require excessive indentation. Signed-off-by: Tejun Heo <tj@kernel.org> kernel/cgroup/freezer.c | 38 +- kernel/cgroup/namespace.c | 8 6 files changed, 302 insertions(+), 346 deletions(-) --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -198,8 +198,6 @@ void put_css_set_locked(struct css_set * static inline void put_css_set(struct css_set *cset) { - unsigned long flags; - /* * Ensure that the refcount doesn't hit zero while any readers * can see it. Similar to atomic_dec_and_lock(), but for an @@ -208,9 +206,9 @@ static inline void put_css_set(struct cs if (refcount_dec_not_one(&cset->refcount)) return; - raw_spin_lock_irqsave(&css_set_lock, flags); + guard(raw_spinlock_irqsave)(&css_set_lock); + put_css_set_locked(cset); - raw_spin_unlock_irqrestore(&css_set_lock, flags); } /* --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -73,9 +73,8 @@ int cgroup_attach_task_all(struct task_s for_each_root(root) { struct cgroup *from_cgrp; - raw_spin_lock_irq(&css_set_lock); - from_cgrp = task_cgroup_from_root(from, root); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) + from_cgrp = task_cgroup_from_root(from, root); retval = cgroup_attach_task(from_cgrp, tsk, false); if (retval) @@ -121,10 +120,10 @@ int cgroup_transfer_tasks(struct cgroup cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); /* all tasks in @from are being moved, all csets are source */ - raw_spin_lock_irq(&css_set_lock); - list_for_each_entry(link, &from->cset_links, cset_link) - cgroup_migrate_add_src(link->cset, to, &mgctx); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + list_for_each_entry(link, &from->cset_links, cset_link) + cgroup_migrate_add_src(link->cset, to, &mgctx); + } ret = cgroup_migrate_prepare_dst(&mgctx); if (ret) @@ -1299,7 +1298,6 @@ struct cgroup *task_get_cgroup1(struct t { struct cgroup *cgrp = ERR_PTR(-ENOENT); struct cgroup_root *root; - unsigned long flags; rcu_read_lock(); for_each_root(root) { @@ -1308,11 +1306,11 @@ struct cgroup *task_get_cgroup1(struct t continue; if (root->hierarchy_id != hierarchy_id) continue; - raw_spin_lock_irqsave(&css_set_lock, flags); - cgrp = task_cgroup_from_root(tsk, root); - if (!cgrp || !cgroup_tryget(cgrp)) - cgrp = ERR_PTR(-ENOENT); - raw_spin_unlock_irqrestore(&css_set_lock, flags); + scoped_guard (raw_spinlock_irqsave, &css_set_lock) { + cgrp = task_cgroup_from_root(tsk, root); + if (!cgrp || !cgroup_tryget(cgrp)) + cgrp = ERR_PTR(-ENOENT); + } break; } rcu_read_unlock(); --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -666,13 +666,9 @@ int __cgroup_task_count(const struct cgr */ int cgroup_task_count(const struct cgroup *cgrp) { - int count; + guard(raw_spinlock_irq)(&css_set_lock); - raw_spin_lock_irq(&css_set_lock); - count = __cgroup_task_count(cgrp); - raw_spin_unlock_irq(&css_set_lock); - - return count; + return __cgroup_task_count(cgrp); } static struct cgroup *kn_priv(struct kernfs_node *kn) @@ -1238,11 +1234,11 @@ static struct css_set *find_css_set(stru /* First see if we already have a cgroup group that matches * the desired set */ - raw_spin_lock_irq(&css_set_lock); - cset = find_existing_css_set(old_cset, cgrp, template); - if (cset) - get_css_set(cset); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + cset = find_existing_css_set(old_cset, cgrp, template); + if (cset) + get_css_set(cset); + } if (cset) return cset; @@ -1274,34 +1270,33 @@ static struct css_set *find_css_set(stru * find_existing_css_set() */ memcpy(cset->subsys, template, sizeof(cset->subsys)); - raw_spin_lock_irq(&css_set_lock); - /* Add reference counts and links from the new css_set. */ - list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) { - struct cgroup *c = link->cgrp; + scoped_guard (raw_spinlock_irq, &css_set_lock) { + /* Add reference counts and links from the new css_set. */ + list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) { + struct cgroup *c = link->cgrp; - if (c->root == cgrp->root) - c = cgrp; - link_css_set(&tmp_links, cset, c); - } + if (c->root == cgrp->root) + c = cgrp; + link_css_set(&tmp_links, cset, c); + } - BUG_ON(!list_empty(&tmp_links)); + BUG_ON(!list_empty(&tmp_links)); - css_set_count++; + css_set_count++; - /* Add @cset to the hash table */ - key = css_set_hash(cset->subsys); - hash_add(css_set_table, &cset->hlist, key); + /* Add @cset to the hash table */ + key = css_set_hash(cset->subsys); + hash_add(css_set_table, &cset->hlist, key); - for_each_subsys(ss, ssid) { - struct cgroup_subsys_state *css = cset->subsys[ssid]; + for_each_subsys(ss, ssid) { + struct cgroup_subsys_state *css = cset->subsys[ssid]; - list_add_tail(&cset->e_cset_node[ssid], - &css->cgroup->e_csets[ssid]); - css_get(css); + list_add_tail(&cset->e_cset_node[ssid], + &css->cgroup->e_csets[ssid]); + css_get(css); + } } - raw_spin_unlock_irq(&css_set_lock); - /* * If @cset should be threaded, look up the matching dom_cset and * link them up. We first fully initialize @cset then look for the @@ -1317,11 +1312,11 @@ static struct css_set *find_css_set(stru return NULL; } - raw_spin_lock_irq(&css_set_lock); - cset->dom_cset = dcset; - list_add_tail(&cset->threaded_csets_node, - &dcset->threaded_csets); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + cset->dom_cset = dcset; + list_add_tail(&cset->threaded_csets_node, + &dcset->threaded_csets); + } } return cset; @@ -1414,16 +1409,14 @@ static void cgroup_destroy_root(struct c * Release all the links from cset_links to this hierarchy's * root cgroup */ - raw_spin_lock_irq(&css_set_lock); - - list_for_each_entry_safe(link, tmp_link, &cgrp->cset_links, cset_link) { - list_del(&link->cset_link); - list_del(&link->cgrp_link); - kfree(link); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + list_for_each_entry_safe(link, tmp_link, &cgrp->cset_links, cset_link) { + list_del(&link->cset_link); + list_del(&link->cgrp_link); + kfree(link); + } } - raw_spin_unlock_irq(&css_set_lock); - WARN_ON_ONCE(list_empty(&root->root_list)); list_del_rcu(&root->root_list); cgroup_root_count--; @@ -1919,25 +1912,27 @@ int rebind_subsystems(struct cgroup_root rcu_assign_pointer(dcgrp->subsys[ssid], css); ss->root = dst_root; - raw_spin_lock_irq(&css_set_lock); - css->cgroup = dcgrp; - WARN_ON(!list_empty(&dcgrp->e_csets[ss->id])); - list_for_each_entry_safe(cset, cset_pos, &scgrp->e_csets[ss->id], - e_cset_node[ss->id]) { - list_move_tail(&cset->e_cset_node[ss->id], - &dcgrp->e_csets[ss->id]); - /* - * all css_sets of scgrp together in same order to dcgrp, - * patch in-flight iterators to preserve correct iteration. - * since the iterator is always advanced right away and - * finished when it->cset_pos meets it->cset_head, so only - * update it->cset_head is enough here. - */ - list_for_each_entry(it, &cset->task_iters, iters_node) - if (it->cset_head == &scgrp->e_csets[ss->id]) - it->cset_head = &dcgrp->e_csets[ss->id]; + scoped_guard (raw_spinlock_irq, &css_set_lock) { + css->cgroup = dcgrp; + WARN_ON(!list_empty(&dcgrp->e_csets[ss->id])); + list_for_each_entry_safe(cset, cset_pos, &scgrp->e_csets[ss->id], + e_cset_node[ss->id]) { + list_move_tail(&cset->e_cset_node[ss->id], + &dcgrp->e_csets[ss->id]); + /* + * all css_sets of scgrp together in same order + * to dcgrp, patch in-flight iterators to + * preserve correct iteration. since the + * iterator is always advanced right away and + * finished when it->cset_pos meets + * it->cset_head, so only update it->cset_head + * is enough here. + */ + list_for_each_entry(it, &cset->task_iters, iters_node) + if (it->cset_head == &scgrp->e_csets[ss->id]) + it->cset_head = &dcgrp->e_csets[ss->id]; + } } - raw_spin_unlock_irq(&css_set_lock); /* default hierarchy doesn't enable controllers by default */ dst_root->subsys_mask |= 1 << ssid; @@ -1973,10 +1968,10 @@ int cgroup_show_path(struct seq_file *sf if (!buf) return -ENOMEM; - raw_spin_lock_irq(&css_set_lock); - ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot); - len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot); + len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX); + } if (len == -E2BIG) len = -ERANGE; @@ -2232,13 +2227,13 @@ int cgroup_setup_root(struct cgroup_root * Link the root cgroup in this hierarchy into all the css_set * objects. */ - raw_spin_lock_irq(&css_set_lock); - hash_for_each(css_set_table, i, cset, hlist) { - link_css_set(&tmp_links, cset, root_cgrp); - if (css_set_populated(cset)) - cgroup_update_populated(root_cgrp, true); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + hash_for_each(css_set_table, i, cset, hlist) { + link_css_set(&tmp_links, cset, root_cgrp); + if (css_set_populated(cset)) + cgroup_update_populated(root_cgrp, true); + } } - raw_spin_unlock_irq(&css_set_lock); BUG_ON(!list_empty(&root_cgrp->self.children)); BUG_ON(atomic_read(&root->nr_cgrps) != 1); @@ -2282,11 +2277,8 @@ int cgroup_do_get_tree(struct fs_context struct cgroup *cgrp; cgroup_lock(); - raw_spin_lock_irq(&css_set_lock); - - cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root); - - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) + cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root); cgroup_unlock(); nsdentry = kernfs_node_dentry(cgrp->kn, sb); @@ -2498,11 +2490,8 @@ int cgroup_path_ns(struct cgroup *cgrp, int ret; cgroup_lock(); - raw_spin_lock_irq(&css_set_lock); - - ret = cgroup_path_ns_locked(cgrp, buf, buflen, ns); - - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) + ret = cgroup_path_ns_locked(cgrp, buf, buflen, ns); cgroup_unlock(); return ret; @@ -2721,27 +2710,27 @@ static int cgroup_migrate_execute(struct * the new cgroup. There are no failure cases after here, so this * is the commit point. */ - raw_spin_lock_irq(&css_set_lock); - list_for_each_entry(cset, &tset->src_csets, mg_node) { - list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) { - struct css_set *from_cset = task_css_set(task); - struct css_set *to_cset = cset->mg_dst_cset; - - get_css_set(to_cset); - to_cset->nr_tasks++; - css_set_move_task(task, from_cset, to_cset, true); - from_cset->nr_tasks--; - /* - * If the source or destination cgroup is frozen, - * the task might require to change its state. - */ - cgroup_freezer_migrate_task(task, from_cset->dfl_cgrp, - to_cset->dfl_cgrp); - put_css_set_locked(from_cset); - + scoped_guard (raw_spinlock_irq, &css_set_lock) { + list_for_each_entry(cset, &tset->src_csets, mg_node) { + list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) { + struct css_set *from_cset = task_css_set(task); + struct css_set *to_cset = cset->mg_dst_cset; + + get_css_set(to_cset); + to_cset->nr_tasks++; + css_set_move_task(task, from_cset, to_cset, true); + from_cset->nr_tasks--; + /* + * If the source or destination cgroup is + * frozen, the task might require to change its + * state. + */ + cgroup_freezer_migrate_task(task, from_cset->dfl_cgrp, + to_cset->dfl_cgrp); + put_css_set_locked(from_cset); + } } } - raw_spin_unlock_irq(&css_set_lock); /* * Migration is committed, all target tasks are now on dst_csets. @@ -2774,13 +2763,13 @@ out_cancel_attach: } while_each_subsys_mask(); } out_release_tset: - raw_spin_lock_irq(&css_set_lock); - list_splice_init(&tset->dst_csets, &tset->src_csets); - list_for_each_entry_safe(cset, tmp_cset, &tset->src_csets, mg_node) { - list_splice_tail_init(&cset->mg_tasks, &cset->tasks); - list_del_init(&cset->mg_node); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + list_splice_init(&tset->dst_csets, &tset->src_csets); + list_for_each_entry_safe(cset, tmp_cset, &tset->src_csets, mg_node) { + list_splice_tail_init(&cset->mg_tasks, &cset->tasks); + list_del_init(&cset->mg_node); + } } - raw_spin_unlock_irq(&css_set_lock); /* * Re-initialize the cgroup_taskset structure in case it is reused @@ -2838,7 +2827,7 @@ void cgroup_migrate_finish(struct cgroup lockdep_assert_held(&cgroup_mutex); - raw_spin_lock_irq(&css_set_lock); + guard(raw_spinlock_irq)(&css_set_lock); list_for_each_entry_safe(cset, tmp_cset, &mgctx->preloaded_src_csets, mg_src_preload_node) { @@ -2857,8 +2846,6 @@ void cgroup_migrate_finish(struct cgroup list_del_init(&cset->mg_dst_preload_node); put_css_set_locked(cset); } - - raw_spin_unlock_irq(&css_set_lock); } /** @@ -2994,21 +2981,19 @@ int cgroup_migrate_prepare_dst(struct cg int cgroup_migrate(struct task_struct *leader, bool threadgroup, struct cgroup_mgctx *mgctx) { - struct task_struct *task; - /* * The following thread iteration should be inside an RCU critical * section to prevent tasks from being freed while taking the snapshot. * spin_lock_irq() implies RCU critical section here. */ - raw_spin_lock_irq(&css_set_lock); - task = leader; - do { - cgroup_migrate_add_task(task, mgctx); - if (!threadgroup) - break; - } while_each_thread(leader, task); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + struct task_struct *task = leader; + do { + cgroup_migrate_add_task(task, mgctx); + if (!threadgroup) + break; + } while_each_thread(leader, task); + } return cgroup_migrate_execute(mgctx); } @@ -3025,18 +3010,17 @@ int cgroup_attach_task(struct cgroup *ds bool threadgroup) { DEFINE_CGROUP_MGCTX(mgctx); - struct task_struct *task; int ret = 0; /* look up all src csets */ - raw_spin_lock_irq(&css_set_lock); - task = leader; - do { - cgroup_migrate_add_src(task_css_set(task), dst_cgrp, &mgctx); - if (!threadgroup) - break; - } while_each_thread(leader, task); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + struct task_struct *task = leader; + do { + cgroup_migrate_add_src(task_css_set(task), dst_cgrp, &mgctx); + if (!threadgroup) + break; + } while_each_thread(leader, task); + } /* prepare dst csets and commit */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3193,23 +3177,23 @@ static int cgroup_update_dfl_csses(struc lockdep_assert_held(&cgroup_mutex); /* look up all csses currently attached to @cgrp's subtree */ - raw_spin_lock_irq(&css_set_lock); - cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) { - struct cgrp_cset_link *link; + scoped_guard (raw_spinlock_irq, &css_set_lock) { + cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) { + struct cgrp_cset_link *link; - /* - * As cgroup_update_dfl_csses() is only called by - * cgroup_apply_control(). The csses associated with the - * given cgrp will not be affected by changes made to - * its subtree_control file. We can skip them. - */ - if (dsct == cgrp) - continue; + /* + * As cgroup_update_dfl_csses() is only called by + * cgroup_apply_control(). The csses associated with the + * given cgrp will not be affected by changes made to + * its subtree_control file. We can skip them. + */ + if (dsct == cgrp) + continue; - list_for_each_entry(link, &dsct->cset_links, cset_link) - cgroup_migrate_add_src(link->cset, dsct, &mgctx); + list_for_each_entry(link, &dsct->cset_links, cset_link) + cgroup_migrate_add_src(link->cset, dsct, &mgctx); + } } - raw_spin_unlock_irq(&css_set_lock); /* * We need to write-lock threadgroup_rwsem while migrating tasks. @@ -3231,16 +3215,16 @@ static int cgroup_update_dfl_csses(struc if (ret) goto out_finish; - raw_spin_lock_irq(&css_set_lock); - list_for_each_entry(src_cset, &mgctx.preloaded_src_csets, - mg_src_preload_node) { - struct task_struct *task, *ntask; - - /* all tasks in src_csets need to be migrated */ - list_for_each_entry_safe(task, ntask, &src_cset->tasks, cg_list) - cgroup_migrate_add_task(task, &mgctx); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + list_for_each_entry(src_cset, &mgctx.preloaded_src_csets, + mg_src_preload_node) { + struct task_struct *task, *ntask; + + /* all tasks in src_csets need to be migrated */ + list_for_each_entry_safe(task, ntask, &src_cset->tasks, cg_list) + cgroup_migrate_add_task(task, &mgctx); + } } - raw_spin_unlock_irq(&css_set_lock); ret = cgroup_migrate_execute(&mgctx); out_finish: @@ -4188,9 +4172,8 @@ static void __cgroup_kill(struct cgroup lockdep_assert_held(&cgroup_mutex); - raw_spin_lock_irq(&css_set_lock); - cgrp->kill_seq++; - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) + cgrp->kill_seq++; css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it); while ((task = css_task_iter_next(&it))) { @@ -5144,11 +5127,9 @@ repeat: void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags, struct css_task_iter *it) { - unsigned long irqflags; - memset(it, 0, sizeof(*it)); - raw_spin_lock_irqsave(&css_set_lock, irqflags); + guard(raw_spinlock_irqsave)(&css_set_lock); it->ss = css->ss; it->flags = flags; @@ -5161,8 +5142,6 @@ void css_task_iter_start(struct cgroup_s it->cset_head = it->cset_pos; css_task_iter_advance(it); - - raw_spin_unlock_irqrestore(&css_set_lock, irqflags); } /** @@ -5175,14 +5154,12 @@ void css_task_iter_start(struct cgroup_s */ struct task_struct *css_task_iter_next(struct css_task_iter *it) { - unsigned long irqflags; - if (it->cur_task) { put_task_struct(it->cur_task); it->cur_task = NULL; } - raw_spin_lock_irqsave(&css_set_lock, irqflags); + guard(raw_spinlock_irqsave)(&css_set_lock); /* @it may be half-advanced by skips, finish advancing */ if (it->flags & CSS_TASK_ITER_SKIPPED) @@ -5195,8 +5172,6 @@ struct task_struct *css_task_iter_next(s css_task_iter_advance(it); } - raw_spin_unlock_irqrestore(&css_set_lock, irqflags); - return it->cur_task; } @@ -5208,13 +5183,11 @@ struct task_struct *css_task_iter_next(s */ void css_task_iter_end(struct css_task_iter *it) { - unsigned long irqflags; - if (it->cur_cset) { - raw_spin_lock_irqsave(&css_set_lock, irqflags); - list_del(&it->iters_node); - put_css_set_locked(it->cur_cset); - raw_spin_unlock_irqrestore(&css_set_lock, irqflags); + scoped_guard (raw_spinlock_irqsave, &css_set_lock) { + list_del(&it->iters_node); + put_css_set_locked(it->cur_cset); + } } if (it->cur_dcset) @@ -5380,9 +5353,8 @@ static ssize_t __cgroup_procs_write(stru goto out_unlock; /* find the source cgroup */ - raw_spin_lock_irq(&css_set_lock); - src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) + src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root); /* * Process and thread migrations follow same delegation rule. Check @@ -5669,11 +5641,11 @@ static void css_release_work_fn(struct w css_rstat_flush(&cgrp->self); - raw_spin_lock_irq(&css_set_lock); - for (tcgrp = cgroup_parent(cgrp); tcgrp; - tcgrp = cgroup_parent(tcgrp)) - tcgrp->nr_dying_descendants--; - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + for (tcgrp = cgroup_parent(cgrp); tcgrp; + tcgrp = cgroup_parent(tcgrp)) + tcgrp->nr_dying_descendants--; + } /* * There are two control paths which try to determine @@ -5924,20 +5896,20 @@ static struct cgroup *cgroup_create(stru goto out_psi_free; /* allocation complete, commit to creation */ - raw_spin_lock_irq(&css_set_lock); - for (i = 0; i < level; i++) { - tcgrp = cgrp->ancestors[i]; - tcgrp->nr_descendants++; + scoped_guard (raw_spinlock_irq, &css_set_lock) { + for (i = 0; i < level; i++) { + tcgrp = cgrp->ancestors[i]; + tcgrp->nr_descendants++; - /* - * If the new cgroup is frozen, all ancestor cgroups get a new - * frozen descendant, but their state can't change because of - * this. - */ - if (cgrp->freezer.e_freeze) - tcgrp->freezer.nr_frozen_descendants++; + /* + * If the new cgroup is frozen, all ancestor cgroups get + * a new frozen descendant, but their state can't change + * because of this. + */ + if (cgrp->freezer.e_freeze) + tcgrp->freezer.nr_frozen_descendants++; + } } - raw_spin_unlock_irq(&css_set_lock); list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children); atomic_inc(&root->nr_cgrps); @@ -6183,10 +6155,10 @@ static int cgroup_destroy_locked(struct */ cgrp->self.flags &= ~CSS_ONLINE; - raw_spin_lock_irq(&css_set_lock); - list_for_each_entry(link, &cgrp->cset_links, cset_link) - link->cset->dead = true; - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + list_for_each_entry(link, &cgrp->cset_links, cset_link) + link->cset->dead = true; + } /* initiate massacre of all css's */ for_each_css(css, ssid, cgrp) @@ -6199,18 +6171,18 @@ static int cgroup_destroy_locked(struct if (cgroup_is_threaded(cgrp)) parent->nr_threaded_children--; - raw_spin_lock_irq(&css_set_lock); - for (tcgrp = parent; tcgrp; tcgrp = cgroup_parent(tcgrp)) { - tcgrp->nr_descendants--; - tcgrp->nr_dying_descendants++; - /* - * If the dying cgroup is frozen, decrease frozen descendants - * counters of ancestor cgroups. - */ - if (test_bit(CGRP_FROZEN, &cgrp->flags)) - tcgrp->freezer.nr_frozen_descendants--; + scoped_guard (raw_spinlock_irq, &css_set_lock) { + for (tcgrp = parent; tcgrp; tcgrp = cgroup_parent(tcgrp)) { + tcgrp->nr_descendants--; + tcgrp->nr_dying_descendants++; + /* + * If the dying cgroup is frozen, decrease frozen + * descendants counters of ancestor cgroups. + */ + if (test_bit(CGRP_FROZEN, &cgrp->flags)) + tcgrp->freezer.nr_frozen_descendants--; + } } - raw_spin_unlock_irq(&css_set_lock); cgroup1_check_for_release(parent); @@ -6549,17 +6521,14 @@ EXPORT_SYMBOL_GPL(cgroup_get_from_id); int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk) { - char *buf; - int retval; struct cgroup_root *root; - retval = -ENOMEM; - buf = kmalloc(PATH_MAX, GFP_KERNEL); + char *buf __free(kfree) = kmalloc(PATH_MAX, GFP_KERNEL); if (!buf) - goto out; + return -ENOMEM; - rcu_read_lock(); - raw_spin_lock_irq(&css_set_lock); + guard(rcu)(); + guard(raw_spinlock_irq)(&css_set_lock); for_each_root(root) { struct cgroup_subsys *ss; @@ -6594,12 +6563,12 @@ int proc_cgroup_show(struct seq_file *m, * " (deleted)" is appended to the cgroup path. */ if (cgroup_on_dfl(cgrp) || !(tsk->flags & PF_EXITING)) { - retval = cgroup_path_ns_locked(cgrp, buf, PATH_MAX, - current->nsproxy->cgroup_ns); + int retval = cgroup_path_ns_locked(cgrp, buf, PATH_MAX, + current->nsproxy->cgroup_ns); if (retval == -E2BIG) retval = -ENAMETOOLONG; if (retval < 0) - goto out_unlock; + return retval; seq_puts(m, buf); } else { @@ -6612,13 +6581,7 @@ int proc_cgroup_show(struct seq_file *m, seq_putc(m, '\n'); } - retval = 0; -out_unlock: - raw_spin_unlock_irq(&css_set_lock); - rcu_read_unlock(); - kfree(buf); -out: - return retval; + return 0; } /** @@ -6702,14 +6665,14 @@ static int cgroup_css_set_fork(struct ke cgroup_threadgroup_change_begin(current); - raw_spin_lock_irq(&css_set_lock); - cset = task_css_set(current); - get_css_set(cset); - if (kargs->cgrp) - kargs->kill_seq = kargs->cgrp->kill_seq; - else - kargs->kill_seq = cset->dfl_cgrp->kill_seq; - raw_spin_unlock_irq(&css_set_lock); + scoped_guard(raw_spinlock_irq, &css_set_lock) { + cset = task_css_set(current); + get_css_set(cset); + if (kargs->cgrp) + kargs->kill_seq = kargs->cgrp->kill_seq; + else + kargs->kill_seq = cset->dfl_cgrp->kill_seq; + } if (!(kargs->flags & CLONE_INTO_CGROUP)) { kargs->cset = cset; @@ -6899,56 +6862,56 @@ void cgroup_post_fork(struct task_struct cset = kargs->cset; kargs->cset = NULL; - raw_spin_lock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + /* init tasks are special, only link regular threads */ + if (likely(child->pid)) { + if (kargs->cgrp) { + cgrp_flags = kargs->cgrp->flags; + cgrp_kill_seq = kargs->cgrp->kill_seq; + } else { + cgrp_flags = cset->dfl_cgrp->flags; + cgrp_kill_seq = cset->dfl_cgrp->kill_seq; + } - /* init tasks are special, only link regular threads */ - if (likely(child->pid)) { - if (kargs->cgrp) { - cgrp_flags = kargs->cgrp->flags; - cgrp_kill_seq = kargs->cgrp->kill_seq; + WARN_ON_ONCE(!list_empty(&child->cg_list)); + cset->nr_tasks++; + css_set_move_task(child, NULL, cset, false); } else { - cgrp_flags = cset->dfl_cgrp->flags; - cgrp_kill_seq = cset->dfl_cgrp->kill_seq; + put_css_set(cset); + cset = NULL; } - WARN_ON_ONCE(!list_empty(&child->cg_list)); - cset->nr_tasks++; - css_set_move_task(child, NULL, cset, false); - } else { - put_css_set(cset); - cset = NULL; - } - - if (!(child->flags & PF_KTHREAD)) { - if (unlikely(test_bit(CGRP_FREEZE, &cgrp_flags))) { - /* - * If the cgroup has to be frozen, the new task has - * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to - * get the task into the frozen state. - */ - spin_lock(&child->sighand->siglock); - WARN_ON_ONCE(child->frozen); - child->jobctl |= JOBCTL_TRAP_FREEZE; - spin_unlock(&child->sighand->siglock); + if (!(child->flags & PF_KTHREAD)) { + if (unlikely(test_bit(CGRP_FREEZE, &cgrp_flags))) { + /* + * If the cgroup has to be frozen, the new task + * has too. Let's set the JOBCTL_TRAP_FREEZE + * jobctl bit to get the task into the frozen + * state. + */ + spin_lock(&child->sighand->siglock); + WARN_ON_ONCE(child->frozen); + child->jobctl |= JOBCTL_TRAP_FREEZE; + spin_unlock(&child->sighand->siglock); + + /* + * Calling cgroup_update_frozen() isn't required + * here, because it will be called anyway a bit + * later from do_freezer_trap(). So we avoid + * cgroup's transient switch from the frozen + * state and back. + */ + } /* - * Calling cgroup_update_frozen() isn't required here, - * because it will be called anyway a bit later from - * do_freezer_trap(). So we avoid cgroup's transient - * switch from the frozen state and back. + * If the cgroup is to be killed notice it now and take + * the child down right after we finished preparing it + * for userspace. */ + kill = kargs->kill_seq != cgrp_kill_seq; } - - /* - * If the cgroup is to be killed notice it now and take the - * child down right after we finished preparing it for - * userspace. - */ - kill = kargs->kill_seq != cgrp_kill_seq; } - raw_spin_unlock_irq(&css_set_lock); - /* * Call ss->fork(). This must happen after @child is linked on * css_set; otherwise, @child might change state between ->fork() @@ -6995,9 +6958,8 @@ void cgroup_task_exit(struct task_struct void cgroup_task_dead(struct task_struct *tsk) { struct css_set *cset; - unsigned long flags; - raw_spin_lock_irqsave(&css_set_lock, flags); + guard(raw_spinlock_irqsave)(&css_set_lock); WARN_ON_ONCE(list_empty(&tsk->cg_list)); cset = task_css_set(tsk); @@ -7014,8 +6976,6 @@ void cgroup_task_dead(struct task_struct if (unlikely(!(tsk->flags & PF_KTHREAD) && test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags))) cgroup_update_frozen(task_dfl_cgroup(tsk)); - - raw_spin_unlock_irqrestore(&css_set_lock, flags); } void cgroup_task_release(struct task_struct *task) @@ -7033,10 +6993,10 @@ void cgroup_task_free(struct task_struct struct css_set *cset = task_css_set(task); if (!list_empty(&task->cg_list)) { - raw_spin_lock_irq(&css_set_lock); - css_set_skip_task_iters(task_css_set(task), task); - list_del_init(&task->cg_list); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + css_set_skip_task_iters(task_css_set(task), task); + list_del_init(&task->cg_list); + } } put_css_set(cset); --- a/kernel/cgroup/debug.c +++ b/kernel/cgroup/debug.c @@ -40,33 +40,34 @@ static u64 debug_taskcount_read(struct c static int current_css_set_read(struct seq_file *seq, void *v) { struct kernfs_open_file *of = seq->private; - struct css_set *cset; - struct cgroup_subsys *ss; - struct cgroup_subsys_state *css; - int i, refcnt; if (!cgroup_kn_lock_live(of->kn, false)) return -ENODEV; - raw_spin_lock_irq(&css_set_lock); - cset = task_css_set(current); - refcnt = refcount_read(&cset->refcount); - seq_printf(seq, "css_set %pK %d", cset, refcnt); - if (refcnt > cset->nr_tasks) - seq_printf(seq, " +%d", refcnt - cset->nr_tasks); - seq_puts(seq, "\n"); - - /* - * Print the css'es stored in the current css_set. - */ - for_each_subsys(ss, i) { - css = cset->subsys[ss->id]; - if (!css) - continue; - seq_printf(seq, "%2d: %-4s\t- %p[%d]\n", ss->id, ss->name, - css, css->id); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + struct css_set *cset = task_css_set(current); + struct cgroup_subsys *ss; + struct cgroup_subsys_state *css; + int i, refcnt; + + refcnt = refcount_read(&cset->refcount); + seq_printf(seq, "css_set %pK %d", cset, refcnt); + if (refcnt > cset->nr_tasks) + seq_printf(seq, " +%d", refcnt - cset->nr_tasks); + seq_puts(seq, "\n"); + + /* + * Print the css'es stored in the current css_set. + */ + for_each_subsys(ss, i) { + css = cset->subsys[ss->id]; + if (!css) + continue; + seq_printf(seq, "%2d: %-4s\t- %p[%d]\n", ss->id, ss->name, + css, css->id); + } } - raw_spin_unlock_irq(&css_set_lock); + cgroup_kn_unlock(of->kn); return 0; } @@ -86,13 +87,13 @@ static int current_css_set_cg_links_read { struct cgrp_cset_link *link; struct css_set *cset; - char *name_buf; - name_buf = kmalloc(NAME_MAX + 1, GFP_KERNEL); + char *name_buf __free(kfree) = kmalloc(NAME_MAX + 1, GFP_KERNEL); if (!name_buf) return -ENOMEM; - raw_spin_lock_irq(&css_set_lock); + guard(raw_spinlock_irq)(&css_set_lock); + cset = task_css_set(current); list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { struct cgroup *c = link->cgrp; @@ -101,8 +102,7 @@ static int current_css_set_cg_links_read seq_printf(seq, "Root %d group %s\n", c->root->hierarchy_id, name_buf); } - raw_spin_unlock_irq(&css_set_lock); - kfree(name_buf); + return 0; } --- a/kernel/cgroup/freezer.c +++ b/kernel/cgroup/freezer.c @@ -108,12 +108,12 @@ void cgroup_enter_frozen(void) if (current->frozen) return; - raw_spin_lock_irq(&css_set_lock); + guard (raw_spinlock_irq)(&css_set_lock); + current->frozen = true; cgrp = task_dfl_cgroup(current); cgroup_inc_frozen_cnt(cgrp); cgroup_update_frozen(cgrp); - raw_spin_unlock_irq(&css_set_lock); } /* @@ -129,7 +129,8 @@ void cgroup_leave_frozen(bool always_lea { struct cgroup *cgrp; - raw_spin_lock_irq(&css_set_lock); + guard (raw_spinlock_irq)(&css_set_lock); + cgrp = task_dfl_cgroup(current); if (always_leave || !test_bit(CGRP_FREEZE, &cgrp->flags)) { cgroup_dec_frozen_cnt(cgrp); @@ -142,7 +143,6 @@ void cgroup_leave_frozen(bool always_lea set_thread_flag(TIF_SIGPENDING); spin_unlock(¤t->sighand->siglock); } - raw_spin_unlock_irq(&css_set_lock); } /* @@ -178,18 +178,18 @@ static void cgroup_do_freeze(struct cgro lockdep_assert_held(&cgroup_mutex); - raw_spin_lock_irq(&css_set_lock); - write_seqcount_begin(&cgrp->freezer.freeze_seq); - if (freeze) { - set_bit(CGRP_FREEZE, &cgrp->flags); - cgrp->freezer.freeze_start_nsec = ts_nsec; - } else { - clear_bit(CGRP_FREEZE, &cgrp->flags); - cgrp->freezer.frozen_nsec += (ts_nsec - - cgrp->freezer.freeze_start_nsec); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + write_seqcount_begin(&cgrp->freezer.freeze_seq); + if (freeze) { + set_bit(CGRP_FREEZE, &cgrp->flags); + cgrp->freezer.freeze_start_nsec = ts_nsec; + } else { + clear_bit(CGRP_FREEZE, &cgrp->flags); + cgrp->freezer.frozen_nsec += + (ts_nsec - cgrp->freezer.freeze_start_nsec); + } + write_seqcount_end(&cgrp->freezer.freeze_seq); } - write_seqcount_end(&cgrp->freezer.freeze_seq); - raw_spin_unlock_irq(&css_set_lock); if (freeze) TRACE_CGROUP_PATH(freeze, cgrp); @@ -212,10 +212,10 @@ static void cgroup_do_freeze(struct cgro * Cgroup state should be revisited here to cover empty leaf cgroups * and cgroups which descendants are already in the desired state. */ - raw_spin_lock_irq(&css_set_lock); - if (cgrp->nr_descendants == cgrp->freezer.nr_frozen_descendants) - cgroup_update_frozen(cgrp); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + if (cgrp->nr_descendants == cgrp->freezer.nr_frozen_descendants) + cgroup_update_frozen(cgrp); + } } /* --- a/kernel/cgroup/namespace.c +++ b/kernel/cgroup/namespace.c @@ -70,10 +70,10 @@ struct cgroup_namespace *copy_cgroup_ns( return ERR_PTR(-ENOSPC); /* It is not safe to take cgroup_mutex here */ - raw_spin_lock_irq(&css_set_lock); - cset = task_css_set(current); - get_css_set(cset); - raw_spin_unlock_irq(&css_set_lock); + scoped_guard (raw_spinlock_irq, &css_set_lock) { + cset = task_css_set(current); + get_css_set(cset); + } new_ns = alloc_cgroup_ns(); if (IS_ERR(new_ns)) { ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2026-03-02 11:15 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-11-04 18:11 DEBUG_ATOMIC_SLEEP spew in cgroup_task_dead() on next-20251104 Calvin Owens 2025-11-04 19:30 ` Tejun Heo 2025-11-05 15:16 ` Calvin Owens 2025-11-05 19:03 ` [PATCH cgroup/for-6.19] cgroup: Fix sleeping from invalid context warning on PREEMPT_RT Tejun Heo 2025-11-06 1:15 ` Calvin Owens 2025-11-06 17:36 ` Tejun Heo 2025-11-06 15:07 ` Sebastian Andrzej Siewior 2025-11-06 17:37 ` Tejun Heo 2025-11-06 17:46 ` Sebastian Andrzej Siewior 2025-11-06 17:55 ` Tejun Heo 2025-11-06 18:06 ` Sebastian Andrzej Siewior 2026-02-19 16:46 ` ~90s reboot delay with v6.19 and PREEMPT_RT Bert Karwatzki 2026-02-19 20:53 ` Calvin Owens 2026-02-19 23:10 ` Bert Karwatzki 2026-02-20 0:58 ` Steven Rostedt 2026-02-20 9:15 ` ~90s shutdown " Bert Karwatzki 2026-02-20 15:44 ` Steven Rostedt 2026-02-23 0:35 ` Bert Karwatzki 2026-02-23 8:22 ` Steven Rostedt 2026-02-23 13:36 ` Bert Karwatzki 2026-02-23 23:36 ` Bert Karwatzki 2026-02-24 12:44 ` Bert Karwatzki 2026-02-24 12:58 ` Bert Karwatzki 2026-02-24 14:20 ` Steven Rostedt 2026-02-24 15:45 ` ~90s reboot " Sebastian Andrzej Siewior 2026-02-25 15:43 ` Sebastian Andrzej Siewior 2026-02-25 16:37 ` Bert Karwatzki 2026-02-25 16:59 ` Sebastian Andrzej Siewior 2026-02-25 22:31 ` Sebastian Andrzej Siewior 2026-02-26 13:24 ` Bert Karwatzki 2026-02-26 13:46 ` Sebastian Andrzej Siewior 2026-02-26 16:37 ` Steven Rostedt 2026-02-27 14:13 ` Sebastian Andrzej Siewior 2026-02-27 22:57 ` Bert Karwatzki 2026-03-02 11:15 ` Sebastian Andrzej Siewior 2025-11-04 19:32 ` [PATCH cgroup/for-6.19 1/2] cgroup: Convert css_set_lock from spinlock_t to raw_spinlock_t Tejun Heo 2025-11-05 7:30 ` Sebastian Andrzej Siewior 2025-11-05 16:19 ` Tejun Heo 2025-11-05 8:50 ` Peter Zijlstra 2025-11-05 16:20 ` Tejun Heo 2025-11-04 19:32 ` [PATCH cgroup/for-6.19 2/2] cgroup: Convert css_set_lock locking to use cleanup guards Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox