* [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads @ 2025-09-03 11:11 Yi Tao 2025-09-03 13:14 ` Waiman Long ` (5 more replies) 0 siblings, 6 replies; 44+ messages in thread From: Yi Tao @ 2025-09-03 11:11 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migration ------------------------------------------------------------------------------ cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. Signed-off-by: Yi Tao <escape@linux.alibaba.com> --- include/linux/cgroup-defs.h | 2 ++ include/linux/sched/signal.h | 4 +++ init/init_task.c | 3 ++ kernel/cgroup/cgroup-internal.h | 6 ++-- kernel/cgroup/cgroup-v1.c | 8 ++--- kernel/cgroup/cgroup.c | 56 ++++++++++++++++++++------------- kernel/fork.c | 4 +++ 7 files changed, 55 insertions(+), 28 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 6b93a64115fe..8e0fdad8a440 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -838,6 +838,7 @@ struct cgroup_of_peak { static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) { percpu_down_read(&cgroup_threadgroup_rwsem); + down_read(&tsk->signal->group_rwsem); } /** @@ -848,6 +849,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) */ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) { + up_read(&tsk->signal->group_rwsem); percpu_up_read(&cgroup_threadgroup_rwsem); } diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1ef1edbaaf79..86fbc99a9174 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -226,6 +226,10 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif +#ifdef CONFIG_CGROUPS + struct rw_semaphore group_rwsem; +#endif + /* * Thread is the potential origin of an oom condition; kill first on * oom diff --git a/init/init_task.c b/init/init_task.c index e557f622bd90..0450093924a7 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { }, .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, +#ifdef CONFIG_CGROUPS + .group_rwsem = __RWSEM_INITIALIZER(init_signals.group_rwsem), +#endif .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), #ifdef CONFIG_POSIX_TIMERS diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index b14e61c64a34..572c24b7e947 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(bool lock_threadgroup); -void cgroup_attach_unlock(bool lock_threadgroup); +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup, + bool global); +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup, + bool global); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, bool *locked) __acquires(&cgroup_threadgroup_rwsem); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 2a4a387f867a..3e5ead8c5bc5 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) int retval = 0; cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(NULL, true, true); for_each_root(root) { struct cgroup *from_cgrp; @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(true); + cgroup_attach_unlock(NULL, true, true); cgroup_unlock(); return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(NULL, true, true); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(true); + cgroup_attach_unlock(NULL, true, true); cgroup_unlock(); return ret; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 312c6a8b55bb..4e1d80a2741f 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that * CPU hotplug is disabled on entry. */ -void cgroup_attach_lock(bool lock_threadgroup) +void cgroup_attach_lock(struct task_struct *tsk, + bool lock_threadgroup, bool global) { cpus_read_lock(); - if (lock_threadgroup) - percpu_down_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (global) + percpu_down_write(&cgroup_threadgroup_rwsem); + else + down_write(&tsk->signal->group_rwsem); + } } /** * cgroup_attach_unlock - Undo cgroup_attach_lock() * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem */ -void cgroup_attach_unlock(bool lock_threadgroup) +void cgroup_attach_unlock(struct task_struct *tsk, + bool lock_threadgroup, bool global) { - if (lock_threadgroup) - percpu_up_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (global) + percpu_up_write(&cgroup_threadgroup_rwsem); + else + up_write(&tsk->signal->group_rwsem); + } cpus_read_unlock(); } @@ -2970,12 +2980,23 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, bool *threadgroup_locked) { - struct task_struct *tsk; + struct task_struct *tsk, *tmp_tsk; pid_t pid; if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) return ERR_PTR(-EINVAL); + rcu_read_lock(); + if (pid) { + tsk = find_task_by_vpid(pid); + if (!tsk) { + tsk = ERR_PTR(-ESRCH); + goto out_unlock_rcu; + } + } else { + tsk = current; + } + /* * If we migrate a single thread, we don't care about threadgroup * stability. If the thread is `current`, it won't exit(2) under our @@ -2986,18 +3007,9 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, */ lockdep_assert_held(&cgroup_mutex); *threadgroup_locked = pid || threadgroup; - cgroup_attach_lock(*threadgroup_locked); - rcu_read_lock(); - if (pid) { - tsk = find_task_by_vpid(pid); - if (!tsk) { - tsk = ERR_PTR(-ESRCH); - goto out_unlock_threadgroup; - } - } else { - tsk = current; - } + tmp_tsk = tsk; + cgroup_attach_lock(tmp_tsk, *threadgroup_locked, false); if (threadgroup) tsk = tsk->group_leader; @@ -3017,7 +3029,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, goto out_unlock_rcu; out_unlock_threadgroup: - cgroup_attach_unlock(*threadgroup_locked); + cgroup_attach_unlock(tmp_tsk, *threadgroup_locked, false); *threadgroup_locked = false; out_unlock_rcu: rcu_read_unlock(); @@ -3032,7 +3044,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked /* release reference from cgroup_procs_write_start() */ put_task_struct(task); - cgroup_attach_unlock(threadgroup_locked); + cgroup_attach_unlock(task, threadgroup_locked, false); for_each_subsys(ss, ssid) if (ss->post_attach) @@ -3119,7 +3131,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) * write-locking can be skipped safely. */ has_tasks = !list_empty(&mgctx.preloaded_src_csets); - cgroup_attach_lock(has_tasks); + cgroup_attach_lock(NULL, has_tasks, true); /* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3140,7 +3152,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(has_tasks); + cgroup_attach_unlock(NULL, has_tasks, true); return ret; } diff --git a/kernel/fork.c b/kernel/fork.c index af673856499d..5218f9b93c77 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) tty_audit_fork(sig); sched_autogroup_fork(sig); +#ifdef CONFIG_CGROUPS + init_rwsem(&sig->group_rwsem); +#endif + sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; -- 2.32.0.3.g01195cf9f ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao @ 2025-09-03 13:14 ` Waiman Long 2025-09-04 1:35 ` Chen Ridong ` (2 more replies) 2025-09-03 16:53 ` Tejun Heo ` (4 subsequent siblings) 5 siblings, 3 replies; 44+ messages in thread From: Waiman Long @ 2025-09-03 13:14 UTC (permalink / raw) To: Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel On 9/3/25 7:11 AM, Yi Tao wrote: > As computer hardware advances, modern systems are typically equipped > with many CPU cores and large amounts of memory, enabling the deployment > of numerous applications. On such systems, container creation and > deletion become frequent operations, making cgroup process migration no > longer a cold path. This leads to noticeable contention with common > process operations such as fork, exec, and exit. > > To alleviate the contention between cgroup process migration and > operations like process fork, this patch modifies lock to take the write > lock on signal_struct->group_rwsem when writing pid to > cgroup.procs/threads instead of holding a global write lock. > > Cgroup process migration has historically relied on > signal_struct->group_rwsem to protect thread group integrity. In commit > <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with > a global percpu_rwsem"), this was changed to a global > cgroup_threadgroup_rwsem. The advantage of using a global lock was > simplified handling of process group migrations. This patch retains the > use of the global lock for protecting process group migration, while > reducing contention by using per thread group lock during > cgroup.procs/threads writes. > > The locking behavior is as follows: > > write cgroup.procs/threads | process fork,exec,exit | process group migration > ------------------------------------------------------------------------------ > cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() > down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) > critical section | critical section | critical section > up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) > cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() > > g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes > signal_struct->group_rwsem. > > This patch eliminates contention between cgroup migration and fork > operations for threads that belong to different thread groups, thereby > reducing the long-tail latency of cgroup migrations and lowering system > load. Do you have any performance numbers to showcase how much performance benefit does this change provide? > Signed-off-by: Yi Tao <escape@linux.alibaba.com> > --- > include/linux/cgroup-defs.h | 2 ++ > include/linux/sched/signal.h | 4 +++ > init/init_task.c | 3 ++ > kernel/cgroup/cgroup-internal.h | 6 ++-- > kernel/cgroup/cgroup-v1.c | 8 ++--- > kernel/cgroup/cgroup.c | 56 ++++++++++++++++++++------------- > kernel/fork.c | 4 +++ > 7 files changed, 55 insertions(+), 28 deletions(-) > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 6b93a64115fe..8e0fdad8a440 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -838,6 +838,7 @@ struct cgroup_of_peak { > static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) > { > percpu_down_read(&cgroup_threadgroup_rwsem); > + down_read(&tsk->signal->group_rwsem); > } > > /** > @@ -848,6 +849,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) > */ > static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) > { > + up_read(&tsk->signal->group_rwsem); > percpu_up_read(&cgroup_threadgroup_rwsem); > } > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > index 1ef1edbaaf79..86fbc99a9174 100644 > --- a/include/linux/sched/signal.h > +++ b/include/linux/sched/signal.h > @@ -226,6 +226,10 @@ struct signal_struct { > struct tty_audit_buf *tty_audit_buf; > #endif > > +#ifdef CONFIG_CGROUPS > + struct rw_semaphore group_rwsem; > +#endif > + > /* > * Thread is the potential origin of an oom condition; kill first on > * oom > diff --git a/init/init_task.c b/init/init_task.c > index e557f622bd90..0450093924a7 100644 > --- a/init/init_task.c > +++ b/init/init_task.c > @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { > }, > .multiprocess = HLIST_HEAD_INIT, > .rlim = INIT_RLIMITS, > +#ifdef CONFIG_CGROUPS > + .group_rwsem = __RWSEM_INITIALIZER(init_signals.group_rwsem), > +#endif > .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), > .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), > #ifdef CONFIG_POSIX_TIMERS > diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h > index b14e61c64a34..572c24b7e947 100644 > --- a/kernel/cgroup/cgroup-internal.h > +++ b/kernel/cgroup/cgroup-internal.h > @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, > > int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, > bool threadgroup); > -void cgroup_attach_lock(bool lock_threadgroup); > -void cgroup_attach_unlock(bool lock_threadgroup); > +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup, > + bool global); > +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup, > + bool global); > struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > bool *locked) > __acquires(&cgroup_threadgroup_rwsem); > diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c > index 2a4a387f867a..3e5ead8c5bc5 100644 > --- a/kernel/cgroup/cgroup-v1.c > +++ b/kernel/cgroup/cgroup-v1.c > @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) > int retval = 0; > > cgroup_lock(); > - cgroup_attach_lock(true); > + cgroup_attach_lock(NULL, true, true); > for_each_root(root) { > struct cgroup *from_cgrp; > > @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) > if (retval) > break; > } > - cgroup_attach_unlock(true); > + cgroup_attach_unlock(NULL, true, true); > cgroup_unlock(); > > return retval; > @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) > > cgroup_lock(); > > - cgroup_attach_lock(true); > + cgroup_attach_lock(NULL, true, true); > > /* all tasks in @from are being moved, all csets are source */ > spin_lock_irq(&css_set_lock); > @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) > } while (task && !ret); > out_err: > cgroup_migrate_finish(&mgctx); > - cgroup_attach_unlock(true); > + cgroup_attach_unlock(NULL, true, true); > cgroup_unlock(); > return ret; > } > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 312c6a8b55bb..4e1d80a2741f 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); > * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that > * CPU hotplug is disabled on entry. > */ > -void cgroup_attach_lock(bool lock_threadgroup) > +void cgroup_attach_lock(struct task_struct *tsk, > + bool lock_threadgroup, bool global) > { > cpus_read_lock(); > - if (lock_threadgroup) > - percpu_down_write(&cgroup_threadgroup_rwsem); > + if (lock_threadgroup) { > + if (global) > + percpu_down_write(&cgroup_threadgroup_rwsem); > + else > + down_write(&tsk->signal->group_rwsem); > + } > } tsk is only used when global is false. So you can eliminate the redundant global argument and use the presence of tsk to indicate which lock to take. Cheers, Longman ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 13:14 ` Waiman Long @ 2025-09-04 1:35 ` Chen Ridong 2025-09-04 4:59 ` escape 2025-09-04 5:02 ` escape 2 siblings, 0 replies; 44+ messages in thread From: Chen Ridong @ 2025-09-04 1:35 UTC (permalink / raw) To: Waiman Long, Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel On 2025/9/3 21:14, Waiman Long wrote: > > On 9/3/25 7:11 AM, Yi Tao wrote: >> As computer hardware advances, modern systems are typically equipped >> with many CPU cores and large amounts of memory, enabling the deployment >> of numerous applications. On such systems, container creation and >> deletion become frequent operations, making cgroup process migration no >> longer a cold path. This leads to noticeable contention with common >> process operations such as fork, exec, and exit. >> We are facing the same issue. >> To alleviate the contention between cgroup process migration and >> operations like process fork, this patch modifies lock to take the write >> lock on signal_struct->group_rwsem when writing pid to >> cgroup.procs/threads instead of holding a global write lock. >> >> Cgroup process migration has historically relied on >> signal_struct->group_rwsem to protect thread group integrity. In commit >> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with >> a global percpu_rwsem"), this was changed to a global >> cgroup_threadgroup_rwsem. The advantage of using a global lock was >> simplified handling of process group migrations. This patch retains the >> use of the global lock for protecting process group migration, while >> reducing contention by using per thread group lock during >> cgroup.procs/threads writes. >> >> The locking behavior is as follows: >> >> write cgroup.procs/threads | process fork,exec,exit | process group migration >> ------------------------------------------------------------------------------ >> cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() >> down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) >> critical section | critical section | critical section >> up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) >> cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() >> >> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes >> signal_struct->group_rwsem. >> >> This patch eliminates contention between cgroup migration and fork >> operations for threads that belong to different thread groups, thereby >> reducing the long-tail latency of cgroup migrations and lowering system >> load. > Do you have any performance numbers to showcase how much performance benefit does this change provide? Yes, I would like this performance data as well. >> Signed-off-by: Yi Tao <escape@linux.alibaba.com> >> --- >> include/linux/cgroup-defs.h | 2 ++ >> include/linux/sched/signal.h | 4 +++ >> init/init_task.c | 3 ++ >> kernel/cgroup/cgroup-internal.h | 6 ++-- >> kernel/cgroup/cgroup-v1.c | 8 ++--- >> kernel/cgroup/cgroup.c | 56 ++++++++++++++++++++------------- >> kernel/fork.c | 4 +++ >> 7 files changed, 55 insertions(+), 28 deletions(-) >> >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h >> index 6b93a64115fe..8e0fdad8a440 100644 >> --- a/include/linux/cgroup-defs.h >> +++ b/include/linux/cgroup-defs.h >> @@ -838,6 +838,7 @@ struct cgroup_of_peak { >> static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) >> { >> percpu_down_read(&cgroup_threadgroup_rwsem); >> + down_read(&tsk->signal->group_rwsem); >> } >> /** >> @@ -848,6 +849,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) >> */ >> static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) >> { >> + up_read(&tsk->signal->group_rwsem); >> percpu_up_read(&cgroup_threadgroup_rwsem); >> } >> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h >> index 1ef1edbaaf79..86fbc99a9174 100644 >> --- a/include/linux/sched/signal.h >> +++ b/include/linux/sched/signal.h >> @@ -226,6 +226,10 @@ struct signal_struct { >> struct tty_audit_buf *tty_audit_buf; >> #endif >> +#ifdef CONFIG_CGROUPS >> + struct rw_semaphore group_rwsem; >> +#endif >> + >> /* >> * Thread is the potential origin of an oom condition; kill first on >> * oom >> diff --git a/init/init_task.c b/init/init_task.c >> index e557f622bd90..0450093924a7 100644 >> --- a/init/init_task.c >> +++ b/init/init_task.c >> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { >> }, >> .multiprocess = HLIST_HEAD_INIT, >> .rlim = INIT_RLIMITS, >> +#ifdef CONFIG_CGROUPS >> + .group_rwsem = __RWSEM_INITIALIZER(init_signals.group_rwsem), >> +#endif >> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), >> .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), >> #ifdef CONFIG_POSIX_TIMERS >> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h >> index b14e61c64a34..572c24b7e947 100644 >> --- a/kernel/cgroup/cgroup-internal.h >> +++ b/kernel/cgroup/cgroup-internal.h >> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, >> int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, >> bool threadgroup); >> -void cgroup_attach_lock(bool lock_threadgroup); >> -void cgroup_attach_unlock(bool lock_threadgroup); >> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup, >> + bool global); >> +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup, >> + bool global); >> struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, >> bool *locked) >> __acquires(&cgroup_threadgroup_rwsem); >> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c >> index 2a4a387f867a..3e5ead8c5bc5 100644 >> --- a/kernel/cgroup/cgroup-v1.c >> +++ b/kernel/cgroup/cgroup-v1.c >> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) >> int retval = 0; >> cgroup_lock(); >> - cgroup_attach_lock(true); >> + cgroup_attach_lock(NULL, true, true); >> for_each_root(root) { >> struct cgroup *from_cgrp; >> @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) >> if (retval) >> break; >> } >> - cgroup_attach_unlock(true); >> + cgroup_attach_unlock(NULL, true, true); >> cgroup_unlock(); >> return retval; >> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) >> cgroup_lock(); >> - cgroup_attach_lock(true); >> + cgroup_attach_lock(NULL, true, true); >> /* all tasks in @from are being moved, all csets are source */ >> spin_lock_irq(&css_set_lock); >> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) >> } while (task && !ret); >> out_err: >> cgroup_migrate_finish(&mgctx); >> - cgroup_attach_unlock(true); >> + cgroup_attach_unlock(NULL, true, true); >> cgroup_unlock(); >> return ret; >> } >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index 312c6a8b55bb..4e1d80a2741f 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); >> * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that >> * CPU hotplug is disabled on entry. >> */ >> -void cgroup_attach_lock(bool lock_threadgroup) >> +void cgroup_attach_lock(struct task_struct *tsk, >> + bool lock_threadgroup, bool global) >> { >> cpus_read_lock(); >> - if (lock_threadgroup) >> - percpu_down_write(&cgroup_threadgroup_rwsem); >> + if (lock_threadgroup) { >> + if (global) >> + percpu_down_write(&cgroup_threadgroup_rwsem); >> + else >> + down_write(&tsk->signal->group_rwsem); >> + } >> } > > tsk is only used when global is false. So you can eliminate the redundant global argument and use > the presence of tsk to indicate which lock to take. > > Cheers, > Longman > -- Best regards, Ridong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 13:14 ` Waiman Long 2025-09-04 1:35 ` Chen Ridong @ 2025-09-04 4:59 ` escape 2025-09-04 5:02 ` escape 2 siblings, 0 replies; 44+ messages in thread From: escape @ 2025-09-04 4:59 UTC (permalink / raw) To: Waiman Long, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel 在 2025/9/3 21:14, Waiman Long 写道: > > On 9/3/25 7:11 AM, Yi Tao wrote: >> As computer hardware advances, modern systems are typically equipped >> with many CPU cores and large amounts of memory, enabling the deployment >> of numerous applications. On such systems, container creation and >> deletion become frequent operations, making cgroup process migration no >> longer a cold path. This leads to noticeable contention with common >> process operations such as fork, exec, and exit. >> >> To alleviate the contention between cgroup process migration and >> operations like process fork, this patch modifies lock to take the write >> lock on signal_struct->group_rwsem when writing pid to >> cgroup.procs/threads instead of holding a global write lock. >> >> Cgroup process migration has historically relied on >> signal_struct->group_rwsem to protect thread group integrity. In commit >> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with >> a global percpu_rwsem"), this was changed to a global >> cgroup_threadgroup_rwsem. The advantage of using a global lock was >> simplified handling of process group migrations. This patch retains the >> use of the global lock for protecting process group migration, while >> reducing contention by using per thread group lock during >> cgroup.procs/threads writes. >> >> The locking behavior is as follows: >> >> write cgroup.procs/threads | process fork,exec,exit | process group >> migration >> ------------------------------------------------------------------------------ >> >> cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() >> down_write(&p_rwsem) | down_read(&p_rwsem) | >> down_write(&g_rwsem) >> critical section | critical section | critical section >> up_write(&p_rwsem) | up_read(&p_rwsem) | >> up_write(&g_rwsem) >> cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() >> >> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes >> signal_struct->group_rwsem. >> >> This patch eliminates contention between cgroup migration and fork >> operations for threads that belong to different thread groups, thereby >> reducing the long-tail latency of cgroup migrations and lowering system >> load. > Do you have any performance numbers to showcase how much performance > benefit does this change provide? Yes, here is the simulation test setup and results. Machine Configuration: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8475B BIOS Model name: Intel(R) Xeon(R) Platinum 8475B NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Test Program 1: cgroup_attach, four threads each write to cgroup.procs every 10ms, simulating frequent cgroup process migrations. Test Program 2: stress-ng or UnixBench, used to generate CPU and fork-intensive workloads. Terminal 1: taskset -ac 0-47 ./cgroup_attach Terminal 2: taskset -ac 48-95 stress-ng --cpu 48 --forkheavy 4 Use bpftrace to monitor the latency of cgroup1_procs_write. Without the patch: The tail latency is 30ms With the patch:The tail latency is 16μs. Replace stress-ng with UnixBench to evaluate fork performance: Terminal 2: taskset -ac 48-95 ./Run spawn Without the patch: Multi-CPU score =15753.9 With the patch: Multi-CPU score = 17111.2 We can observe that the latency of cgroup process migration has been significantly reduced, and the performance of process forking has improved. In production environments running big data computing workloads, without this patch, we observed a large number of processes entering the uninterruptible sleep (D) state due to contention on `cgroup_threadgroup_rwsem`. With the patch applied, the system load average has decreased. cgroup_attach.c ``` #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <pthread.h> #include <sys/types.h> #include <sys/syscall.h> #include <fcntl.h> #include <string.h> #define NUM_THREADS 4 #define SLEEP_US 10000 // 10ms #define CGROUP_ROOT "/sys/fs/cgroup/cpu" pid_t gettid() { return syscall(SYS_gettid); } void* thread_func(void* arg) { char dirname[256]; char procs_file[256]; int fd; pid_t child_pid; FILE* file; pid_t tid = gettid(); snprintf(dirname, sizeof(dirname), "%s/test-%d", CGROUP_ROOT, tid); snprintf(procs_file, sizeof(procs_file), "%s/cgroup.procs", dirname); if (mkdir(dirname, 0755) != 0) { perror(dirname); pthread_exit(NULL); } printf("thread %d: cgroup dir %s\n", tid, dirname); while (1) { child_pid = fork(); if (child_pid < 0) { perror("fork"); continue; } if (child_pid == 0) { usleep(SLEEP_US); exit(0); } fd = open(procs_file, O_WRONLY); if (fd < 0) { perror("open cgroup.procs"); } else { char pid_str[16]; int len = snprintf(pid_str, sizeof(pid_str), "%d\n", child_pid); if (write(fd, pid_str, len) < 0) { perror("write cgroup.procs"); } close(fd); } close(fd); waitpid(child_pid, NULL, NULL); usleep(SLEEP_US); } pthread_exit(NULL); } int main() { pthread_t threads[NUM_THREADS]; int i; for (i = 0; i < NUM_THREADS; i++) { if (pthread_create(&threads[i], NULL, thread_func, NULL) != 0) { perror("pthread_create"); break; } } for (i = 0; i < NUM_THREADS; i++) { pthread_join(threads[i], NULL); } return 0; } ``` >> Signed-off-by: Yi Tao <escape@linux.alibaba.com> >> --- >> include/linux/cgroup-defs.h | 2 ++ >> include/linux/sched/signal.h | 4 +++ >> init/init_task.c | 3 ++ >> kernel/cgroup/cgroup-internal.h | 6 ++-- >> kernel/cgroup/cgroup-v1.c | 8 ++--- >> kernel/cgroup/cgroup.c | 56 ++++++++++++++++++++------------- >> kernel/fork.c | 4 +++ >> 7 files changed, 55 insertions(+), 28 deletions(-) >> >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h >> index 6b93a64115fe..8e0fdad8a440 100644 >> --- a/include/linux/cgroup-defs.h >> +++ b/include/linux/cgroup-defs.h >> @@ -838,6 +838,7 @@ struct cgroup_of_peak { >> static inline void cgroup_threadgroup_change_begin(struct >> task_struct *tsk) >> { >> percpu_down_read(&cgroup_threadgroup_rwsem); >> + down_read(&tsk->signal->group_rwsem); >> } >> /** >> @@ -848,6 +849,7 @@ static inline void >> cgroup_threadgroup_change_begin(struct task_struct *tsk) >> */ >> static inline void cgroup_threadgroup_change_end(struct task_struct >> *tsk) >> { >> + up_read(&tsk->signal->group_rwsem); >> percpu_up_read(&cgroup_threadgroup_rwsem); >> } >> diff --git a/include/linux/sched/signal.h >> b/include/linux/sched/signal.h >> index 1ef1edbaaf79..86fbc99a9174 100644 >> --- a/include/linux/sched/signal.h >> +++ b/include/linux/sched/signal.h >> @@ -226,6 +226,10 @@ struct signal_struct { >> struct tty_audit_buf *tty_audit_buf; >> #endif >> +#ifdef CONFIG_CGROUPS >> + struct rw_semaphore group_rwsem; >> +#endif >> + >> /* >> * Thread is the potential origin of an oom condition; kill >> first on >> * oom >> diff --git a/init/init_task.c b/init/init_task.c >> index e557f622bd90..0450093924a7 100644 >> --- a/init/init_task.c >> +++ b/init/init_task.c >> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { >> }, >> .multiprocess = HLIST_HEAD_INIT, >> .rlim = INIT_RLIMITS, >> +#ifdef CONFIG_CGROUPS >> + .group_rwsem = __RWSEM_INITIALIZER(init_signals.group_rwsem), >> +#endif >> .cred_guard_mutex = >> __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), >> .exec_update_lock = >> __RWSEM_INITIALIZER(init_signals.exec_update_lock), >> #ifdef CONFIG_POSIX_TIMERS >> diff --git a/kernel/cgroup/cgroup-internal.h >> b/kernel/cgroup/cgroup-internal.h >> index b14e61c64a34..572c24b7e947 100644 >> --- a/kernel/cgroup/cgroup-internal.h >> +++ b/kernel/cgroup/cgroup-internal.h >> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, >> bool threadgroup, >> int cgroup_attach_task(struct cgroup *dst_cgrp, struct >> task_struct *leader, >> bool threadgroup); >> -void cgroup_attach_lock(bool lock_threadgroup); >> -void cgroup_attach_unlock(bool lock_threadgroup); >> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup, >> + bool global); >> +void cgroup_attach_unlock(struct task_struct *tsk, bool >> lock_threadgroup, >> + bool global); >> struct task_struct *cgroup_procs_write_start(char *buf, bool >> threadgroup, >> bool *locked) >> __acquires(&cgroup_threadgroup_rwsem); >> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c >> index 2a4a387f867a..3e5ead8c5bc5 100644 >> --- a/kernel/cgroup/cgroup-v1.c >> +++ b/kernel/cgroup/cgroup-v1.c >> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct >> *from, struct task_struct *tsk) >> int retval = 0; >> cgroup_lock(); >> - cgroup_attach_lock(true); >> + cgroup_attach_lock(NULL, true, true); >> for_each_root(root) { >> struct cgroup *from_cgrp; >> @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct >> *from, struct task_struct *tsk) >> if (retval) >> break; >> } >> - cgroup_attach_unlock(true); >> + cgroup_attach_unlock(NULL, true, true); >> cgroup_unlock(); >> return retval; >> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, >> struct cgroup *from) >> cgroup_lock(); >> - cgroup_attach_lock(true); >> + cgroup_attach_lock(NULL, true, true); >> /* all tasks in @from are being moved, all csets are source */ >> spin_lock_irq(&css_set_lock); >> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, >> struct cgroup *from) >> } while (task && !ret); >> out_err: >> cgroup_migrate_finish(&mgctx); >> - cgroup_attach_unlock(true); >> + cgroup_attach_unlock(NULL, true, true); >> cgroup_unlock(); >> return ret; >> } >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index 312c6a8b55bb..4e1d80a2741f 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); >> * write-locking cgroup_threadgroup_rwsem. This allows ->attach() >> to assume that >> * CPU hotplug is disabled on entry. >> */ >> -void cgroup_attach_lock(bool lock_threadgroup) >> +void cgroup_attach_lock(struct task_struct *tsk, >> + bool lock_threadgroup, bool global) >> { >> cpus_read_lock(); >> - if (lock_threadgroup) >> - percpu_down_write(&cgroup_threadgroup_rwsem); >> + if (lock_threadgroup) { >> + if (global) >> + percpu_down_write(&cgroup_threadgroup_rwsem); >> + else >> + down_write(&tsk->signal->group_rwsem); >> + } >> } > > tsk is only used when global is false. So you can eliminate the > redundant global argument and use the presence of tsk to indicate > which lock to take. > > Cheers, > Longman ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 13:14 ` Waiman Long 2025-09-04 1:35 ` Chen Ridong 2025-09-04 4:59 ` escape @ 2025-09-04 5:02 ` escape 2 siblings, 0 replies; 44+ messages in thread From: escape @ 2025-09-04 5:02 UTC (permalink / raw) To: Waiman Long, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel 在 2025/9/3 21:14, Waiman Long 写道: > > On 9/3/25 7:11 AM, Yi Tao wrote: >> As computer hardware advances, modern systems are typically equipped >> with many CPU cores and large amounts of memory, enabling the deployment >> of numerous applications. On such systems, container creation and >> deletion become frequent operations, making cgroup process migration no >> longer a cold path. This leads to noticeable contention with common >> process operations such as fork, exec, and exit. >> >> To alleviate the contention between cgroup process migration and >> operations like process fork, this patch modifies lock to take the write >> lock on signal_struct->group_rwsem when writing pid to >> cgroup.procs/threads instead of holding a global write lock. >> >> Cgroup process migration has historically relied on >> signal_struct->group_rwsem to protect thread group integrity. In commit >> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with >> a global percpu_rwsem"), this was changed to a global >> cgroup_threadgroup_rwsem. The advantage of using a global lock was >> simplified handling of process group migrations. This patch retains the >> use of the global lock for protecting process group migration, while >> reducing contention by using per thread group lock during >> cgroup.procs/threads writes. >> >> The locking behavior is as follows: >> >> write cgroup.procs/threads | process fork,exec,exit | process group >> migration >> ------------------------------------------------------------------------------ >> >> cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() >> down_write(&p_rwsem) | down_read(&p_rwsem) | >> down_write(&g_rwsem) >> critical section | critical section | critical section >> up_write(&p_rwsem) | up_read(&p_rwsem) | >> up_write(&g_rwsem) >> cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() >> >> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes >> signal_struct->group_rwsem. >> >> This patch eliminates contention between cgroup migration and fork >> operations for threads that belong to different thread groups, thereby >> reducing the long-tail latency of cgroup migrations and lowering system >> load. > Do you have any performance numbers to showcase how much performance > benefit does this change provide? >> Signed-off-by: Yi Tao <escape@linux.alibaba.com> >> --- >> include/linux/cgroup-defs.h | 2 ++ >> include/linux/sched/signal.h | 4 +++ >> init/init_task.c | 3 ++ >> kernel/cgroup/cgroup-internal.h | 6 ++-- >> kernel/cgroup/cgroup-v1.c | 8 ++--- >> kernel/cgroup/cgroup.c | 56 ++++++++++++++++++++------------- >> kernel/fork.c | 4 +++ >> 7 files changed, 55 insertions(+), 28 deletions(-) >> >> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h >> index 6b93a64115fe..8e0fdad8a440 100644 >> --- a/include/linux/cgroup-defs.h >> +++ b/include/linux/cgroup-defs.h >> @@ -838,6 +838,7 @@ struct cgroup_of_peak { >> static inline void cgroup_threadgroup_change_begin(struct >> task_struct *tsk) >> { >> percpu_down_read(&cgroup_threadgroup_rwsem); >> + down_read(&tsk->signal->group_rwsem); >> } >> /** >> @@ -848,6 +849,7 @@ static inline void >> cgroup_threadgroup_change_begin(struct task_struct *tsk) >> */ >> static inline void cgroup_threadgroup_change_end(struct task_struct >> *tsk) >> { >> + up_read(&tsk->signal->group_rwsem); >> percpu_up_read(&cgroup_threadgroup_rwsem); >> } >> diff --git a/include/linux/sched/signal.h >> b/include/linux/sched/signal.h >> index 1ef1edbaaf79..86fbc99a9174 100644 >> --- a/include/linux/sched/signal.h >> +++ b/include/linux/sched/signal.h >> @@ -226,6 +226,10 @@ struct signal_struct { >> struct tty_audit_buf *tty_audit_buf; >> #endif >> +#ifdef CONFIG_CGROUPS >> + struct rw_semaphore group_rwsem; >> +#endif >> + >> /* >> * Thread is the potential origin of an oom condition; kill >> first on >> * oom >> diff --git a/init/init_task.c b/init/init_task.c >> index e557f622bd90..0450093924a7 100644 >> --- a/init/init_task.c >> +++ b/init/init_task.c >> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { >> }, >> .multiprocess = HLIST_HEAD_INIT, >> .rlim = INIT_RLIMITS, >> +#ifdef CONFIG_CGROUPS >> + .group_rwsem = __RWSEM_INITIALIZER(init_signals.group_rwsem), >> +#endif >> .cred_guard_mutex = >> __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), >> .exec_update_lock = >> __RWSEM_INITIALIZER(init_signals.exec_update_lock), >> #ifdef CONFIG_POSIX_TIMERS >> diff --git a/kernel/cgroup/cgroup-internal.h >> b/kernel/cgroup/cgroup-internal.h >> index b14e61c64a34..572c24b7e947 100644 >> --- a/kernel/cgroup/cgroup-internal.h >> +++ b/kernel/cgroup/cgroup-internal.h >> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, >> bool threadgroup, >> int cgroup_attach_task(struct cgroup *dst_cgrp, struct >> task_struct *leader, >> bool threadgroup); >> -void cgroup_attach_lock(bool lock_threadgroup); >> -void cgroup_attach_unlock(bool lock_threadgroup); >> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup, >> + bool global); >> +void cgroup_attach_unlock(struct task_struct *tsk, bool >> lock_threadgroup, >> + bool global); >> struct task_struct *cgroup_procs_write_start(char *buf, bool >> threadgroup, >> bool *locked) >> __acquires(&cgroup_threadgroup_rwsem); >> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c >> index 2a4a387f867a..3e5ead8c5bc5 100644 >> --- a/kernel/cgroup/cgroup-v1.c >> +++ b/kernel/cgroup/cgroup-v1.c >> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct >> *from, struct task_struct *tsk) >> int retval = 0; >> cgroup_lock(); >> - cgroup_attach_lock(true); >> + cgroup_attach_lock(NULL, true, true); >> for_each_root(root) { >> struct cgroup *from_cgrp; >> @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct >> *from, struct task_struct *tsk) >> if (retval) >> break; >> } >> - cgroup_attach_unlock(true); >> + cgroup_attach_unlock(NULL, true, true); >> cgroup_unlock(); >> return retval; >> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, >> struct cgroup *from) >> cgroup_lock(); >> - cgroup_attach_lock(true); >> + cgroup_attach_lock(NULL, true, true); >> /* all tasks in @from are being moved, all csets are source */ >> spin_lock_irq(&css_set_lock); >> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, >> struct cgroup *from) >> } while (task && !ret); >> out_err: >> cgroup_migrate_finish(&mgctx); >> - cgroup_attach_unlock(true); >> + cgroup_attach_unlock(NULL, true, true); >> cgroup_unlock(); >> return ret; >> } >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index 312c6a8b55bb..4e1d80a2741f 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); >> * write-locking cgroup_threadgroup_rwsem. This allows ->attach() >> to assume that >> * CPU hotplug is disabled on entry. >> */ >> -void cgroup_attach_lock(bool lock_threadgroup) >> +void cgroup_attach_lock(struct task_struct *tsk, >> + bool lock_threadgroup, bool global) >> { >> cpus_read_lock(); >> - if (lock_threadgroup) >> - percpu_down_write(&cgroup_threadgroup_rwsem); >> + if (lock_threadgroup) { >> + if (global) >> + percpu_down_write(&cgroup_threadgroup_rwsem); >> + else >> + down_write(&tsk->signal->group_rwsem); >> + } >> } > > tsk is only used when global is false. So you can eliminate the > redundant global argument and use the presence of tsk to indicate > which lock to take. > > Cheers, > Longman Thank you for your suggestion. As you pointed out, we can determine which lock to acquire based on whether the task is NULL. I'll incorporate this change in the next version of the patch. Thanks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao 2025-09-03 13:14 ` Waiman Long @ 2025-09-03 16:53 ` Tejun Heo 2025-09-03 20:03 ` Michal Koutný 2025-09-04 3:15 ` escape 2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao ` (3 subsequent siblings) 5 siblings, 2 replies; 44+ messages in thread From: Tejun Heo @ 2025-09-03 16:53 UTC (permalink / raw) To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel Hello, On Wed, Sep 03, 2025 at 07:11:07PM +0800, Yi Tao wrote: > As computer hardware advances, modern systems are typically equipped > with many CPU cores and large amounts of memory, enabling the deployment > of numerous applications. On such systems, container creation and > deletion become frequent operations, making cgroup process migration no > longer a cold path. This leads to noticeable contention with common > process operations such as fork, exec, and exit. If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It disappears completely and CLONE_INTO_CGROUP doesn't need any global locks from cgroup side. Are there reasons why you can't use CLONE_INTO_CGROUP? Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 16:53 ` Tejun Heo @ 2025-09-03 20:03 ` Michal Koutný 2025-09-03 20:45 ` Tejun Heo 2025-09-04 3:15 ` escape 1 sibling, 1 reply; 44+ messages in thread From: Michal Koutný @ 2025-09-03 20:03 UTC (permalink / raw) To: Tejun Heo; +Cc: Yi Tao, hannes, cgroups, linux-kernel [-- Attachment #1: Type: text/plain, Size: 530 bytes --] On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote: > If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It > disappears completely and CLONE_INTO_CGROUP doesn't need any global locks > from cgroup side. CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular migration, no? Its effect is atomicity wrt clone. Or, Tejum, what do you mean that it disappears? (I think we cannot give up cgroup_mutex as it ensures synchronization of possible parent's migration.) Michal [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 265 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 20:03 ` Michal Koutný @ 2025-09-03 20:45 ` Tejun Heo 2025-09-04 1:40 ` Chen Ridong 0 siblings, 1 reply; 44+ messages in thread From: Tejun Heo @ 2025-09-03 20:45 UTC (permalink / raw) To: Michal Koutný; +Cc: Yi Tao, hannes, cgroups, linux-kernel Hello, Michal. On Wed, Sep 03, 2025 at 10:03:39PM +0200, Michal Koutný wrote: > On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote: > > If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It > > disappears completely and CLONE_INTO_CGROUP doesn't need any global locks > > from cgroup side. > > CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular > migration, no? Its effect is atomicity wrt clone. > Or, Tejum, what do you mean that it disappears? (I think we cannot give > up cgroup_mutex as it ensures synchronization of possible parent's > migration.) Sorry, I was confused. We no longer need to write lock threadgroup rwsem when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need cgroup_mutex. 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree") Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 20:45 ` Tejun Heo @ 2025-09-04 1:40 ` Chen Ridong 2025-09-04 6:43 ` escape 2025-09-04 6:52 ` Tejun Heo 0 siblings, 2 replies; 44+ messages in thread From: Chen Ridong @ 2025-09-04 1:40 UTC (permalink / raw) To: Tejun Heo, Michal Koutný; +Cc: Yi Tao, hannes, cgroups, linux-kernel On 2025/9/4 4:45, Tejun Heo wrote: > Hello, Michal. > > On Wed, Sep 03, 2025 at 10:03:39PM +0200, Michal Koutný wrote: >> On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote: >>> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It >>> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks >>> from cgroup side. >> >> CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular >> migration, no? Its effect is atomicity wrt clone. >> Or, Tejum, what do you mean that it disappears? (I think we cannot give >> up cgroup_mutex as it ensures synchronization of possible parent's >> migration.) > > Sorry, I was confused. We no longer need to write lock threadgroup rwsem > when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need > cgroup_mutex. > > 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree") > > Thanks. > I'm still a bit confused. Commit 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree") only applies to CSS updates. However, cloning with CLONE_INTO_CGROUP still requires acquiring the threadgroup_rwsem. cgroup_can_fork cgroup_css_set_fork if (kargs->flags & CLONE_INTO_CGROUP) cgroup_lock(); cgroup_threadgroup_change_begin(current); -- Best regards, Ridong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 1:40 ` Chen Ridong @ 2025-09-04 6:43 ` escape 2025-09-04 6:52 ` Tejun Heo 1 sibling, 0 replies; 44+ messages in thread From: escape @ 2025-09-04 6:43 UTC (permalink / raw) To: Chen Ridong, Tejun Heo, Michal Koutný; +Cc: hannes, cgroups, linux-kernel 在 2025/9/4 09:40, Chen Ridong 写道: > > On 2025/9/4 4:45, Tejun Heo wrote: >> Hello, Michal. >> >> On Wed, Sep 03, 2025 at 10:03:39PM +0200, Michal Koutný wrote: >>> On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote: >>>> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It >>>> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks >>>> from cgroup side. >>> CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular >>> migration, no? Its effect is atomicity wrt clone. >>> Or, Tejum, what do you mean that it disappears? (I think we cannot give >>> up cgroup_mutex as it ensures synchronization of possible parent's >>> migration.) >> Sorry, I was confused. We no longer need to write lock threadgroup rwsem >> when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need >> cgroup_mutex. >> >> 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree") >> >> Thanks. >> > I'm still a bit confused. Commit 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when > updating csses on an empty subtree") only applies to CSS updates. However, cloning with > CLONE_INTO_CGROUP still requires acquiring the threadgroup_rwsem. > > cgroup_can_fork > cgroup_css_set_fork > if (kargs->flags & CLONE_INTO_CGROUP) > cgroup_lock(); > cgroup_threadgroup_change_begin(current); > When using CLONE_INTO_CGROUP, there is no need to write cgroup.procs. So although a read lock is required here, there is no contention with the write lock. Thanks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 1:40 ` Chen Ridong 2025-09-04 6:43 ` escape @ 2025-09-04 6:52 ` Tejun Heo 1 sibling, 0 replies; 44+ messages in thread From: Tejun Heo @ 2025-09-04 6:52 UTC (permalink / raw) To: Chen Ridong; +Cc: Michal Koutný, Yi Tao, hannes, cgroups, linux-kernel Hello, On Thu, Sep 04, 2025 at 09:40:12AM +0800, Chen Ridong wrote: ... > > Sorry, I was confused. We no longer need to write lock threadgroup rwsem > > when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need > > cgroup_mutex. > > > > 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree") > > > > Thanks. > > > > I'm still a bit confused. Commit 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when > updating csses on an empty subtree") only applies to CSS updates. However, cloning with > CLONE_INTO_CGROUP still requires acquiring the threadgroup_rwsem. > > cgroup_can_fork > cgroup_css_set_fork > if (kargs->flags & CLONE_INTO_CGROUP) > cgroup_lock(); > cgroup_threadgroup_change_begin(current); Ah, yeah, I'm misremembering things, sorry. What got elided in that commit is down_write of threadgroup_rwsem when enabling controllers on empty cgroups, which was the only operation which still needed to down_write the rwsem. Here's an excerpt from the commit message: After this optimization, the usage pattern of creating a cgroup, enabling the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and then removing the cgroup after it becomes empty doesn't need to write-lock threadgroup_rwsem at all. It's true that cgroup_threadgroup_change_begin() down_reads the threadgroup_rwsem but that is a percpu_rwsem whose read operations are percpu inc/dec. This doesn't add any noticeable overhead or has any scalability concerns. So, if you follow the "recommended" workflow, the only remaining possible scalability bottleneck is cgroup_mutex. Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 16:53 ` Tejun Heo 2025-09-03 20:03 ` Michal Koutný @ 2025-09-04 3:15 ` escape 2025-09-04 6:38 ` escape 2025-09-04 7:28 ` Tejun Heo 1 sibling, 2 replies; 44+ messages in thread From: escape @ 2025-09-04 3:15 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel 在 2025/9/4 00:53, Tejun Heo 写道: > Hello, > > On Wed, Sep 03, 2025 at 07:11:07PM +0800, Yi Tao wrote: >> As computer hardware advances, modern systems are typically equipped >> with many CPU cores and large amounts of memory, enabling the deployment >> of numerous applications. On such systems, container creation and >> deletion become frequent operations, making cgroup process migration no >> longer a cold path. This leads to noticeable contention with common >> process operations such as fork, exec, and exit. > If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It > disappears completely and CLONE_INTO_CGROUP doesn't need any global locks > from cgroup side. Are there reasons why you can't use CLONE_INTO_CGROUP? > > Thanks. > As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP still requires holding the threadgroup_rwsem, so contention with fork operations persists. CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation and deletion, but its usage comes with significant limitations: 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2 adoption is gradually increasing, many applications have not yet been adapted to cgroup v2, and phasing out cgroup v1 will be a long and gradual process. 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the time of process fork, effectively restricting cgroup migration to the fork stage. This differs significantly from the typical cgroup attach workflow. For example, in Kubernetes, systemd is the recommended cgroup driver; kubelet communicates with systemd via D-Bus, and systemd performs the actual cgroup attachment. In this case, the process being attached typically does not have systemd as its parent. Using CLONE_INTO_CGROUP in such a scenario is impractical and would require coordinated changes to both systemd and kubelet. Thanks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 3:15 ` escape @ 2025-09-04 6:38 ` escape 2025-09-04 7:28 ` Tejun Heo 1 sibling, 0 replies; 44+ messages in thread From: escape @ 2025-09-04 6:38 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel 在 2025/9/4 11:15, escape 写道: > > 在 2025/9/4 00:53, Tejun Heo 写道: >> Hello, >> >> On Wed, Sep 03, 2025 at 07:11:07PM +0800, Yi Tao wrote: >>> As computer hardware advances, modern systems are typically equipped >>> with many CPU cores and large amounts of memory, enabling the >>> deployment >>> of numerous applications. On such systems, container creation and >>> deletion become frequent operations, making cgroup process migration no >>> longer a cold path. This leads to noticeable contention with common >>> process operations such as fork, exec, and exit. >> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become >> cold. It >> disappears completely and CLONE_INTO_CGROUP doesn't need any global >> locks >> from cgroup side. Are there reasons why you can't use CLONE_INTO_CGROUP? >> >> Thanks. >> > As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP > still requires holding the threadgroup_rwsem, so contention with fork > operations persists. Sorry, my understanding here was wrong; using CLONE_INTO_CGROUP can indeed avoid the race condition with fork, but the restrictions do exist. Thanks. > > CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation > and deletion, but its usage comes with significant limitations: > > 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2 > adoption is gradually increasing, many applications have not yet been > adapted to cgroup v2, and phasing out cgroup v1 will be a long and > gradual process. > > > 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at > the > time of process fork, effectively restricting cgroup migration to the > fork stage. This differs significantly from the typical cgroup attach > workflow. For example, in Kubernetes, systemd is the recommended cgroup > driver; kubelet communicates with systemd via D-Bus, and systemd > performs the actual cgroup attachment. In this case, the process being > attached typically does not have systemd as its parent. Using > CLONE_INTO_CGROUP in such a scenario is impractical and would require > coordinated changes to both systemd and kubelet. > > Thanks. > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 3:15 ` escape 2025-09-04 6:38 ` escape @ 2025-09-04 7:28 ` Tejun Heo 2025-09-04 8:10 ` escape 1 sibling, 1 reply; 44+ messages in thread From: Tejun Heo @ 2025-09-04 7:28 UTC (permalink / raw) To: escape; +Cc: hannes, mkoutny, cgroups, linux-kernel Hello, On Thu, Sep 04, 2025 at 11:15:26AM +0800, escape wrote: > 在 2025/9/4 00:53, Tejun Heo 写道: > > Hello, ... > As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP > still requires holding the threadgroup_rwsem, so contention with fork > operations persists. Sorry about my fumbling explanations repeatedly but this isn't true. On cgroup2, if you create a cgroup, enable controllers and then seed it with CLONE_INTO_CGROUP, threadgroup_rwsem is out of the picture. The only remaining contention point is cgroup_mutex. > CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation > and deletion, but its usage comes with significant limitations: > > 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2 > adoption is gradually increasing, many applications have not yet been > adapted to cgroup v2, and phasing out cgroup v1 will be a long and > gradual process. > > 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the > time of process fork, effectively restricting cgroup migration to the > fork stage. This differs significantly from the typical cgroup attach > workflow. For example, in Kubernetes, systemd is the recommended cgroup > driver; kubelet communicates with systemd via D-Bus, and systemd > performs the actual cgroup attachment. In this case, the process being > attached typically does not have systemd as its parent. Using > CLONE_INTO_CGROUP in such a scenario is impractical and would require > coordinated changes to both systemd and kubelet. A percpu rwsem (threadgroup_rwsem) was used instead of per-threadgroup locking to avoid adding overhead to hot paths - fork and exit - because cgroup operations were expected to be a lot colder. Now, threadgroup rwsem is *really* expensive for the writers, so the trade-off could be a bit too extreme for some use cases. However, now that the most common usage pattern doesn't involve threadgroup_rwsem, I don't feel too enthusiastic about adding hot path overhead to work around usage patterns that we want to move away from. Note that dynamic migrations have other more fundamental problems for stateful resources and we generally want to move away from it. Sure, a single rwsem operation in fork/exit isn't a lot of overhead but it isn't nothing either and this will impact everybody. Maybe we can make it a mount option so that use cases that still depend on it can toggle it on? In fact, there's already favordynmods mount option which seems like a good fit. Maybe put the extra locking behind that flag? Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 7:28 ` Tejun Heo @ 2025-09-04 8:10 ` escape 0 siblings, 0 replies; 44+ messages in thread From: escape @ 2025-09-04 8:10 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel 在 2025/9/4 15:28, Tejun Heo 写道: > Hello, > > On Thu, Sep 04, 2025 at 11:15:26AM +0800, escape wrote: >> 在 2025/9/4 00:53, Tejun Heo 写道: >>> Hello, > ... >> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP >> still requires holding the threadgroup_rwsem, so contention with fork >> operations persists. > Sorry about my fumbling explanations repeatedly but this isn't true. On > cgroup2, if you create a cgroup, enable controllers and then seed it with > CLONE_INTO_CGROUP, threadgroup_rwsem is out of the picture. The only > remaining contention point is cgroup_mutex. > >> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation >> and deletion, but its usage comes with significant limitations: >> >> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2 >> adoption is gradually increasing, many applications have not yet been >> adapted to cgroup v2, and phasing out cgroup v1 will be a long and >> gradual process. >> >> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the >> time of process fork, effectively restricting cgroup migration to the >> fork stage. This differs significantly from the typical cgroup attach >> workflow. For example, in Kubernetes, systemd is the recommended cgroup >> driver; kubelet communicates with systemd via D-Bus, and systemd >> performs the actual cgroup attachment. In this case, the process being >> attached typically does not have systemd as its parent. Using >> CLONE_INTO_CGROUP in such a scenario is impractical and would require >> coordinated changes to both systemd and kubelet. > A percpu rwsem (threadgroup_rwsem) was used instead of per-threadgroup > locking to avoid adding overhead to hot paths - fork and exit - because > cgroup operations were expected to be a lot colder. Now, threadgroup rwsem > is *really* expensive for the writers, so the trade-off could be a bit too > extreme for some use cases. > > However, now that the most common usage pattern doesn't involve > threadgroup_rwsem, I don't feel too enthusiastic about adding hot path > overhead to work around usage patterns that we want to move away from. Note > that dynamic migrations have other more fundamental problems for stateful > resources and we generally want to move away from it. Sure, a single rwsem > operation in fork/exit isn't a lot of overhead but it isn't nothing either > and this will impact everybody. > > Maybe we can make it a mount option so that use cases that still depend on > it can toggle it on? In fact, there's already favordynmods mount option > which seems like a good fit. Maybe put the extra locking behind that flag? > > Thanks. > Thank you for your reply. I agree that mounting with cgroupv2, creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP is an excellent solution. However, we've encountered significant obstacles when applying this approach on some older systems. We are simultaneously working on enabling CLONE_INTO_CGROUP support in runc, systemd, and other components, but this will take some time. This patch aims to alleviate the issue to some extent during this transitional period. Regarding the impact of the extra rwsem operations on hot paths, I have conducted performance testing. In cases where there is no contention on down_write, the UnixBench spawn test scores remain unaffected. The suggestion to use the favordynmods flag to control whether the extra rwsem is used is excellent. I will incorporate this condition in the next version of the patch. Thanks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v2 0/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao 2025-09-03 13:14 ` Waiman Long 2025-09-03 16:53 ` Tejun Heo @ 2025-09-04 11:39 ` Yi Tao 2025-09-04 11:39 ` [PATCH v2 1/1] " Yi Tao 2025-09-08 10:20 ` [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao ` (2 subsequent siblings) 5 siblings, 1 reply; 44+ messages in thread From: Yi Tao @ 2025-09-04 11:39 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel Changes in v2: - Use favordynmods as the enabling switch. - Determine whether to use the per-thread-group rwsem based on whether the task is NULL. - Fix system hang caused by acquiring cgroup_threadgroup_rwsem inside rcu_read_lock. Yi Tao (1): cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads include/linux/cgroup-defs.h | 6 +++ include/linux/sched/signal.h | 4 ++ init/init_task.c | 3 ++ kernel/cgroup/cgroup-internal.h | 4 +- kernel/cgroup/cgroup-v1.c | 8 ++-- kernel/cgroup/cgroup.c | 72 +++++++++++++++++++-------------- kernel/fork.c | 4 ++ 7 files changed, 64 insertions(+), 37 deletions(-) -- 2.32.0.3.g01195cf9f ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao @ 2025-09-04 11:39 ` Yi Tao 2025-09-04 16:31 ` Tejun Heo ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Yi Tao @ 2025-09-04 11:39 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migration ------------------------------------------------------------------------------ cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. To avoid affecting other users, the per-thread-group rwsem is only used when the `favordynmods` flag is enabled. Signed-off-by: Yi Tao <escape@linux.alibaba.com> --- include/linux/cgroup-defs.h | 6 +++ include/linux/sched/signal.h | 4 ++ init/init_task.c | 3 ++ kernel/cgroup/cgroup-internal.h | 4 +- kernel/cgroup/cgroup-v1.c | 8 ++-- kernel/cgroup/cgroup.c | 72 +++++++++++++++++++-------------- kernel/fork.c | 4 ++ 7 files changed, 64 insertions(+), 37 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 6b93a64115fe..0c068dc3e08d 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -828,6 +828,8 @@ struct cgroup_of_peak { struct list_head list; }; +extern bool have_favordynmods; + /** * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups * @tsk: target task @@ -838,6 +840,8 @@ struct cgroup_of_peak { static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) { percpu_down_read(&cgroup_threadgroup_rwsem); + if (have_favordynmods) + down_read(&tsk->signal->group_rwsem); } /** @@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) */ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) { + if (have_favordynmods) + up_read(&tsk->signal->group_rwsem); percpu_up_read(&cgroup_threadgroup_rwsem); } diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1ef1edbaaf79..86fbc99a9174 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -226,6 +226,10 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif +#ifdef CONFIG_CGROUPS + struct rw_semaphore group_rwsem; +#endif + /* * Thread is the potential origin of an oom condition; kill first on * oom diff --git a/init/init_task.c b/init/init_task.c index e557f622bd90..0450093924a7 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { }, .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, +#ifdef CONFIG_CGROUPS + .group_rwsem = __RWSEM_INITIALIZER(init_signals.group_rwsem), +#endif .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), #ifdef CONFIG_POSIX_TIMERS diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index b14e61c64a34..35005543f0c7 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(bool lock_threadgroup); -void cgroup_attach_unlock(bool lock_threadgroup); +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup); +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, bool *locked) __acquires(&cgroup_threadgroup_rwsem); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 2a4a387f867a..9f1a4d1fc741 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) int retval = 0; cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(NULL, true); for_each_root(root) { struct cgroup *from_cgrp; @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(true); + cgroup_attach_unlock(NULL, true); cgroup_unlock(); return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(NULL, true); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(true); + cgroup_attach_unlock(NULL, true); cgroup_unlock(); return ret; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 312c6a8b55bb..22b1659b623c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -214,7 +214,7 @@ static u16 have_exit_callback __read_mostly; static u16 have_release_callback __read_mostly; static u16 have_canfork_callback __read_mostly; -static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS); +bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS); /* cgroup namespace for init task */ struct cgroup_namespace init_cgroup_ns = { @@ -2459,7 +2459,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); /** * cgroup_attach_lock - Lock for ->attach() - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem + * @tsk: thread group to lock + * @lock_threadgroup: whether to down_write rwsem * * cgroup migration sometimes needs to stabilize threadgroups against forks and * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() @@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that * CPU hotplug is disabled on entry. */ -void cgroup_attach_lock(bool lock_threadgroup) +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup) { cpus_read_lock(); - if (lock_threadgroup) - percpu_down_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (tsk && favor_dynmods) + down_write(&tsk->signal->group_rwsem); + else + percpu_down_write(&cgroup_threadgroup_rwsem); + } } /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem + * @tsk: thread group to lock + * @lock_threadgroup: whether to up_write rwsem */ -void cgroup_attach_unlock(bool lock_threadgroup) +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup) { - if (lock_threadgroup) - percpu_up_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (tsk && favor_dynmods) + up_write(&tsk->signal->group_rwsem); + else + percpu_up_write(&cgroup_threadgroup_rwsem); + } cpus_read_unlock(); } @@ -2976,24 +2986,12 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) return ERR_PTR(-EINVAL); - /* - * If we migrate a single thread, we don't care about threadgroup - * stability. If the thread is `current`, it won't exit(2) under our - * hands or change PID through exec(2). We exclude - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write - * callers by cgroup_mutex. - * Therefore, we can skip the global lock. - */ - lockdep_assert_held(&cgroup_mutex); - *threadgroup_locked = pid || threadgroup; - cgroup_attach_lock(*threadgroup_locked); - rcu_read_lock(); if (pid) { tsk = find_task_by_vpid(pid); if (!tsk) { tsk = ERR_PTR(-ESRCH); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } } else { tsk = current; @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, */ if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { tsk = ERR_PTR(-EINVAL); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } - get_task_struct(tsk); - goto out_unlock_rcu; -out_unlock_threadgroup: - cgroup_attach_unlock(*threadgroup_locked); - *threadgroup_locked = false; + rcu_read_unlock(); + + /* + * If we migrate a single thread, we don't care about threadgroup + * stability. If the thread is `current`, it won't exit(2) under our + * hands or change PID through exec(2). We exclude + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write + * callers by cgroup_mutex. + * Therefore, we can skip the global lock. + */ + lockdep_assert_held(&cgroup_mutex); + *threadgroup_locked = pid || threadgroup; + + cgroup_attach_lock(tsk, *threadgroup_locked); + + return tsk; + out_unlock_rcu: rcu_read_unlock(); return tsk; @@ -3032,7 +3042,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked /* release reference from cgroup_procs_write_start() */ put_task_struct(task); - cgroup_attach_unlock(threadgroup_locked); + cgroup_attach_unlock(task, threadgroup_locked); for_each_subsys(ss, ssid) if (ss->post_attach) @@ -3119,7 +3129,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) * write-locking can be skipped safely. */ has_tasks = !list_empty(&mgctx.preloaded_src_csets); - cgroup_attach_lock(has_tasks); + cgroup_attach_lock(NULL, has_tasks); /* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3140,7 +3150,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(has_tasks); + cgroup_attach_unlock(NULL, has_tasks); return ret; } diff --git a/kernel/fork.c b/kernel/fork.c index af673856499d..5218f9b93c77 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) tty_audit_fork(sig); sched_autogroup_fork(sig); +#ifdef CONFIG_CGROUPS + init_rwsem(&sig->group_rwsem); +#endif + sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; -- 2.32.0.3.g01195cf9f ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 11:39 ` [PATCH v2 1/1] " Yi Tao @ 2025-09-04 16:31 ` Tejun Heo 2025-09-05 2:16 ` escape 2025-09-05 2:02 ` Chen Ridong 2025-09-05 13:17 ` kernel test robot 2 siblings, 1 reply; 44+ messages in thread From: Tejun Heo @ 2025-09-04 16:31 UTC (permalink / raw) To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel Hello, On Thu, Sep 04, 2025 at 07:39:32PM +0800, Yi Tao wrote: ... > To avoid affecting other users, the per-thread-group rwsem is only used > when the `favordynmods` flag is enabled. Can you please: - Note that this isn't necessary for cgroup2's recommended workflow and is thus gated behind favordynmods. - Include performance numbers briefly. > +extern bool have_favordynmods; > + > /** > * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups > * @tsk: target task > @@ -838,6 +840,8 @@ struct cgroup_of_peak { > static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) > { > percpu_down_read(&cgroup_threadgroup_rwsem); > + if (have_favordynmods) > + down_read(&tsk->signal->group_rwsem); > } > > /** > @@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) > */ > static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) > { > + if (have_favordynmods) > + up_read(&tsk->signal->group_rwsem); > percpu_up_read(&cgroup_threadgroup_rwsem); Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents have_favordynmods flipping while a task is between change_begin and end? > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > index 1ef1edbaaf79..86fbc99a9174 100644 > --- a/include/linux/sched/signal.h > +++ b/include/linux/sched/signal.h > @@ -226,6 +226,10 @@ struct signal_struct { > struct tty_audit_buf *tty_audit_buf; > #endif > > +#ifdef CONFIG_CGROUPS > + struct rw_semaphore group_rwsem; > +#endif Maybe name it more specific - e.g. cgroup_threadgroup_rwsem? > /** > * cgroup_attach_lock - Lock for ->attach() > - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem > + * @tsk: thread group to lock > + * @lock_threadgroup: whether to down_write rwsem > * > * cgroup migration sometimes needs to stabilize threadgroups against forks and > * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() > @@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); > * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that > * CPU hotplug is disabled on entry. > */ Please expand the function comment to explain what's going on and why and maybe point to it from a comment on top of favor_dynmods. > -void cgroup_attach_lock(bool lock_threadgroup) > +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup) As @tsk is an optional argument, it'd probably make more sense to put it at the end. > @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > */ > if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { > tsk = ERR_PTR(-EINVAL); > - goto out_unlock_threadgroup; > + goto out_unlock_rcu; > } > - > get_task_struct(tsk); > - goto out_unlock_rcu; > > -out_unlock_threadgroup: > - cgroup_attach_unlock(*threadgroup_locked); > - *threadgroup_locked = false; > + rcu_read_unlock(); > + > + /* > + * If we migrate a single thread, we don't care about threadgroup > + * stability. If the thread is `current`, it won't exit(2) under our > + * hands or change PID through exec(2). We exclude > + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write > + * callers by cgroup_mutex. > + * Therefore, we can skip the global lock. > + */ > + lockdep_assert_held(&cgroup_mutex); > + *threadgroup_locked = pid || threadgroup; > + > + cgroup_attach_lock(tsk, *threadgroup_locked); I'm not sure this relocation is safe. What prevents e.g. @tsk changing its group leader or signal struct before lock is grabbed? Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 16:31 ` Tejun Heo @ 2025-09-05 2:16 ` escape 2025-09-05 2:27 ` Tejun Heo 0 siblings, 1 reply; 44+ messages in thread From: escape @ 2025-09-05 2:16 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel Thanks for your review. 在 2025/9/5 00:31, Tejun Heo 写道: > Hello, > > On Thu, Sep 04, 2025 at 07:39:32PM +0800, Yi Tao wrote: > ... >> To avoid affecting other users, the per-thread-group rwsem is only used >> when the `favordynmods` flag is enabled. > Can you please: > > - Note that this isn't necessary for cgroup2's recommended workflow and is > thus gated behind favordynmods. > > - Include performance numbers briefly. > >> +extern bool have_favordynmods; >> + >> /** >> * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups >> * @tsk: target task >> @@ -838,6 +840,8 @@ struct cgroup_of_peak { >> static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) >> { >> percpu_down_read(&cgroup_threadgroup_rwsem); >> + if (have_favordynmods) >> + down_read(&tsk->signal->group_rwsem); >> } >> >> /** >> @@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) >> */ >> static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) >> { >> + if (have_favordynmods) >> + up_read(&tsk->signal->group_rwsem); >> percpu_up_read(&cgroup_threadgroup_rwsem); > Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents > have_favordynmods flipping while a task is between change_begin and end? have_favordynmods is read-only after initialization and will not change during runtime. >> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h >> index 1ef1edbaaf79..86fbc99a9174 100644 >> --- a/include/linux/sched/signal.h >> +++ b/include/linux/sched/signal.h >> @@ -226,6 +226,10 @@ struct signal_struct { >> struct tty_audit_buf *tty_audit_buf; >> #endif >> >> +#ifdef CONFIG_CGROUPS >> + struct rw_semaphore group_rwsem; >> +#endif > Maybe name it more specific - e.g. cgroup_threadgroup_rwsem? > >> /** >> * cgroup_attach_lock - Lock for ->attach() >> - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem >> + * @tsk: thread group to lock >> + * @lock_threadgroup: whether to down_write rwsem >> * >> * cgroup migration sometimes needs to stabilize threadgroups against forks and >> * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() >> @@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); >> * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that >> * CPU hotplug is disabled on entry. >> */ > Please expand the function comment to explain what's going on and why and > maybe point to it from a comment on top of favor_dynmods. > >> -void cgroup_attach_lock(bool lock_threadgroup) >> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup) > As @tsk is an optional argument, it'd probably make more sense to put it at > the end. > >> @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, >> */ >> if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { >> tsk = ERR_PTR(-EINVAL); >> - goto out_unlock_threadgroup; >> + goto out_unlock_rcu; >> } >> - >> get_task_struct(tsk); >> - goto out_unlock_rcu; >> >> -out_unlock_threadgroup: >> - cgroup_attach_unlock(*threadgroup_locked); >> - *threadgroup_locked = false; >> + rcu_read_unlock(); >> + >> + /* >> + * If we migrate a single thread, we don't care about threadgroup >> + * stability. If the thread is `current`, it won't exit(2) under our >> + * hands or change PID through exec(2). We exclude >> + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write >> + * callers by cgroup_mutex. >> + * Therefore, we can skip the global lock. >> + */ >> + lockdep_assert_held(&cgroup_mutex); >> + *threadgroup_locked = pid || threadgroup; >> + >> + cgroup_attach_lock(tsk, *threadgroup_locked); > I'm not sure this relocation is safe. What prevents e.g. @tsk changing its > group leader or signal struct before lock is grabbed? When a non-leader thread in a thread group executes the exec system call, the thread group leader is updated, but the signal_struct remains unchanged, so this part is safe. > > Thanks. I will accept the remaining comments and apply them in the next version of the patch. Thanks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-05 2:16 ` escape @ 2025-09-05 2:27 ` Tejun Heo 2025-09-05 3:44 ` escape 0 siblings, 1 reply; 44+ messages in thread From: Tejun Heo @ 2025-09-05 2:27 UTC (permalink / raw) To: escape; +Cc: hannes, mkoutny, cgroups, linux-kernel Hello, On Fri, Sep 05, 2025 at 10:16:30AM +0800, escape wrote: > > > + if (have_favordynmods) > > > + up_read(&tsk->signal->group_rwsem); > > > percpu_up_read(&cgroup_threadgroup_rwsem); > > Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents > > have_favordynmods flipping while a task is between change_begin and end? > > have_favordynmods is read-only after initialization and will not change > during runtime. I don't think that's necessarily true. favordynmods can also be specified as a mount option and mount can race against forks, execs and exits. Also, IIRC, cgroup2 doesn't allow remounts but there's nothing preventing someone from unmounting and mounting it again with different options. > > > @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > > > */ > > > if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { > > > tsk = ERR_PTR(-EINVAL); > > > - goto out_unlock_threadgroup; > > > + goto out_unlock_rcu; > > > } > > > - > > > get_task_struct(tsk); > > > - goto out_unlock_rcu; > > > -out_unlock_threadgroup: > > > - cgroup_attach_unlock(*threadgroup_locked); > > > - *threadgroup_locked = false; > > > + rcu_read_unlock(); > > > + > > > + /* > > > + * If we migrate a single thread, we don't care about threadgroup > > > + * stability. If the thread is `current`, it won't exit(2) under our > > > + * hands or change PID through exec(2). We exclude > > > + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write > > > + * callers by cgroup_mutex. > > > + * Therefore, we can skip the global lock. > > > + */ > > > + lockdep_assert_held(&cgroup_mutex); > > > + *threadgroup_locked = pid || threadgroup; > > > + > > > + cgroup_attach_lock(tsk, *threadgroup_locked); > > I'm not sure this relocation is safe. What prevents e.g. @tsk changing its > > group leader or signal struct before lock is grabbed? > > When a non-leader thread in a thread group executes the exec system call, > the thread group leader is updated, but the signal_struct remains unchanged, > so this part is safe. But the leader can change, right? So, we can end up in a situation where threadgroup is set but the task is not the leader which I think can lead to really subtle incorrect behaviors like write succeeding but nothing happening when racing against exec. Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-05 2:27 ` Tejun Heo @ 2025-09-05 3:44 ` escape 2025-09-05 3:48 ` Tejun Heo 0 siblings, 1 reply; 44+ messages in thread From: escape @ 2025-09-05 3:44 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel Hello, 在 2025/9/5 10:27, Tejun Heo 写道: > Hello, > > On Fri, Sep 05, 2025 at 10:16:30AM +0800, escape wrote: >>>> + if (have_favordynmods) >>>> + up_read(&tsk->signal->group_rwsem); >>>> percpu_up_read(&cgroup_threadgroup_rwsem); >>> Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents >>> have_favordynmods flipping while a task is between change_begin and end? >> have_favordynmods is read-only after initialization and will not change >> during runtime. > I don't think that's necessarily true. favordynmods can also be specified as > a mount option and mount can race against forks, execs and exits. Also, > IIRC, cgroup2 doesn't allow remounts but there's nothing preventing someone > from unmounting and mounting it again with different options. There are two ways to set favordynmods. The first is through kernel config or the kernel command line parameter cgroup_favordynmods, which sets the value of the variable have_favordynmods and automatically sets the CGRP_ROOT_FAVOR_DYNMODS flag on cgroup_root->flags at mount time. The second is by passing a mount option, which only sets the CGRP_ROOT_FAVOR_DYNMODS flag on cgroup_root->flags during mount. In this patch, the decision whether to use the per-threadgroup rwsem is based on have_favordynmods, not on cgroup_root->flags. An umount & mount affects cgroup_root->flags, but does not change have_favordynmods. Indeed, there is a minor flaw here: if favordynmods is enabled via a mount option rather than the kernel command line, the per-threadgroup rwsem will not take effect. To avoid the complexity of runtime changes to favordynmods, perhaps a better approach would be to introduce a separate control variable, configured via a kernel command line parameter such as cgroup_migration_lock=[global|thread_group] to explicitly govern this behavior. What do you think about this approach? >>>> @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, >>>> */ >>>> if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { >>>> tsk = ERR_PTR(-EINVAL); >>>> - goto out_unlock_threadgroup; >>>> + goto out_unlock_rcu; >>>> } >>>> - >>>> get_task_struct(tsk); >>>> - goto out_unlock_rcu; >>>> -out_unlock_threadgroup: >>>> - cgroup_attach_unlock(*threadgroup_locked); >>>> - *threadgroup_locked = false; >>>> + rcu_read_unlock(); >>>> + >>>> + /* >>>> + * If we migrate a single thread, we don't care about threadgroup >>>> + * stability. If the thread is `current`, it won't exit(2) under our >>>> + * hands or change PID through exec(2). We exclude >>>> + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write >>>> + * callers by cgroup_mutex. >>>> + * Therefore, we can skip the global lock. >>>> + */ >>>> + lockdep_assert_held(&cgroup_mutex); >>>> + *threadgroup_locked = pid || threadgroup; >>>> + >>>> + cgroup_attach_lock(tsk, *threadgroup_locked); >>> I'm not sure this relocation is safe. What prevents e.g. @tsk changing its >>> group leader or signal struct before lock is grabbed? >> When a non-leader thread in a thread group executes the exec system call, >> the thread group leader is updated, but the signal_struct remains unchanged, >> so this part is safe. > But the leader can change, right? So, we can end up in a situation where > threadgroup is set but the task is not the leader which I think can lead to > really subtle incorrect behaviors like write succeeding but nothing > happening when racing against exec. > > Thanks. > You are right. If the thread group leader changes, the task passed to cgroup_attach_task may not be the leader, which could lead to errors when iterating through all threads in while_each_thread. Similar to commit b78949ebfb56 ("cgroup: simplify double-check locking in cgroup_attach_proc"), it should check whether the leader has changed after acquiring the lock, to determine if a retry is needed. I will fix this in the next version patch. Thanks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-05 3:44 ` escape @ 2025-09-05 3:48 ` Tejun Heo 2025-09-05 4:30 ` escape 0 siblings, 1 reply; 44+ messages in thread From: Tejun Heo @ 2025-09-05 3:48 UTC (permalink / raw) To: escape; +Cc: hannes, mkoutny, cgroups, linux-kernel Hello, On Fri, Sep 05, 2025 at 11:44:53AM +0800, escape wrote: ... > To avoid the complexity of runtime changes to favordynmods, perhaps a > better approach would be to introduce a separate control variable, > configured via a kernel command line parameter such as > cgroup_migration_lock=[global|thread_group] to explicitly govern this > behavior. What do you think about this approach? Can't you just down_write threadgroup_rwsem when flipping the flag? Wouldn't that provide sufficient synchronization? Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-05 3:48 ` Tejun Heo @ 2025-09-05 4:30 ` escape 0 siblings, 0 replies; 44+ messages in thread From: escape @ 2025-09-05 4:30 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel 在 2025/9/5 11:48, Tejun Heo 写道: > Hello, > > On Fri, Sep 05, 2025 at 11:44:53AM +0800, escape wrote: > ... >> To avoid the complexity of runtime changes to favordynmods, perhaps a >> better approach would be to introduce a separate control variable, >> configured via a kernel command line parameter such as >> cgroup_migration_lock=[global|thread_group] to explicitly govern this >> behavior. What do you think about this approach? > Can't you just down_write threadgroup_rwsem when flipping the flag? Wouldn't > that provide sufficient synchronization? > > Thanks. > You're right, it's clear and simple. I will make the changes in this way. Thanks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 11:39 ` [PATCH v2 1/1] " Yi Tao 2025-09-04 16:31 ` Tejun Heo @ 2025-09-05 2:02 ` Chen Ridong 2025-09-05 13:17 ` kernel test robot 2 siblings, 0 replies; 44+ messages in thread From: Chen Ridong @ 2025-09-05 2:02 UTC (permalink / raw) To: Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel On 2025/9/4 19:39, Yi Tao wrote: > As computer hardware advances, modern systems are typically equipped > with many CPU cores and large amounts of memory, enabling the deployment > of numerous applications. On such systems, container creation and > deletion become frequent operations, making cgroup process migration no > longer a cold path. This leads to noticeable contention with common > process operations such as fork, exec, and exit. > > To alleviate the contention between cgroup process migration and > operations like process fork, this patch modifies lock to take the write > lock on signal_struct->group_rwsem when writing pid to > cgroup.procs/threads instead of holding a global write lock. > > Cgroup process migration has historically relied on > signal_struct->group_rwsem to protect thread group integrity. In commit > <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with > a global percpu_rwsem"), this was changed to a global > cgroup_threadgroup_rwsem. The advantage of using a global lock was > simplified handling of process group migrations. This patch retains the > use of the global lock for protecting process group migration, while > reducing contention by using per thread group lock during > cgroup.procs/threads writes. > > The locking behavior is as follows: > > write cgroup.procs/threads | process fork,exec,exit | process group migration > ------------------------------------------------------------------------------ > cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() > down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) > critical section | critical section | critical section > up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) > cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() > > g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes > signal_struct->group_rwsem. > > This patch eliminates contention between cgroup migration and fork > operations for threads that belong to different thread groups, thereby > reducing the long-tail latency of cgroup migrations and lowering system > load. > > To avoid affecting other users, the per-thread-group rwsem is only used > when the `favordynmods` flag is enabled. > > Signed-off-by: Yi Tao <escape@linux.alibaba.com> > --- > include/linux/cgroup-defs.h | 6 +++ > include/linux/sched/signal.h | 4 ++ > init/init_task.c | 3 ++ > kernel/cgroup/cgroup-internal.h | 4 +- > kernel/cgroup/cgroup-v1.c | 8 ++-- > kernel/cgroup/cgroup.c | 72 +++++++++++++++++++-------------- > kernel/fork.c | 4 ++ > 7 files changed, 64 insertions(+), 37 deletions(-) > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 6b93a64115fe..0c068dc3e08d 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -828,6 +828,8 @@ struct cgroup_of_peak { > struct list_head list; > }; > > +extern bool have_favordynmods; > + > /** > * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups > * @tsk: target task > @@ -838,6 +840,8 @@ struct cgroup_of_peak { > static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) > { > percpu_down_read(&cgroup_threadgroup_rwsem); > + if (have_favordynmods) > + down_read(&tsk->signal->group_rwsem); > } > > /** > @@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) > */ > static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) > { > + if (have_favordynmods) > + up_read(&tsk->signal->group_rwsem); > percpu_up_read(&cgroup_threadgroup_rwsem); > } > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > index 1ef1edbaaf79..86fbc99a9174 100644 > --- a/include/linux/sched/signal.h > +++ b/include/linux/sched/signal.h > @@ -226,6 +226,10 @@ struct signal_struct { > struct tty_audit_buf *tty_audit_buf; > #endif > > +#ifdef CONFIG_CGROUPS > + struct rw_semaphore group_rwsem; > +#endif > + > /* > * Thread is the potential origin of an oom condition; kill first on > * oom > diff --git a/init/init_task.c b/init/init_task.c > index e557f622bd90..0450093924a7 100644 > --- a/init/init_task.c > +++ b/init/init_task.c > @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { > }, > .multiprocess = HLIST_HEAD_INIT, > .rlim = INIT_RLIMITS, > +#ifdef CONFIG_CGROUPS > + .group_rwsem = __RWSEM_INITIALIZER(init_signals.group_rwsem), > +#endif > .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), > .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), > #ifdef CONFIG_POSIX_TIMERS > diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h > index b14e61c64a34..35005543f0c7 100644 > --- a/kernel/cgroup/cgroup-internal.h > +++ b/kernel/cgroup/cgroup-internal.h > @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, > > int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, > bool threadgroup); > -void cgroup_attach_lock(bool lock_threadgroup); > -void cgroup_attach_unlock(bool lock_threadgroup); > +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup); > +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup); > struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > bool *locked) > __acquires(&cgroup_threadgroup_rwsem); > diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c > index 2a4a387f867a..9f1a4d1fc741 100644 > --- a/kernel/cgroup/cgroup-v1.c > +++ b/kernel/cgroup/cgroup-v1.c > @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) > int retval = 0; > > cgroup_lock(); > - cgroup_attach_lock(true); > + cgroup_attach_lock(NULL, true); > for_each_root(root) { > struct cgroup *from_cgrp; > > @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) > if (retval) > break; > } > - cgroup_attach_unlock(true); > + cgroup_attach_unlock(NULL, true); > cgroup_unlock(); > > return retval; > @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) > > cgroup_lock(); > > - cgroup_attach_lock(true); > + cgroup_attach_lock(NULL, true); > > /* all tasks in @from are being moved, all csets are source */ > spin_lock_irq(&css_set_lock); > @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) > } while (task && !ret); > out_err: > cgroup_migrate_finish(&mgctx); > - cgroup_attach_unlock(true); > + cgroup_attach_unlock(NULL, true); > cgroup_unlock(); > return ret; > } > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 312c6a8b55bb..22b1659b623c 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -214,7 +214,7 @@ static u16 have_exit_callback __read_mostly; > static u16 have_release_callback __read_mostly; > static u16 have_canfork_callback __read_mostly; > > -static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS); > +bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS); > If have_favordynmods is gating the acquisition of tsk->signal->group_rwsem, would it be better to replace it with a static key? > /* cgroup namespace for init task */ > struct cgroup_namespace init_cgroup_ns = { > @@ -2459,7 +2459,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); > > /** > * cgroup_attach_lock - Lock for ->attach() > - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem > + * @tsk: thread group to lock > + * @lock_threadgroup: whether to down_write rwsem > * > * cgroup migration sometimes needs to stabilize threadgroups against forks and > * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() > @@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); > * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that > * CPU hotplug is disabled on entry. > */ > -void cgroup_attach_lock(bool lock_threadgroup) > +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup) > { > cpus_read_lock(); > - if (lock_threadgroup) > - percpu_down_write(&cgroup_threadgroup_rwsem); > + if (lock_threadgroup) { > + if (tsk && favor_dynmods) > + down_write(&tsk->signal->group_rwsem); > + else > + percpu_down_write(&cgroup_threadgroup_rwsem); > + } > } > > /** > * cgroup_attach_unlock - Undo cgroup_attach_lock() > - * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem > + * @tsk: thread group to lock > + * @lock_threadgroup: whether to up_write rwsem > */ > -void cgroup_attach_unlock(bool lock_threadgroup) > +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup) > { > - if (lock_threadgroup) > - percpu_up_write(&cgroup_threadgroup_rwsem); > + if (lock_threadgroup) { > + if (tsk && favor_dynmods) > + up_write(&tsk->signal->group_rwsem); > + else > + percpu_up_write(&cgroup_threadgroup_rwsem); > + } > cpus_read_unlock(); > } > > @@ -2976,24 +2986,12 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) > return ERR_PTR(-EINVAL); > > - /* > - * If we migrate a single thread, we don't care about threadgroup > - * stability. If the thread is `current`, it won't exit(2) under our > - * hands or change PID through exec(2). We exclude > - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write > - * callers by cgroup_mutex. > - * Therefore, we can skip the global lock. > - */ > - lockdep_assert_held(&cgroup_mutex); > - *threadgroup_locked = pid || threadgroup; > - cgroup_attach_lock(*threadgroup_locked); > - > rcu_read_lock(); > if (pid) { > tsk = find_task_by_vpid(pid); > if (!tsk) { > tsk = ERR_PTR(-ESRCH); > - goto out_unlock_threadgroup; > + goto out_unlock_rcu; > } > } else { > tsk = current; > @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > */ > if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { > tsk = ERR_PTR(-EINVAL); > - goto out_unlock_threadgroup; > + goto out_unlock_rcu; > } > - > get_task_struct(tsk); > - goto out_unlock_rcu; > > -out_unlock_threadgroup: > - cgroup_attach_unlock(*threadgroup_locked); > - *threadgroup_locked = false; > + rcu_read_unlock(); > + > + /* > + * If we migrate a single thread, we don't care about threadgroup > + * stability. If the thread is `current`, it won't exit(2) under our > + * hands or change PID through exec(2). We exclude > + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write > + * callers by cgroup_mutex. > + * Therefore, we can skip the global lock. > + */ > + lockdep_assert_held(&cgroup_mutex); > + *threadgroup_locked = pid || threadgroup; > + > + cgroup_attach_lock(tsk, *threadgroup_locked); > + > + return tsk; > + > out_unlock_rcu: > rcu_read_unlock(); > return tsk; > @@ -3032,7 +3042,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked > /* release reference from cgroup_procs_write_start() */ > put_task_struct(task); > > - cgroup_attach_unlock(threadgroup_locked); > + cgroup_attach_unlock(task, threadgroup_locked); > > for_each_subsys(ss, ssid) > if (ss->post_attach) > @@ -3119,7 +3129,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) > * write-locking can be skipped safely. > */ > has_tasks = !list_empty(&mgctx.preloaded_src_csets); > - cgroup_attach_lock(has_tasks); > + cgroup_attach_lock(NULL, has_tasks); > > /* NULL dst indicates self on default hierarchy */ > ret = cgroup_migrate_prepare_dst(&mgctx); > @@ -3140,7 +3150,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) > ret = cgroup_migrate_execute(&mgctx); > out_finish: > cgroup_migrate_finish(&mgctx); > - cgroup_attach_unlock(has_tasks); > + cgroup_attach_unlock(NULL, has_tasks); > return ret; > } > > diff --git a/kernel/fork.c b/kernel/fork.c > index af673856499d..5218f9b93c77 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) > tty_audit_fork(sig); > sched_autogroup_fork(sig); > > +#ifdef CONFIG_CGROUPS > + init_rwsem(&sig->group_rwsem); > +#endif > + > sig->oom_score_adj = current->signal->oom_score_adj; > sig->oom_score_adj_min = current->signal->oom_score_adj_min; > -- Best regards, Ridong ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads 2025-09-04 11:39 ` [PATCH v2 1/1] " Yi Tao 2025-09-04 16:31 ` Tejun Heo 2025-09-05 2:02 ` Chen Ridong @ 2025-09-05 13:17 ` kernel test robot 2 siblings, 0 replies; 44+ messages in thread From: kernel test robot @ 2025-09-05 13:17 UTC (permalink / raw) To: Yi Tao, tj, hannes, mkoutny; +Cc: oe-kbuild-all, cgroups, linux-kernel Hi Yi, kernel test robot noticed the following build errors: [auto build test ERROR on tj-cgroup/for-next] [also build test ERROR on kees/for-next/execve akpm-mm/mm-everything tip/sched/core linus/master v6.17-rc4] [cannot apply to next-20250905] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Yi-Tao/cgroup-replace-global-percpu_rwsem-with-signal_struct-group_rwsem-when-writing-cgroup-procs-threads/20250904-194505 base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next patch link: https://lore.kernel.org/r/068d58f1f497bc4971c6ac0bae58bf53b98451fd.1756985260.git.escape%40linux.alibaba.com patch subject: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads config: i386-buildonly-randconfig-004-20250905 (https://download.01.org/0day-ci/archive/20250905/202509052150.4GQ04PJn-lkp@intel.com/config) compiler: gcc-13 (Debian 13.3.0-16) 13.3.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250905/202509052150.4GQ04PJn-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202509052150.4GQ04PJn-lkp@intel.com/ All errors (new ones prefixed by >>): kernel/cgroup/cgroup.c: In function 'cgroup_attach_lock': >> kernel/cgroup/cgroup.c:2511:28: error: 'favor_dynmods' undeclared (first use in this function); did you mean 'Opt_favordynmods'? 2511 | if (tsk && favor_dynmods) | ^~~~~~~~~~~~~ | Opt_favordynmods kernel/cgroup/cgroup.c:2511:28: note: each undeclared identifier is reported only once for each function it appears in kernel/cgroup/cgroup.c: In function 'cgroup_attach_unlock': kernel/cgroup/cgroup.c:2526:28: error: 'favor_dynmods' undeclared (first use in this function); did you mean 'Opt_favordynmods'? 2526 | if (tsk && favor_dynmods) | ^~~~~~~~~~~~~ | Opt_favordynmods vim +2511 kernel/cgroup/cgroup.c 2482 2483 /** 2484 * cgroup_attach_lock - Lock for ->attach() 2485 * @tsk: thread group to lock 2486 * @lock_threadgroup: whether to down_write rwsem 2487 * 2488 * cgroup migration sometimes needs to stabilize threadgroups against forks and 2489 * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() 2490 * implementations (e.g. cpuset), also need to disable CPU hotplug. 2491 * Unfortunately, letting ->attach() operations acquire cpus_read_lock() can 2492 * lead to deadlocks. 2493 * 2494 * Bringing up a CPU may involve creating and destroying tasks which requires 2495 * read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside 2496 * cpus_read_lock(). If we call an ->attach() which acquires the cpus lock while 2497 * write-locking threadgroup_rwsem, the locking order is reversed and we end up 2498 * waiting for an on-going CPU hotplug operation which in turn is waiting for 2499 * the threadgroup_rwsem to be released to create new tasks. For more details: 2500 * 2501 * http://lkml.kernel.org/r/20220711174629.uehfmqegcwn2lqzu@wubuntu 2502 * 2503 * Resolve the situation by always acquiring cpus_read_lock() before optionally 2504 * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that 2505 * CPU hotplug is disabled on entry. 2506 */ 2507 void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup) 2508 { 2509 cpus_read_lock(); 2510 if (lock_threadgroup) { > 2511 if (tsk && favor_dynmods) 2512 down_write(&tsk->signal->group_rwsem); 2513 else 2514 percpu_down_write(&cgroup_threadgroup_rwsem); 2515 } 2516 } 2517 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao ` (2 preceding siblings ...) 2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao @ 2025-09-08 10:20 ` Yi Tao 2025-09-08 10:20 ` [PATCH v3 1/1] " Yi Tao 2025-09-09 7:55 ` [PATCH v4 0/3] " Yi Tao 2025-09-10 6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao 5 siblings, 1 reply; 44+ messages in thread From: Yi Tao @ 2025-09-08 10:20 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel Changes in v3: - Expend commit log and comments. - Put argument @tsk at end in cgroup_attach_lock/unlock. - down_write global cgroup_thread_rwsem when flipping favordynmods to synchronize with task between cgroup_threadgroup_change_begin and end. - Rename group_rwsem to cgroup_threadgroup_rwsem. - Fix bug causing abnormal cgroup migration due to threadgroup leader changes。 Changes in v2: - Use favordynmods as the enabling switch. - Determine whether to use the per-thread-group rwsem based on whether the task is NULL. - Fix system hang caused by acquiring cgroup_threadgroup_rwsem inside rcu_read_lock. Yi Tao (1): cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs include/linux/cgroup-defs.h | 12 +++- include/linux/sched/signal.h | 4 ++ init/init_task.c | 3 + kernel/cgroup/cgroup-internal.h | 4 +- kernel/cgroup/cgroup-v1.c | 8 +-- kernel/cgroup/cgroup.c | 105 ++++++++++++++++++++++---------- kernel/fork.c | 4 ++ 7 files changed, 101 insertions(+), 39 deletions(-) -- 2.32.0.3.g01195cf9f ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-08 10:20 ` [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao @ 2025-09-08 10:20 ` Yi Tao 2025-09-08 16:05 ` Tejun Heo 2025-09-08 19:39 ` Waiman Long 0 siblings, 2 replies; 44+ messages in thread From: Yi Tao @ 2025-09-08 10:20 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migration ------------------------------------------------------------------------------ cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the `favordynmods` flag is enabled. Signed-off-by: Yi Tao <escape@linux.alibaba.com> --- include/linux/cgroup-defs.h | 12 +++- include/linux/sched/signal.h | 4 ++ init/init_task.c | 3 + kernel/cgroup/cgroup-internal.h | 4 +- kernel/cgroup/cgroup-v1.c | 8 +-- kernel/cgroup/cgroup.c | 105 ++++++++++++++++++++++---------- kernel/fork.c | 4 ++ 7 files changed, 101 insertions(+), 39 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 6b93a64115fe..5033e3bdac9e 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -88,7 +88,8 @@ enum { /* * Reduce latencies on dynamic cgroup modifications such as task * migrations and controller on/offs by disabling percpu operation on - * cgroup_threadgroup_rwsem. This makes hot path operations such as + * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when + * writing to cgroup.procs. This makes hot path operations such as * forks and exits into the slow path and more expensive. * * The static usage pattern of creating a cgroup, enabling controllers, @@ -828,16 +829,21 @@ struct cgroup_of_peak { struct list_head list; }; +extern int take_per_threadgroup_rwsem; + /** * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups * @tsk: target task * * Allows cgroup operations to synchronize against threadgroup changes - * using a percpu_rw_semaphore. + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition. */ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) { percpu_down_read(&cgroup_threadgroup_rwsem); + if (take_per_threadgroup_rwsem) + down_read(&tsk->signal->cgroup_threadgroup_rwsem); } /** @@ -848,6 +854,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) */ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) { + if (take_per_threadgroup_rwsem) + up_read(&tsk->signal->cgroup_threadgroup_rwsem); percpu_up_read(&cgroup_threadgroup_rwsem); } diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1ef1edbaaf79..7d6449982822 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -226,6 +226,10 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif +#ifdef CONFIG_CGROUPS + struct rw_semaphore cgroup_threadgroup_rwsem; +#endif + /* * Thread is the potential origin of an oom condition; kill first on * oom diff --git a/init/init_task.c b/init/init_task.c index e557f622bd90..a55e2189206f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { }, .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, +#ifdef CONFIG_CGROUPS + .cgroup_threadgroup_rwsem = __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem), +#endif .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), #ifdef CONFIG_POSIX_TIMERS diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index b14e61c64a34..318cc7f22e2c 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(bool lock_threadgroup); -void cgroup_attach_unlock(bool lock_threadgroup); +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk); +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, bool *locked) __acquires(&cgroup_threadgroup_rwsem); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 2a4a387f867a..65e9b454780c 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) int retval = 0; cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(true, NULL); for_each_root(root) { struct cgroup *from_cgrp; @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(true); + cgroup_attach_unlock(true, NULL); cgroup_unlock(); return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(true, NULL); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(true); + cgroup_attach_unlock(true, NULL); cgroup_unlock(); return ret; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 312c6a8b55bb..8650ec394d0c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root) return root_cgrp->root; } +int take_per_threadgroup_rwsem; + void cgroup_favor_dynmods(struct cgroup_root *root, bool favor) { bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS; - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */ + /* + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition. + * favordynmods can flip while task is between + * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end, + * so down_write global cgroup_threadgroup_rwsem to synchronize them. + */ + percpu_down_write(&cgroup_threadgroup_rwsem); if (favor && !favoring) { + take_per_threadgroup_rwsem++; rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); root->flags |= CGRP_ROOT_FAVOR_DYNMODS; } else if (!favor && favoring) { + take_per_threadgroup_rwsem--; rcu_sync_exit(&cgroup_threadgroup_rwsem.rss); root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS; } + percpu_up_write(&cgroup_threadgroup_rwsem); } static int cgroup_init_root_id(struct cgroup_root *root) @@ -2459,7 +2470,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); /** * cgroup_attach_lock - Lock for ->attach() - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem + * @lock_threadgroup: whether to down_write rwsem + * @tsk: thread group to lock * * cgroup migration sometimes needs to stabilize threadgroups against forks and * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() @@ -2479,22 +2491,37 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * Resolve the situation by always acquiring cpus_read_lock() before optionally * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that * CPU hotplug is disabled on entry. + * + * When favordynmods is enabled, take per threadgroup rwsem to reduce latencies + * on dynamic cgroup modifications. see the comment above + * CGRP_ROOT_FAVOR_DYNMODS definition. + * + * tsk is not NULL only when writing to cgroup.procs. */ -void cgroup_attach_lock(bool lock_threadgroup) +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk) { cpus_read_lock(); - if (lock_threadgroup) - percpu_down_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (tsk && take_per_threadgroup_rwsem) + down_write(&tsk->signal->cgroup_threadgroup_rwsem); + else + percpu_down_write(&cgroup_threadgroup_rwsem); + } } /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem + * @lock_threadgroup: whether to up_write rwsem + * @tsk: thread group to lock */ -void cgroup_attach_unlock(bool lock_threadgroup) +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk) { - if (lock_threadgroup) - percpu_up_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (tsk && take_per_threadgroup_rwsem) + up_write(&tsk->signal->cgroup_threadgroup_rwsem); + else + percpu_up_write(&cgroup_threadgroup_rwsem); + } cpus_read_unlock(); } @@ -2976,24 +3003,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) return ERR_PTR(-EINVAL); - /* - * If we migrate a single thread, we don't care about threadgroup - * stability. If the thread is `current`, it won't exit(2) under our - * hands or change PID through exec(2). We exclude - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write - * callers by cgroup_mutex. - * Therefore, we can skip the global lock. - */ - lockdep_assert_held(&cgroup_mutex); - *threadgroup_locked = pid || threadgroup; - cgroup_attach_lock(*threadgroup_locked); - +retry_find_task: rcu_read_lock(); if (pid) { tsk = find_task_by_vpid(pid); if (!tsk) { tsk = ERR_PTR(-ESRCH); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } } else { tsk = current; @@ -3010,15 +3026,42 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, */ if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { tsk = ERR_PTR(-EINVAL); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } - get_task_struct(tsk); - goto out_unlock_rcu; -out_unlock_threadgroup: - cgroup_attach_unlock(*threadgroup_locked); - *threadgroup_locked = false; + rcu_read_unlock(); + + /* + * If we migrate a single thread, we don't care about threadgroup + * stability. If the thread is `current`, it won't exit(2) under our + * hands or change PID through exec(2). We exclude + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write + * callers by cgroup_mutex. + * Therefore, we can skip the global lock. + */ + lockdep_assert_held(&cgroup_mutex); + *threadgroup_locked = pid || threadgroup; + + cgroup_attach_lock(*threadgroup_locked, tsk); + + if (threadgroup) { + if (!thread_group_leader(tsk)) { + /* + * a race with de_thread from another thread's exec() + * may strip us of our leadership, if this happens, + * there is no choice but to throw this task away and + * try again; this is + * "double-double-toil-and-trouble-check locking". + */ + cgroup_attach_unlock(*threadgroup_locked, tsk); + put_task_struct(tsk); + goto retry_find_task; + } + } + + return tsk; + out_unlock_rcu: rcu_read_unlock(); return tsk; @@ -3032,7 +3075,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked /* release reference from cgroup_procs_write_start() */ put_task_struct(task); - cgroup_attach_unlock(threadgroup_locked); + cgroup_attach_unlock(threadgroup_locked, task); for_each_subsys(ss, ssid) if (ss->post_attach) @@ -3119,7 +3162,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) * write-locking can be skipped safely. */ has_tasks = !list_empty(&mgctx.preloaded_src_csets); - cgroup_attach_lock(has_tasks); + cgroup_attach_lock(has_tasks, NULL); /* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3140,7 +3183,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(has_tasks); + cgroup_attach_unlock(has_tasks, NULL); return ret; } diff --git a/kernel/fork.c b/kernel/fork.c index af673856499d..ae7826815c2c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) tty_audit_fork(sig); sched_autogroup_fork(sig); +#ifdef CONFIG_CGROUPS + init_rwsem(&sig->cgroup_threadgroup_rwsem); +#endif + sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; -- 2.32.0.3.g01195cf9f ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-08 10:20 ` [PATCH v3 1/1] " Yi Tao @ 2025-09-08 16:05 ` Tejun Heo 2025-09-08 19:39 ` Waiman Long 1 sibling, 0 replies; 44+ messages in thread From: Tejun Heo @ 2025-09-08 16:05 UTC (permalink / raw) To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel Hello, On Mon, Sep 08, 2025 at 06:20:27PM +0800, Yi Tao wrote: ... > The static usage pattern of creating a cgroup, enabling controllers, > and then seeding it with CLONE_INTO_CGROUP doesn't require write > locking cgroup_threadgroup_rwsem and thus doesn't benefit from this > patch. Please bring this to the top, note that this is the default mode of operation and the mechanism being introduced is thus an optional one. > @@ -88,7 +88,8 @@ enum { > /* > * Reduce latencies on dynamic cgroup modifications such as task > * migrations and controller on/offs by disabling percpu operation on > - * cgroup_threadgroup_rwsem. This makes hot path operations such as > + * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when > + * writing to cgroup.procs. This makes hot path operations such as > * forks and exits into the slow path and more expensive. This comment is pointed to from all other places. Please expand on why per-threadgroup rwsem is beneficial for what use cases. > * The static usage pattern of creating a cgroup, enabling controllers, > @@ -828,16 +829,21 @@ struct cgroup_of_peak { > struct list_head list; > }; > > +extern int take_per_threadgroup_rwsem; I think it needs cgroup in its name. Maybe something like cgroup_enable_per_threadgroup_rwsem? > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 312c6a8b55bb..8650ec394d0c 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root) > return root_cgrp->root; > } > > +int take_per_threadgroup_rwsem; Please put it where other global variables are and also note what it does and that writes protected by cgroup_mutex and write-lock of cgroup_threadgroup_rwsem and thus reads are protected by either. > void cgroup_favor_dynmods(struct cgroup_root *root, bool favor) > { > bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS; > > - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */ > + /* > + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition. > + * favordynmods can flip while task is between > + * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end, > + * so down_write global cgroup_threadgroup_rwsem to synchronize them. Maybe: take_per_threadgroup_rwsem must not be flipped while threads are between cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end. down_write global group_threadgroup_rwsem to exclude them. But does this actually work? It works for turning it on. I don't think it'd work for turning it off, right? Maybe make it enable only and trigger a warning message when people try to turn it off? > + */ > + percpu_down_write(&cgroup_threadgroup_rwsem); > if (favor && !favoring) { > + take_per_threadgroup_rwsem++; Given that favoring is gating the switch, this can be a bool, right? > rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); > root->flags |= CGRP_ROOT_FAVOR_DYNMODS; > } else if (!favor && favoring) { > + take_per_threadgroup_rwsem--; And here, you can trigger a warning that per_threadgroup opreation can't be disabled once enabled instead of actually turning it off. Another alternative would be using a task flag to track whether %current is holding per_threadgroup_rwsem and then using that to decide whether to unlock. Maybe that's cleaner but I don't think it really matters here. .. > + * When favordynmods is enabled, take per threadgroup rwsem to reduce latencies > + * on dynamic cgroup modifications. see the comment above > + * CGRP_ROOT_FAVOR_DYNMODS definition. This is more about scalability, right? Maybe just say overhead? > @@ -2976,24 +3003,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) > return ERR_PTR(-EINVAL); > > - /* > - * If we migrate a single thread, we don't care about threadgroup > - * stability. If the thread is `current`, it won't exit(2) under our > - * hands or change PID through exec(2). We exclude > - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write > - * callers by cgroup_mutex. > - * Therefore, we can skip the global lock. > - */ > - lockdep_assert_held(&cgroup_mutex); > - *threadgroup_locked = pid || threadgroup; > - cgroup_attach_lock(*threadgroup_locked); > - > +retry_find_task: > rcu_read_lock(); > if (pid) { > tsk = find_task_by_vpid(pid); > if (!tsk) { > tsk = ERR_PTR(-ESRCH); > - goto out_unlock_threadgroup; > + goto out_unlock_rcu; > } > } else { > tsk = current; > @@ -3010,15 +3026,42 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > */ > if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { > tsk = ERR_PTR(-EINVAL); > - goto out_unlock_threadgroup; > + goto out_unlock_rcu; > } > - > get_task_struct(tsk); > - goto out_unlock_rcu; > > -out_unlock_threadgroup: > - cgroup_attach_unlock(*threadgroup_locked); > - *threadgroup_locked = false; > + rcu_read_unlock(); > + > + /* > + * If we migrate a single thread, we don't care about threadgroup > + * stability. If the thread is `current`, it won't exit(2) under our > + * hands or change PID through exec(2). We exclude > + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write > + * callers by cgroup_mutex. > + * Therefore, we can skip the global lock. > + */ > + lockdep_assert_held(&cgroup_mutex); > + *threadgroup_locked = pid || threadgroup; > + > + cgroup_attach_lock(*threadgroup_locked, tsk); > + > + if (threadgroup) { > + if (!thread_group_leader(tsk)) { > + /* > + * a race with de_thread from another thread's exec() > + * may strip us of our leadership, if this happens, > + * there is no choice but to throw this task away and > + * try again; this is > + * "double-double-toil-and-trouble-check locking". > + */ > + cgroup_attach_unlock(*threadgroup_locked, tsk); > + put_task_struct(tsk); > + goto retry_find_task; This is subtle. Can you please separate this out to a spearate patch? Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-08 10:20 ` [PATCH v3 1/1] " Yi Tao 2025-09-08 16:05 ` Tejun Heo @ 2025-09-08 19:39 ` Waiman Long 1 sibling, 0 replies; 44+ messages in thread From: Waiman Long @ 2025-09-08 19:39 UTC (permalink / raw) To: Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel On 9/8/25 6:20 AM, Yi Tao wrote: > As computer hardware advances, modern systems are typically equipped > with many CPU cores and large amounts of memory, enabling the deployment > of numerous applications. On such systems, container creation and > deletion become frequent operations, making cgroup process migration no > longer a cold path. This leads to noticeable contention with common > process operations such as fork, exec, and exit. > > To alleviate the contention between cgroup process migration and > operations like process fork, this patch modifies lock to take the write > lock on signal_struct->group_rwsem when writing pid to > cgroup.procs/threads instead of holding a global write lock. > > Cgroup process migration has historically relied on > signal_struct->group_rwsem to protect thread group integrity. In commit > <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with > a global percpu_rwsem"), this was changed to a global > cgroup_threadgroup_rwsem. The advantage of using a global lock was > simplified handling of process group migrations. This patch retains the > use of the global lock for protecting process group migration, while > reducing contention by using per thread group lock during > cgroup.procs/threads writes. > > The locking behavior is as follows: > > write cgroup.procs/threads | process fork,exec,exit | process group migration > ------------------------------------------------------------------------------ > cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() > down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) > critical section | critical section | critical section > up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) > cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() > > g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes > signal_struct->group_rwsem. > > This patch eliminates contention between cgroup migration and fork > operations for threads that belong to different thread groups, thereby > reducing the long-tail latency of cgroup migrations and lowering system > load. > > With this patch, under heavy fork and exec interference, the long-tail > latency of cgroup migration has been reduced from milliseconds to > microseconds. Under heavy cgroup migration interference, the multi-CPU > score of the spawn test case in UnixBench increased by 9%. > > The static usage pattern of creating a cgroup, enabling controllers, > and then seeding it with CLONE_INTO_CGROUP doesn't require write > locking cgroup_threadgroup_rwsem and thus doesn't benefit from this > patch. > > To avoid affecting other users, the per threadgroup rwsem is only used > when the `favordynmods` flag is enabled. > > Signed-off-by: Yi Tao <escape@linux.alibaba.com> > --- > include/linux/cgroup-defs.h | 12 +++- > include/linux/sched/signal.h | 4 ++ > init/init_task.c | 3 + > kernel/cgroup/cgroup-internal.h | 4 +- > kernel/cgroup/cgroup-v1.c | 8 +-- > kernel/cgroup/cgroup.c | 105 ++++++++++++++++++++++---------- > kernel/fork.c | 4 ++ > 7 files changed, 101 insertions(+), 39 deletions(-) > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 6b93a64115fe..5033e3bdac9e 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -88,7 +88,8 @@ enum { > /* > * Reduce latencies on dynamic cgroup modifications such as task > * migrations and controller on/offs by disabling percpu operation on > - * cgroup_threadgroup_rwsem. This makes hot path operations such as > + * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when > + * writing to cgroup.procs. This makes hot path operations such as > * forks and exits into the slow path and more expensive. > * > * The static usage pattern of creating a cgroup, enabling controllers, > @@ -828,16 +829,21 @@ struct cgroup_of_peak { > struct list_head list; > }; > > +extern int take_per_threadgroup_rwsem; > + > /** > * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups > * @tsk: target task > * > * Allows cgroup operations to synchronize against threadgroup changes > - * using a percpu_rw_semaphore. > + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when > + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition. > */ > static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) > { > percpu_down_read(&cgroup_threadgroup_rwsem); > + if (take_per_threadgroup_rwsem) > + down_read(&tsk->signal->cgroup_threadgroup_rwsem); > } > > /** > @@ -848,6 +854,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) > */ > static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) > { > + if (take_per_threadgroup_rwsem) > + up_read(&tsk->signal->cgroup_threadgroup_rwsem); > percpu_up_read(&cgroup_threadgroup_rwsem); > } > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > index 1ef1edbaaf79..7d6449982822 100644 > --- a/include/linux/sched/signal.h > +++ b/include/linux/sched/signal.h > @@ -226,6 +226,10 @@ struct signal_struct { > struct tty_audit_buf *tty_audit_buf; > #endif > > +#ifdef CONFIG_CGROUPS > + struct rw_semaphore cgroup_threadgroup_rwsem; > +#endif > + > /* > * Thread is the potential origin of an oom condition; kill first on > * oom > diff --git a/init/init_task.c b/init/init_task.c > index e557f622bd90..a55e2189206f 100644 > --- a/init/init_task.c > +++ b/init/init_task.c > @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { > }, > .multiprocess = HLIST_HEAD_INIT, > .rlim = INIT_RLIMITS, > +#ifdef CONFIG_CGROUPS > + .cgroup_threadgroup_rwsem = __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem), > +#endif > .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), > .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), > #ifdef CONFIG_POSIX_TIMERS > diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h > index b14e61c64a34..318cc7f22e2c 100644 > --- a/kernel/cgroup/cgroup-internal.h > +++ b/kernel/cgroup/cgroup-internal.h > @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, > > int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, > bool threadgroup); > -void cgroup_attach_lock(bool lock_threadgroup); > -void cgroup_attach_unlock(bool lock_threadgroup); > +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk); > +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk); > struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > bool *locked) > __acquires(&cgroup_threadgroup_rwsem); > diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c > index 2a4a387f867a..65e9b454780c 100644 > --- a/kernel/cgroup/cgroup-v1.c > +++ b/kernel/cgroup/cgroup-v1.c > @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) > int retval = 0; > > cgroup_lock(); > - cgroup_attach_lock(true); > + cgroup_attach_lock(true, NULL); > for_each_root(root) { > struct cgroup *from_cgrp; > > @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) > if (retval) > break; > } > - cgroup_attach_unlock(true); > + cgroup_attach_unlock(true, NULL); > cgroup_unlock(); > > return retval; > @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) > > cgroup_lock(); > > - cgroup_attach_lock(true); > + cgroup_attach_lock(true, NULL); > > /* all tasks in @from are being moved, all csets are source */ > spin_lock_irq(&css_set_lock); > @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) > } while (task && !ret); > out_err: > cgroup_migrate_finish(&mgctx); > - cgroup_attach_unlock(true); > + cgroup_attach_unlock(true, NULL); > cgroup_unlock(); > return ret; > } > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 312c6a8b55bb..8650ec394d0c 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root) > return root_cgrp->root; > } > > +int take_per_threadgroup_rwsem; I would suggest adding the "__read_mostly" attribute to avoid the chance of false cacheline sharing. > + > void cgroup_favor_dynmods(struct cgroup_root *root, bool favor) > { > bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS; > > - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */ > + /* > + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition. > + * favordynmods can flip while task is between > + * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end, > + * so down_write global cgroup_threadgroup_rwsem to synchronize them. > + */ > + percpu_down_write(&cgroup_threadgroup_rwsem); > if (favor && !favoring) { > + take_per_threadgroup_rwsem++; > rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); > root->flags |= CGRP_ROOT_FAVOR_DYNMODS; > } else if (!favor && favoring) { > + take_per_threadgroup_rwsem--; > rcu_sync_exit(&cgroup_threadgroup_rwsem.rss); > root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS; > } > + percpu_up_write(&cgroup_threadgroup_rwsem); > } > Changing take_per_threadgroup_rwsem inside the cgroup_threadgroup_rwsem critical section will ensure that the flag won't change in between cgroup_threadgroup_change_begin() and cgroup_threadgroup_change_end(). However, it may still change in between cgroup_attach_lock() and cgroup_attach_unlock(). Since cgroup_attach_[un]lock() has already taken a threadgroup_locked argument, we can extend it to a flag word that holds 2 flag bits - one for threadgroup_locked and another one for whether the threadgroup rwsem has been taken or not. So take_per_threadgroup_rwsem will only be checked in cgroup_attach_lock(). cgroup_attach_lock() can either return a value to indicate the state or the argument can be changed into a flag pointer with the new flag bit added internally. It should be a separate patch if you decide to do the extension. Cheers, Longman ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v4 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao ` (3 preceding siblings ...) 2025-09-08 10:20 ` [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao @ 2025-09-09 7:55 ` Yi Tao 2025-09-09 7:55 ` [PATCH v4 1/3] " Yi Tao ` (2 more replies) 2025-09-10 6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao 5 siblings, 3 replies; 44+ messages in thread From: Yi Tao @ 2025-09-09 7:55 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel Changes in v4: - Adjust commit log and comments. - Rename take_per_threadgroup_rwsem to cgroup_enable_per_threadgroup_rwsem and add attr __read_mostly. - Trigger a warning that per_threadgroup opreation can't be disabled once enabled instead of actually turning it off. - Split the code for retrying when the threadgroup leader changes into a separate patch. - Refactor the cgroup_attach_lock code to make it clearer. Changes in v3: - Expend commit log and comments. - Put argument @tsk at end in cgroup_attach_lock/unlock. - down_write global cgroup_thread_rwsem when flipping favordynmods to synchronize with task between cgroup_threadgroup_change_begin and end. - Rename group_rwsem to cgroup_threadgroup_rwsem. - Fix bug causing abnormal cgroup migration due to threadgroup leader changes。 Changes in v2: - Use favordynmods as the enabling switch. - Determine whether to use the per-thread-group rwsem based on whether the task is NULL. - Fix system hang caused by acquiring cgroup_threadgroup_rwsem inside rcu_read_lock. Yi Tao (3): cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs cgroup: retry find task if threadgroup leader changed cgroup: refactor the cgroup_attach_lock code to make it clearer include/linux/cgroup-defs.h | 25 +++++- include/linux/sched/signal.h | 4 + init/init_task.c | 3 + kernel/cgroup/cgroup-internal.h | 8 +- kernel/cgroup/cgroup-v1.c | 14 +-- kernel/cgroup/cgroup.c | 149 ++++++++++++++++++++++++-------- kernel/fork.c | 4 + 7 files changed, 161 insertions(+), 46 deletions(-) -- 2.32.0.3.g01195cf9f ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v4 1/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-09 7:55 ` [PATCH v4 0/3] " Yi Tao @ 2025-09-09 7:55 ` Yi Tao 2025-09-09 7:55 ` [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed Yi Tao 2025-09-09 7:55 ` [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao 2 siblings, 0 replies; 44+ messages in thread From: Yi Tao @ 2025-09-09 7:55 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the favordynmods is enabled. As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migration ------------------------------------------------------------------------------ cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. Signed-off-by: Yi Tao <escape@linux.alibaba.com> --- include/linux/cgroup-defs.h | 14 ++++- include/linux/sched/signal.h | 4 ++ init/init_task.c | 3 ++ kernel/cgroup/cgroup-internal.h | 4 +- kernel/cgroup/cgroup-v1.c | 8 +-- kernel/cgroup/cgroup.c | 96 ++++++++++++++++++++++----------- kernel/fork.c | 4 ++ 7 files changed, 95 insertions(+), 38 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 6b93a64115fe..552050bd2c02 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -91,6 +91,12 @@ enum { * cgroup_threadgroup_rwsem. This makes hot path operations such as * forks and exits into the slow path and more expensive. * + * Alleviate the contention between fork, exec, exit operations and + * writing to cgroup.procs by taking a per threadgroup rwsem instead of + * the global cgroup_threadgroup_rwsem. Fork and other operations + * from threads in different thread groups no longer contend with + * writing to cgroup.procs. + * * The static usage pattern of creating a cgroup, enabling controllers, * and then seeding it with CLONE_INTO_CGROUP doesn't require write * locking cgroup_threadgroup_rwsem and thus doesn't benefit from @@ -822,6 +828,7 @@ struct cgroup_subsys { }; extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem; +extern bool cgroup_enable_per_threadgroup_rwsem; struct cgroup_of_peak { unsigned long value; @@ -833,11 +840,14 @@ struct cgroup_of_peak { * @tsk: target task * * Allows cgroup operations to synchronize against threadgroup changes - * using a percpu_rw_semaphore. + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition. */ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) { percpu_down_read(&cgroup_threadgroup_rwsem); + if (cgroup_enable_per_threadgroup_rwsem) + down_read(&tsk->signal->cgroup_threadgroup_rwsem); } /** @@ -848,6 +858,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) */ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) { + if (cgroup_enable_per_threadgroup_rwsem) + up_read(&tsk->signal->cgroup_threadgroup_rwsem); percpu_up_read(&cgroup_threadgroup_rwsem); } diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1ef1edbaaf79..7d6449982822 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -226,6 +226,10 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif +#ifdef CONFIG_CGROUPS + struct rw_semaphore cgroup_threadgroup_rwsem; +#endif + /* * Thread is the potential origin of an oom condition; kill first on * oom diff --git a/init/init_task.c b/init/init_task.c index e557f622bd90..a55e2189206f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { }, .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, +#ifdef CONFIG_CGROUPS + .cgroup_threadgroup_rwsem = __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem), +#endif .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), #ifdef CONFIG_POSIX_TIMERS diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index b14e61c64a34..318cc7f22e2c 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(bool lock_threadgroup); -void cgroup_attach_unlock(bool lock_threadgroup); +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk); +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, bool *locked) __acquires(&cgroup_threadgroup_rwsem); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 2a4a387f867a..65e9b454780c 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) int retval = 0; cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(true, NULL); for_each_root(root) { struct cgroup *from_cgrp; @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(true); + cgroup_attach_unlock(true, NULL); cgroup_unlock(); return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(true, NULL); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(true); + cgroup_attach_unlock(true, NULL); cgroup_unlock(); return ret; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 312c6a8b55bb..b53ae8fd9681 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -216,6 +216,14 @@ static u16 have_canfork_callback __read_mostly; static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS); +/* + * Write protected by cgroup_mutex and write-lock of cgroup_threadgroup_rwsem, + * read protected by either. + * + * Can only be turned on, but not turned off. + */ +bool cgroup_enable_per_threadgroup_rwsem __read_mostly; + /* cgroup namespace for init task */ struct cgroup_namespace init_cgroup_ns = { .ns.count = REFCOUNT_INIT(2), @@ -1302,14 +1310,24 @@ void cgroup_favor_dynmods(struct cgroup_root *root, bool favor) { bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS; - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */ + /* + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition. + * favordynmods can flip while task is between + * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end, + * so down_write global cgroup_threadgroup_rwsem to synchronize them. + */ + percpu_down_write(&cgroup_threadgroup_rwsem); if (favor && !favoring) { + cgroup_enable_per_threadgroup_rwsem = true; rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); root->flags |= CGRP_ROOT_FAVOR_DYNMODS; } else if (!favor && favoring) { + if (cgroup_enable_per_threadgroup_rwsem) + WARN_ONCE(1, "cgroup favordynmods: per threadgroup rwsem mechanism can't be disabled\n"); rcu_sync_exit(&cgroup_threadgroup_rwsem.rss); root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS; } + percpu_up_write(&cgroup_threadgroup_rwsem); } static int cgroup_init_root_id(struct cgroup_root *root) @@ -2459,7 +2477,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); /** * cgroup_attach_lock - Lock for ->attach() - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem + * @lock_threadgroup: whether to down_write rwsem + * @tsk: thread group to lock * * cgroup migration sometimes needs to stabilize threadgroups against forks and * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() @@ -2479,22 +2498,37 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * Resolve the situation by always acquiring cpus_read_lock() before optionally * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that * CPU hotplug is disabled on entry. + * + * When favordynmods is enabled, take per threadgroup rwsem to reduce overhead + * on dynamic cgroup modifications. see the comment above + * CGRP_ROOT_FAVOR_DYNMODS definition. + * + * tsk is not NULL only when writing to cgroup.procs. */ -void cgroup_attach_lock(bool lock_threadgroup) +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk) { cpus_read_lock(); - if (lock_threadgroup) - percpu_down_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (tsk && cgroup_enable_per_threadgroup_rwsem) + down_write(&tsk->signal->cgroup_threadgroup_rwsem); + else + percpu_down_write(&cgroup_threadgroup_rwsem); + } } /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem + * @lock_threadgroup: whether to up_write rwsem + * @tsk: thread group to lock */ -void cgroup_attach_unlock(bool lock_threadgroup) +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk) { - if (lock_threadgroup) - percpu_up_write(&cgroup_threadgroup_rwsem); + if (lock_threadgroup) { + if (tsk && cgroup_enable_per_threadgroup_rwsem) + up_write(&tsk->signal->cgroup_threadgroup_rwsem); + else + percpu_up_write(&cgroup_threadgroup_rwsem); + } cpus_read_unlock(); } @@ -2976,24 +3010,12 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) return ERR_PTR(-EINVAL); - /* - * If we migrate a single thread, we don't care about threadgroup - * stability. If the thread is `current`, it won't exit(2) under our - * hands or change PID through exec(2). We exclude - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write - * callers by cgroup_mutex. - * Therefore, we can skip the global lock. - */ - lockdep_assert_held(&cgroup_mutex); - *threadgroup_locked = pid || threadgroup; - cgroup_attach_lock(*threadgroup_locked); - rcu_read_lock(); if (pid) { tsk = find_task_by_vpid(pid); if (!tsk) { tsk = ERR_PTR(-ESRCH); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } } else { tsk = current; @@ -3010,15 +3032,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, */ if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { tsk = ERR_PTR(-EINVAL); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } - get_task_struct(tsk); - goto out_unlock_rcu; -out_unlock_threadgroup: - cgroup_attach_unlock(*threadgroup_locked); - *threadgroup_locked = false; + rcu_read_unlock(); + + /* + * If we migrate a single thread, we don't care about threadgroup + * stability. If the thread is `current`, it won't exit(2) under our + * hands or change PID through exec(2). We exclude + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write + * callers by cgroup_mutex. + * Therefore, we can skip the global lock. + */ + lockdep_assert_held(&cgroup_mutex); + *threadgroup_locked = pid || threadgroup; + + cgroup_attach_lock(*threadgroup_locked, tsk); + + return tsk; + out_unlock_rcu: rcu_read_unlock(); return tsk; @@ -3032,7 +3066,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked /* release reference from cgroup_procs_write_start() */ put_task_struct(task); - cgroup_attach_unlock(threadgroup_locked); + cgroup_attach_unlock(threadgroup_locked, task); for_each_subsys(ss, ssid) if (ss->post_attach) @@ -3119,7 +3153,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) * write-locking can be skipped safely. */ has_tasks = !list_empty(&mgctx.preloaded_src_csets); - cgroup_attach_lock(has_tasks); + cgroup_attach_lock(has_tasks, NULL); /* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3140,7 +3174,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(has_tasks); + cgroup_attach_unlock(has_tasks, NULL); return ret; } diff --git a/kernel/fork.c b/kernel/fork.c index af673856499d..ae7826815c2c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) tty_audit_fork(sig); sched_autogroup_fork(sig); +#ifdef CONFIG_CGROUPS + init_rwsem(&sig->cgroup_threadgroup_rwsem); +#endif + sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; -- 2.32.0.3.g01195cf9f ^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed 2025-09-09 7:55 ` [PATCH v4 0/3] " Yi Tao 2025-09-09 7:55 ` [PATCH v4 1/3] " Yi Tao @ 2025-09-09 7:55 ` Yi Tao 2025-09-09 16:59 ` Tejun Heo 2025-09-09 7:55 ` [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao 2 siblings, 1 reply; 44+ messages in thread From: Yi Tao @ 2025-09-09 7:55 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel Between obtaining the threadgroup leader via PID and acquiring the cgroup attach lock, the threadgroup leader may change, which could lead to incorrect cgroup migration. Therefore, after acquiring the cgroup attach lock, we check whether the threadgroup leader has changed, and if so, retry the operation. Signed-off-by: Yi Tao <escape@linux.alibaba.com> --- kernel/cgroup/cgroup.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index b53ae8fd9681..6e90b79e8fa3 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3010,6 +3010,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) return ERR_PTR(-EINVAL); +retry_find_task: rcu_read_lock(); if (pid) { tsk = find_task_by_vpid(pid); @@ -3051,6 +3052,21 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, cgroup_attach_lock(*threadgroup_locked, tsk); + if (threadgroup) { + if (!thread_group_leader(tsk)) { + /* + * a race with de_thread from another thread's exec() + * may strip us of our leadership, if this happens, + * there is no choice but to throw this task away and + * try again; this is + * "double-double-toil-and-trouble-check locking". + */ + cgroup_attach_unlock(*threadgroup_locked, tsk); + put_task_struct(tsk); + goto retry_find_task; + } + } + return tsk; out_unlock_rcu: -- 2.32.0.3.g01195cf9f ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed 2025-09-09 7:55 ` [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed Yi Tao @ 2025-09-09 16:59 ` Tejun Heo 0 siblings, 0 replies; 44+ messages in thread From: Tejun Heo @ 2025-09-09 16:59 UTC (permalink / raw) To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel On Tue, Sep 09, 2025 at 03:55:29PM +0800, Yi Tao wrote: > Between obtaining the threadgroup leader via PID and acquiring the > cgroup attach lock, the threadgroup leader may change, which could lead > to incorrect cgroup migration. Therefore, after acquiring the cgroup > attach lock, we check whether the threadgroup leader has changed, and if > so, retry the operation. > > Signed-off-by: Yi Tao <escape@linux.alibaba.com> This and thread group lock relocation should come before the patch which requires it, not after. Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer 2025-09-09 7:55 ` [PATCH v4 0/3] " Yi Tao 2025-09-09 7:55 ` [PATCH v4 1/3] " Yi Tao 2025-09-09 7:55 ` [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed Yi Tao @ 2025-09-09 7:55 ` Yi Tao 2025-09-09 17:00 ` Tejun Heo 2 siblings, 1 reply; 44+ messages in thread From: Yi Tao @ 2025-09-09 7:55 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel Dynamic cgroup migration involving threadgroup locks can be in one of three states: no lock held, holding the global lock, or holding a per threadgroup lock. Explicitly declaring the different lock modes to make the code easier to understand. Signed-off-by: Yi Tao <escape@linux.alibaba.com> --- include/linux/cgroup-defs.h | 11 ++++ kernel/cgroup/cgroup-internal.h | 8 +-- kernel/cgroup/cgroup-v1.c | 14 ++--- kernel/cgroup/cgroup.c | 91 ++++++++++++++++++++++----------- 4 files changed, 83 insertions(+), 41 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 552050bd2c02..d764eba041c6 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -146,6 +146,17 @@ enum { __CFTYPE_ADDED = (1 << 18), }; +enum { + /* Default */ + CGRP_ATTACH_LOCK_GLOBAL, + + /* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */ + CGRP_ATTACH_LOCK_NONE, + + /* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */ + CGRP_ATTACH_LOCK_PER_THREADGROUP, +}; + /* * cgroup_file is the handle for a file instance created in a cgroup which * is used, for example, to generate file changed notifications. This can diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index 318cc7f22e2c..9de8ab47a335 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,12 +249,12 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk); -void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk); +void cgroup_attach_lock(int lock_mode, struct task_struct *tsk); +void cgroup_attach_unlock(int lock_mode, struct task_struct *tsk); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, - bool *locked) + int *lock_mode) __acquires(&cgroup_threadgroup_rwsem); -void cgroup_procs_write_finish(struct task_struct *task, bool locked) +void cgroup_procs_write_finish(struct task_struct *task, int locke_mode) __releases(&cgroup_threadgroup_rwsem); void cgroup_lock_and_drain_offline(struct cgroup *cgrp); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 65e9b454780c..73d8dafe219f 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) int retval = 0; cgroup_lock(); - cgroup_attach_lock(true, NULL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); for_each_root(root) { struct cgroup *from_cgrp; @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(true, NULL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) cgroup_lock(); - cgroup_attach_lock(true, NULL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(true, NULL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return ret; } @@ -502,13 +502,13 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of, struct task_struct *task; const struct cred *cred, *tcred; ssize_t ret; - bool locked; + int lock_mode; cgrp = cgroup_kn_lock_live(of->kn, false); if (!cgrp) return -ENODEV; - task = cgroup_procs_write_start(buf, threadgroup, &locked); + task = cgroup_procs_write_start(buf, threadgroup, &lock_mode); ret = PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -531,7 +531,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of, ret = cgroup_attach_task(cgrp, task, threadgroup); out_finish: - cgroup_procs_write_finish(task, locked); + cgroup_procs_write_finish(task, lock_mode); out_unlock: cgroup_kn_unlock(of->kn); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 6e90b79e8fa3..a568629f7ed0 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2477,7 +2477,7 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); /** * cgroup_attach_lock - Lock for ->attach() - * @lock_threadgroup: whether to down_write rwsem + * @lock_mode: whether acquire and acquire which rwsem * @tsk: thread group to lock * * cgroup migration sometimes needs to stabilize threadgroups against forks and @@ -2499,35 +2499,46 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that * CPU hotplug is disabled on entry. * - * When favordynmods is enabled, take per threadgroup rwsem to reduce overhead - * on dynamic cgroup modifications. see the comment above - * CGRP_ROOT_FAVOR_DYNMODS definition. - * * tsk is not NULL only when writing to cgroup.procs. */ -void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk) +void cgroup_attach_lock(int lock_mode, struct task_struct *tsk) { cpus_read_lock(); - if (lock_threadgroup) { - if (tsk && cgroup_enable_per_threadgroup_rwsem) - down_write(&tsk->signal->cgroup_threadgroup_rwsem); - else - percpu_down_write(&cgroup_threadgroup_rwsem); + + switch (lock_mode) { + case CGRP_ATTACH_LOCK_NONE: + break; + case CGRP_ATTACH_LOCK_GLOBAL: + percpu_down_write(&cgroup_threadgroup_rwsem); + break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + down_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; + default: + pr_warn("cgroup: Unexpected attach lock mode."); + break; } } /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_threadgroup: whether to up_write rwsem + * @lock_mode: whether release and release which rwsem * @tsk: thread group to lock */ -void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk) -{ - if (lock_threadgroup) { - if (tsk && cgroup_enable_per_threadgroup_rwsem) - up_write(&tsk->signal->cgroup_threadgroup_rwsem); - else - percpu_up_write(&cgroup_threadgroup_rwsem); +void cgroup_attach_unlock(int lock_mode, struct task_struct *tsk) +{ + switch (lock_mode) { + case CGRP_ATTACH_LOCK_NONE: + break; + case CGRP_ATTACH_LOCK_GLOBAL: + percpu_up_write(&cgroup_threadgroup_rwsem); + break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + up_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; + default: + pr_warn("cgroup: Unexpected attach lock mode."); + break; } cpus_read_unlock(); } @@ -3002,7 +3013,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, } struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, - bool *threadgroup_locked) + int *lock_mode) { struct task_struct *tsk; pid_t pid; @@ -3046,11 +3057,24 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write * callers by cgroup_mutex. * Therefore, we can skip the global lock. + * + * When favordynmods is enabled, take per threadgroup rwsem to reduce + * overhead on dynamic cgroup modifications. see the comment above + * CGRP_ROOT_FAVOR_DYNMODS definition. */ lockdep_assert_held(&cgroup_mutex); - *threadgroup_locked = pid || threadgroup; - cgroup_attach_lock(*threadgroup_locked, tsk); + if (pid || threadgroup) { + if (cgroup_enable_per_threadgroup_rwsem) + *lock_mode = CGRP_ATTACH_LOCK_PER_THREADGROUP; + else + *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; + + } else { + *lock_mode = CGRP_ATTACH_LOCK_NONE; + } + + cgroup_attach_lock(*lock_mode, tsk); if (threadgroup) { if (!thread_group_leader(tsk)) { @@ -3061,7 +3085,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, * try again; this is * "double-double-toil-and-trouble-check locking". */ - cgroup_attach_unlock(*threadgroup_locked, tsk); + cgroup_attach_unlock(*lock_mode, tsk); put_task_struct(tsk); goto retry_find_task; } @@ -3074,7 +3098,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, return tsk; } -void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked) +void cgroup_procs_write_finish(struct task_struct *task, int lock_mode) { struct cgroup_subsys *ss; int ssid; @@ -3082,7 +3106,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked /* release reference from cgroup_procs_write_start() */ put_task_struct(task); - cgroup_attach_unlock(threadgroup_locked, task); + cgroup_attach_unlock(lock_mode, task); for_each_subsys(ss, ssid) if (ss->post_attach) @@ -3139,6 +3163,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) struct cgroup *dsct; struct css_set *src_cset; bool has_tasks; + int lock_mode; int ret; lockdep_assert_held(&cgroup_mutex); @@ -3169,7 +3194,13 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) * write-locking can be skipped safely. */ has_tasks = !list_empty(&mgctx.preloaded_src_csets); - cgroup_attach_lock(has_tasks, NULL); + + if (has_tasks) + lock_mode = CGRP_ATTACH_LOCK_GLOBAL; + else + lock_mode = CGRP_ATTACH_LOCK_NONE; + + cgroup_attach_lock(lock_mode, NULL); /* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3190,7 +3221,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(has_tasks, NULL); + cgroup_attach_unlock(lock_mode, NULL); return ret; } @@ -5291,13 +5322,13 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, struct task_struct *task; const struct cred *saved_cred; ssize_t ret; - bool threadgroup_locked; + int lock_mode; dst_cgrp = cgroup_kn_lock_live(of->kn, false); if (!dst_cgrp) return -ENODEV; - task = cgroup_procs_write_start(buf, threadgroup, &threadgroup_locked); + task = cgroup_procs_write_start(buf, threadgroup, &lock_mode); ret = PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -5323,7 +5354,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, ret = cgroup_attach_task(dst_cgrp, task, threadgroup); out_finish: - cgroup_procs_write_finish(task, threadgroup_locked); + cgroup_procs_write_finish(task, lock_mode); out_unlock: cgroup_kn_unlock(of->kn); -- 2.32.0.3.g01195cf9f ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer 2025-09-09 7:55 ` [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao @ 2025-09-09 17:00 ` Tejun Heo 0 siblings, 0 replies; 44+ messages in thread From: Tejun Heo @ 2025-09-09 17:00 UTC (permalink / raw) To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel On Tue, Sep 09, 2025 at 03:55:30PM +0800, Yi Tao wrote: > +enum { Please give it a type name. It makes difference when it gets exported through BTF for debugging and tracing. > + /* Default */ > + CGRP_ATTACH_LOCK_GLOBAL, > + > + /* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */ > + CGRP_ATTACH_LOCK_NONE, > + > + /* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */ > + CGRP_ATTACH_LOCK_PER_THREADGROUP, > +}; And, I'd put this before the actual patch too. Please make it a prep patch. Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao ` (4 preceding siblings ...) 2025-09-09 7:55 ` [PATCH v4 0/3] " Yi Tao @ 2025-09-10 6:59 ` Yi Tao 2025-09-10 6:59 ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao ` (2 more replies) 5 siblings, 3 replies; 44+ messages in thread From: Yi Tao @ 2025-09-10 6:59 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel Changes in v5: - Adjust the order of patches. - Add the type name for enum used by cgroup_attach_lock. Changes in v4: - Adjust commit log and comments. - Rename take_per_threadgroup_rwsem to cgroup_enable_per_threadgroup_rwsem and add attr __read_mostly. - Trigger a warning that per_threadgroup opreation can't be disabled once enabled instead of actually turning it off. - Split the code for retrying when the threadgroup leader changes into a separate patch. - Refactor the cgroup_attach_lock code to make it clearer. Changes in v3: - Expend commit log and comments. - Put argument @tsk at end in cgroup_attach_lock/unlock. - down_write global cgroup_thread_rwsem when flipping favordynmods to synchronize with task between cgroup_threadgroup_change_begin and end. - Rename group_rwsem to cgroup_threadgroup_rwsem. - Fix bug causing abnormal cgroup migration due to threadgroup leader changes。 Changes in v2: - Use favordynmods as the enabling switch. - Determine whether to use the per-thread-group rwsem based on whether the task is NULL. - Fix system hang caused by acquiring cgroup_threadgroup_rwsem inside rcu_read_lock. Yi Tao (3): cgroup: refactor the cgroup_attach_lock code to make it clearer cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs include/linux/cgroup-defs.h | 25 +++++- include/linux/sched/signal.h | 4 + init/init_task.c | 3 + kernel/cgroup/cgroup-internal.h | 11 ++- kernel/cgroup/cgroup-v1.c | 14 +-- kernel/cgroup/cgroup.c | 153 ++++++++++++++++++++++++-------- kernel/fork.c | 4 + 7 files changed, 167 insertions(+), 47 deletions(-) -- 2.32.0.3.g01195cf9f ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer 2025-09-10 6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao @ 2025-09-10 6:59 ` Yi Tao 2025-09-10 17:26 ` Tejun Heo 2025-09-10 6:59 ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao 2025-09-10 6:59 ` [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao 2 siblings, 1 reply; 44+ messages in thread From: Yi Tao @ 2025-09-10 6:59 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel Dynamic cgroup migration involving threadgroup locks can be in one of two states: no lock held, or holding the global lock. Explicitly declaring the different lock modes to make the code easier to understand and facilitates future extensions of the lock modes. Signed-off-by: Yi Tao <escape@linux.alibaba.com> --- include/linux/cgroup-defs.h | 8 ++++ kernel/cgroup/cgroup-internal.h | 9 +++-- kernel/cgroup/cgroup-v1.c | 14 +++---- kernel/cgroup/cgroup.c | 67 ++++++++++++++++++++++++--------- 4 files changed, 69 insertions(+), 29 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 6b93a64115fe..213b0d01464f 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -140,6 +140,14 @@ enum { __CFTYPE_ADDED = (1 << 18), }; +enum cgroup_attach_lock_mode { + /* Default */ + CGRP_ATTACH_LOCK_GLOBAL, + + /* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */ + CGRP_ATTACH_LOCK_NONE, +}; + /* * cgroup_file is the handle for a file instance created in a cgroup which * is used, for example, to generate file changed notifications. This can diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index b14e61c64a34..a6d6f30b6f65 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,12 +249,13 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(bool lock_threadgroup); -void cgroup_attach_unlock(bool lock_threadgroup); +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode); +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, - bool *locked) + enum cgroup_attach_lock_mode *lock_mode) __acquires(&cgroup_threadgroup_rwsem); -void cgroup_procs_write_finish(struct task_struct *task, bool locked) +void cgroup_procs_write_finish(struct task_struct *task, + enum cgroup_attach_lock_mode lock_mode) __releases(&cgroup_threadgroup_rwsem); void cgroup_lock_and_drain_offline(struct cgroup *cgrp); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 2a4a387f867a..f3838888ea6f 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) int retval = 0; cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); for_each_root(root) { struct cgroup *from_cgrp; @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(true); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); cgroup_unlock(); return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) cgroup_lock(); - cgroup_attach_lock(true); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(true); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); cgroup_unlock(); return ret; } @@ -502,13 +502,13 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of, struct task_struct *task; const struct cred *cred, *tcred; ssize_t ret; - bool locked; + enum cgroup_attach_lock_mode lock_mode; cgrp = cgroup_kn_lock_live(of->kn, false); if (!cgrp) return -ENODEV; - task = cgroup_procs_write_start(buf, threadgroup, &locked); + task = cgroup_procs_write_start(buf, threadgroup, &lock_mode); ret = PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -531,7 +531,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of, ret = cgroup_attach_task(cgrp, task, threadgroup); out_finish: - cgroup_procs_write_finish(task, locked); + cgroup_procs_write_finish(task, lock_mode); out_unlock: cgroup_kn_unlock(of->kn); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 312c6a8b55bb..2b88c7abaa00 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2459,7 +2459,7 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); /** * cgroup_attach_lock - Lock for ->attach() - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem + * @lock_mode: whether to down_write cgroup_threadgroup_rwsem * * cgroup migration sometimes needs to stabilize threadgroups against forks and * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() @@ -2480,21 +2480,39 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that * CPU hotplug is disabled on entry. */ -void cgroup_attach_lock(bool lock_threadgroup) +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) { cpus_read_lock(); - if (lock_threadgroup) + + switch (lock_mode) { + case CGRP_ATTACH_LOCK_NONE: + break; + case CGRP_ATTACH_LOCK_GLOBAL: percpu_down_write(&cgroup_threadgroup_rwsem); + break; + default: + pr_warn("cgroup: Unexpected attach lock mode."); + break; + } } /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem + * @lock_mode: whether to up_write cgroup_threadgroup_rwsem */ -void cgroup_attach_unlock(bool lock_threadgroup) +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode) { - if (lock_threadgroup) + switch (lock_mode) { + case CGRP_ATTACH_LOCK_NONE: + break; + case CGRP_ATTACH_LOCK_GLOBAL: percpu_up_write(&cgroup_threadgroup_rwsem); + break; + default: + pr_warn("cgroup: Unexpected attach lock mode."); + break; + } + cpus_read_unlock(); } @@ -2968,7 +2986,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, } struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, - bool *threadgroup_locked) + enum cgroup_attach_lock_mode *lock_mode) { struct task_struct *tsk; pid_t pid; @@ -2985,8 +3003,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, * Therefore, we can skip the global lock. */ lockdep_assert_held(&cgroup_mutex); - *threadgroup_locked = pid || threadgroup; - cgroup_attach_lock(*threadgroup_locked); + + if (pid || threadgroup) + *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; + else + *lock_mode = CGRP_ATTACH_LOCK_NONE; + + cgroup_attach_lock(*lock_mode); rcu_read_lock(); if (pid) { @@ -3017,14 +3040,15 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, goto out_unlock_rcu; out_unlock_threadgroup: - cgroup_attach_unlock(*threadgroup_locked); - *threadgroup_locked = false; + cgroup_attach_unlock(*lock_mode); + *lock_mode = CGRP_ATTACH_LOCK_NONE; out_unlock_rcu: rcu_read_unlock(); return tsk; } -void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked) +void cgroup_procs_write_finish(struct task_struct *task, + enum cgroup_attach_lock_mode lock_mode) { struct cgroup_subsys *ss; int ssid; @@ -3032,7 +3056,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked /* release reference from cgroup_procs_write_start() */ put_task_struct(task); - cgroup_attach_unlock(threadgroup_locked); + cgroup_attach_unlock(lock_mode); for_each_subsys(ss, ssid) if (ss->post_attach) @@ -3088,6 +3112,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) struct cgroup_subsys_state *d_css; struct cgroup *dsct; struct css_set *src_cset; + enum cgroup_attach_lock_mode lock_mode; bool has_tasks; int ret; @@ -3119,7 +3144,13 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) * write-locking can be skipped safely. */ has_tasks = !list_empty(&mgctx.preloaded_src_csets); - cgroup_attach_lock(has_tasks); + + if (has_tasks) + lock_mode = CGRP_ATTACH_LOCK_GLOBAL; + else + lock_mode = CGRP_ATTACH_LOCK_NONE; + + cgroup_attach_lock(lock_mode); /* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3140,7 +3171,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(has_tasks); + cgroup_attach_unlock(lock_mode); return ret; } @@ -5241,13 +5272,13 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, struct task_struct *task; const struct cred *saved_cred; ssize_t ret; - bool threadgroup_locked; + enum cgroup_attach_lock_mode lock_mode; dst_cgrp = cgroup_kn_lock_live(of->kn, false); if (!dst_cgrp) return -ENODEV; - task = cgroup_procs_write_start(buf, threadgroup, &threadgroup_locked); + task = cgroup_procs_write_start(buf, threadgroup, &lock_mode); ret = PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -5273,7 +5304,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, ret = cgroup_attach_task(dst_cgrp, task, threadgroup); out_finish: - cgroup_procs_write_finish(task, threadgroup_locked); + cgroup_procs_write_finish(task, lock_mode); out_unlock: cgroup_kn_unlock(of->kn); -- 2.32.0.3.g01195cf9f ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer 2025-09-10 6:59 ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao @ 2025-09-10 17:26 ` Tejun Heo 0 siblings, 0 replies; 44+ messages in thread From: Tejun Heo @ 2025-09-10 17:26 UTC (permalink / raw) To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel On Wed, Sep 10, 2025 at 02:59:33PM +0800, Yi Tao wrote: > Dynamic cgroup migration involving threadgroup locks can be in one of > two states: no lock held, or holding the global lock. Explicitly > declaring the different lock modes to make the code easier to > understand and facilitates future extensions of the lock modes. > > Signed-off-by: Yi Tao <escape@linux.alibaba.com> Applied to cgroup/for-6.18. There was a conflict that I resolved. Please check the tree to make sure that it looks alright. Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start 2025-09-10 6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao 2025-09-10 6:59 ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao @ 2025-09-10 6:59 ` Yi Tao 2025-09-10 17:31 ` Tejun Heo 2025-09-11 3:22 ` Waiman Long 2025-09-10 6:59 ` [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao 2 siblings, 2 replies; 44+ messages in thread From: Yi Tao @ 2025-09-10 6:59 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel Later patches will introduce a new parameter `task` to cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock within cgroup_procs_write_start. Between obtaining the threadgroup leader via PID and acquiring the cgroup attach lock, the threadgroup leader may change, which could lead to incorrect cgroup migration. Therefore, after acquiring the cgroup attach lock, we check whether the threadgroup leader has changed, and if so, retry the operation. Signed-off-by: Yi Tao <escape@linux.alibaba.com> --- kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++---------------- 1 file changed, 38 insertions(+), 23 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 2b88c7abaa00..756807164091 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2994,29 +2994,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) return ERR_PTR(-EINVAL); - /* - * If we migrate a single thread, we don't care about threadgroup - * stability. If the thread is `current`, it won't exit(2) under our - * hands or change PID through exec(2). We exclude - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write - * callers by cgroup_mutex. - * Therefore, we can skip the global lock. - */ - lockdep_assert_held(&cgroup_mutex); - - if (pid || threadgroup) - *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; - else - *lock_mode = CGRP_ATTACH_LOCK_NONE; - - cgroup_attach_lock(*lock_mode); - +retry_find_task: rcu_read_lock(); if (pid) { tsk = find_task_by_vpid(pid); if (!tsk) { tsk = ERR_PTR(-ESRCH); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } } else { tsk = current; @@ -3033,15 +3017,46 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, */ if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { tsk = ERR_PTR(-EINVAL); - goto out_unlock_threadgroup; + goto out_unlock_rcu; } get_task_struct(tsk); - goto out_unlock_rcu; + rcu_read_unlock(); + + /* + * If we migrate a single thread, we don't care about threadgroup + * stability. If the thread is `current`, it won't exit(2) under our + * hands or change PID through exec(2). We exclude + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write + * callers by cgroup_mutex. + * Therefore, we can skip the global lock. + */ + lockdep_assert_held(&cgroup_mutex); + + if (pid || threadgroup) + *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; + else + *lock_mode = CGRP_ATTACH_LOCK_NONE; + + cgroup_attach_lock(*lock_mode); + + if (threadgroup) { + if (!thread_group_leader(tsk)) { + /* + * a race with de_thread from another thread's exec() + * may strip us of our leadership, if this happens, + * there is no choice but to throw this task away and + * try again; this is + * "double-double-toil-and-trouble-check locking". + */ + cgroup_attach_unlock(*lock_mode); + put_task_struct(tsk); + goto retry_find_task; + } + } + + return tsk; -out_unlock_threadgroup: - cgroup_attach_unlock(*lock_mode); - *lock_mode = CGRP_ATTACH_LOCK_NONE; out_unlock_rcu: rcu_read_unlock(); return tsk; -- 2.32.0.3.g01195cf9f ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start 2025-09-10 6:59 ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao @ 2025-09-10 17:31 ` Tejun Heo 2025-09-11 3:22 ` Waiman Long 1 sibling, 0 replies; 44+ messages in thread From: Tejun Heo @ 2025-09-10 17:31 UTC (permalink / raw) To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel On Wed, Sep 10, 2025 at 02:59:34PM +0800, Yi Tao wrote: > Later patches will introduce a new parameter `task` to > cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock > within cgroup_procs_write_start. > > Between obtaining the threadgroup leader via PID and acquiring the > cgroup attach lock, the threadgroup leader may change, which could lead > to incorrect cgroup migration. Therefore, after acquiring the cgroup > attach lock, we check whether the threadgroup leader has changed, and if > so, retry the operation. > > Signed-off-by: Yi Tao <escape@linux.alibaba.com> Applied to cgroup/for-6.18 with minor comment adjustments. Thanks. -- tejun ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start 2025-09-10 6:59 ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao 2025-09-10 17:31 ` Tejun Heo @ 2025-09-11 3:22 ` Waiman Long 1 sibling, 0 replies; 44+ messages in thread From: Waiman Long @ 2025-09-11 3:22 UTC (permalink / raw) To: Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel On 9/10/25 2:59 AM, Yi Tao wrote: > Later patches will introduce a new parameter `task` to > cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock > within cgroup_procs_write_start. > > Between obtaining the threadgroup leader via PID and acquiring the > cgroup attach lock, the threadgroup leader may change, which could lead > to incorrect cgroup migration. Therefore, after acquiring the cgroup > attach lock, we check whether the threadgroup leader has changed, and if > so, retry the operation. > > Signed-off-by: Yi Tao <escape@linux.alibaba.com> > --- > kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++---------------- > 1 file changed, 38 insertions(+), 23 deletions(-) > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 2b88c7abaa00..756807164091 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -2994,29 +2994,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0) > return ERR_PTR(-EINVAL); > > - /* > - * If we migrate a single thread, we don't care about threadgroup > - * stability. If the thread is `current`, it won't exit(2) under our > - * hands or change PID through exec(2). We exclude > - * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write > - * callers by cgroup_mutex. > - * Therefore, we can skip the global lock. > - */ > - lockdep_assert_held(&cgroup_mutex); > - > - if (pid || threadgroup) > - *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; > - else > - *lock_mode = CGRP_ATTACH_LOCK_NONE; > - > - cgroup_attach_lock(*lock_mode); > - > +retry_find_task: > rcu_read_lock(); > if (pid) { > tsk = find_task_by_vpid(pid); > if (!tsk) { > tsk = ERR_PTR(-ESRCH); > - goto out_unlock_threadgroup; > + goto out_unlock_rcu; > } > } else { > tsk = current; > @@ -3033,15 +3017,46 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, > */ > if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { > tsk = ERR_PTR(-EINVAL); > - goto out_unlock_threadgroup; > + goto out_unlock_rcu; > } > > get_task_struct(tsk); > - goto out_unlock_rcu; > + rcu_read_unlock(); > + > + /* > + * If we migrate a single thread, we don't care about threadgroup > + * stability. If the thread is `current`, it won't exit(2) under our > + * hands or change PID through exec(2). We exclude > + * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write > + * callers by cgroup_mutex. > + * Therefore, we can skip the global lock. > + */ > + lockdep_assert_held(&cgroup_mutex); > + > + if (pid || threadgroup) > + *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; > + else > + *lock_mode = CGRP_ATTACH_LOCK_NONE; > + > + cgroup_attach_lock(*lock_mode); > + > + if (threadgroup) { > + if (!thread_group_leader(tsk)) { Nit: You can combine the 2 conditions together to avoid excessive indent. if (threadgroup && !thread_group_leader(tsk)) { > + /* > + * a race with de_thread from another thread's exec() Should be "de_thread()" to signal that it is a function. > + * may strip us of our leadership, if this happens, > + * there is no choice but to throw this task away and > + * try again; this is > + * "double-double-toil-and-trouble-check locking". This "double-double-toil-and-trouble-check" is a new term in the kernel source tree. I will suggest to use something simpler to avoid confusion. Cheers, Longman ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-10 6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao 2025-09-10 6:59 ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao 2025-09-10 6:59 ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao @ 2025-09-10 6:59 ` Yi Tao 2025-09-10 17:47 ` Tejun Heo 2 siblings, 1 reply; 44+ messages in thread From: Yi Tao @ 2025-09-10 6:59 UTC (permalink / raw) To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the favordynmods is enabled. As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migration ------------------------------------------------------------------------------ cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. Signed-off-by: Yi Tao <escape@linux.alibaba.com> --- include/linux/cgroup-defs.h | 17 ++++++++- include/linux/sched/signal.h | 4 ++ init/init_task.c | 3 ++ kernel/cgroup/cgroup-internal.h | 6 ++- kernel/cgroup/cgroup-v1.c | 8 ++-- kernel/cgroup/cgroup.c | 67 +++++++++++++++++++++++++-------- kernel/fork.c | 4 ++ 7 files changed, 87 insertions(+), 22 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 213b0d01464f..fe29152bceff 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -91,6 +91,12 @@ enum { * cgroup_threadgroup_rwsem. This makes hot path operations such as * forks and exits into the slow path and more expensive. * + * Alleviate the contention between fork, exec, exit operations and + * writing to cgroup.procs by taking a per threadgroup rwsem instead of + * the global cgroup_threadgroup_rwsem. Fork and other operations + * from threads in different thread groups no longer contend with + * writing to cgroup.procs. + * * The static usage pattern of creating a cgroup, enabling controllers, * and then seeding it with CLONE_INTO_CGROUP doesn't require write * locking cgroup_threadgroup_rwsem and thus doesn't benefit from @@ -146,6 +152,9 @@ enum cgroup_attach_lock_mode { /* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */ CGRP_ATTACH_LOCK_NONE, + + /* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */ + CGRP_ATTACH_LOCK_PER_THREADGROUP, }; /* @@ -830,6 +839,7 @@ struct cgroup_subsys { }; extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem; +extern bool cgroup_enable_per_threadgroup_rwsem; struct cgroup_of_peak { unsigned long value; @@ -841,11 +851,14 @@ struct cgroup_of_peak { * @tsk: target task * * Allows cgroup operations to synchronize against threadgroup changes - * using a percpu_rw_semaphore. + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition. */ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) { percpu_down_read(&cgroup_threadgroup_rwsem); + if (cgroup_enable_per_threadgroup_rwsem) + down_read(&tsk->signal->cgroup_threadgroup_rwsem); } /** @@ -856,6 +869,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) */ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) { + if (cgroup_enable_per_threadgroup_rwsem) + up_read(&tsk->signal->cgroup_threadgroup_rwsem); percpu_up_read(&cgroup_threadgroup_rwsem); } diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1ef1edbaaf79..7d6449982822 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -226,6 +226,10 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif +#ifdef CONFIG_CGROUPS + struct rw_semaphore cgroup_threadgroup_rwsem; +#endif + /* * Thread is the potential origin of an oom condition; kill first on * oom diff --git a/init/init_task.c b/init/init_task.c index e557f622bd90..a55e2189206f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { }, .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, +#ifdef CONFIG_CGROUPS + .cgroup_threadgroup_rwsem = __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem), +#endif .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), #ifdef CONFIG_POSIX_TIMERS diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index a6d6f30b6f65..22051b4f1ccb 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode); -void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode); +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk); +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, enum cgroup_attach_lock_mode *lock_mode) __acquires(&cgroup_threadgroup_rwsem); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index f3838888ea6f..db48550bc4cc 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) int retval = 0; cgroup_lock(); - cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); for_each_root(root) { struct cgroup *from_cgrp; @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return retval; @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) cgroup_lock(); - cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return ret; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 756807164091..344424dd365b 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -216,6 +216,14 @@ static u16 have_canfork_callback __read_mostly; static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS); +/* + * Write protected by cgroup_mutex and write-lock of cgroup_threadgroup_rwsem, + * read protected by either. + * + * Can only be turned on, but not turned off. + */ +bool cgroup_enable_per_threadgroup_rwsem __read_mostly; + /* cgroup namespace for init task */ struct cgroup_namespace init_cgroup_ns = { .ns.count = REFCOUNT_INIT(2), @@ -1302,14 +1310,24 @@ void cgroup_favor_dynmods(struct cgroup_root *root, bool favor) { bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS; - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */ + /* + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition. + * favordynmods can flip while task is between + * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end, + * so down_write global cgroup_threadgroup_rwsem to synchronize them. + */ + percpu_down_write(&cgroup_threadgroup_rwsem); if (favor && !favoring) { + cgroup_enable_per_threadgroup_rwsem = true; rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); root->flags |= CGRP_ROOT_FAVOR_DYNMODS; } else if (!favor && favoring) { + if (cgroup_enable_per_threadgroup_rwsem) + WARN_ONCE(1, "cgroup favordynmods: per threadgroup rwsem mechanism can't be disabled\n"); rcu_sync_exit(&cgroup_threadgroup_rwsem.rss); root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS; } + percpu_up_write(&cgroup_threadgroup_rwsem); } static int cgroup_init_root_id(struct cgroup_root *root) @@ -2459,7 +2477,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); /** * cgroup_attach_lock - Lock for ->attach() - * @lock_mode: whether to down_write cgroup_threadgroup_rwsem + * @lock_mode: whether acquire and acquire which rwsem + * @tsk: thread group to lock * * cgroup migration sometimes needs to stabilize threadgroups against forks and * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() @@ -2479,8 +2498,15 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * Resolve the situation by always acquiring cpus_read_lock() before optionally * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that * CPU hotplug is disabled on entry. + * + * When favordynmods is enabled, take per threadgroup rwsem to reduce overhead + * on dynamic cgroup modifications. see the comment above + * CGRP_ROOT_FAVOR_DYNMODS definition. + * + * tsk is not NULL only when writing to cgroup.procs. */ -void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk) { cpus_read_lock(); @@ -2490,6 +2516,9 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) case CGRP_ATTACH_LOCK_GLOBAL: percpu_down_write(&cgroup_threadgroup_rwsem); break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + down_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; default: pr_warn("cgroup: Unexpected attach lock mode."); break; @@ -2498,9 +2527,11 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_mode: whether to up_write cgroup_threadgroup_rwsem + * @lock_mode: whether release and release which rwsem + * @tsk: thread group to lock */ -void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode) +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk) { switch (lock_mode) { case CGRP_ATTACH_LOCK_NONE: @@ -2508,6 +2539,9 @@ void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode) case CGRP_ATTACH_LOCK_GLOBAL: percpu_up_write(&cgroup_threadgroup_rwsem); break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + up_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; default: pr_warn("cgroup: Unexpected attach lock mode."); break; @@ -3019,7 +3053,6 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, tsk = ERR_PTR(-EINVAL); goto out_unlock_rcu; } - get_task_struct(tsk); rcu_read_unlock(); @@ -3033,12 +3066,16 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, */ lockdep_assert_held(&cgroup_mutex); - if (pid || threadgroup) - *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; - else + if (pid || threadgroup) { + if (cgroup_enable_per_threadgroup_rwsem) + *lock_mode = CGRP_ATTACH_LOCK_PER_THREADGROUP; + else + *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; + } else { *lock_mode = CGRP_ATTACH_LOCK_NONE; + } - cgroup_attach_lock(*lock_mode); + cgroup_attach_lock(*lock_mode, tsk); if (threadgroup) { if (!thread_group_leader(tsk)) { @@ -3049,7 +3086,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, * try again; this is * "double-double-toil-and-trouble-check locking". */ - cgroup_attach_unlock(*lock_mode); + cgroup_attach_unlock(*lock_mode, tsk); put_task_struct(tsk); goto retry_find_task; } @@ -3068,11 +3105,11 @@ void cgroup_procs_write_finish(struct task_struct *task, struct cgroup_subsys *ss; int ssid; + cgroup_attach_unlock(lock_mode, task); + /* release reference from cgroup_procs_write_start() */ put_task_struct(task); - cgroup_attach_unlock(lock_mode); - for_each_subsys(ss, ssid) if (ss->post_attach) ss->post_attach(); @@ -3165,7 +3202,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) else lock_mode = CGRP_ATTACH_LOCK_NONE; - cgroup_attach_lock(lock_mode); + cgroup_attach_lock(lock_mode, NULL); /* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3186,7 +3223,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(lock_mode); + cgroup_attach_unlock(lock_mode, NULL); return ret; } diff --git a/kernel/fork.c b/kernel/fork.c index c4ada32598bd..9a039867ecfd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) tty_audit_fork(sig); sched_autogroup_fork(sig); +#ifdef CONFIG_CGROUPS + init_rwsem(&sig->cgroup_threadgroup_rwsem); +#endif + sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; -- 2.32.0.3.g01195cf9f ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-10 6:59 ` [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao @ 2025-09-10 17:47 ` Tejun Heo 2025-09-11 1:52 ` Yi Tao 0 siblings, 1 reply; 44+ messages in thread From: Tejun Heo @ 2025-09-10 17:47 UTC (permalink / raw) To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel Applied to cgroup/for-6.18 with a comment update and s/WARN_ONCE()/pr_warn_once()/ as the backtrace isn't likely helpful here. The applied version follows: ------ 8< ------ From 0568f89d4fb82d98001baeb870e92f43cd1f7317 Mon Sep 17 00:00:00 2001 From: Yi Tao <escape@linux.alibaba.com> Date: Wed, 10 Sep 2025 14:59:35 +0800 Subject: [PATCH] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP doesn't require write locking cgroup_threadgroup_rwsem and thus doesn't benefit from this patch. To avoid affecting other users, the per threadgroup rwsem is only used when the favordynmods is enabled. As computer hardware advances, modern systems are typically equipped with many CPU cores and large amounts of memory, enabling the deployment of numerous applications. On such systems, container creation and deletion become frequent operations, making cgroup process migration no longer a cold path. This leads to noticeable contention with common process operations such as fork, exec, and exit. To alleviate the contention between cgroup process migration and operations like process fork, this patch modifies lock to take the write lock on signal_struct->group_rwsem when writing pid to cgroup.procs/threads instead of holding a global write lock. Cgroup process migration has historically relied on signal_struct->group_rwsem to protect thread group integrity. In commit <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"), this was changed to a global cgroup_threadgroup_rwsem. The advantage of using a global lock was simplified handling of process group migrations. This patch retains the use of the global lock for protecting process group migration, while reducing contention by using per thread group lock during cgroup.procs/threads writes. The locking behavior is as follows: write cgroup.procs/threads | process fork,exec,exit | process group migration ------------------------------------------------------------------------------ cgroup_lock() | down_read(&g_rwsem) | cgroup_lock() down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem) critical section | critical section | critical section up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem) cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock() g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes signal_struct->group_rwsem. This patch eliminates contention between cgroup migration and fork operations for threads that belong to different thread groups, thereby reducing the long-tail latency of cgroup migrations and lowering system load. With this patch, under heavy fork and exec interference, the long-tail latency of cgroup migration has been reduced from milliseconds to microseconds. Under heavy cgroup migration interference, the multi-CPU score of the spawn test case in UnixBench increased by 9%. tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to pr_warn_once(). Signed-off-by: Yi Tao <escape@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/cgroup-defs.h | 17 +++++++- include/linux/sched/signal.h | 4 ++ init/init_task.c | 3 ++ kernel/cgroup/cgroup-internal.h | 6 ++- kernel/cgroup/cgroup-v1.c | 8 ++-- kernel/cgroup/cgroup.c | 73 ++++++++++++++++++++++++++------- kernel/fork.c | 4 ++ 7 files changed, 93 insertions(+), 22 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ff3c7d0e3e01..93318fce31f3 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -91,6 +91,12 @@ enum { * cgroup_threadgroup_rwsem. This makes hot path operations such as * forks and exits into the slow path and more expensive. * + * Alleviate the contention between fork, exec, exit operations and + * writing to cgroup.procs by taking a per threadgroup rwsem instead of + * the global cgroup_threadgroup_rwsem. Fork and other operations + * from threads in different thread groups no longer contend with + * writing to cgroup.procs. + * * The static usage pattern of creating a cgroup, enabling controllers, * and then seeding it with CLONE_INTO_CGROUP doesn't require write * locking cgroup_threadgroup_rwsem and thus doesn't benefit from @@ -146,6 +152,9 @@ enum cgroup_attach_lock_mode { /* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */ CGRP_ATTACH_LOCK_NONE, + + /* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */ + CGRP_ATTACH_LOCK_PER_THREADGROUP, }; /* @@ -846,6 +855,7 @@ struct cgroup_subsys { }; extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem; +extern bool cgroup_enable_per_threadgroup_rwsem; struct cgroup_of_peak { unsigned long value; @@ -857,11 +867,14 @@ struct cgroup_of_peak { * @tsk: target task * * Allows cgroup operations to synchronize against threadgroup changes - * using a percpu_rw_semaphore. + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition. */ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) { percpu_down_read(&cgroup_threadgroup_rwsem); + if (cgroup_enable_per_threadgroup_rwsem) + down_read(&tsk->signal->cgroup_threadgroup_rwsem); } /** @@ -872,6 +885,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) */ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) { + if (cgroup_enable_per_threadgroup_rwsem) + up_read(&tsk->signal->cgroup_threadgroup_rwsem); percpu_up_read(&cgroup_threadgroup_rwsem); } diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1ef1edbaaf79..7d6449982822 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -226,6 +226,10 @@ struct signal_struct { struct tty_audit_buf *tty_audit_buf; #endif +#ifdef CONFIG_CGROUPS + struct rw_semaphore cgroup_threadgroup_rwsem; +#endif + /* * Thread is the potential origin of an oom condition; kill first on * oom diff --git a/init/init_task.c b/init/init_task.c index e557f622bd90..a55e2189206f 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -27,6 +27,9 @@ static struct signal_struct init_signals = { }, .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, +#ifdef CONFIG_CGROUPS + .cgroup_threadgroup_rwsem = __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem), +#endif .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), #ifdef CONFIG_POSIX_TIMERS diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index a6d6f30b6f65..22051b4f1ccb 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, bool threadgroup); -void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode); -void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode); +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk); +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk); struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, enum cgroup_attach_lock_mode *lock_mode) __acquires(&cgroup_threadgroup_rwsem); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 852ebe7ca3a1..a9e029b570c8 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -69,7 +69,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) int retval = 0; cgroup_lock(); - cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); for_each_root(root) { struct cgroup *from_cgrp; @@ -81,7 +81,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) if (retval) break; } - cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return retval; @@ -118,7 +118,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) cgroup_lock(); - cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -154,7 +154,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); cgroup_unlock(); return ret; } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a6b81b48bb70..fed701df1167 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -239,6 +239,14 @@ static u16 have_canfork_callback __read_mostly; static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS); +/* + * Write protected by cgroup_mutex and write-lock of cgroup_threadgroup_rwsem, + * read protected by either. + * + * Can only be turned on, but not turned off. + */ +bool cgroup_enable_per_threadgroup_rwsem __read_mostly; + /* cgroup namespace for init task */ struct cgroup_namespace init_cgroup_ns = { .ns.count = REFCOUNT_INIT(2), @@ -1325,14 +1333,30 @@ void cgroup_favor_dynmods(struct cgroup_root *root, bool favor) { bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS; - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */ + /* + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition. + * favordynmods can flip while task is between + * cgroup_threadgroup_change_begin() and end(), so down_write global + * cgroup_threadgroup_rwsem to synchronize them. + * + * Once cgroup_enable_per_threadgroup_rwsem is enabled, holding + * cgroup_threadgroup_rwsem doesn't exlude tasks between + * cgroup_thread_group_change_begin() and end() and thus it's unsafe to + * turn off. As the scenario is unlikely, simply disallow disabling once + * enabled and print out a warning. + */ + percpu_down_write(&cgroup_threadgroup_rwsem); if (favor && !favoring) { + cgroup_enable_per_threadgroup_rwsem = true; rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); root->flags |= CGRP_ROOT_FAVOR_DYNMODS; } else if (!favor && favoring) { + if (cgroup_enable_per_threadgroup_rwsem) + pr_warn_once("cgroup favordynmods: per threadgroup rwsem mechanism can't be disabled\n"); rcu_sync_exit(&cgroup_threadgroup_rwsem.rss); root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS; } + percpu_up_write(&cgroup_threadgroup_rwsem); } static int cgroup_init_root_id(struct cgroup_root *root) @@ -2482,7 +2506,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); /** * cgroup_attach_lock - Lock for ->attach() - * @lock_mode: whether to down_write cgroup_threadgroup_rwsem + * @lock_mode: whether acquire and acquire which rwsem + * @tsk: thread group to lock * * cgroup migration sometimes needs to stabilize threadgroups against forks and * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() @@ -2502,8 +2527,15 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns); * Resolve the situation by always acquiring cpus_read_lock() before optionally * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that * CPU hotplug is disabled on entry. + * + * When favordynmods is enabled, take per threadgroup rwsem to reduce overhead + * on dynamic cgroup modifications. see the comment above + * CGRP_ROOT_FAVOR_DYNMODS definition. + * + * tsk is not NULL only when writing to cgroup.procs. */ -void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) +void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk) { cpus_read_lock(); @@ -2513,6 +2545,9 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) case CGRP_ATTACH_LOCK_GLOBAL: percpu_down_write(&cgroup_threadgroup_rwsem); break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + down_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; default: pr_warn("cgroup: Unexpected attach lock mode."); break; @@ -2521,9 +2556,11 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) /** * cgroup_attach_unlock - Undo cgroup_attach_lock() - * @lock_mode: whether to up_write cgroup_threadgroup_rwsem + * @lock_mode: whether release and release which rwsem + * @tsk: thread group to lock */ -void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode) +void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode, + struct task_struct *tsk) { switch (lock_mode) { case CGRP_ATTACH_LOCK_NONE: @@ -2531,6 +2568,9 @@ void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode) case CGRP_ATTACH_LOCK_GLOBAL: percpu_up_write(&cgroup_threadgroup_rwsem); break; + case CGRP_ATTACH_LOCK_PER_THREADGROUP: + up_write(&tsk->signal->cgroup_threadgroup_rwsem); + break; default: pr_warn("cgroup: Unexpected attach lock mode."); break; @@ -3042,7 +3082,6 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, tsk = ERR_PTR(-EINVAL); goto out_unlock_rcu; } - get_task_struct(tsk); rcu_read_unlock(); @@ -3055,12 +3094,16 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, */ lockdep_assert_held(&cgroup_mutex); - if (pid || threadgroup) - *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; - else + if (pid || threadgroup) { + if (cgroup_enable_per_threadgroup_rwsem) + *lock_mode = CGRP_ATTACH_LOCK_PER_THREADGROUP; + else + *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; + } else { *lock_mode = CGRP_ATTACH_LOCK_NONE; + } - cgroup_attach_lock(*lock_mode); + cgroup_attach_lock(*lock_mode, tsk); if (threadgroup) { if (!thread_group_leader(tsk)) { @@ -3069,7 +3112,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, * may strip us of our leadership. If this happens, * throw this task away and try again. */ - cgroup_attach_unlock(*lock_mode); + cgroup_attach_unlock(*lock_mode, tsk); put_task_struct(tsk); goto retry_find_task; } @@ -3085,10 +3128,10 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, void cgroup_procs_write_finish(struct task_struct *task, enum cgroup_attach_lock_mode lock_mode) { + cgroup_attach_unlock(lock_mode, task); + /* release reference from cgroup_procs_write_start() */ put_task_struct(task); - - cgroup_attach_unlock(lock_mode); } static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask) @@ -3178,7 +3221,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) else lock_mode = CGRP_ATTACH_LOCK_NONE; - cgroup_attach_lock(lock_mode); + cgroup_attach_lock(lock_mode, NULL); /* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -3199,7 +3242,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - cgroup_attach_unlock(lock_mode); + cgroup_attach_unlock(lock_mode, NULL); return ret; } diff --git a/kernel/fork.c b/kernel/fork.c index c4ada32598bd..9a039867ecfd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) tty_audit_fork(sig); sched_autogroup_fork(sig); +#ifdef CONFIG_CGROUPS + init_rwsem(&sig->cgroup_threadgroup_rwsem); +#endif + sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; -- 2.51.0 ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs 2025-09-10 17:47 ` Tejun Heo @ 2025-09-11 1:52 ` Yi Tao 0 siblings, 0 replies; 44+ messages in thread From: Yi Tao @ 2025-09-11 1:52 UTC (permalink / raw) To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel Thank you for the review and guidance. ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2025-09-11 3:22 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao 2025-09-03 13:14 ` Waiman Long 2025-09-04 1:35 ` Chen Ridong 2025-09-04 4:59 ` escape 2025-09-04 5:02 ` escape 2025-09-03 16:53 ` Tejun Heo 2025-09-03 20:03 ` Michal Koutný 2025-09-03 20:45 ` Tejun Heo 2025-09-04 1:40 ` Chen Ridong 2025-09-04 6:43 ` escape 2025-09-04 6:52 ` Tejun Heo 2025-09-04 3:15 ` escape 2025-09-04 6:38 ` escape 2025-09-04 7:28 ` Tejun Heo 2025-09-04 8:10 ` escape 2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao 2025-09-04 11:39 ` [PATCH v2 1/1] " Yi Tao 2025-09-04 16:31 ` Tejun Heo 2025-09-05 2:16 ` escape 2025-09-05 2:27 ` Tejun Heo 2025-09-05 3:44 ` escape 2025-09-05 3:48 ` Tejun Heo 2025-09-05 4:30 ` escape 2025-09-05 2:02 ` Chen Ridong 2025-09-05 13:17 ` kernel test robot 2025-09-08 10:20 ` [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao 2025-09-08 10:20 ` [PATCH v3 1/1] " Yi Tao 2025-09-08 16:05 ` Tejun Heo 2025-09-08 19:39 ` Waiman Long 2025-09-09 7:55 ` [PATCH v4 0/3] " Yi Tao 2025-09-09 7:55 ` [PATCH v4 1/3] " Yi Tao 2025-09-09 7:55 ` [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed Yi Tao 2025-09-09 16:59 ` Tejun Heo 2025-09-09 7:55 ` [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao 2025-09-09 17:00 ` Tejun Heo 2025-09-10 6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao 2025-09-10 6:59 ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao 2025-09-10 17:26 ` Tejun Heo 2025-09-10 6:59 ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao 2025-09-10 17:31 ` Tejun Heo 2025-09-11 3:22 ` Waiman Long 2025-09-10 6:59 ` [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao 2025-09-10 17:47 ` Tejun Heo 2025-09-11 1:52 ` Yi Tao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).