linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
@ 2025-09-03 11:11 Yi Tao
  2025-09-03 13:14 ` Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 44+ messages in thread
From: Yi Tao @ 2025-09-03 11:11 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads  | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
critical section            | critical section       | critical section
up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h     |  2 ++
 include/linux/sched/signal.h    |  4 +++
 init/init_task.c                |  3 ++
 kernel/cgroup/cgroup-internal.h |  6 ++--
 kernel/cgroup/cgroup-v1.c       |  8 ++---
 kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
 kernel/fork.c                   |  4 +++
 7 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b93a64115fe..8e0fdad8a440 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -838,6 +838,7 @@ struct cgroup_of_peak {
 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
 {
 	percpu_down_read(&cgroup_threadgroup_rwsem);
+	down_read(&tsk->signal->group_rwsem);
 }
 
 /**
@@ -848,6 +849,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
  */
 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 {
+	up_read(&tsk->signal->group_rwsem);
 	percpu_up_read(&cgroup_threadgroup_rwsem);
 }
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..86fbc99a9174 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -226,6 +226,10 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
+#ifdef CONFIG_CGROUPS
+	struct rw_semaphore group_rwsem;
+#endif
+
 	/*
 	 * Thread is the potential origin of an oom condition; kill first on
 	 * oom
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..0450093924a7 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
 	},
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
+#ifdef CONFIG_CGROUPS
+	.group_rwsem	= __RWSEM_INITIALIZER(init_signals.group_rwsem),
+#endif
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index b14e61c64a34..572c24b7e947 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(bool lock_threadgroup);
-void cgroup_attach_unlock(bool lock_threadgroup);
+void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
+			bool global);
+void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup,
+			  bool global);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     bool *locked)
 	__acquires(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 2a4a387f867a..3e5ead8c5bc5 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(NULL, true, true);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(NULL, true, true);
 	cgroup_unlock();
 
 	return retval;
@@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(NULL, true, true);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(NULL, true, true);
 	cgroup_unlock();
 	return ret;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..4e1d80a2741f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
  */
-void cgroup_attach_lock(bool lock_threadgroup)
+void cgroup_attach_lock(struct task_struct *tsk,
+			       bool lock_threadgroup, bool global)
 {
 	cpus_read_lock();
-	if (lock_threadgroup)
-		percpu_down_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (global)
+			percpu_down_write(&cgroup_threadgroup_rwsem);
+		else
+			down_write(&tsk->signal->group_rwsem);
+	}
 }
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
  * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
  */
-void cgroup_attach_unlock(bool lock_threadgroup)
+void cgroup_attach_unlock(struct task_struct *tsk,
+				 bool lock_threadgroup, bool global)
 {
-	if (lock_threadgroup)
-		percpu_up_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (global)
+			percpu_up_write(&cgroup_threadgroup_rwsem);
+		else
+			up_write(&tsk->signal->group_rwsem);
+	}
 	cpus_read_unlock();
 }
 
@@ -2970,12 +2980,23 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     bool *threadgroup_locked)
 {
-	struct task_struct *tsk;
+	struct task_struct *tsk, *tmp_tsk;
 	pid_t pid;
 
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
 		return ERR_PTR(-EINVAL);
 
+	rcu_read_lock();
+	if (pid) {
+		tsk = find_task_by_vpid(pid);
+		if (!tsk) {
+			tsk = ERR_PTR(-ESRCH);
+			goto out_unlock_rcu;
+		}
+	} else {
+		tsk = current;
+	}
+
 	/*
 	 * If we migrate a single thread, we don't care about threadgroup
 	 * stability. If the thread is `current`, it won't exit(2) under our
@@ -2986,18 +3007,9 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 */
 	lockdep_assert_held(&cgroup_mutex);
 	*threadgroup_locked = pid || threadgroup;
-	cgroup_attach_lock(*threadgroup_locked);
 
-	rcu_read_lock();
-	if (pid) {
-		tsk = find_task_by_vpid(pid);
-		if (!tsk) {
-			tsk = ERR_PTR(-ESRCH);
-			goto out_unlock_threadgroup;
-		}
-	} else {
-		tsk = current;
-	}
+	tmp_tsk = tsk;
+	cgroup_attach_lock(tmp_tsk, *threadgroup_locked, false);
 
 	if (threadgroup)
 		tsk = tsk->group_leader;
@@ -3017,7 +3029,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	goto out_unlock_rcu;
 
 out_unlock_threadgroup:
-	cgroup_attach_unlock(*threadgroup_locked);
+	cgroup_attach_unlock(tmp_tsk, *threadgroup_locked, false);
 	*threadgroup_locked = false;
 out_unlock_rcu:
 	rcu_read_unlock();
@@ -3032,7 +3044,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
 
-	cgroup_attach_unlock(threadgroup_locked);
+	cgroup_attach_unlock(task, threadgroup_locked, false);
 
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3119,7 +3131,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	 * write-locking can be skipped safely.
 	 */
 	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
-	cgroup_attach_lock(has_tasks);
+	cgroup_attach_lock(NULL, has_tasks, true);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3140,7 +3152,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(has_tasks);
+	cgroup_attach_unlock(NULL, has_tasks, true);
 	return ret;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499d..5218f9b93c77 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->group_rwsem);
+#endif
+
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
@ 2025-09-03 13:14 ` Waiman Long
  2025-09-04  1:35   ` Chen Ridong
                     ` (2 more replies)
  2025-09-03 16:53 ` Tejun Heo
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 44+ messages in thread
From: Waiman Long @ 2025-09-03 13:14 UTC (permalink / raw)
  To: Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel


On 9/3/25 7:11 AM, Yi Tao wrote:
> As computer hardware advances, modern systems are typically equipped
> with many CPU cores and large amounts of memory, enabling the deployment
> of numerous applications. On such systems, container creation and
> deletion become frequent operations, making cgroup process migration no
> longer a cold path. This leads to noticeable contention with common
> process operations such as fork, exec, and exit.
>
> To alleviate the contention between cgroup process migration and
> operations like process fork, this patch modifies lock to take the write
> lock on signal_struct->group_rwsem when writing pid to
> cgroup.procs/threads instead of holding a global write lock.
>
> Cgroup process migration has historically relied on
> signal_struct->group_rwsem to protect thread group integrity. In commit
> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
> a global percpu_rwsem"), this was changed to a global
> cgroup_threadgroup_rwsem. The advantage of using a global lock was
> simplified handling of process group migrations. This patch retains the
> use of the global lock for protecting process group migration, while
> reducing contention by using per thread group lock during
> cgroup.procs/threads writes.
>
> The locking behavior is as follows:
>
> write cgroup.procs/threads  | process fork,exec,exit | process group migration
> ------------------------------------------------------------------------------
> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
> down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
> critical section            | critical section       | critical section
> up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>
> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
> signal_struct->group_rwsem.
>
> This patch eliminates contention between cgroup migration and fork
> operations for threads that belong to different thread groups, thereby
> reducing the long-tail latency of cgroup migrations and lowering system
> load.
Do you have any performance numbers to showcase how much performance 
benefit does this change provide?
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
> ---
>   include/linux/cgroup-defs.h     |  2 ++
>   include/linux/sched/signal.h    |  4 +++
>   init/init_task.c                |  3 ++
>   kernel/cgroup/cgroup-internal.h |  6 ++--
>   kernel/cgroup/cgroup-v1.c       |  8 ++---
>   kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
>   kernel/fork.c                   |  4 +++
>   7 files changed, 55 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 6b93a64115fe..8e0fdad8a440 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -838,6 +838,7 @@ struct cgroup_of_peak {
>   static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>   {
>   	percpu_down_read(&cgroup_threadgroup_rwsem);
> +	down_read(&tsk->signal->group_rwsem);
>   }
>   
>   /**
> @@ -848,6 +849,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>    */
>   static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>   {
> +	up_read(&tsk->signal->group_rwsem);
>   	percpu_up_read(&cgroup_threadgroup_rwsem);
>   }
>   
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..86fbc99a9174 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -226,6 +226,10 @@ struct signal_struct {
>   	struct tty_audit_buf *tty_audit_buf;
>   #endif
>   
> +#ifdef CONFIG_CGROUPS
> +	struct rw_semaphore group_rwsem;
> +#endif
> +
>   	/*
>   	 * Thread is the potential origin of an oom condition; kill first on
>   	 * oom
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..0450093924a7 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>   	},
>   	.multiprocess	= HLIST_HEAD_INIT,
>   	.rlim		= INIT_RLIMITS,
> +#ifdef CONFIG_CGROUPS
> +	.group_rwsem	= __RWSEM_INITIALIZER(init_signals.group_rwsem),
> +#endif
>   	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>   	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>   #ifdef CONFIG_POSIX_TIMERS
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index b14e61c64a34..572c24b7e947 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
>   
>   int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
>   		       bool threadgroup);
> -void cgroup_attach_lock(bool lock_threadgroup);
> -void cgroup_attach_unlock(bool lock_threadgroup);
> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
> +			bool global);
> +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup,
> +			  bool global);
>   struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>   					     bool *locked)
>   	__acquires(&cgroup_threadgroup_rwsem);
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 2a4a387f867a..3e5ead8c5bc5 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>   	int retval = 0;
>   
>   	cgroup_lock();
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(NULL, true, true);
>   	for_each_root(root) {
>   		struct cgroup *from_cgrp;
>   
> @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>   		if (retval)
>   			break;
>   	}
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(NULL, true, true);
>   	cgroup_unlock();
>   
>   	return retval;
> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>   
>   	cgroup_lock();
>   
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(NULL, true, true);
>   
>   	/* all tasks in @from are being moved, all csets are source */
>   	spin_lock_irq(&css_set_lock);
> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>   	} while (task && !ret);
>   out_err:
>   	cgroup_migrate_finish(&mgctx);
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(NULL, true, true);
>   	cgroup_unlock();
>   	return ret;
>   }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..4e1d80a2741f 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>    * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
>    * CPU hotplug is disabled on entry.
>    */
> -void cgroup_attach_lock(bool lock_threadgroup)
> +void cgroup_attach_lock(struct task_struct *tsk,
> +			       bool lock_threadgroup, bool global)
>   {
>   	cpus_read_lock();
> -	if (lock_threadgroup)
> -		percpu_down_write(&cgroup_threadgroup_rwsem);
> +	if (lock_threadgroup) {
> +		if (global)
> +			percpu_down_write(&cgroup_threadgroup_rwsem);
> +		else
> +			down_write(&tsk->signal->group_rwsem);
> +	}
>   }

tsk is only used when global is false. So you can eliminate the 
redundant global argument and use the presence of tsk to indicate which 
lock to take.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
  2025-09-03 13:14 ` Waiman Long
@ 2025-09-03 16:53 ` Tejun Heo
  2025-09-03 20:03   ` Michal Koutný
  2025-09-04  3:15   ` escape
  2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 44+ messages in thread
From: Tejun Heo @ 2025-09-03 16:53 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

On Wed, Sep 03, 2025 at 07:11:07PM +0800, Yi Tao wrote:
> As computer hardware advances, modern systems are typically equipped
> with many CPU cores and large amounts of memory, enabling the deployment
> of numerous applications. On such systems, container creation and
> deletion become frequent operations, making cgroup process migration no
> longer a cold path. This leads to noticeable contention with common
> process operations such as fork, exec, and exit.

If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
from cgroup side. Are there reasons why you can't use CLONE_INTO_CGROUP?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 16:53 ` Tejun Heo
@ 2025-09-03 20:03   ` Michal Koutný
  2025-09-03 20:45     ` Tejun Heo
  2025-09-04  3:15   ` escape
  1 sibling, 1 reply; 44+ messages in thread
From: Michal Koutný @ 2025-09-03 20:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Yi Tao, hannes, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 530 bytes --]

On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote:
> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
> from cgroup side.

CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular
migration, no? Its effect is atomicity wrt clone.
Or, Tejum, what do you mean that it disappears? (I think we cannot give
up cgroup_mutex as it ensures synchronization of possible parent's
migration.)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 20:03   ` Michal Koutný
@ 2025-09-03 20:45     ` Tejun Heo
  2025-09-04  1:40       ` Chen Ridong
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2025-09-03 20:45 UTC (permalink / raw)
  To: Michal Koutný; +Cc: Yi Tao, hannes, cgroups, linux-kernel

Hello, Michal.

On Wed, Sep 03, 2025 at 10:03:39PM +0200, Michal Koutný wrote:
> On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
> > disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
> > from cgroup side.
> 
> CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular
> migration, no? Its effect is atomicity wrt clone.
> Or, Tejum, what do you mean that it disappears? (I think we cannot give
> up cgroup_mutex as it ensures synchronization of possible parent's
> migration.)

Sorry, I was confused. We no longer need to write lock threadgroup rwsem
when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need
cgroup_mutex.

  671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree")

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 13:14 ` Waiman Long
@ 2025-09-04  1:35   ` Chen Ridong
  2025-09-04  4:59   ` escape
  2025-09-04  5:02   ` escape
  2 siblings, 0 replies; 44+ messages in thread
From: Chen Ridong @ 2025-09-04  1:35 UTC (permalink / raw)
  To: Waiman Long, Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel



On 2025/9/3 21:14, Waiman Long wrote:
> 
> On 9/3/25 7:11 AM, Yi Tao wrote:
>> As computer hardware advances, modern systems are typically equipped
>> with many CPU cores and large amounts of memory, enabling the deployment
>> of numerous applications. On such systems, container creation and
>> deletion become frequent operations, making cgroup process migration no
>> longer a cold path. This leads to noticeable contention with common
>> process operations such as fork, exec, and exit.
>>

We are facing the same issue.

>> To alleviate the contention between cgroup process migration and
>> operations like process fork, this patch modifies lock to take the write
>> lock on signal_struct->group_rwsem when writing pid to
>> cgroup.procs/threads instead of holding a global write lock.
>>
>> Cgroup process migration has historically relied on
>> signal_struct->group_rwsem to protect thread group integrity. In commit
>> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
>> a global percpu_rwsem"), this was changed to a global
>> cgroup_threadgroup_rwsem. The advantage of using a global lock was
>> simplified handling of process group migrations. This patch retains the
>> use of the global lock for protecting process group migration, while
>> reducing contention by using per thread group lock during
>> cgroup.procs/threads writes.
>>
>> The locking behavior is as follows:
>>
>> write cgroup.procs/threads  | process fork,exec,exit | process group migration
>> ------------------------------------------------------------------------------
>> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
>> down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
>> critical section            | critical section       | critical section
>> up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
>> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>>
>> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
>> signal_struct->group_rwsem.
>>
>> This patch eliminates contention between cgroup migration and fork
>> operations for threads that belong to different thread groups, thereby
>> reducing the long-tail latency of cgroup migrations and lowering system
>> load.
> Do you have any performance numbers to showcase how much performance benefit does this change provide?

Yes, I would like this performance data as well.

>> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
>> ---
>>   include/linux/cgroup-defs.h     |  2 ++
>>   include/linux/sched/signal.h    |  4 +++
>>   init/init_task.c                |  3 ++
>>   kernel/cgroup/cgroup-internal.h |  6 ++--
>>   kernel/cgroup/cgroup-v1.c       |  8 ++---
>>   kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
>>   kernel/fork.c                   |  4 +++
>>   7 files changed, 55 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 6b93a64115fe..8e0fdad8a440 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -838,6 +838,7 @@ struct cgroup_of_peak {
>>   static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>   {
>>       percpu_down_read(&cgroup_threadgroup_rwsem);
>> +    down_read(&tsk->signal->group_rwsem);
>>   }
>>     /**
>> @@ -848,6 +849,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>    */
>>   static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>>   {
>> +    up_read(&tsk->signal->group_rwsem);
>>       percpu_up_read(&cgroup_threadgroup_rwsem);
>>   }
>>   diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>> index 1ef1edbaaf79..86fbc99a9174 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -226,6 +226,10 @@ struct signal_struct {
>>       struct tty_audit_buf *tty_audit_buf;
>>   #endif
>>   +#ifdef CONFIG_CGROUPS
>> +    struct rw_semaphore group_rwsem;
>> +#endif
>> +
>>       /*
>>        * Thread is the potential origin of an oom condition; kill first on
>>        * oom
>> diff --git a/init/init_task.c b/init/init_task.c
>> index e557f622bd90..0450093924a7 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>>       },
>>       .multiprocess    = HLIST_HEAD_INIT,
>>       .rlim        = INIT_RLIMITS,
>> +#ifdef CONFIG_CGROUPS
>> +    .group_rwsem    = __RWSEM_INITIALIZER(init_signals.group_rwsem),
>> +#endif
>>       .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>>       .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>>   #ifdef CONFIG_POSIX_TIMERS
>> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
>> index b14e61c64a34..572c24b7e947 100644
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
>>     int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
>>                  bool threadgroup);
>> -void cgroup_attach_lock(bool lock_threadgroup);
>> -void cgroup_attach_unlock(bool lock_threadgroup);
>> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
>> +            bool global);
>> +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup,
>> +              bool global);
>>   struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>>                            bool *locked)
>>       __acquires(&cgroup_threadgroup_rwsem);
>> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
>> index 2a4a387f867a..3e5ead8c5bc5 100644
>> --- a/kernel/cgroup/cgroup-v1.c
>> +++ b/kernel/cgroup/cgroup-v1.c
>> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>>       int retval = 0;
>>         cgroup_lock();
>> -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>       for_each_root(root) {
>>           struct cgroup *from_cgrp;
>>   @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>>           if (retval)
>>               break;
>>       }
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>         return retval;
>> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>>         cgroup_lock();
>>   -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>         /* all tasks in @from are being moved, all csets are source */
>>       spin_lock_irq(&css_set_lock);
>> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>>       } while (task && !ret);
>>   out_err:
>>       cgroup_migrate_finish(&mgctx);
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>       return ret;
>>   }
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 312c6a8b55bb..4e1d80a2741f 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>>    * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
>>    * CPU hotplug is disabled on entry.
>>    */
>> -void cgroup_attach_lock(bool lock_threadgroup)
>> +void cgroup_attach_lock(struct task_struct *tsk,
>> +                   bool lock_threadgroup, bool global)
>>   {
>>       cpus_read_lock();
>> -    if (lock_threadgroup)
>> -        percpu_down_write(&cgroup_threadgroup_rwsem);
>> +    if (lock_threadgroup) {
>> +        if (global)
>> +            percpu_down_write(&cgroup_threadgroup_rwsem);
>> +        else
>> +            down_write(&tsk->signal->group_rwsem);
>> +    }
>>   }
> 
> tsk is only used when global is false. So you can eliminate the redundant global argument and use
> the presence of tsk to indicate which lock to take.
> 
> Cheers,
> Longman
> 

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 20:45     ` Tejun Heo
@ 2025-09-04  1:40       ` Chen Ridong
  2025-09-04  6:43         ` escape
  2025-09-04  6:52         ` Tejun Heo
  0 siblings, 2 replies; 44+ messages in thread
From: Chen Ridong @ 2025-09-04  1:40 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný; +Cc: Yi Tao, hannes, cgroups, linux-kernel



On 2025/9/4 4:45, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Sep 03, 2025 at 10:03:39PM +0200, Michal Koutný wrote:
>> On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote:
>>> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
>>> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
>>> from cgroup side.
>>
>> CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular
>> migration, no? Its effect is atomicity wrt clone.
>> Or, Tejum, what do you mean that it disappears? (I think we cannot give
>> up cgroup_mutex as it ensures synchronization of possible parent's
>> migration.)
> 
> Sorry, I was confused. We no longer need to write lock threadgroup rwsem
> when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need
> cgroup_mutex.
> 
>   671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree")
> 
> Thanks.
> 

I'm still a bit confused. Commit 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when
updating csses on an empty subtree") only applies to CSS updates. However, cloning with
CLONE_INTO_CGROUP still requires acquiring the threadgroup_rwsem.

cgroup_can_fork
  cgroup_css_set_fork
    	if (kargs->flags & CLONE_INTO_CGROUP)
		cgroup_lock();
	cgroup_threadgroup_change_begin(current);

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 16:53 ` Tejun Heo
  2025-09-03 20:03   ` Michal Koutný
@ 2025-09-04  3:15   ` escape
  2025-09-04  6:38     ` escape
  2025-09-04  7:28     ` Tejun Heo
  1 sibling, 2 replies; 44+ messages in thread
From: escape @ 2025-09-04  3:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel


在 2025/9/4 00:53, Tejun Heo 写道:
> Hello,
>
> On Wed, Sep 03, 2025 at 07:11:07PM +0800, Yi Tao wrote:
>> As computer hardware advances, modern systems are typically equipped
>> with many CPU cores and large amounts of memory, enabling the deployment
>> of numerous applications. On such systems, container creation and
>> deletion become frequent operations, making cgroup process migration no
>> longer a cold path. This leads to noticeable contention with common
>> process operations such as fork, exec, and exit.
> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
> from cgroup side. Are there reasons why you can't use CLONE_INTO_CGROUP?
>
> Thanks.
>
As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
still requires holding the threadgroup_rwsem, so contention with fork
operations persists.

CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
and deletion, but its usage comes with significant limitations:

1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
adoption is gradually increasing, many applications have not yet been
adapted to cgroup v2, and phasing out cgroup v1 will be a long and
gradual process.


2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the
time of process fork, effectively restricting cgroup migration to the
fork stage. This differs significantly from the typical cgroup attach
workflow. For example, in Kubernetes, systemd is the recommended cgroup
driver; kubelet communicates with systemd via D-Bus, and systemd
performs the actual cgroup attachment. In this case, the process being
attached typically does not have systemd as its parent. Using
CLONE_INTO_CGROUP in such a scenario is impractical and would require
coordinated changes to both systemd and kubelet.

Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 13:14 ` Waiman Long
  2025-09-04  1:35   ` Chen Ridong
@ 2025-09-04  4:59   ` escape
  2025-09-04  5:02   ` escape
  2 siblings, 0 replies; 44+ messages in thread
From: escape @ 2025-09-04  4:59 UTC (permalink / raw)
  To: Waiman Long, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel



在 2025/9/3 21:14, Waiman Long 写道:
>
> On 9/3/25 7:11 AM, Yi Tao wrote:
>> As computer hardware advances, modern systems are typically equipped
>> with many CPU cores and large amounts of memory, enabling the deployment
>> of numerous applications. On such systems, container creation and
>> deletion become frequent operations, making cgroup process migration no
>> longer a cold path. This leads to noticeable contention with common
>> process operations such as fork, exec, and exit.
>>
>> To alleviate the contention between cgroup process migration and
>> operations like process fork, this patch modifies lock to take the write
>> lock on signal_struct->group_rwsem when writing pid to
>> cgroup.procs/threads instead of holding a global write lock.
>>
>> Cgroup process migration has historically relied on
>> signal_struct->group_rwsem to protect thread group integrity. In commit
>> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
>> a global percpu_rwsem"), this was changed to a global
>> cgroup_threadgroup_rwsem. The advantage of using a global lock was
>> simplified handling of process group migrations. This patch retains the
>> use of the global lock for protecting process group migration, while
>> reducing contention by using per thread group lock during
>> cgroup.procs/threads writes.
>>
>> The locking behavior is as follows:
>>
>> write cgroup.procs/threads  | process fork,exec,exit | process group 
>> migration
>> ------------------------------------------------------------------------------ 
>>
>> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
>> down_write(&p_rwsem)        | down_read(&p_rwsem)    | 
>> down_write(&g_rwsem)
>> critical section            | critical section       | critical section
>> up_write(&p_rwsem)          | up_read(&p_rwsem)      | 
>> up_write(&g_rwsem)
>> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>>
>> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
>> signal_struct->group_rwsem.
>>
>> This patch eliminates contention between cgroup migration and fork
>> operations for threads that belong to different thread groups, thereby
>> reducing the long-tail latency of cgroup migrations and lowering system
>> load.
> Do you have any performance numbers to showcase how much performance 
> benefit does this change provide?
Yes, here is the simulation test setup and results.

Machine Configuration:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2 NUMA node(s): 2
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8475B
BIOS Model name: Intel(R) Xeon(R) Platinum 8475B
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191

Test Program 1: cgroup_attach, four threads each write to cgroup.procs 
every 10ms,
simulating frequent cgroup process migrations.
Test Program 2: stress-ng or UnixBench, used to generate CPU and 
fork-intensive workloads.

Terminal 1: taskset -ac 0-47 ./cgroup_attach
Terminal 2: taskset -ac 48-95 stress-ng --cpu 48 --forkheavy 4

Use bpftrace to monitor the latency of cgroup1_procs_write.
Without the patch: The tail latency is 30ms
With the patch:The tail latency is 16μs.

Replace stress-ng with UnixBench to evaluate fork performance:

Terminal 2: taskset -ac 48-95 ./Run spawn

Without the patch: Multi-CPU score =15753.9

With the patch: Multi-CPU score = 17111.2

We can observe that the latency of cgroup process migration has been 
significantly
reduced, and the performance of process forking has improved.

In production environments running big data computing workloads, without 
this patch,
we observed a large number of processes entering the uninterruptible 
sleep (D) state
due to contention on `cgroup_threadgroup_rwsem`. With the patch applied, 
the system
load average has decreased.

cgroup_attach.c
```
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <fcntl.h>
#include <string.h>

#define NUM_THREADS 4
#define SLEEP_US 10000  // 10ms
#define CGROUP_ROOT "/sys/fs/cgroup/cpu"

pid_t gettid() {
     return syscall(SYS_gettid);
}


void* thread_func(void* arg) {
     char dirname[256];
     char procs_file[256];
     int fd;
     pid_t child_pid;
     FILE* file;

     pid_t tid = gettid();

     snprintf(dirname, sizeof(dirname), "%s/test-%d", CGROUP_ROOT, tid);
     snprintf(procs_file, sizeof(procs_file), "%s/cgroup.procs", dirname);

     if (mkdir(dirname, 0755) != 0) {
         perror(dirname);
         pthread_exit(NULL);
     }

     printf("thread %d:  cgroup dir %s\n", tid, dirname);


     while (1) {
         child_pid = fork();

         if (child_pid < 0) {
             perror("fork");
             continue;
         }

         if (child_pid == 0) {
             usleep(SLEEP_US);
             exit(0);
         }

         fd = open(procs_file, O_WRONLY);
         if (fd < 0) {
             perror("open cgroup.procs");
         } else {
             char pid_str[16];
             int len = snprintf(pid_str, sizeof(pid_str), "%d\n", 
child_pid);
             if (write(fd, pid_str, len) < 0) {
                 perror("write cgroup.procs");
             }
             close(fd);
         }
         close(fd);
         waitpid(child_pid, NULL, NULL);
         usleep(SLEEP_US);
     }

     pthread_exit(NULL);
}

int main() {
     pthread_t threads[NUM_THREADS];
     int i;

     for (i = 0; i < NUM_THREADS; i++) {
         if (pthread_create(&threads[i], NULL, thread_func, NULL) != 0) {
             perror("pthread_create");
             break;
         }
     }

     for (i = 0; i < NUM_THREADS; i++) {
         pthread_join(threads[i], NULL);
     }

     return 0;
}

```

>> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
>> ---
>>   include/linux/cgroup-defs.h     |  2 ++
>>   include/linux/sched/signal.h    |  4 +++
>>   init/init_task.c                |  3 ++
>>   kernel/cgroup/cgroup-internal.h |  6 ++--
>>   kernel/cgroup/cgroup-v1.c       |  8 ++---
>>   kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
>>   kernel/fork.c                   |  4 +++
>>   7 files changed, 55 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 6b93a64115fe..8e0fdad8a440 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -838,6 +838,7 @@ struct cgroup_of_peak {
>>   static inline void cgroup_threadgroup_change_begin(struct 
>> task_struct *tsk)
>>   {
>>       percpu_down_read(&cgroup_threadgroup_rwsem);
>> +    down_read(&tsk->signal->group_rwsem);
>>   }
>>     /**
>> @@ -848,6 +849,7 @@ static inline void 
>> cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>    */
>>   static inline void cgroup_threadgroup_change_end(struct task_struct 
>> *tsk)
>>   {
>> +    up_read(&tsk->signal->group_rwsem);
>>       percpu_up_read(&cgroup_threadgroup_rwsem);
>>   }
>>   diff --git a/include/linux/sched/signal.h 
>> b/include/linux/sched/signal.h
>> index 1ef1edbaaf79..86fbc99a9174 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -226,6 +226,10 @@ struct signal_struct {
>>       struct tty_audit_buf *tty_audit_buf;
>>   #endif
>>   +#ifdef CONFIG_CGROUPS
>> +    struct rw_semaphore group_rwsem;
>> +#endif
>> +
>>       /*
>>        * Thread is the potential origin of an oom condition; kill 
>> first on
>>        * oom
>> diff --git a/init/init_task.c b/init/init_task.c
>> index e557f622bd90..0450093924a7 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>>       },
>>       .multiprocess    = HLIST_HEAD_INIT,
>>       .rlim        = INIT_RLIMITS,
>> +#ifdef CONFIG_CGROUPS
>> +    .group_rwsem    = __RWSEM_INITIALIZER(init_signals.group_rwsem),
>> +#endif
>>       .cred_guard_mutex = 
>> __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>>       .exec_update_lock = 
>> __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>>   #ifdef CONFIG_POSIX_TIMERS
>> diff --git a/kernel/cgroup/cgroup-internal.h 
>> b/kernel/cgroup/cgroup-internal.h
>> index b14e61c64a34..572c24b7e947 100644
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, 
>> bool threadgroup,
>>     int cgroup_attach_task(struct cgroup *dst_cgrp, struct 
>> task_struct *leader,
>>                  bool threadgroup);
>> -void cgroup_attach_lock(bool lock_threadgroup);
>> -void cgroup_attach_unlock(bool lock_threadgroup);
>> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
>> +            bool global);
>> +void cgroup_attach_unlock(struct task_struct *tsk, bool 
>> lock_threadgroup,
>> +              bool global);
>>   struct task_struct *cgroup_procs_write_start(char *buf, bool 
>> threadgroup,
>>                            bool *locked)
>>       __acquires(&cgroup_threadgroup_rwsem);
>> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
>> index 2a4a387f867a..3e5ead8c5bc5 100644
>> --- a/kernel/cgroup/cgroup-v1.c
>> +++ b/kernel/cgroup/cgroup-v1.c
>> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct 
>> *from, struct task_struct *tsk)
>>       int retval = 0;
>>         cgroup_lock();
>> -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>       for_each_root(root) {
>>           struct cgroup *from_cgrp;
>>   @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct 
>> *from, struct task_struct *tsk)
>>           if (retval)
>>               break;
>>       }
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>         return retval;
>> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, 
>> struct cgroup *from)
>>         cgroup_lock();
>>   -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>         /* all tasks in @from are being moved, all csets are source */
>>       spin_lock_irq(&css_set_lock);
>> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, 
>> struct cgroup *from)
>>       } while (task && !ret);
>>   out_err:
>>       cgroup_migrate_finish(&mgctx);
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>       return ret;
>>   }
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 312c6a8b55bb..4e1d80a2741f 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>>    * write-locking cgroup_threadgroup_rwsem. This allows ->attach() 
>> to assume that
>>    * CPU hotplug is disabled on entry.
>>    */
>> -void cgroup_attach_lock(bool lock_threadgroup)
>> +void cgroup_attach_lock(struct task_struct *tsk,
>> +                   bool lock_threadgroup, bool global)
>>   {
>>       cpus_read_lock();
>> -    if (lock_threadgroup)
>> -        percpu_down_write(&cgroup_threadgroup_rwsem);
>> +    if (lock_threadgroup) {
>> +        if (global)
>> +            percpu_down_write(&cgroup_threadgroup_rwsem);
>> +        else
>> +            down_write(&tsk->signal->group_rwsem);
>> +    }
>>   }
>
> tsk is only used when global is false. So you can eliminate the 
> redundant global argument and use the presence of tsk to indicate 
> which lock to take.
>
> Cheers,
> Longman


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 13:14 ` Waiman Long
  2025-09-04  1:35   ` Chen Ridong
  2025-09-04  4:59   ` escape
@ 2025-09-04  5:02   ` escape
  2 siblings, 0 replies; 44+ messages in thread
From: escape @ 2025-09-04  5:02 UTC (permalink / raw)
  To: Waiman Long, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel




在 2025/9/3 21:14, Waiman Long 写道:
>
> On 9/3/25 7:11 AM, Yi Tao wrote:
>> As computer hardware advances, modern systems are typically equipped
>> with many CPU cores and large amounts of memory, enabling the deployment
>> of numerous applications. On such systems, container creation and
>> deletion become frequent operations, making cgroup process migration no
>> longer a cold path. This leads to noticeable contention with common
>> process operations such as fork, exec, and exit.
>>
>> To alleviate the contention between cgroup process migration and
>> operations like process fork, this patch modifies lock to take the write
>> lock on signal_struct->group_rwsem when writing pid to
>> cgroup.procs/threads instead of holding a global write lock.
>>
>> Cgroup process migration has historically relied on
>> signal_struct->group_rwsem to protect thread group integrity. In commit
>> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
>> a global percpu_rwsem"), this was changed to a global
>> cgroup_threadgroup_rwsem. The advantage of using a global lock was
>> simplified handling of process group migrations. This patch retains the
>> use of the global lock for protecting process group migration, while
>> reducing contention by using per thread group lock during
>> cgroup.procs/threads writes.
>>
>> The locking behavior is as follows:
>>
>> write cgroup.procs/threads  | process fork,exec,exit | process group 
>> migration
>> ------------------------------------------------------------------------------ 
>>
>> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
>> down_write(&p_rwsem)        | down_read(&p_rwsem)    | 
>> down_write(&g_rwsem)
>> critical section            | critical section       | critical section
>> up_write(&p_rwsem)          | up_read(&p_rwsem)      | 
>> up_write(&g_rwsem)
>> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>>
>> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
>> signal_struct->group_rwsem.
>>
>> This patch eliminates contention between cgroup migration and fork
>> operations for threads that belong to different thread groups, thereby
>> reducing the long-tail latency of cgroup migrations and lowering system
>> load.
> Do you have any performance numbers to showcase how much performance 
> benefit does this change provide?
>> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
>> ---
>>   include/linux/cgroup-defs.h     |  2 ++
>>   include/linux/sched/signal.h    |  4 +++
>>   init/init_task.c                |  3 ++
>>   kernel/cgroup/cgroup-internal.h |  6 ++--
>>   kernel/cgroup/cgroup-v1.c       |  8 ++---
>>   kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
>>   kernel/fork.c                   |  4 +++
>>   7 files changed, 55 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 6b93a64115fe..8e0fdad8a440 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -838,6 +838,7 @@ struct cgroup_of_peak {
>>   static inline void cgroup_threadgroup_change_begin(struct 
>> task_struct *tsk)
>>   {
>>       percpu_down_read(&cgroup_threadgroup_rwsem);
>> +    down_read(&tsk->signal->group_rwsem);
>>   }
>>     /**
>> @@ -848,6 +849,7 @@ static inline void 
>> cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>    */
>>   static inline void cgroup_threadgroup_change_end(struct task_struct 
>> *tsk)
>>   {
>> +    up_read(&tsk->signal->group_rwsem);
>>       percpu_up_read(&cgroup_threadgroup_rwsem);
>>   }
>>   diff --git a/include/linux/sched/signal.h 
>> b/include/linux/sched/signal.h
>> index 1ef1edbaaf79..86fbc99a9174 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -226,6 +226,10 @@ struct signal_struct {
>>       struct tty_audit_buf *tty_audit_buf;
>>   #endif
>>   +#ifdef CONFIG_CGROUPS
>> +    struct rw_semaphore group_rwsem;
>> +#endif
>> +
>>       /*
>>        * Thread is the potential origin of an oom condition; kill 
>> first on
>>        * oom
>> diff --git a/init/init_task.c b/init/init_task.c
>> index e557f622bd90..0450093924a7 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>>       },
>>       .multiprocess    = HLIST_HEAD_INIT,
>>       .rlim        = INIT_RLIMITS,
>> +#ifdef CONFIG_CGROUPS
>> +    .group_rwsem    = __RWSEM_INITIALIZER(init_signals.group_rwsem),
>> +#endif
>>       .cred_guard_mutex = 
>> __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>>       .exec_update_lock = 
>> __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>>   #ifdef CONFIG_POSIX_TIMERS
>> diff --git a/kernel/cgroup/cgroup-internal.h 
>> b/kernel/cgroup/cgroup-internal.h
>> index b14e61c64a34..572c24b7e947 100644
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, 
>> bool threadgroup,
>>     int cgroup_attach_task(struct cgroup *dst_cgrp, struct 
>> task_struct *leader,
>>                  bool threadgroup);
>> -void cgroup_attach_lock(bool lock_threadgroup);
>> -void cgroup_attach_unlock(bool lock_threadgroup);
>> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
>> +            bool global);
>> +void cgroup_attach_unlock(struct task_struct *tsk, bool 
>> lock_threadgroup,
>> +              bool global);
>>   struct task_struct *cgroup_procs_write_start(char *buf, bool 
>> threadgroup,
>>                            bool *locked)
>>       __acquires(&cgroup_threadgroup_rwsem);
>> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
>> index 2a4a387f867a..3e5ead8c5bc5 100644
>> --- a/kernel/cgroup/cgroup-v1.c
>> +++ b/kernel/cgroup/cgroup-v1.c
>> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct 
>> *from, struct task_struct *tsk)
>>       int retval = 0;
>>         cgroup_lock();
>> -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>       for_each_root(root) {
>>           struct cgroup *from_cgrp;
>>   @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct 
>> *from, struct task_struct *tsk)
>>           if (retval)
>>               break;
>>       }
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>         return retval;
>> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, 
>> struct cgroup *from)
>>         cgroup_lock();
>>   -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>         /* all tasks in @from are being moved, all csets are source */
>>       spin_lock_irq(&css_set_lock);
>> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, 
>> struct cgroup *from)
>>       } while (task && !ret);
>>   out_err:
>>       cgroup_migrate_finish(&mgctx);
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>       return ret;
>>   }
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 312c6a8b55bb..4e1d80a2741f 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>>    * write-locking cgroup_threadgroup_rwsem. This allows ->attach() 
>> to assume that
>>    * CPU hotplug is disabled on entry.
>>    */
>> -void cgroup_attach_lock(bool lock_threadgroup)
>> +void cgroup_attach_lock(struct task_struct *tsk,
>> +                   bool lock_threadgroup, bool global)
>>   {
>>       cpus_read_lock();
>> -    if (lock_threadgroup)
>> -        percpu_down_write(&cgroup_threadgroup_rwsem);
>> +    if (lock_threadgroup) {
>> +        if (global)
>> +            percpu_down_write(&cgroup_threadgroup_rwsem);
>> +        else
>> +            down_write(&tsk->signal->group_rwsem);
>> +    }
>>   }
>
> tsk is only used when global is false. So you can eliminate the 
> redundant global argument and use the presence of tsk to indicate 
> which lock to take.
>
> Cheers,
> Longman
Thank you for your suggestion. As you pointed out, we can determine 
which lock to acquire

based on whether the task is NULL. I'll incorporate this change in the 
next version of the patch.


Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  3:15   ` escape
@ 2025-09-04  6:38     ` escape
  2025-09-04  7:28     ` Tejun Heo
  1 sibling, 0 replies; 44+ messages in thread
From: escape @ 2025-09-04  6:38 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel



在 2025/9/4 11:15, escape 写道:
>
> 在 2025/9/4 00:53, Tejun Heo 写道:
>> Hello,
>>
>> On Wed, Sep 03, 2025 at 07:11:07PM +0800, Yi Tao wrote:
>>> As computer hardware advances, modern systems are typically equipped
>>> with many CPU cores and large amounts of memory, enabling the 
>>> deployment
>>> of numerous applications. On such systems, container creation and
>>> deletion become frequent operations, making cgroup process migration no
>>> longer a cold path. This leads to noticeable contention with common
>>> process operations such as fork, exec, and exit.
>> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become 
>> cold. It
>> disappears completely and CLONE_INTO_CGROUP doesn't need any global 
>> locks
>> from cgroup side. Are there reasons why you can't use CLONE_INTO_CGROUP?
>>
>> Thanks.
>>
> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
> still requires holding the threadgroup_rwsem, so contention with fork
> operations persists.
Sorry, my understanding here was wrong; using CLONE_INTO_CGROUP can
indeed avoid the race condition with fork, but the restrictions do exist.

Thanks.
>
> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
> and deletion, but its usage comes with significant limitations:
>
> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
> adoption is gradually increasing, many applications have not yet been
> adapted to cgroup v2, and phasing out cgroup v1 will be a long and
> gradual process.
>
>
> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at 
> the
> time of process fork, effectively restricting cgroup migration to the
> fork stage. This differs significantly from the typical cgroup attach
> workflow. For example, in Kubernetes, systemd is the recommended cgroup
> driver; kubelet communicates with systemd via D-Bus, and systemd
> performs the actual cgroup attachment. In this case, the process being
> attached typically does not have systemd as its parent. Using
> CLONE_INTO_CGROUP in such a scenario is impractical and would require
> coordinated changes to both systemd and kubelet.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  1:40       ` Chen Ridong
@ 2025-09-04  6:43         ` escape
  2025-09-04  6:52         ` Tejun Heo
  1 sibling, 0 replies; 44+ messages in thread
From: escape @ 2025-09-04  6:43 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Michal Koutný; +Cc: hannes, cgroups, linux-kernel



在 2025/9/4 09:40, Chen Ridong 写道:
>
> On 2025/9/4 4:45, Tejun Heo wrote:
>> Hello, Michal.
>>
>> On Wed, Sep 03, 2025 at 10:03:39PM +0200, Michal Koutný wrote:
>>> On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote:
>>>> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
>>>> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
>>>> from cgroup side.
>>> CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular
>>> migration, no? Its effect is atomicity wrt clone.
>>> Or, Tejum, what do you mean that it disappears? (I think we cannot give
>>> up cgroup_mutex as it ensures synchronization of possible parent's
>>> migration.)
>> Sorry, I was confused. We no longer need to write lock threadgroup rwsem
>> when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need
>> cgroup_mutex.
>>
>>    671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree")
>>
>> Thanks.
>>
> I'm still a bit confused. Commit 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when
> updating csses on an empty subtree") only applies to CSS updates. However, cloning with
> CLONE_INTO_CGROUP still requires acquiring the threadgroup_rwsem.
>
> cgroup_can_fork
>    cgroup_css_set_fork
>      	if (kargs->flags & CLONE_INTO_CGROUP)
> 		cgroup_lock();
> 	cgroup_threadgroup_change_begin(current);
>
When using CLONE_INTO_CGROUP, there is no need to write cgroup.procs. So 
although
a read lock is required here, there is no contention with the write lock.

Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  1:40       ` Chen Ridong
  2025-09-04  6:43         ` escape
@ 2025-09-04  6:52         ` Tejun Heo
  1 sibling, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2025-09-04  6:52 UTC (permalink / raw)
  To: Chen Ridong; +Cc: Michal Koutný, Yi Tao, hannes, cgroups, linux-kernel

Hello,

On Thu, Sep 04, 2025 at 09:40:12AM +0800, Chen Ridong wrote:
...
> > Sorry, I was confused. We no longer need to write lock threadgroup rwsem
> > when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need
> > cgroup_mutex.
> > 
> >   671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree")
> > 
> > Thanks.
> > 
> 
> I'm still a bit confused. Commit 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when
> updating csses on an empty subtree") only applies to CSS updates. However, cloning with
> CLONE_INTO_CGROUP still requires acquiring the threadgroup_rwsem.
> 
> cgroup_can_fork
>   cgroup_css_set_fork
>     	if (kargs->flags & CLONE_INTO_CGROUP)
> 		cgroup_lock();
> 	cgroup_threadgroup_change_begin(current);

Ah, yeah, I'm misremembering things, sorry. What got elided in that commit
is down_write of threadgroup_rwsem when enabling controllers on empty
cgroups, which was the only operation which still needed to down_write the
rwsem. Here's an excerpt from the commit message:

    After this optimization, the usage pattern of creating a cgroup, enabling
    the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and
    then removing the cgroup after it becomes empty doesn't need to write-lock
    threadgroup_rwsem at all.

It's true that cgroup_threadgroup_change_begin() down_reads the
threadgroup_rwsem but that is a percpu_rwsem whose read operations are
percpu inc/dec. This doesn't add any noticeable overhead or has any
scalability concerns.

So, if you follow the "recommended" workflow, the only remaining possible
scalability bottleneck is cgroup_mutex.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  3:15   ` escape
  2025-09-04  6:38     ` escape
@ 2025-09-04  7:28     ` Tejun Heo
  2025-09-04  8:10       ` escape
  1 sibling, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2025-09-04  7:28 UTC (permalink / raw)
  To: escape; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

On Thu, Sep 04, 2025 at 11:15:26AM +0800, escape wrote:
> 在 2025/9/4 00:53, Tejun Heo 写道:
> > Hello,
...
> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
> still requires holding the threadgroup_rwsem, so contention with fork
> operations persists.

Sorry about my fumbling explanations repeatedly but this isn't true. On
cgroup2, if you create a cgroup, enable controllers and then seed it with
CLONE_INTO_CGROUP, threadgroup_rwsem is out of the picture. The only
remaining contention point is cgroup_mutex.

> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
> and deletion, but its usage comes with significant limitations:
> 
> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
> adoption is gradually increasing, many applications have not yet been
> adapted to cgroup v2, and phasing out cgroup v1 will be a long and
> gradual process.
> 
> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the
> time of process fork, effectively restricting cgroup migration to the
> fork stage. This differs significantly from the typical cgroup attach
> workflow. For example, in Kubernetes, systemd is the recommended cgroup
> driver; kubelet communicates with systemd via D-Bus, and systemd
> performs the actual cgroup attachment. In this case, the process being
> attached typically does not have systemd as its parent. Using
> CLONE_INTO_CGROUP in such a scenario is impractical and would require
> coordinated changes to both systemd and kubelet.

A percpu rwsem (threadgroup_rwsem) was used instead of per-threadgroup
locking to avoid adding overhead to hot paths - fork and exit - because
cgroup operations were expected to be a lot colder. Now, threadgroup rwsem
is *really* expensive for the writers, so the trade-off could be a bit too
extreme for some use cases.

However, now that the most common usage pattern doesn't involve
threadgroup_rwsem, I don't feel too enthusiastic about adding hot path
overhead to work around usage patterns that we want to move away from. Note
that dynamic migrations have other more fundamental problems for stateful
resources and we generally want to move away from it. Sure, a single rwsem
operation in fork/exit isn't a lot of overhead but it isn't nothing either
and this will impact everybody.

Maybe we can make it a mount option so that use cases that still depend on
it can toggle it on? In fact, there's already favordynmods mount option
which seems like a good fit. Maybe put the extra locking behind that flag?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  7:28     ` Tejun Heo
@ 2025-09-04  8:10       ` escape
  0 siblings, 0 replies; 44+ messages in thread
From: escape @ 2025-09-04  8:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel



在 2025/9/4 15:28, Tejun Heo 写道:
> Hello,
>
> On Thu, Sep 04, 2025 at 11:15:26AM +0800, escape wrote:
>> 在 2025/9/4 00:53, Tejun Heo 写道:
>>> Hello,
> ...
>> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
>> still requires holding the threadgroup_rwsem, so contention with fork
>> operations persists.
> Sorry about my fumbling explanations repeatedly but this isn't true. On
> cgroup2, if you create a cgroup, enable controllers and then seed it with
> CLONE_INTO_CGROUP, threadgroup_rwsem is out of the picture. The only
> remaining contention point is cgroup_mutex.
>
>> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
>> and deletion, but its usage comes with significant limitations:
>>
>> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
>> adoption is gradually increasing, many applications have not yet been
>> adapted to cgroup v2, and phasing out cgroup v1 will be a long and
>> gradual process.
>>
>> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the
>> time of process fork, effectively restricting cgroup migration to the
>> fork stage. This differs significantly from the typical cgroup attach
>> workflow. For example, in Kubernetes, systemd is the recommended cgroup
>> driver; kubelet communicates with systemd via D-Bus, and systemd
>> performs the actual cgroup attachment. In this case, the process being
>> attached typically does not have systemd as its parent. Using
>> CLONE_INTO_CGROUP in such a scenario is impractical and would require
>> coordinated changes to both systemd and kubelet.
> A percpu rwsem (threadgroup_rwsem) was used instead of per-threadgroup
> locking to avoid adding overhead to hot paths - fork and exit - because
> cgroup operations were expected to be a lot colder. Now, threadgroup rwsem
> is *really* expensive for the writers, so the trade-off could be a bit too
> extreme for some use cases.
>
> However, now that the most common usage pattern doesn't involve
> threadgroup_rwsem, I don't feel too enthusiastic about adding hot path
> overhead to work around usage patterns that we want to move away from. Note
> that dynamic migrations have other more fundamental problems for stateful
> resources and we generally want to move away from it. Sure, a single rwsem
> operation in fork/exit isn't a lot of overhead but it isn't nothing either
> and this will impact everybody.
>
> Maybe we can make it a mount option so that use cases that still depend on
> it can toggle it on? In fact, there's already favordynmods mount option
> which seems like a good fit. Maybe put the extra locking behind that flag?
>
> Thanks.
>
Thank you for your reply.

I agree that mounting with cgroupv2, creating a cgroup, enabling 
controllers,
and then seeding it with CLONE_INTO_CGROUP is an excellent solution. 
However,
we've encountered significant obstacles when applying this approach on some
older systems. We are simultaneously working on enabling CLONE_INTO_CGROUP
support in runc, systemd, and other components, but this will take some 
time.
This patch aims to alleviate the issue to some extent during this 
transitional
period.

Regarding the impact of the extra rwsem operations on hot paths, I have 
conducted
performance testing. In cases where there is no contention on down_write,
the UnixBench spawn test scores remain unaffected.

The suggestion to use the favordynmods flag to control whether the extra 
rwsem
is used is excellent. I will incorporate this condition in the next 
version of
the patch.

Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
  2025-09-03 13:14 ` Waiman Long
  2025-09-03 16:53 ` Tejun Heo
@ 2025-09-04 11:39 ` Yi Tao
  2025-09-04 11:39   ` [PATCH v2 1/1] " Yi Tao
  2025-09-08 10:20 ` [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Yi Tao @ 2025-09-04 11:39 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Changes in v2:
- Use favordynmods as the enabling switch.
- Determine whether to use the per-thread-group rwsem based on whether
the task is NULL.
- Fix system hang caused by acquiring cgroup_threadgroup_rwsem inside
rcu_read_lock.

Yi Tao (1):
  cgroup: replace global percpu_rwsem with signal_struct->group_rwsem
    when writing cgroup.procs/threads

 include/linux/cgroup-defs.h     |  6 +++
 include/linux/sched/signal.h    |  4 ++
 init/init_task.c                |  3 ++
 kernel/cgroup/cgroup-internal.h |  4 +-
 kernel/cgroup/cgroup-v1.c       |  8 ++--
 kernel/cgroup/cgroup.c          | 72 +++++++++++++++++++--------------
 kernel/fork.c                   |  4 ++
 7 files changed, 64 insertions(+), 37 deletions(-)

-- 
2.32.0.3.g01195cf9f


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao
@ 2025-09-04 11:39   ` Yi Tao
  2025-09-04 16:31     ` Tejun Heo
                       ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Yi Tao @ 2025-09-04 11:39 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads  | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
critical section            | critical section       | critical section
up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

To avoid affecting other users, the per-thread-group rwsem is only used
when the `favordynmods` flag is enabled.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h     |  6 +++
 include/linux/sched/signal.h    |  4 ++
 init/init_task.c                |  3 ++
 kernel/cgroup/cgroup-internal.h |  4 +-
 kernel/cgroup/cgroup-v1.c       |  8 ++--
 kernel/cgroup/cgroup.c          | 72 +++++++++++++++++++--------------
 kernel/fork.c                   |  4 ++
 7 files changed, 64 insertions(+), 37 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b93a64115fe..0c068dc3e08d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -828,6 +828,8 @@ struct cgroup_of_peak {
 	struct list_head	list;
 };
 
+extern bool have_favordynmods;
+
 /**
  * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
  * @tsk: target task
@@ -838,6 +840,8 @@ struct cgroup_of_peak {
 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
 {
 	percpu_down_read(&cgroup_threadgroup_rwsem);
+	if (have_favordynmods)
+		down_read(&tsk->signal->group_rwsem);
 }
 
 /**
@@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
  */
 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 {
+	if (have_favordynmods)
+		up_read(&tsk->signal->group_rwsem);
 	percpu_up_read(&cgroup_threadgroup_rwsem);
 }
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..86fbc99a9174 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -226,6 +226,10 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
+#ifdef CONFIG_CGROUPS
+	struct rw_semaphore group_rwsem;
+#endif
+
 	/*
 	 * Thread is the potential origin of an oom condition; kill first on
 	 * oom
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..0450093924a7 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
 	},
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
+#ifdef CONFIG_CGROUPS
+	.group_rwsem	= __RWSEM_INITIALIZER(init_signals.group_rwsem),
+#endif
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index b14e61c64a34..35005543f0c7 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(bool lock_threadgroup);
-void cgroup_attach_unlock(bool lock_threadgroup);
+void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup);
+void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     bool *locked)
 	__acquires(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 2a4a387f867a..9f1a4d1fc741 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(NULL, true);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(NULL, true);
 	cgroup_unlock();
 
 	return retval;
@@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(NULL, true);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(NULL, true);
 	cgroup_unlock();
 	return ret;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..22b1659b623c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -214,7 +214,7 @@ static u16 have_exit_callback __read_mostly;
 static u16 have_release_callback __read_mostly;
 static u16 have_canfork_callback __read_mostly;
 
-static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
+bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
 
 /* cgroup namespace for init task */
 struct cgroup_namespace init_cgroup_ns = {
@@ -2459,7 +2459,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
 /**
  * cgroup_attach_lock - Lock for ->attach()
- * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
+ * @tsk: thread group to lock
+ * @lock_threadgroup: whether to down_write rwsem
  *
  * cgroup migration sometimes needs to stabilize threadgroups against forks and
  * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
@@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
  */
-void cgroup_attach_lock(bool lock_threadgroup)
+void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup)
 {
 	cpus_read_lock();
-	if (lock_threadgroup)
-		percpu_down_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (tsk && favor_dynmods)
+			down_write(&tsk->signal->group_rwsem);
+		else
+			percpu_down_write(&cgroup_threadgroup_rwsem);
+	}
 }
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
- * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
+ * @tsk: thread group to lock
+ * @lock_threadgroup: whether to up_write rwsem
  */
-void cgroup_attach_unlock(bool lock_threadgroup)
+void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup)
 {
-	if (lock_threadgroup)
-		percpu_up_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (tsk && favor_dynmods)
+			up_write(&tsk->signal->group_rwsem);
+		else
+			percpu_up_write(&cgroup_threadgroup_rwsem);
+	}
 	cpus_read_unlock();
 }
 
@@ -2976,24 +2986,12 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
 		return ERR_PTR(-EINVAL);
 
-	/*
-	 * If we migrate a single thread, we don't care about threadgroup
-	 * stability. If the thread is `current`, it won't exit(2) under our
-	 * hands or change PID through exec(2). We exclude
-	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
-	 * callers by cgroup_mutex.
-	 * Therefore, we can skip the global lock.
-	 */
-	lockdep_assert_held(&cgroup_mutex);
-	*threadgroup_locked = pid || threadgroup;
-	cgroup_attach_lock(*threadgroup_locked);
-
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
 			tsk = ERR_PTR(-ESRCH);
-			goto out_unlock_threadgroup;
+			goto out_unlock_rcu;
 		}
 	} else {
 		tsk = current;
@@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 */
 	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
 		tsk = ERR_PTR(-EINVAL);
-		goto out_unlock_threadgroup;
+		goto out_unlock_rcu;
 	}
-
 	get_task_struct(tsk);
-	goto out_unlock_rcu;
 
-out_unlock_threadgroup:
-	cgroup_attach_unlock(*threadgroup_locked);
-	*threadgroup_locked = false;
+	rcu_read_unlock();
+
+	/*
+	 * If we migrate a single thread, we don't care about threadgroup
+	 * stability. If the thread is `current`, it won't exit(2) under our
+	 * hands or change PID through exec(2). We exclude
+	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
+	 * callers by cgroup_mutex.
+	 * Therefore, we can skip the global lock.
+	 */
+	lockdep_assert_held(&cgroup_mutex);
+	*threadgroup_locked = pid || threadgroup;
+
+	cgroup_attach_lock(tsk, *threadgroup_locked);
+
+	return tsk;
+
 out_unlock_rcu:
 	rcu_read_unlock();
 	return tsk;
@@ -3032,7 +3042,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
 
-	cgroup_attach_unlock(threadgroup_locked);
+	cgroup_attach_unlock(task, threadgroup_locked);
 
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3119,7 +3129,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	 * write-locking can be skipped safely.
 	 */
 	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
-	cgroup_attach_lock(has_tasks);
+	cgroup_attach_lock(NULL, has_tasks);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3140,7 +3150,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(has_tasks);
+	cgroup_attach_unlock(NULL, has_tasks);
 	return ret;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499d..5218f9b93c77 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->group_rwsem);
+#endif
+
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04 11:39   ` [PATCH v2 1/1] " Yi Tao
@ 2025-09-04 16:31     ` Tejun Heo
  2025-09-05  2:16       ` escape
  2025-09-05  2:02     ` Chen Ridong
  2025-09-05 13:17     ` kernel test robot
  2 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2025-09-04 16:31 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

On Thu, Sep 04, 2025 at 07:39:32PM +0800, Yi Tao wrote:
...
> To avoid affecting other users, the per-thread-group rwsem is only used
> when the `favordynmods` flag is enabled.

Can you please:

- Note that this isn't necessary for cgroup2's recommended workflow and is
  thus gated behind favordynmods.

- Include performance numbers briefly.

> +extern bool have_favordynmods;
> +
>  /**
>   * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
>   * @tsk: target task
> @@ -838,6 +840,8 @@ struct cgroup_of_peak {
>  static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>  {
>  	percpu_down_read(&cgroup_threadgroup_rwsem);
> +	if (have_favordynmods)
> +		down_read(&tsk->signal->group_rwsem);
>  }
>  
>  /**
> @@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>   */
>  static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>  {
> +	if (have_favordynmods)
> +		up_read(&tsk->signal->group_rwsem);
>  	percpu_up_read(&cgroup_threadgroup_rwsem);

Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents
have_favordynmods flipping while a task is between change_begin and end?

> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..86fbc99a9174 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -226,6 +226,10 @@ struct signal_struct {
>  	struct tty_audit_buf *tty_audit_buf;
>  #endif
>  
> +#ifdef CONFIG_CGROUPS
> +	struct rw_semaphore group_rwsem;
> +#endif

Maybe name it more specific - e.g. cgroup_threadgroup_rwsem?

>  /**
>   * cgroup_attach_lock - Lock for ->attach()
> - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
> + * @tsk: thread group to lock
> + * @lock_threadgroup: whether to down_write rwsem
>   *
>   * cgroup migration sometimes needs to stabilize threadgroups against forks and
>   * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
> @@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>   * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
>   * CPU hotplug is disabled on entry.
>   */

Please expand the function comment to explain what's going on and why and
maybe point to it from a comment on top of favor_dynmods.

> -void cgroup_attach_lock(bool lock_threadgroup)
> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup)

As @tsk is an optional argument, it'd probably make more sense to put it at
the end.

> @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>  	 */
>  	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
>  		tsk = ERR_PTR(-EINVAL);
> -		goto out_unlock_threadgroup;
> +		goto out_unlock_rcu;
>  	}
> -
>  	get_task_struct(tsk);
> -	goto out_unlock_rcu;
>  
> -out_unlock_threadgroup:
> -	cgroup_attach_unlock(*threadgroup_locked);
> -	*threadgroup_locked = false;
> +	rcu_read_unlock();
> +
> +	/*
> +	 * If we migrate a single thread, we don't care about threadgroup
> +	 * stability. If the thread is `current`, it won't exit(2) under our
> +	 * hands or change PID through exec(2). We exclude
> +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> +	 * callers by cgroup_mutex.
> +	 * Therefore, we can skip the global lock.
> +	 */
> +	lockdep_assert_held(&cgroup_mutex);
> +	*threadgroup_locked = pid || threadgroup;
> +
> +	cgroup_attach_lock(tsk, *threadgroup_locked);

I'm not sure this relocation is safe. What prevents e.g. @tsk changing its
group leader or signal struct before lock is grabbed?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04 11:39   ` [PATCH v2 1/1] " Yi Tao
  2025-09-04 16:31     ` Tejun Heo
@ 2025-09-05  2:02     ` Chen Ridong
  2025-09-05 13:17     ` kernel test robot
  2 siblings, 0 replies; 44+ messages in thread
From: Chen Ridong @ 2025-09-05  2:02 UTC (permalink / raw)
  To: Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel



On 2025/9/4 19:39, Yi Tao wrote:
> As computer hardware advances, modern systems are typically equipped
> with many CPU cores and large amounts of memory, enabling the deployment
> of numerous applications. On such systems, container creation and
> deletion become frequent operations, making cgroup process migration no
> longer a cold path. This leads to noticeable contention with common
> process operations such as fork, exec, and exit.
> 
> To alleviate the contention between cgroup process migration and
> operations like process fork, this patch modifies lock to take the write
> lock on signal_struct->group_rwsem when writing pid to
> cgroup.procs/threads instead of holding a global write lock.
> 
> Cgroup process migration has historically relied on
> signal_struct->group_rwsem to protect thread group integrity. In commit
> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
> a global percpu_rwsem"), this was changed to a global
> cgroup_threadgroup_rwsem. The advantage of using a global lock was
> simplified handling of process group migrations. This patch retains the
> use of the global lock for protecting process group migration, while
> reducing contention by using per thread group lock during
> cgroup.procs/threads writes.
> 
> The locking behavior is as follows:
> 
> write cgroup.procs/threads  | process fork,exec,exit | process group migration
> ------------------------------------------------------------------------------
> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
> down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
> critical section            | critical section       | critical section
> up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
> 
> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
> signal_struct->group_rwsem.
> 
> This patch eliminates contention between cgroup migration and fork
> operations for threads that belong to different thread groups, thereby
> reducing the long-tail latency of cgroup migrations and lowering system
> load.
> 
> To avoid affecting other users, the per-thread-group rwsem is only used
> when the `favordynmods` flag is enabled.
> 
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
> ---
>  include/linux/cgroup-defs.h     |  6 +++
>  include/linux/sched/signal.h    |  4 ++
>  init/init_task.c                |  3 ++
>  kernel/cgroup/cgroup-internal.h |  4 +-
>  kernel/cgroup/cgroup-v1.c       |  8 ++--
>  kernel/cgroup/cgroup.c          | 72 +++++++++++++++++++--------------
>  kernel/fork.c                   |  4 ++
>  7 files changed, 64 insertions(+), 37 deletions(-)
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 6b93a64115fe..0c068dc3e08d 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -828,6 +828,8 @@ struct cgroup_of_peak {
>  	struct list_head	list;
>  };
>  
> +extern bool have_favordynmods;
> +
>  /**
>   * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
>   * @tsk: target task
> @@ -838,6 +840,8 @@ struct cgroup_of_peak {
>  static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>  {
>  	percpu_down_read(&cgroup_threadgroup_rwsem);
> +	if (have_favordynmods)
> +		down_read(&tsk->signal->group_rwsem);
>  }
>  
>  /**
> @@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>   */
>  static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>  {
> +	if (have_favordynmods)
> +		up_read(&tsk->signal->group_rwsem);
>  	percpu_up_read(&cgroup_threadgroup_rwsem);
>  }
>  
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..86fbc99a9174 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -226,6 +226,10 @@ struct signal_struct {
>  	struct tty_audit_buf *tty_audit_buf;
>  #endif
>  
> +#ifdef CONFIG_CGROUPS
> +	struct rw_semaphore group_rwsem;
> +#endif
> +
>  	/*
>  	 * Thread is the potential origin of an oom condition; kill first on
>  	 * oom
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..0450093924a7 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>  	},
>  	.multiprocess	= HLIST_HEAD_INIT,
>  	.rlim		= INIT_RLIMITS,
> +#ifdef CONFIG_CGROUPS
> +	.group_rwsem	= __RWSEM_INITIALIZER(init_signals.group_rwsem),
> +#endif
>  	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>  	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>  #ifdef CONFIG_POSIX_TIMERS
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index b14e61c64a34..35005543f0c7 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
>  
>  int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
>  		       bool threadgroup);
> -void cgroup_attach_lock(bool lock_threadgroup);
> -void cgroup_attach_unlock(bool lock_threadgroup);
> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup);
> +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup);
>  struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>  					     bool *locked)
>  	__acquires(&cgroup_threadgroup_rwsem);
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 2a4a387f867a..9f1a4d1fc741 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>  	int retval = 0;
>  
>  	cgroup_lock();
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(NULL, true);
>  	for_each_root(root) {
>  		struct cgroup *from_cgrp;
>  
> @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>  		if (retval)
>  			break;
>  	}
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(NULL, true);
>  	cgroup_unlock();
>  
>  	return retval;
> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>  
>  	cgroup_lock();
>  
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(NULL, true);
>  
>  	/* all tasks in @from are being moved, all csets are source */
>  	spin_lock_irq(&css_set_lock);
> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>  	} while (task && !ret);
>  out_err:
>  	cgroup_migrate_finish(&mgctx);
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(NULL, true);
>  	cgroup_unlock();
>  	return ret;
>  }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..22b1659b623c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -214,7 +214,7 @@ static u16 have_exit_callback __read_mostly;
>  static u16 have_release_callback __read_mostly;
>  static u16 have_canfork_callback __read_mostly;
>  
> -static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
> +bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
>  

If have_favordynmods is gating the acquisition of tsk->signal->group_rwsem, would it be better to
replace it with a static key?

>  /* cgroup namespace for init task */
>  struct cgroup_namespace init_cgroup_ns = {
> @@ -2459,7 +2459,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>  
>  /**
>   * cgroup_attach_lock - Lock for ->attach()
> - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
> + * @tsk: thread group to lock
> + * @lock_threadgroup: whether to down_write rwsem
>   *
>   * cgroup migration sometimes needs to stabilize threadgroups against forks and
>   * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
> @@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>   * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
>   * CPU hotplug is disabled on entry.
>   */
> -void cgroup_attach_lock(bool lock_threadgroup)
> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup)
>  {
>  	cpus_read_lock();
> -	if (lock_threadgroup)
> -		percpu_down_write(&cgroup_threadgroup_rwsem);
> +	if (lock_threadgroup) {
> +		if (tsk && favor_dynmods)
> +			down_write(&tsk->signal->group_rwsem);
> +		else
> +			percpu_down_write(&cgroup_threadgroup_rwsem);
> +	}
>  }
>  
>  /**
>   * cgroup_attach_unlock - Undo cgroup_attach_lock()
> - * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
> + * @tsk: thread group to lock
> + * @lock_threadgroup: whether to up_write rwsem
>   */
> -void cgroup_attach_unlock(bool lock_threadgroup)
> +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup)
>  {
> -	if (lock_threadgroup)
> -		percpu_up_write(&cgroup_threadgroup_rwsem);
> +	if (lock_threadgroup) {
> +		if (tsk && favor_dynmods)
> +			up_write(&tsk->signal->group_rwsem);
> +		else
> +			percpu_up_write(&cgroup_threadgroup_rwsem);
> +	}
>  	cpus_read_unlock();
>  }
>  
> @@ -2976,24 +2986,12 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>  	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
>  		return ERR_PTR(-EINVAL);
>  
> -	/*
> -	 * If we migrate a single thread, we don't care about threadgroup
> -	 * stability. If the thread is `current`, it won't exit(2) under our
> -	 * hands or change PID through exec(2). We exclude
> -	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> -	 * callers by cgroup_mutex.
> -	 * Therefore, we can skip the global lock.
> -	 */
> -	lockdep_assert_held(&cgroup_mutex);
> -	*threadgroup_locked = pid || threadgroup;
> -	cgroup_attach_lock(*threadgroup_locked);
> -
>  	rcu_read_lock();
>  	if (pid) {
>  		tsk = find_task_by_vpid(pid);
>  		if (!tsk) {
>  			tsk = ERR_PTR(-ESRCH);
> -			goto out_unlock_threadgroup;
> +			goto out_unlock_rcu;
>  		}
>  	} else {
>  		tsk = current;
> @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>  	 */
>  	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
>  		tsk = ERR_PTR(-EINVAL);
> -		goto out_unlock_threadgroup;
> +		goto out_unlock_rcu;
>  	}
> -
>  	get_task_struct(tsk);
> -	goto out_unlock_rcu;
>  
> -out_unlock_threadgroup:
> -	cgroup_attach_unlock(*threadgroup_locked);
> -	*threadgroup_locked = false;
> +	rcu_read_unlock();
> +
> +	/*
> +	 * If we migrate a single thread, we don't care about threadgroup
> +	 * stability. If the thread is `current`, it won't exit(2) under our
> +	 * hands or change PID through exec(2). We exclude
> +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> +	 * callers by cgroup_mutex.
> +	 * Therefore, we can skip the global lock.
> +	 */
> +	lockdep_assert_held(&cgroup_mutex);
> +	*threadgroup_locked = pid || threadgroup;
> +
> +	cgroup_attach_lock(tsk, *threadgroup_locked);
> +
> +	return tsk;
> +
>  out_unlock_rcu:
>  	rcu_read_unlock();
>  	return tsk;
> @@ -3032,7 +3042,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
>  	/* release reference from cgroup_procs_write_start() */
>  	put_task_struct(task);
>  
> -	cgroup_attach_unlock(threadgroup_locked);
> +	cgroup_attach_unlock(task, threadgroup_locked);
>  
>  	for_each_subsys(ss, ssid)
>  		if (ss->post_attach)
> @@ -3119,7 +3129,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
>  	 * write-locking can be skipped safely.
>  	 */
>  	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
> -	cgroup_attach_lock(has_tasks);
> +	cgroup_attach_lock(NULL, has_tasks);
>  
>  	/* NULL dst indicates self on default hierarchy */
>  	ret = cgroup_migrate_prepare_dst(&mgctx);
> @@ -3140,7 +3150,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
>  	ret = cgroup_migrate_execute(&mgctx);
>  out_finish:
>  	cgroup_migrate_finish(&mgctx);
> -	cgroup_attach_unlock(has_tasks);
> +	cgroup_attach_unlock(NULL, has_tasks);
>  	return ret;
>  }
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index af673856499d..5218f9b93c77 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>  	tty_audit_fork(sig);
>  	sched_autogroup_fork(sig);
>  
> +#ifdef CONFIG_CGROUPS
> +	init_rwsem(&sig->group_rwsem);
> +#endif
> +
>  	sig->oom_score_adj = current->signal->oom_score_adj;
>  	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>  

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04 16:31     ` Tejun Heo
@ 2025-09-05  2:16       ` escape
  2025-09-05  2:27         ` Tejun Heo
  0 siblings, 1 reply; 44+ messages in thread
From: escape @ 2025-09-05  2:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel

Thanks for your review.

在 2025/9/5 00:31, Tejun Heo 写道:
> Hello,
>
> On Thu, Sep 04, 2025 at 07:39:32PM +0800, Yi Tao wrote:
> ...
>> To avoid affecting other users, the per-thread-group rwsem is only used
>> when the `favordynmods` flag is enabled.
> Can you please:
>
> - Note that this isn't necessary for cgroup2's recommended workflow and is
>    thus gated behind favordynmods.
>
> - Include performance numbers briefly.
>
>> +extern bool have_favordynmods;
>> +
>>   /**
>>    * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
>>    * @tsk: target task
>> @@ -838,6 +840,8 @@ struct cgroup_of_peak {
>>   static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>   {
>>   	percpu_down_read(&cgroup_threadgroup_rwsem);
>> +	if (have_favordynmods)
>> +		down_read(&tsk->signal->group_rwsem);
>>   }
>>   
>>   /**
>> @@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>    */
>>   static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>>   {
>> +	if (have_favordynmods)
>> +		up_read(&tsk->signal->group_rwsem);
>>   	percpu_up_read(&cgroup_threadgroup_rwsem);
> Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents
> have_favordynmods flipping while a task is between change_begin and end?
have_favordynmods is read-only after initialization and will not change 
during runtime.
>> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>> index 1ef1edbaaf79..86fbc99a9174 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -226,6 +226,10 @@ struct signal_struct {
>>   	struct tty_audit_buf *tty_audit_buf;
>>   #endif
>>   
>> +#ifdef CONFIG_CGROUPS
>> +	struct rw_semaphore group_rwsem;
>> +#endif
> Maybe name it more specific - e.g. cgroup_threadgroup_rwsem?
>
>>   /**
>>    * cgroup_attach_lock - Lock for ->attach()
>> - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
>> + * @tsk: thread group to lock
>> + * @lock_threadgroup: whether to down_write rwsem
>>    *
>>    * cgroup migration sometimes needs to stabilize threadgroups against forks and
>>    * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
>> @@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>>    * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
>>    * CPU hotplug is disabled on entry.
>>    */
> Please expand the function comment to explain what's going on and why and
> maybe point to it from a comment on top of favor_dynmods.
>
>> -void cgroup_attach_lock(bool lock_threadgroup)
>> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup)
> As @tsk is an optional argument, it'd probably make more sense to put it at
> the end.
>
>> @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>>   	 */
>>   	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
>>   		tsk = ERR_PTR(-EINVAL);
>> -		goto out_unlock_threadgroup;
>> +		goto out_unlock_rcu;
>>   	}
>> -
>>   	get_task_struct(tsk);
>> -	goto out_unlock_rcu;
>>   
>> -out_unlock_threadgroup:
>> -	cgroup_attach_unlock(*threadgroup_locked);
>> -	*threadgroup_locked = false;
>> +	rcu_read_unlock();
>> +
>> +	/*
>> +	 * If we migrate a single thread, we don't care about threadgroup
>> +	 * stability. If the thread is `current`, it won't exit(2) under our
>> +	 * hands or change PID through exec(2). We exclude
>> +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
>> +	 * callers by cgroup_mutex.
>> +	 * Therefore, we can skip the global lock.
>> +	 */
>> +	lockdep_assert_held(&cgroup_mutex);
>> +	*threadgroup_locked = pid || threadgroup;
>> +
>> +	cgroup_attach_lock(tsk, *threadgroup_locked);
> I'm not sure this relocation is safe. What prevents e.g. @tsk changing its
> group leader or signal struct before lock is grabbed?
When a non-leader thread in a thread group executes the exec system call,
the thread group leader is updated, but the signal_struct remains unchanged,
so this part is safe.
>
> Thanks.

I will accept the remaining comments and apply them in the next version of
the patch.

Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-05  2:16       ` escape
@ 2025-09-05  2:27         ` Tejun Heo
  2025-09-05  3:44           ` escape
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2025-09-05  2:27 UTC (permalink / raw)
  To: escape; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

On Fri, Sep 05, 2025 at 10:16:30AM +0800, escape wrote:
> > > +	if (have_favordynmods)
> > > +		up_read(&tsk->signal->group_rwsem);
> > >   	percpu_up_read(&cgroup_threadgroup_rwsem);
> > Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents
> > have_favordynmods flipping while a task is between change_begin and end?
>
> have_favordynmods is read-only after initialization and will not change
> during runtime.

I don't think that's necessarily true. favordynmods can also be specified as
a mount option and mount can race against forks, execs and exits. Also,
IIRC, cgroup2 doesn't allow remounts but there's nothing preventing someone
from unmounting and mounting it again with different options.

> > > @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
> > >   	 */
> > >   	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
> > >   		tsk = ERR_PTR(-EINVAL);
> > > -		goto out_unlock_threadgroup;
> > > +		goto out_unlock_rcu;
> > >   	}
> > > -
> > >   	get_task_struct(tsk);
> > > -	goto out_unlock_rcu;
> > > -out_unlock_threadgroup:
> > > -	cgroup_attach_unlock(*threadgroup_locked);
> > > -	*threadgroup_locked = false;
> > > +	rcu_read_unlock();
> > > +
> > > +	/*
> > > +	 * If we migrate a single thread, we don't care about threadgroup
> > > +	 * stability. If the thread is `current`, it won't exit(2) under our
> > > +	 * hands or change PID through exec(2). We exclude
> > > +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> > > +	 * callers by cgroup_mutex.
> > > +	 * Therefore, we can skip the global lock.
> > > +	 */
> > > +	lockdep_assert_held(&cgroup_mutex);
> > > +	*threadgroup_locked = pid || threadgroup;
> > > +
> > > +	cgroup_attach_lock(tsk, *threadgroup_locked);
> > I'm not sure this relocation is safe. What prevents e.g. @tsk changing its
> > group leader or signal struct before lock is grabbed?
>
> When a non-leader thread in a thread group executes the exec system call,
> the thread group leader is updated, but the signal_struct remains unchanged,
> so this part is safe.

But the leader can change, right? So, we can end up in a situation where
threadgroup is set but the task is not the leader which I think can lead to
really subtle incorrect behaviors like write succeeding but nothing
happening when racing against exec.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-05  2:27         ` Tejun Heo
@ 2025-09-05  3:44           ` escape
  2025-09-05  3:48             ` Tejun Heo
  0 siblings, 1 reply; 44+ messages in thread
From: escape @ 2025-09-05  3:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

在 2025/9/5 10:27, Tejun Heo 写道:
> Hello,
>
> On Fri, Sep 05, 2025 at 10:16:30AM +0800, escape wrote:
>>>> +	if (have_favordynmods)
>>>> +		up_read(&tsk->signal->group_rwsem);
>>>>    	percpu_up_read(&cgroup_threadgroup_rwsem);
>>> Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents
>>> have_favordynmods flipping while a task is between change_begin and end?
>> have_favordynmods is read-only after initialization and will not change
>> during runtime.
> I don't think that's necessarily true. favordynmods can also be specified as
> a mount option and mount can race against forks, execs and exits. Also,
> IIRC, cgroup2 doesn't allow remounts but there's nothing preventing someone
> from unmounting and mounting it again with different options.
There are two ways to set favordynmods. The first is through kernel 
config or
the kernel command line parameter cgroup_favordynmods, which sets the 
value of
the variable have_favordynmods and automatically sets the 
CGRP_ROOT_FAVOR_DYNMODS
flag on cgroup_root->flags at mount time. The second is by passing a 
mount option,
which only sets the CGRP_ROOT_FAVOR_DYNMODS flag on cgroup_root->flags 
during mount.

In this patch, the decision whether to use the per-threadgroup rwsem is 
based on
have_favordynmods, not on cgroup_root->flags. An umount & mount affects 
cgroup_root->flags,
but does not change have_favordynmods.

Indeed, there is a minor flaw here: if favordynmods is enabled via a 
mount option rather
than the kernel command line, the per-threadgroup rwsem will not take 
effect.

To avoid the complexity of runtime changes to favordynmods, perhaps a 
better approach
would be to introduce a separate control variable, configured via a 
kernel command line
parameter such as cgroup_migration_lock=[global|thread_group] to 
explicitly govern
this behavior. What do you think about this approach?
>>>> @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>>>>    	 */
>>>>    	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
>>>>    		tsk = ERR_PTR(-EINVAL);
>>>> -		goto out_unlock_threadgroup;
>>>> +		goto out_unlock_rcu;
>>>>    	}
>>>> -
>>>>    	get_task_struct(tsk);
>>>> -	goto out_unlock_rcu;
>>>> -out_unlock_threadgroup:
>>>> -	cgroup_attach_unlock(*threadgroup_locked);
>>>> -	*threadgroup_locked = false;
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	/*
>>>> +	 * If we migrate a single thread, we don't care about threadgroup
>>>> +	 * stability. If the thread is `current`, it won't exit(2) under our
>>>> +	 * hands or change PID through exec(2). We exclude
>>>> +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
>>>> +	 * callers by cgroup_mutex.
>>>> +	 * Therefore, we can skip the global lock.
>>>> +	 */
>>>> +	lockdep_assert_held(&cgroup_mutex);
>>>> +	*threadgroup_locked = pid || threadgroup;
>>>> +
>>>> +	cgroup_attach_lock(tsk, *threadgroup_locked);
>>> I'm not sure this relocation is safe. What prevents e.g. @tsk changing its
>>> group leader or signal struct before lock is grabbed?
>> When a non-leader thread in a thread group executes the exec system call,
>> the thread group leader is updated, but the signal_struct remains unchanged,
>> so this part is safe.
> But the leader can change, right? So, we can end up in a situation where
> threadgroup is set but the task is not the leader which I think can lead to
> really subtle incorrect behaviors like write succeeding but nothing
> happening when racing against exec.
>
> Thanks.
>
You are right. If the thread group leader changes, the task passed to
cgroup_attach_task may not be the leader, which could lead to errors when
iterating through all threads in while_each_thread.


Similar to commit b78949ebfb56 ("cgroup: simplify double-check locking in
cgroup_attach_proc"), it should check whether the leader has changed after
acquiring the lock, to determine if a retry is needed. I will fix this in
the next version patch.

Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-05  3:44           ` escape
@ 2025-09-05  3:48             ` Tejun Heo
  2025-09-05  4:30               ` escape
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2025-09-05  3:48 UTC (permalink / raw)
  To: escape; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

On Fri, Sep 05, 2025 at 11:44:53AM +0800, escape wrote:
...
> To avoid the complexity of runtime changes to favordynmods, perhaps a
> better approach would be to introduce a separate control variable,
> configured via a kernel command line parameter such as
> cgroup_migration_lock=[global|thread_group] to explicitly govern this
> behavior. What do you think about this approach?

Can't you just down_write threadgroup_rwsem when flipping the flag? Wouldn't
that provide sufficient synchronization?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-05  3:48             ` Tejun Heo
@ 2025-09-05  4:30               ` escape
  0 siblings, 0 replies; 44+ messages in thread
From: escape @ 2025-09-05  4:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel



在 2025/9/5 11:48, Tejun Heo 写道:
> Hello,
>
> On Fri, Sep 05, 2025 at 11:44:53AM +0800, escape wrote:
> ...
>> To avoid the complexity of runtime changes to favordynmods, perhaps a
>> better approach would be to introduce a separate control variable,
>> configured via a kernel command line parameter such as
>> cgroup_migration_lock=[global|thread_group] to explicitly govern this
>> behavior. What do you think about this approach?
> Can't you just down_write threadgroup_rwsem when flipping the flag? Wouldn't
> that provide sufficient synchronization?
>
> Thanks.
>
You're right, it's clear and simple. I will make the changes in this way.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04 11:39   ` [PATCH v2 1/1] " Yi Tao
  2025-09-04 16:31     ` Tejun Heo
  2025-09-05  2:02     ` Chen Ridong
@ 2025-09-05 13:17     ` kernel test robot
  2 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2025-09-05 13:17 UTC (permalink / raw)
  To: Yi Tao, tj, hannes, mkoutny; +Cc: oe-kbuild-all, cgroups, linux-kernel

Hi Yi,

kernel test robot noticed the following build errors:

[auto build test ERROR on tj-cgroup/for-next]
[also build test ERROR on kees/for-next/execve akpm-mm/mm-everything tip/sched/core linus/master v6.17-rc4]
[cannot apply to next-20250905]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yi-Tao/cgroup-replace-global-percpu_rwsem-with-signal_struct-group_rwsem-when-writing-cgroup-procs-threads/20250904-194505
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/068d58f1f497bc4971c6ac0bae58bf53b98451fd.1756985260.git.escape%40linux.alibaba.com
patch subject: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
config: i386-buildonly-randconfig-004-20250905 (https://download.01.org/0day-ci/archive/20250905/202509052150.4GQ04PJn-lkp@intel.com/config)
compiler: gcc-13 (Debian 13.3.0-16) 13.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250905/202509052150.4GQ04PJn-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509052150.4GQ04PJn-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/cgroup/cgroup.c: In function 'cgroup_attach_lock':
>> kernel/cgroup/cgroup.c:2511:28: error: 'favor_dynmods' undeclared (first use in this function); did you mean 'Opt_favordynmods'?
    2511 |                 if (tsk && favor_dynmods)
         |                            ^~~~~~~~~~~~~
         |                            Opt_favordynmods
   kernel/cgroup/cgroup.c:2511:28: note: each undeclared identifier is reported only once for each function it appears in
   kernel/cgroup/cgroup.c: In function 'cgroup_attach_unlock':
   kernel/cgroup/cgroup.c:2526:28: error: 'favor_dynmods' undeclared (first use in this function); did you mean 'Opt_favordynmods'?
    2526 |                 if (tsk && favor_dynmods)
         |                            ^~~~~~~~~~~~~
         |                            Opt_favordynmods


vim +2511 kernel/cgroup/cgroup.c

  2482	
  2483	/**
  2484	 * cgroup_attach_lock - Lock for ->attach()
  2485	 * @tsk: thread group to lock
  2486	 * @lock_threadgroup: whether to down_write rwsem
  2487	 *
  2488	 * cgroup migration sometimes needs to stabilize threadgroups against forks and
  2489	 * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
  2490	 * implementations (e.g. cpuset), also need to disable CPU hotplug.
  2491	 * Unfortunately, letting ->attach() operations acquire cpus_read_lock() can
  2492	 * lead to deadlocks.
  2493	 *
  2494	 * Bringing up a CPU may involve creating and destroying tasks which requires
  2495	 * read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
  2496	 * cpus_read_lock(). If we call an ->attach() which acquires the cpus lock while
  2497	 * write-locking threadgroup_rwsem, the locking order is reversed and we end up
  2498	 * waiting for an on-going CPU hotplug operation which in turn is waiting for
  2499	 * the threadgroup_rwsem to be released to create new tasks. For more details:
  2500	 *
  2501	 *   http://lkml.kernel.org/r/20220711174629.uehfmqegcwn2lqzu@wubuntu
  2502	 *
  2503	 * Resolve the situation by always acquiring cpus_read_lock() before optionally
  2504	 * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  2505	 * CPU hotplug is disabled on entry.
  2506	 */
  2507	void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup)
  2508	{
  2509		cpus_read_lock();
  2510		if (lock_threadgroup) {
> 2511			if (tsk && favor_dynmods)
  2512				down_write(&tsk->signal->group_rwsem);
  2513			else
  2514				percpu_down_write(&cgroup_threadgroup_rwsem);
  2515		}
  2516	}
  2517	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
                   ` (2 preceding siblings ...)
  2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao
@ 2025-09-08 10:20 ` Yi Tao
  2025-09-08 10:20   ` [PATCH v3 1/1] " Yi Tao
  2025-09-09  7:55 ` [PATCH v4 0/3] " Yi Tao
  2025-09-10  6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
  5 siblings, 1 reply; 44+ messages in thread
From: Yi Tao @ 2025-09-08 10:20 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Changes in v3:
- Expend commit log and comments.
- Put argument @tsk at end in cgroup_attach_lock/unlock.
- down_write global cgroup_thread_rwsem when flipping favordynmods to
synchronize with task between cgroup_threadgroup_change_begin and end.
- Rename group_rwsem to cgroup_threadgroup_rwsem.
- Fix bug causing abnormal cgroup migration due to threadgroup leader changes。

Changes in v2:
- Use favordynmods as the enabling switch.
- Determine whether to use the per-thread-group rwsem based on whether
the task is NULL.
- Fix system hang caused by acquiring cgroup_threadgroup_rwsem inside
rcu_read_lock.

Yi Tao (1):
  cgroup: replace global percpu_rwsem with per threadgroup resem when
    writing to cgroup.procs

 include/linux/cgroup-defs.h     |  12 +++-
 include/linux/sched/signal.h    |   4 ++
 init/init_task.c                |   3 +
 kernel/cgroup/cgroup-internal.h |   4 +-
 kernel/cgroup/cgroup-v1.c       |   8 +--
 kernel/cgroup/cgroup.c          | 105 ++++++++++++++++++++++----------
 kernel/fork.c                   |   4 ++
 7 files changed, 101 insertions(+), 39 deletions(-)

-- 
2.32.0.3.g01195cf9f


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-08 10:20 ` [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
@ 2025-09-08 10:20   ` Yi Tao
  2025-09-08 16:05     ` Tejun Heo
  2025-09-08 19:39     ` Waiman Long
  0 siblings, 2 replies; 44+ messages in thread
From: Yi Tao @ 2025-09-08 10:20 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads  | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
critical section            | critical section       | critical section
up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

With this patch, under heavy fork and exec interference, the long-tail
latency of cgroup migration has been reduced from milliseconds to
microseconds. Under heavy cgroup migration interference, the multi-CPU
score of the spawn test case in UnixBench increased by 9%.

The static usage pattern of creating a cgroup, enabling controllers,
and then seeding it with CLONE_INTO_CGROUP doesn't require write
locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
patch.

To avoid affecting other users, the per threadgroup rwsem is only used
when the `favordynmods` flag is enabled.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h     |  12 +++-
 include/linux/sched/signal.h    |   4 ++
 init/init_task.c                |   3 +
 kernel/cgroup/cgroup-internal.h |   4 +-
 kernel/cgroup/cgroup-v1.c       |   8 +--
 kernel/cgroup/cgroup.c          | 105 ++++++++++++++++++++++----------
 kernel/fork.c                   |   4 ++
 7 files changed, 101 insertions(+), 39 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b93a64115fe..5033e3bdac9e 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -88,7 +88,8 @@ enum {
 	/*
 	 * Reduce latencies on dynamic cgroup modifications such as task
 	 * migrations and controller on/offs by disabling percpu operation on
-	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
+	 * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when
+	 * writing to cgroup.procs. This makes hot path operations such as
 	 * forks and exits into the slow path and more expensive.
 	 *
 	 * The static usage pattern of creating a cgroup, enabling controllers,
@@ -828,16 +829,21 @@ struct cgroup_of_peak {
 	struct list_head	list;
 };
 
+extern int take_per_threadgroup_rwsem;
+
 /**
  * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
  * @tsk: target task
  *
  * Allows cgroup operations to synchronize against threadgroup changes
- * using a percpu_rw_semaphore.
+ * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when
+ * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
  */
 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
 {
 	percpu_down_read(&cgroup_threadgroup_rwsem);
+	if (take_per_threadgroup_rwsem)
+		down_read(&tsk->signal->cgroup_threadgroup_rwsem);
 }
 
 /**
@@ -848,6 +854,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
  */
 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 {
+	if (take_per_threadgroup_rwsem)
+		up_read(&tsk->signal->cgroup_threadgroup_rwsem);
 	percpu_up_read(&cgroup_threadgroup_rwsem);
 }
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..7d6449982822 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -226,6 +226,10 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
+#ifdef CONFIG_CGROUPS
+	struct rw_semaphore cgroup_threadgroup_rwsem;
+#endif
+
 	/*
 	 * Thread is the potential origin of an oom condition; kill first on
 	 * oom
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..a55e2189206f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
 	},
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
+#ifdef CONFIG_CGROUPS
+	.cgroup_threadgroup_rwsem	= __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
+#endif
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index b14e61c64a34..318cc7f22e2c 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(bool lock_threadgroup);
-void cgroup_attach_unlock(bool lock_threadgroup);
+void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk);
+void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     bool *locked)
 	__acquires(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 2a4a387f867a..65e9b454780c 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(true, NULL);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(true, NULL);
 	cgroup_unlock();
 
 	return retval;
@@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(true, NULL);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(true, NULL);
 	cgroup_unlock();
 	return ret;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..8650ec394d0c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root)
 	return root_cgrp->root;
 }
 
+int take_per_threadgroup_rwsem;
+
 void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
 {
 	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
 
-	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+	/*
+	 * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
+	 * favordynmods can flip while task is between
+	 * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end,
+	 * so down_write global cgroup_threadgroup_rwsem to synchronize them.
+	 */
+	percpu_down_write(&cgroup_threadgroup_rwsem);
 	if (favor && !favoring) {
+		take_per_threadgroup_rwsem++;
 		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
 		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
 	} else if (!favor && favoring) {
+		take_per_threadgroup_rwsem--;
 		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
 		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
 	}
+	percpu_up_write(&cgroup_threadgroup_rwsem);
 }
 
 static int cgroup_init_root_id(struct cgroup_root *root)
@@ -2459,7 +2470,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
 /**
  * cgroup_attach_lock - Lock for ->attach()
- * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
+ * @lock_threadgroup: whether to down_write rwsem
+ * @tsk: thread group to lock
  *
  * cgroup migration sometimes needs to stabilize threadgroups against forks and
  * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
@@ -2479,22 +2491,37 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * Resolve the situation by always acquiring cpus_read_lock() before optionally
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
+ *
+ * When favordynmods is enabled, take per threadgroup rwsem to reduce latencies
+ * on dynamic cgroup modifications. see the comment above
+ * CGRP_ROOT_FAVOR_DYNMODS definition.
+ *
+ * tsk is not NULL only when writing to cgroup.procs.
  */
-void cgroup_attach_lock(bool lock_threadgroup)
+void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk)
 {
 	cpus_read_lock();
-	if (lock_threadgroup)
-		percpu_down_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (tsk && take_per_threadgroup_rwsem)
+			down_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		else
+			percpu_down_write(&cgroup_threadgroup_rwsem);
+	}
 }
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
- * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
+ * @lock_threadgroup: whether to up_write rwsem
+ * @tsk: thread group to lock
  */
-void cgroup_attach_unlock(bool lock_threadgroup)
+void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk)
 {
-	if (lock_threadgroup)
-		percpu_up_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (tsk && take_per_threadgroup_rwsem)
+			up_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		else
+			percpu_up_write(&cgroup_threadgroup_rwsem);
+	}
 	cpus_read_unlock();
 }
 
@@ -2976,24 +3003,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
 		return ERR_PTR(-EINVAL);
 
-	/*
-	 * If we migrate a single thread, we don't care about threadgroup
-	 * stability. If the thread is `current`, it won't exit(2) under our
-	 * hands or change PID through exec(2). We exclude
-	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
-	 * callers by cgroup_mutex.
-	 * Therefore, we can skip the global lock.
-	 */
-	lockdep_assert_held(&cgroup_mutex);
-	*threadgroup_locked = pid || threadgroup;
-	cgroup_attach_lock(*threadgroup_locked);
-
+retry_find_task:
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
 			tsk = ERR_PTR(-ESRCH);
-			goto out_unlock_threadgroup;
+			goto out_unlock_rcu;
 		}
 	} else {
 		tsk = current;
@@ -3010,15 +3026,42 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 */
 	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
 		tsk = ERR_PTR(-EINVAL);
-		goto out_unlock_threadgroup;
+		goto out_unlock_rcu;
 	}
-
 	get_task_struct(tsk);
-	goto out_unlock_rcu;
 
-out_unlock_threadgroup:
-	cgroup_attach_unlock(*threadgroup_locked);
-	*threadgroup_locked = false;
+	rcu_read_unlock();
+
+	/*
+	 * If we migrate a single thread, we don't care about threadgroup
+	 * stability. If the thread is `current`, it won't exit(2) under our
+	 * hands or change PID through exec(2). We exclude
+	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
+	 * callers by cgroup_mutex.
+	 * Therefore, we can skip the global lock.
+	 */
+	lockdep_assert_held(&cgroup_mutex);
+	*threadgroup_locked = pid || threadgroup;
+
+	cgroup_attach_lock(*threadgroup_locked, tsk);
+
+	if (threadgroup) {
+		if (!thread_group_leader(tsk)) {
+			/*
+			 * a race with de_thread from another thread's exec()
+			 * may strip us of our leadership, if this happens,
+			 * there is no choice but to throw this task away and
+			 * try again; this is
+			 * "double-double-toil-and-trouble-check locking".
+			 */
+			cgroup_attach_unlock(*threadgroup_locked, tsk);
+			put_task_struct(tsk);
+			goto retry_find_task;
+		}
+	}
+
+	return tsk;
+
 out_unlock_rcu:
 	rcu_read_unlock();
 	return tsk;
@@ -3032,7 +3075,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
 
-	cgroup_attach_unlock(threadgroup_locked);
+	cgroup_attach_unlock(threadgroup_locked, task);
 
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3119,7 +3162,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	 * write-locking can be skipped safely.
 	 */
 	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
-	cgroup_attach_lock(has_tasks);
+	cgroup_attach_lock(has_tasks, NULL);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3140,7 +3183,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(has_tasks);
+	cgroup_attach_unlock(has_tasks, NULL);
 	return ret;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499d..ae7826815c2c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->cgroup_threadgroup_rwsem);
+#endif
+
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-08 10:20   ` [PATCH v3 1/1] " Yi Tao
@ 2025-09-08 16:05     ` Tejun Heo
  2025-09-08 19:39     ` Waiman Long
  1 sibling, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2025-09-08 16:05 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

On Mon, Sep 08, 2025 at 06:20:27PM +0800, Yi Tao wrote:
...
> The static usage pattern of creating a cgroup, enabling controllers,
> and then seeding it with CLONE_INTO_CGROUP doesn't require write
> locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
> patch.

Please bring this to the top, note that this is the default mode of
operation and the mechanism being introduced is thus an optional one.

> @@ -88,7 +88,8 @@ enum {
>  	/*
>  	 * Reduce latencies on dynamic cgroup modifications such as task
>  	 * migrations and controller on/offs by disabling percpu operation on
> -	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
> +	 * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when
> +	 * writing to cgroup.procs. This makes hot path operations such as
>  	 * forks and exits into the slow path and more expensive.

This comment is pointed to from all other places. Please expand on why
per-threadgroup rwsem is beneficial for what use cases.

>  	 * The static usage pattern of creating a cgroup, enabling controllers,
> @@ -828,16 +829,21 @@ struct cgroup_of_peak {
>  	struct list_head	list;
>  };
>  
> +extern int take_per_threadgroup_rwsem;

I think it needs cgroup in its name. Maybe something like
cgroup_enable_per_threadgroup_rwsem?

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..8650ec394d0c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root)
>  	return root_cgrp->root;
>  }
>  
> +int take_per_threadgroup_rwsem;

Please put it where other global variables are and also note what it does
and that writes protected by cgroup_mutex and write-lock of
cgroup_threadgroup_rwsem and thus reads are protected by either.

>  void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
>  {
>  	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
>  
> -	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
> +	/*
> +	 * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
> +	 * favordynmods can flip while task is between
> +	 * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end,
> +	 * so down_write global cgroup_threadgroup_rwsem to synchronize them.

Maybe: take_per_threadgroup_rwsem must not be flipped while threads are
between cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end.
down_write global group_threadgroup_rwsem to exclude them.

But does this actually work? It works for turning it on. I don't think it'd
work for turning it off, right? Maybe make it enable only and trigger a
warning message when people try to turn it off?

> +	 */
> +	percpu_down_write(&cgroup_threadgroup_rwsem);
>  	if (favor && !favoring) {
> +		take_per_threadgroup_rwsem++;

Given that favoring is gating the switch, this can be a bool, right?

>  		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
>  		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
>  	} else if (!favor && favoring) {
> +		take_per_threadgroup_rwsem--;

And here, you can trigger a warning that per_threadgroup opreation can't be
disabled once enabled instead of actually turning it off.

Another alternative would be using a task flag to track whether %current is
holding per_threadgroup_rwsem and then using that to decide whether to
unlock. Maybe that's cleaner but I don't think it really matters here.

..
> + * When favordynmods is enabled, take per threadgroup rwsem to reduce latencies
> + * on dynamic cgroup modifications. see the comment above
> + * CGRP_ROOT_FAVOR_DYNMODS definition.

This is more about scalability, right? Maybe just say overhead?

> @@ -2976,24 +3003,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>  	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
>  		return ERR_PTR(-EINVAL);
>  
> -	/*
> -	 * If we migrate a single thread, we don't care about threadgroup
> -	 * stability. If the thread is `current`, it won't exit(2) under our
> -	 * hands or change PID through exec(2). We exclude
> -	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> -	 * callers by cgroup_mutex.
> -	 * Therefore, we can skip the global lock.
> -	 */
> -	lockdep_assert_held(&cgroup_mutex);
> -	*threadgroup_locked = pid || threadgroup;
> -	cgroup_attach_lock(*threadgroup_locked);
> -
> +retry_find_task:
>  	rcu_read_lock();
>  	if (pid) {
>  		tsk = find_task_by_vpid(pid);
>  		if (!tsk) {
>  			tsk = ERR_PTR(-ESRCH);
> -			goto out_unlock_threadgroup;
> +			goto out_unlock_rcu;
>  		}
>  	} else {
>  		tsk = current;
> @@ -3010,15 +3026,42 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>  	 */
>  	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
>  		tsk = ERR_PTR(-EINVAL);
> -		goto out_unlock_threadgroup;
> +		goto out_unlock_rcu;
>  	}
> -
>  	get_task_struct(tsk);
> -	goto out_unlock_rcu;
>  
> -out_unlock_threadgroup:
> -	cgroup_attach_unlock(*threadgroup_locked);
> -	*threadgroup_locked = false;
> +	rcu_read_unlock();
> +
> +	/*
> +	 * If we migrate a single thread, we don't care about threadgroup
> +	 * stability. If the thread is `current`, it won't exit(2) under our
> +	 * hands or change PID through exec(2). We exclude
> +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> +	 * callers by cgroup_mutex.
> +	 * Therefore, we can skip the global lock.
> +	 */
> +	lockdep_assert_held(&cgroup_mutex);
> +	*threadgroup_locked = pid || threadgroup;
> +
> +	cgroup_attach_lock(*threadgroup_locked, tsk);
> +
> +	if (threadgroup) {
> +		if (!thread_group_leader(tsk)) {
> +			/*
> +			 * a race with de_thread from another thread's exec()
> +			 * may strip us of our leadership, if this happens,
> +			 * there is no choice but to throw this task away and
> +			 * try again; this is
> +			 * "double-double-toil-and-trouble-check locking".
> +			 */
> +			cgroup_attach_unlock(*threadgroup_locked, tsk);
> +			put_task_struct(tsk);
> +			goto retry_find_task;

This is subtle. Can you please separate this out to a spearate patch?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-08 10:20   ` [PATCH v3 1/1] " Yi Tao
  2025-09-08 16:05     ` Tejun Heo
@ 2025-09-08 19:39     ` Waiman Long
  1 sibling, 0 replies; 44+ messages in thread
From: Waiman Long @ 2025-09-08 19:39 UTC (permalink / raw)
  To: Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

On 9/8/25 6:20 AM, Yi Tao wrote:
> As computer hardware advances, modern systems are typically equipped
> with many CPU cores and large amounts of memory, enabling the deployment
> of numerous applications. On such systems, container creation and
> deletion become frequent operations, making cgroup process migration no
> longer a cold path. This leads to noticeable contention with common
> process operations such as fork, exec, and exit.
>
> To alleviate the contention between cgroup process migration and
> operations like process fork, this patch modifies lock to take the write
> lock on signal_struct->group_rwsem when writing pid to
> cgroup.procs/threads instead of holding a global write lock.
>
> Cgroup process migration has historically relied on
> signal_struct->group_rwsem to protect thread group integrity. In commit
> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
> a global percpu_rwsem"), this was changed to a global
> cgroup_threadgroup_rwsem. The advantage of using a global lock was
> simplified handling of process group migrations. This patch retains the
> use of the global lock for protecting process group migration, while
> reducing contention by using per thread group lock during
> cgroup.procs/threads writes.
>
> The locking behavior is as follows:
>
> write cgroup.procs/threads  | process fork,exec,exit | process group migration
> ------------------------------------------------------------------------------
> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
> down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
> critical section            | critical section       | critical section
> up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>
> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
> signal_struct->group_rwsem.
>
> This patch eliminates contention between cgroup migration and fork
> operations for threads that belong to different thread groups, thereby
> reducing the long-tail latency of cgroup migrations and lowering system
> load.
>
> With this patch, under heavy fork and exec interference, the long-tail
> latency of cgroup migration has been reduced from milliseconds to
> microseconds. Under heavy cgroup migration interference, the multi-CPU
> score of the spawn test case in UnixBench increased by 9%.
>
> The static usage pattern of creating a cgroup, enabling controllers,
> and then seeding it with CLONE_INTO_CGROUP doesn't require write
> locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
> patch.
>
> To avoid affecting other users, the per threadgroup rwsem is only used
> when the `favordynmods` flag is enabled.
>
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
> ---
>   include/linux/cgroup-defs.h     |  12 +++-
>   include/linux/sched/signal.h    |   4 ++
>   init/init_task.c                |   3 +
>   kernel/cgroup/cgroup-internal.h |   4 +-
>   kernel/cgroup/cgroup-v1.c       |   8 +--
>   kernel/cgroup/cgroup.c          | 105 ++++++++++++++++++++++----------
>   kernel/fork.c                   |   4 ++
>   7 files changed, 101 insertions(+), 39 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 6b93a64115fe..5033e3bdac9e 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -88,7 +88,8 @@ enum {
>   	/*
>   	 * Reduce latencies on dynamic cgroup modifications such as task
>   	 * migrations and controller on/offs by disabling percpu operation on
> -	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
> +	 * cgroup_threadgroup_rwsem and taking per threadgroup rwsem when
> +	 * writing to cgroup.procs. This makes hot path operations such as
>   	 * forks and exits into the slow path and more expensive.
>   	 *
>   	 * The static usage pattern of creating a cgroup, enabling controllers,
> @@ -828,16 +829,21 @@ struct cgroup_of_peak {
>   	struct list_head	list;
>   };
>   
> +extern int take_per_threadgroup_rwsem;
> +
>   /**
>    * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
>    * @tsk: target task
>    *
>    * Allows cgroup operations to synchronize against threadgroup changes
> - * using a percpu_rw_semaphore.
> + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when
> + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
>    */
>   static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>   {
>   	percpu_down_read(&cgroup_threadgroup_rwsem);
> +	if (take_per_threadgroup_rwsem)
> +		down_read(&tsk->signal->cgroup_threadgroup_rwsem);
>   }
>   
>   /**
> @@ -848,6 +854,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>    */
>   static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>   {
> +	if (take_per_threadgroup_rwsem)
> +		up_read(&tsk->signal->cgroup_threadgroup_rwsem);
>   	percpu_up_read(&cgroup_threadgroup_rwsem);
>   }
>   
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..7d6449982822 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -226,6 +226,10 @@ struct signal_struct {
>   	struct tty_audit_buf *tty_audit_buf;
>   #endif
>   
> +#ifdef CONFIG_CGROUPS
> +	struct rw_semaphore cgroup_threadgroup_rwsem;
> +#endif
> +
>   	/*
>   	 * Thread is the potential origin of an oom condition; kill first on
>   	 * oom
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..a55e2189206f 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>   	},
>   	.multiprocess	= HLIST_HEAD_INIT,
>   	.rlim		= INIT_RLIMITS,
> +#ifdef CONFIG_CGROUPS
> +	.cgroup_threadgroup_rwsem	= __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
> +#endif
>   	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>   	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>   #ifdef CONFIG_POSIX_TIMERS
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index b14e61c64a34..318cc7f22e2c 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
>   
>   int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
>   		       bool threadgroup);
> -void cgroup_attach_lock(bool lock_threadgroup);
> -void cgroup_attach_unlock(bool lock_threadgroup);
> +void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk);
> +void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk);
>   struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>   					     bool *locked)
>   	__acquires(&cgroup_threadgroup_rwsem);
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 2a4a387f867a..65e9b454780c 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>   	int retval = 0;
>   
>   	cgroup_lock();
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(true, NULL);
>   	for_each_root(root) {
>   		struct cgroup *from_cgrp;
>   
> @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>   		if (retval)
>   			break;
>   	}
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(true, NULL);
>   	cgroup_unlock();
>   
>   	return retval;
> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>   
>   	cgroup_lock();
>   
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(true, NULL);
>   
>   	/* all tasks in @from are being moved, all csets are source */
>   	spin_lock_irq(&css_set_lock);
> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>   	} while (task && !ret);
>   out_err:
>   	cgroup_migrate_finish(&mgctx);
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(true, NULL);
>   	cgroup_unlock();
>   	return ret;
>   }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..8650ec394d0c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1298,18 +1298,29 @@ struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root)
>   	return root_cgrp->root;
>   }
>   
> +int take_per_threadgroup_rwsem;
I would suggest adding the "__read_mostly" attribute to avoid the chance 
of false cacheline sharing.
> +
>   void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
>   {
>   	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
>   
> -	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
> +	/*
> +	 * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
> +	 * favordynmods can flip while task is between
> +	 * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end,
> +	 * so down_write global cgroup_threadgroup_rwsem to synchronize them.
> +	 */
> +	percpu_down_write(&cgroup_threadgroup_rwsem);
>   	if (favor && !favoring) {
> +		take_per_threadgroup_rwsem++;
>   		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
>   		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
>   	} else if (!favor && favoring) {
> +		take_per_threadgroup_rwsem--;
>   		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
>   		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
>   	}
> +	percpu_up_write(&cgroup_threadgroup_rwsem);
>   }
>   

Changing take_per_threadgroup_rwsem inside the cgroup_threadgroup_rwsem 
critical section will ensure that the flag won't change in between 
cgroup_threadgroup_change_begin() and cgroup_threadgroup_change_end(). 
However, it may still change in between cgroup_attach_lock() and 
cgroup_attach_unlock().

Since cgroup_attach_[un]lock() has already taken a threadgroup_locked 
argument, we can extend it to a flag word that holds 2 flag bits - one 
for threadgroup_locked and another one for whether the threadgroup rwsem 
has been taken or not. So take_per_threadgroup_rwsem will only be 
checked in cgroup_attach_lock(). cgroup_attach_lock() can either return 
a value to indicate the state or the argument can be changed into a flag 
pointer with the new flag bit added internally.

It should be a separate patch if you decide to do the extension.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v4 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
                   ` (3 preceding siblings ...)
  2025-09-08 10:20 ` [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
@ 2025-09-09  7:55 ` Yi Tao
  2025-09-09  7:55   ` [PATCH v4 1/3] " Yi Tao
                     ` (2 more replies)
  2025-09-10  6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
  5 siblings, 3 replies; 44+ messages in thread
From: Yi Tao @ 2025-09-09  7:55 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Changes in v4:
- Adjust commit log and comments.
- Rename take_per_threadgroup_rwsem to cgroup_enable_per_threadgroup_rwsem and
add attr __read_mostly.
- Trigger a warning that per_threadgroup opreation can't be
disabled once enabled instead of actually turning it off.
- Split the code for retrying when the threadgroup leader changes into a
separate patch.
- Refactor the cgroup_attach_lock code to make it clearer.

Changes in v3:
- Expend commit log and comments.
- Put argument @tsk at end in cgroup_attach_lock/unlock.
- down_write global cgroup_thread_rwsem when flipping favordynmods to
synchronize with task between cgroup_threadgroup_change_begin and end.
- Rename group_rwsem to cgroup_threadgroup_rwsem.
- Fix bug causing abnormal cgroup migration due to threadgroup leader changes。

Changes in v2:
- Use favordynmods as the enabling switch.
- Determine whether to use the per-thread-group rwsem based on whether
the task is NULL.
- Fix system hang caused by acquiring cgroup_threadgroup_rwsem inside
rcu_read_lock.

Yi Tao (3):
  cgroup: replace global percpu_rwsem with per threadgroup resem when
    writing to cgroup.procs
  cgroup: retry find task if threadgroup leader changed
  cgroup: refactor the cgroup_attach_lock code to make it clearer

 include/linux/cgroup-defs.h     |  25 +++++-
 include/linux/sched/signal.h    |   4 +
 init/init_task.c                |   3 +
 kernel/cgroup/cgroup-internal.h |   8 +-
 kernel/cgroup/cgroup-v1.c       |  14 +--
 kernel/cgroup/cgroup.c          | 149 ++++++++++++++++++++++++--------
 kernel/fork.c                   |   4 +
 7 files changed, 161 insertions(+), 46 deletions(-)

-- 
2.32.0.3.g01195cf9f


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v4 1/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-09  7:55 ` [PATCH v4 0/3] " Yi Tao
@ 2025-09-09  7:55   ` Yi Tao
  2025-09-09  7:55   ` [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed Yi Tao
  2025-09-09  7:55   ` [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao
  2 siblings, 0 replies; 44+ messages in thread
From: Yi Tao @ 2025-09-09  7:55 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

The static usage pattern of creating a cgroup, enabling controllers,
and then seeding it with CLONE_INTO_CGROUP doesn't require write
locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
patch.

To avoid affecting other users, the per threadgroup rwsem is only used
when the favordynmods is enabled.

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads  | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
critical section            | critical section       | critical section
up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

With this patch, under heavy fork and exec interference, the long-tail
latency of cgroup migration has been reduced from milliseconds to
microseconds. Under heavy cgroup migration interference, the multi-CPU
score of the spawn test case in UnixBench increased by 9%.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h     | 14 ++++-
 include/linux/sched/signal.h    |  4 ++
 init/init_task.c                |  3 ++
 kernel/cgroup/cgroup-internal.h |  4 +-
 kernel/cgroup/cgroup-v1.c       |  8 +--
 kernel/cgroup/cgroup.c          | 96 ++++++++++++++++++++++-----------
 kernel/fork.c                   |  4 ++
 7 files changed, 95 insertions(+), 38 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b93a64115fe..552050bd2c02 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -91,6 +91,12 @@ enum {
 	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
 	 * forks and exits into the slow path and more expensive.
 	 *
+	 * Alleviate the contention between fork, exec, exit operations and
+	 * writing to cgroup.procs by taking a per threadgroup rwsem instead of
+	 * the global cgroup_threadgroup_rwsem. Fork and other operations
+	 * from threads in different thread groups no longer contend with
+	 * writing to cgroup.procs.
+	 *
 	 * The static usage pattern of creating a cgroup, enabling controllers,
 	 * and then seeding it with CLONE_INTO_CGROUP doesn't require write
 	 * locking cgroup_threadgroup_rwsem and thus doesn't benefit from
@@ -822,6 +828,7 @@ struct cgroup_subsys {
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
+extern bool cgroup_enable_per_threadgroup_rwsem;
 
 struct cgroup_of_peak {
 	unsigned long		value;
@@ -833,11 +840,14 @@ struct cgroup_of_peak {
  * @tsk: target task
  *
  * Allows cgroup operations to synchronize against threadgroup changes
- * using a percpu_rw_semaphore.
+ * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when
+ * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
  */
 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
 {
 	percpu_down_read(&cgroup_threadgroup_rwsem);
+	if (cgroup_enable_per_threadgroup_rwsem)
+		down_read(&tsk->signal->cgroup_threadgroup_rwsem);
 }
 
 /**
@@ -848,6 +858,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
  */
 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 {
+	if (cgroup_enable_per_threadgroup_rwsem)
+		up_read(&tsk->signal->cgroup_threadgroup_rwsem);
 	percpu_up_read(&cgroup_threadgroup_rwsem);
 }
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..7d6449982822 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -226,6 +226,10 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
+#ifdef CONFIG_CGROUPS
+	struct rw_semaphore cgroup_threadgroup_rwsem;
+#endif
+
 	/*
 	 * Thread is the potential origin of an oom condition; kill first on
 	 * oom
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..a55e2189206f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
 	},
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
+#ifdef CONFIG_CGROUPS
+	.cgroup_threadgroup_rwsem	= __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
+#endif
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index b14e61c64a34..318cc7f22e2c 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(bool lock_threadgroup);
-void cgroup_attach_unlock(bool lock_threadgroup);
+void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk);
+void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     bool *locked)
 	__acquires(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 2a4a387f867a..65e9b454780c 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(true, NULL);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(true, NULL);
 	cgroup_unlock();
 
 	return retval;
@@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(true, NULL);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(true, NULL);
 	cgroup_unlock();
 	return ret;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..b53ae8fd9681 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -216,6 +216,14 @@ static u16 have_canfork_callback __read_mostly;
 
 static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
 
+/*
+ * Write protected by cgroup_mutex and write-lock of cgroup_threadgroup_rwsem,
+ * read protected by either.
+ *
+ * Can only be turned on, but not turned off.
+ */
+bool cgroup_enable_per_threadgroup_rwsem __read_mostly;
+
 /* cgroup namespace for init task */
 struct cgroup_namespace init_cgroup_ns = {
 	.ns.count	= REFCOUNT_INIT(2),
@@ -1302,14 +1310,24 @@ void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
 {
 	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
 
-	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+	/*
+	 * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
+	 * favordynmods can flip while task is between
+	 * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end,
+	 * so down_write global cgroup_threadgroup_rwsem to synchronize them.
+	 */
+	percpu_down_write(&cgroup_threadgroup_rwsem);
 	if (favor && !favoring) {
+		cgroup_enable_per_threadgroup_rwsem = true;
 		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
 		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
 	} else if (!favor && favoring) {
+		if (cgroup_enable_per_threadgroup_rwsem)
+			WARN_ONCE(1, "cgroup favordynmods: per threadgroup rwsem mechanism can't be disabled\n");
 		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
 		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
 	}
+	percpu_up_write(&cgroup_threadgroup_rwsem);
 }
 
 static int cgroup_init_root_id(struct cgroup_root *root)
@@ -2459,7 +2477,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
 /**
  * cgroup_attach_lock - Lock for ->attach()
- * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
+ * @lock_threadgroup: whether to down_write rwsem
+ * @tsk: thread group to lock
  *
  * cgroup migration sometimes needs to stabilize threadgroups against forks and
  * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
@@ -2479,22 +2498,37 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * Resolve the situation by always acquiring cpus_read_lock() before optionally
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
+ *
+ * When favordynmods is enabled, take per threadgroup rwsem to reduce overhead
+ * on dynamic cgroup modifications. see the comment above
+ * CGRP_ROOT_FAVOR_DYNMODS definition.
+ *
+ * tsk is not NULL only when writing to cgroup.procs.
  */
-void cgroup_attach_lock(bool lock_threadgroup)
+void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk)
 {
 	cpus_read_lock();
-	if (lock_threadgroup)
-		percpu_down_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (tsk && cgroup_enable_per_threadgroup_rwsem)
+			down_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		else
+			percpu_down_write(&cgroup_threadgroup_rwsem);
+	}
 }
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
- * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
+ * @lock_threadgroup: whether to up_write rwsem
+ * @tsk: thread group to lock
  */
-void cgroup_attach_unlock(bool lock_threadgroup)
+void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk)
 {
-	if (lock_threadgroup)
-		percpu_up_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (tsk && cgroup_enable_per_threadgroup_rwsem)
+			up_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		else
+			percpu_up_write(&cgroup_threadgroup_rwsem);
+	}
 	cpus_read_unlock();
 }
 
@@ -2976,24 +3010,12 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
 		return ERR_PTR(-EINVAL);
 
-	/*
-	 * If we migrate a single thread, we don't care about threadgroup
-	 * stability. If the thread is `current`, it won't exit(2) under our
-	 * hands or change PID through exec(2). We exclude
-	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
-	 * callers by cgroup_mutex.
-	 * Therefore, we can skip the global lock.
-	 */
-	lockdep_assert_held(&cgroup_mutex);
-	*threadgroup_locked = pid || threadgroup;
-	cgroup_attach_lock(*threadgroup_locked);
-
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
 			tsk = ERR_PTR(-ESRCH);
-			goto out_unlock_threadgroup;
+			goto out_unlock_rcu;
 		}
 	} else {
 		tsk = current;
@@ -3010,15 +3032,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 */
 	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
 		tsk = ERR_PTR(-EINVAL);
-		goto out_unlock_threadgroup;
+		goto out_unlock_rcu;
 	}
-
 	get_task_struct(tsk);
-	goto out_unlock_rcu;
 
-out_unlock_threadgroup:
-	cgroup_attach_unlock(*threadgroup_locked);
-	*threadgroup_locked = false;
+	rcu_read_unlock();
+
+	/*
+	 * If we migrate a single thread, we don't care about threadgroup
+	 * stability. If the thread is `current`, it won't exit(2) under our
+	 * hands or change PID through exec(2). We exclude
+	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
+	 * callers by cgroup_mutex.
+	 * Therefore, we can skip the global lock.
+	 */
+	lockdep_assert_held(&cgroup_mutex);
+	*threadgroup_locked = pid || threadgroup;
+
+	cgroup_attach_lock(*threadgroup_locked, tsk);
+
+	return tsk;
+
 out_unlock_rcu:
 	rcu_read_unlock();
 	return tsk;
@@ -3032,7 +3066,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
 
-	cgroup_attach_unlock(threadgroup_locked);
+	cgroup_attach_unlock(threadgroup_locked, task);
 
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3119,7 +3153,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	 * write-locking can be skipped safely.
 	 */
 	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
-	cgroup_attach_lock(has_tasks);
+	cgroup_attach_lock(has_tasks, NULL);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3140,7 +3174,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(has_tasks);
+	cgroup_attach_unlock(has_tasks, NULL);
 	return ret;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499d..ae7826815c2c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->cgroup_threadgroup_rwsem);
+#endif
+
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed
  2025-09-09  7:55 ` [PATCH v4 0/3] " Yi Tao
  2025-09-09  7:55   ` [PATCH v4 1/3] " Yi Tao
@ 2025-09-09  7:55   ` Yi Tao
  2025-09-09 16:59     ` Tejun Heo
  2025-09-09  7:55   ` [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao
  2 siblings, 1 reply; 44+ messages in thread
From: Yi Tao @ 2025-09-09  7:55 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Between obtaining the threadgroup leader via PID and acquiring the
cgroup attach lock, the threadgroup leader may change, which could lead
to incorrect cgroup migration. Therefore, after acquiring the cgroup
attach lock, we check whether the threadgroup leader has changed, and if
so, retry the operation.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 kernel/cgroup/cgroup.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b53ae8fd9681..6e90b79e8fa3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3010,6 +3010,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
 		return ERR_PTR(-EINVAL);
 
+retry_find_task:
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
@@ -3051,6 +3052,21 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 
 	cgroup_attach_lock(*threadgroup_locked, tsk);
 
+	if (threadgroup) {
+		if (!thread_group_leader(tsk)) {
+			/*
+			 * a race with de_thread from another thread's exec()
+			 * may strip us of our leadership, if this happens,
+			 * there is no choice but to throw this task away and
+			 * try again; this is
+			 * "double-double-toil-and-trouble-check locking".
+			 */
+			cgroup_attach_unlock(*threadgroup_locked, tsk);
+			put_task_struct(tsk);
+			goto retry_find_task;
+		}
+	}
+
 	return tsk;
 
 out_unlock_rcu:
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer
  2025-09-09  7:55 ` [PATCH v4 0/3] " Yi Tao
  2025-09-09  7:55   ` [PATCH v4 1/3] " Yi Tao
  2025-09-09  7:55   ` [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed Yi Tao
@ 2025-09-09  7:55   ` Yi Tao
  2025-09-09 17:00     ` Tejun Heo
  2 siblings, 1 reply; 44+ messages in thread
From: Yi Tao @ 2025-09-09  7:55 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Dynamic cgroup migration involving threadgroup locks can be in one of
three states: no lock held, holding the global lock, or holding a
per threadgroup lock. Explicitly declaring the different lock modes
to make the code easier to understand.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h     | 11 ++++
 kernel/cgroup/cgroup-internal.h |  8 +--
 kernel/cgroup/cgroup-v1.c       | 14 ++---
 kernel/cgroup/cgroup.c          | 91 ++++++++++++++++++++++-----------
 4 files changed, 83 insertions(+), 41 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 552050bd2c02..d764eba041c6 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -146,6 +146,17 @@ enum {
 	__CFTYPE_ADDED		= (1 << 18),
 };
 
+enum {
+	/* Default */
+	CGRP_ATTACH_LOCK_GLOBAL,
+
+	/* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */
+	CGRP_ATTACH_LOCK_NONE,
+
+	/* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */
+	CGRP_ATTACH_LOCK_PER_THREADGROUP,
+};
+
 /*
  * cgroup_file is the handle for a file instance created in a cgroup which
  * is used, for example, to generate file changed notifications.  This can
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 318cc7f22e2c..9de8ab47a335 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,12 +249,12 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk);
-void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk);
+void cgroup_attach_lock(int lock_mode, struct task_struct *tsk);
+void cgroup_attach_unlock(int lock_mode, struct task_struct *tsk);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
-					     bool *locked)
+					     int *lock_mode)
 	__acquires(&cgroup_threadgroup_rwsem);
-void cgroup_procs_write_finish(struct task_struct *task, bool locked)
+void cgroup_procs_write_finish(struct task_struct *task, int locke_mode)
 	__releases(&cgroup_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 65e9b454780c..73d8dafe219f 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(true, NULL);
+	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(true, NULL);
+	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 	cgroup_unlock();
 
 	return retval;
@@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(true, NULL);
+	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(true, NULL);
+	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 	cgroup_unlock();
 	return ret;
 }
@@ -502,13 +502,13 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
 	struct task_struct *task;
 	const struct cred *cred, *tcred;
 	ssize_t ret;
-	bool locked;
+	int lock_mode;
 
 	cgrp = cgroup_kn_lock_live(of->kn, false);
 	if (!cgrp)
 		return -ENODEV;
 
-	task = cgroup_procs_write_start(buf, threadgroup, &locked);
+	task = cgroup_procs_write_start(buf, threadgroup, &lock_mode);
 	ret = PTR_ERR_OR_ZERO(task);
 	if (ret)
 		goto out_unlock;
@@ -531,7 +531,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
 	ret = cgroup_attach_task(cgrp, task, threadgroup);
 
 out_finish:
-	cgroup_procs_write_finish(task, locked);
+	cgroup_procs_write_finish(task, lock_mode);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6e90b79e8fa3..a568629f7ed0 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2477,7 +2477,7 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
 /**
  * cgroup_attach_lock - Lock for ->attach()
- * @lock_threadgroup: whether to down_write rwsem
+ * @lock_mode: whether acquire and acquire which rwsem
  * @tsk: thread group to lock
  *
  * cgroup migration sometimes needs to stabilize threadgroups against forks and
@@ -2499,35 +2499,46 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
  *
- * When favordynmods is enabled, take per threadgroup rwsem to reduce overhead
- * on dynamic cgroup modifications. see the comment above
- * CGRP_ROOT_FAVOR_DYNMODS definition.
- *
  * tsk is not NULL only when writing to cgroup.procs.
  */
-void cgroup_attach_lock(bool lock_threadgroup, struct task_struct *tsk)
+void cgroup_attach_lock(int lock_mode, struct task_struct *tsk)
 {
 	cpus_read_lock();
-	if (lock_threadgroup) {
-		if (tsk && cgroup_enable_per_threadgroup_rwsem)
-			down_write(&tsk->signal->cgroup_threadgroup_rwsem);
-		else
-			percpu_down_write(&cgroup_threadgroup_rwsem);
+
+	switch (lock_mode) {
+	case CGRP_ATTACH_LOCK_NONE:
+		break;
+	case CGRP_ATTACH_LOCK_GLOBAL:
+		percpu_down_write(&cgroup_threadgroup_rwsem);
+		break;
+	case CGRP_ATTACH_LOCK_PER_THREADGROUP:
+		down_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		break;
+	default:
+		pr_warn("cgroup: Unexpected attach lock mode.");
+		break;
 	}
 }
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
- * @lock_threadgroup: whether to up_write rwsem
+ * @lock_mode: whether release and release which rwsem
  * @tsk: thread group to lock
  */
-void cgroup_attach_unlock(bool lock_threadgroup, struct task_struct *tsk)
-{
-	if (lock_threadgroup) {
-		if (tsk && cgroup_enable_per_threadgroup_rwsem)
-			up_write(&tsk->signal->cgroup_threadgroup_rwsem);
-		else
-			percpu_up_write(&cgroup_threadgroup_rwsem);
+void cgroup_attach_unlock(int lock_mode, struct task_struct *tsk)
+{
+	switch (lock_mode) {
+	case CGRP_ATTACH_LOCK_NONE:
+		break;
+	case CGRP_ATTACH_LOCK_GLOBAL:
+		percpu_up_write(&cgroup_threadgroup_rwsem);
+		break;
+	case CGRP_ATTACH_LOCK_PER_THREADGROUP:
+		up_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		break;
+	default:
+		pr_warn("cgroup: Unexpected attach lock mode.");
+		break;
 	}
 	cpus_read_unlock();
 }
@@ -3002,7 +3013,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 }
 
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
-					     bool *threadgroup_locked)
+					     int *lock_mode)
 {
 	struct task_struct *tsk;
 	pid_t pid;
@@ -3046,11 +3057,24 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
 	 * callers by cgroup_mutex.
 	 * Therefore, we can skip the global lock.
+	 *
+	 * When favordynmods is enabled, take per threadgroup rwsem to reduce
+	 * overhead on dynamic cgroup modifications. see the comment above
+	 * CGRP_ROOT_FAVOR_DYNMODS definition.
 	 */
 	lockdep_assert_held(&cgroup_mutex);
-	*threadgroup_locked = pid || threadgroup;
 
-	cgroup_attach_lock(*threadgroup_locked, tsk);
+	if (pid || threadgroup) {
+		if (cgroup_enable_per_threadgroup_rwsem)
+			*lock_mode = CGRP_ATTACH_LOCK_PER_THREADGROUP;
+		else
+			*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
+
+	} else {
+		*lock_mode = CGRP_ATTACH_LOCK_NONE;
+	}
+
+	cgroup_attach_lock(*lock_mode, tsk);
 
 	if (threadgroup) {
 		if (!thread_group_leader(tsk)) {
@@ -3061,7 +3085,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 			 * try again; this is
 			 * "double-double-toil-and-trouble-check locking".
 			 */
-			cgroup_attach_unlock(*threadgroup_locked, tsk);
+			cgroup_attach_unlock(*lock_mode, tsk);
 			put_task_struct(tsk);
 			goto retry_find_task;
 		}
@@ -3074,7 +3098,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	return tsk;
 }
 
-void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked)
+void cgroup_procs_write_finish(struct task_struct *task, int lock_mode)
 {
 	struct cgroup_subsys *ss;
 	int ssid;
@@ -3082,7 +3106,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
 
-	cgroup_attach_unlock(threadgroup_locked, task);
+	cgroup_attach_unlock(lock_mode, task);
 
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3139,6 +3163,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	struct cgroup *dsct;
 	struct css_set *src_cset;
 	bool has_tasks;
+	int lock_mode;
 	int ret;
 
 	lockdep_assert_held(&cgroup_mutex);
@@ -3169,7 +3194,13 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	 * write-locking can be skipped safely.
 	 */
 	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
-	cgroup_attach_lock(has_tasks, NULL);
+
+	if (has_tasks)
+		lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
+	else
+		lock_mode = CGRP_ATTACH_LOCK_NONE;
+
+	cgroup_attach_lock(lock_mode, NULL);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3190,7 +3221,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(has_tasks, NULL);
+	cgroup_attach_unlock(lock_mode, NULL);
 	return ret;
 }
 
@@ -5291,13 +5322,13 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	struct task_struct *task;
 	const struct cred *saved_cred;
 	ssize_t ret;
-	bool threadgroup_locked;
+	int lock_mode;
 
 	dst_cgrp = cgroup_kn_lock_live(of->kn, false);
 	if (!dst_cgrp)
 		return -ENODEV;
 
-	task = cgroup_procs_write_start(buf, threadgroup, &threadgroup_locked);
+	task = cgroup_procs_write_start(buf, threadgroup, &lock_mode);
 	ret = PTR_ERR_OR_ZERO(task);
 	if (ret)
 		goto out_unlock;
@@ -5323,7 +5354,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	ret = cgroup_attach_task(dst_cgrp, task, threadgroup);
 
 out_finish:
-	cgroup_procs_write_finish(task, threadgroup_locked);
+	cgroup_procs_write_finish(task, lock_mode);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed
  2025-09-09  7:55   ` [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed Yi Tao
@ 2025-09-09 16:59     ` Tejun Heo
  0 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2025-09-09 16:59 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

On Tue, Sep 09, 2025 at 03:55:29PM +0800, Yi Tao wrote:
> Between obtaining the threadgroup leader via PID and acquiring the
> cgroup attach lock, the threadgroup leader may change, which could lead
> to incorrect cgroup migration. Therefore, after acquiring the cgroup
> attach lock, we check whether the threadgroup leader has changed, and if
> so, retry the operation.
> 
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>

This and thread group lock relocation should come before the patch which
requires it, not after.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer
  2025-09-09  7:55   ` [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao
@ 2025-09-09 17:00     ` Tejun Heo
  0 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2025-09-09 17:00 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

On Tue, Sep 09, 2025 at 03:55:30PM +0800, Yi Tao wrote:
> +enum {

Please give it a type name. It makes difference when it gets exported
through BTF for debugging and tracing.

> +	/* Default */
> +	CGRP_ATTACH_LOCK_GLOBAL,
> +
> +	/* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */
> +	CGRP_ATTACH_LOCK_NONE,
> +
> +	/* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */
> +	CGRP_ATTACH_LOCK_PER_THREADGROUP,
> +};

And, I'd put this before the actual patch too. Please make it a prep patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
                   ` (4 preceding siblings ...)
  2025-09-09  7:55 ` [PATCH v4 0/3] " Yi Tao
@ 2025-09-10  6:59 ` Yi Tao
  2025-09-10  6:59   ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao
                     ` (2 more replies)
  5 siblings, 3 replies; 44+ messages in thread
From: Yi Tao @ 2025-09-10  6:59 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Changes in v5:
- Adjust the order of patches.
- Add the type name for enum used by cgroup_attach_lock.

Changes in v4:
- Adjust commit log and comments.
- Rename take_per_threadgroup_rwsem to cgroup_enable_per_threadgroup_rwsem
and add attr __read_mostly.
- Trigger a warning that per_threadgroup opreation can't be
disabled once enabled instead of actually turning it off.
- Split the code for retrying when the threadgroup leader changes into a
separate patch.
- Refactor the cgroup_attach_lock code to make it clearer.

Changes in v3:
- Expend commit log and comments.
- Put argument @tsk at end in cgroup_attach_lock/unlock.
- down_write global cgroup_thread_rwsem when flipping favordynmods to
synchronize with task between cgroup_threadgroup_change_begin and end.
- Rename group_rwsem to cgroup_threadgroup_rwsem.
- Fix bug causing abnormal cgroup migration due to threadgroup leader
changes。

Changes in v2:
- Use favordynmods as the enabling switch.
- Determine whether to use the per-thread-group rwsem based on whether
the task is NULL.
- Fix system hang caused by acquiring cgroup_threadgroup_rwsem inside
rcu_read_lock.

Yi Tao (3):
  cgroup: refactor the cgroup_attach_lock code to make it clearer
  cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start
  cgroup: replace global percpu_rwsem with per threadgroup resem when
    writing to cgroup.procs

 include/linux/cgroup-defs.h     |  25 +++++-
 include/linux/sched/signal.h    |   4 +
 init/init_task.c                |   3 +
 kernel/cgroup/cgroup-internal.h |  11 ++-
 kernel/cgroup/cgroup-v1.c       |  14 +--
 kernel/cgroup/cgroup.c          | 153 ++++++++++++++++++++++++--------
 kernel/fork.c                   |   4 +
 7 files changed, 167 insertions(+), 47 deletions(-)

-- 
2.32.0.3.g01195cf9f


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer
  2025-09-10  6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
@ 2025-09-10  6:59   ` Yi Tao
  2025-09-10 17:26     ` Tejun Heo
  2025-09-10  6:59   ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao
  2025-09-10  6:59   ` [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
  2 siblings, 1 reply; 44+ messages in thread
From: Yi Tao @ 2025-09-10  6:59 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Dynamic cgroup migration involving threadgroup locks can be in one of
two states: no lock held, or holding the global lock. Explicitly
declaring the different lock modes to make the code easier to
understand and facilitates future extensions of the lock modes.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h     |  8 ++++
 kernel/cgroup/cgroup-internal.h |  9 +++--
 kernel/cgroup/cgroup-v1.c       | 14 +++----
 kernel/cgroup/cgroup.c          | 67 ++++++++++++++++++++++++---------
 4 files changed, 69 insertions(+), 29 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b93a64115fe..213b0d01464f 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -140,6 +140,14 @@ enum {
 	__CFTYPE_ADDED		= (1 << 18),
 };
 
+enum cgroup_attach_lock_mode {
+	/* Default */
+	CGRP_ATTACH_LOCK_GLOBAL,
+
+	/* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */
+	CGRP_ATTACH_LOCK_NONE,
+};
+
 /*
  * cgroup_file is the handle for a file instance created in a cgroup which
  * is used, for example, to generate file changed notifications.  This can
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index b14e61c64a34..a6d6f30b6f65 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,12 +249,13 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(bool lock_threadgroup);
-void cgroup_attach_unlock(bool lock_threadgroup);
+void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode);
+void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
-					     bool *locked)
+					     enum cgroup_attach_lock_mode *lock_mode)
 	__acquires(&cgroup_threadgroup_rwsem);
-void cgroup_procs_write_finish(struct task_struct *task, bool locked)
+void cgroup_procs_write_finish(struct task_struct *task,
+			       enum cgroup_attach_lock_mode lock_mode)
 	__releases(&cgroup_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 2a4a387f867a..f3838888ea6f 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL);
 	cgroup_unlock();
 
 	return retval;
@@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL);
 	cgroup_unlock();
 	return ret;
 }
@@ -502,13 +502,13 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
 	struct task_struct *task;
 	const struct cred *cred, *tcred;
 	ssize_t ret;
-	bool locked;
+	enum cgroup_attach_lock_mode lock_mode;
 
 	cgrp = cgroup_kn_lock_live(of->kn, false);
 	if (!cgrp)
 		return -ENODEV;
 
-	task = cgroup_procs_write_start(buf, threadgroup, &locked);
+	task = cgroup_procs_write_start(buf, threadgroup, &lock_mode);
 	ret = PTR_ERR_OR_ZERO(task);
 	if (ret)
 		goto out_unlock;
@@ -531,7 +531,7 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
 	ret = cgroup_attach_task(cgrp, task, threadgroup);
 
 out_finish:
-	cgroup_procs_write_finish(task, locked);
+	cgroup_procs_write_finish(task, lock_mode);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..2b88c7abaa00 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2459,7 +2459,7 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
 /**
  * cgroup_attach_lock - Lock for ->attach()
- * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
+ * @lock_mode: whether to down_write cgroup_threadgroup_rwsem
  *
  * cgroup migration sometimes needs to stabilize threadgroups against forks and
  * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
@@ -2480,21 +2480,39 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
  */
-void cgroup_attach_lock(bool lock_threadgroup)
+void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode)
 {
 	cpus_read_lock();
-	if (lock_threadgroup)
+
+	switch (lock_mode) {
+	case CGRP_ATTACH_LOCK_NONE:
+		break;
+	case CGRP_ATTACH_LOCK_GLOBAL:
 		percpu_down_write(&cgroup_threadgroup_rwsem);
+		break;
+	default:
+		pr_warn("cgroup: Unexpected attach lock mode.");
+		break;
+	}
 }
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
- * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
+ * @lock_mode: whether to up_write cgroup_threadgroup_rwsem
  */
-void cgroup_attach_unlock(bool lock_threadgroup)
+void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode)
 {
-	if (lock_threadgroup)
+	switch (lock_mode) {
+	case CGRP_ATTACH_LOCK_NONE:
+		break;
+	case CGRP_ATTACH_LOCK_GLOBAL:
 		percpu_up_write(&cgroup_threadgroup_rwsem);
+		break;
+	default:
+		pr_warn("cgroup: Unexpected attach lock mode.");
+		break;
+	}
+
 	cpus_read_unlock();
 }
 
@@ -2968,7 +2986,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 }
 
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
-					     bool *threadgroup_locked)
+					     enum cgroup_attach_lock_mode *lock_mode)
 {
 	struct task_struct *tsk;
 	pid_t pid;
@@ -2985,8 +3003,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 * Therefore, we can skip the global lock.
 	 */
 	lockdep_assert_held(&cgroup_mutex);
-	*threadgroup_locked = pid || threadgroup;
-	cgroup_attach_lock(*threadgroup_locked);
+
+	if (pid || threadgroup)
+		*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
+	else
+		*lock_mode = CGRP_ATTACH_LOCK_NONE;
+
+	cgroup_attach_lock(*lock_mode);
 
 	rcu_read_lock();
 	if (pid) {
@@ -3017,14 +3040,15 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	goto out_unlock_rcu;
 
 out_unlock_threadgroup:
-	cgroup_attach_unlock(*threadgroup_locked);
-	*threadgroup_locked = false;
+	cgroup_attach_unlock(*lock_mode);
+	*lock_mode = CGRP_ATTACH_LOCK_NONE;
 out_unlock_rcu:
 	rcu_read_unlock();
 	return tsk;
 }
 
-void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked)
+void cgroup_procs_write_finish(struct task_struct *task,
+			       enum cgroup_attach_lock_mode lock_mode)
 {
 	struct cgroup_subsys *ss;
 	int ssid;
@@ -3032,7 +3056,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
 
-	cgroup_attach_unlock(threadgroup_locked);
+	cgroup_attach_unlock(lock_mode);
 
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3088,6 +3112,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	struct cgroup_subsys_state *d_css;
 	struct cgroup *dsct;
 	struct css_set *src_cset;
+	enum cgroup_attach_lock_mode lock_mode;
 	bool has_tasks;
 	int ret;
 
@@ -3119,7 +3144,13 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	 * write-locking can be skipped safely.
 	 */
 	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
-	cgroup_attach_lock(has_tasks);
+
+	if (has_tasks)
+		lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
+	else
+		lock_mode = CGRP_ATTACH_LOCK_NONE;
+
+	cgroup_attach_lock(lock_mode);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3140,7 +3171,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(has_tasks);
+	cgroup_attach_unlock(lock_mode);
 	return ret;
 }
 
@@ -5241,13 +5272,13 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	struct task_struct *task;
 	const struct cred *saved_cred;
 	ssize_t ret;
-	bool threadgroup_locked;
+	enum cgroup_attach_lock_mode lock_mode;
 
 	dst_cgrp = cgroup_kn_lock_live(of->kn, false);
 	if (!dst_cgrp)
 		return -ENODEV;
 
-	task = cgroup_procs_write_start(buf, threadgroup, &threadgroup_locked);
+	task = cgroup_procs_write_start(buf, threadgroup, &lock_mode);
 	ret = PTR_ERR_OR_ZERO(task);
 	if (ret)
 		goto out_unlock;
@@ -5273,7 +5304,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
 	ret = cgroup_attach_task(dst_cgrp, task, threadgroup);
 
 out_finish:
-	cgroup_procs_write_finish(task, threadgroup_locked);
+	cgroup_procs_write_finish(task, lock_mode);
 out_unlock:
 	cgroup_kn_unlock(of->kn);
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start
  2025-09-10  6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
  2025-09-10  6:59   ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao
@ 2025-09-10  6:59   ` Yi Tao
  2025-09-10 17:31     ` Tejun Heo
  2025-09-11  3:22     ` Waiman Long
  2025-09-10  6:59   ` [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
  2 siblings, 2 replies; 44+ messages in thread
From: Yi Tao @ 2025-09-10  6:59 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Later patches will introduce a new parameter `task` to
cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock
within cgroup_procs_write_start.

Between obtaining the threadgroup leader via PID and acquiring the
cgroup attach lock, the threadgroup leader may change, which could lead
to incorrect cgroup migration. Therefore, after acquiring the cgroup
attach lock, we check whether the threadgroup leader has changed, and if
so, retry the operation.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 23 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 2b88c7abaa00..756807164091 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2994,29 +2994,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
 		return ERR_PTR(-EINVAL);
 
-	/*
-	 * If we migrate a single thread, we don't care about threadgroup
-	 * stability. If the thread is `current`, it won't exit(2) under our
-	 * hands or change PID through exec(2). We exclude
-	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
-	 * callers by cgroup_mutex.
-	 * Therefore, we can skip the global lock.
-	 */
-	lockdep_assert_held(&cgroup_mutex);
-
-	if (pid || threadgroup)
-		*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
-	else
-		*lock_mode = CGRP_ATTACH_LOCK_NONE;
-
-	cgroup_attach_lock(*lock_mode);
-
+retry_find_task:
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
 			tsk = ERR_PTR(-ESRCH);
-			goto out_unlock_threadgroup;
+			goto out_unlock_rcu;
 		}
 	} else {
 		tsk = current;
@@ -3033,15 +3017,46 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 */
 	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
 		tsk = ERR_PTR(-EINVAL);
-		goto out_unlock_threadgroup;
+		goto out_unlock_rcu;
 	}
 
 	get_task_struct(tsk);
-	goto out_unlock_rcu;
+	rcu_read_unlock();
+
+	/*
+	 * If we migrate a single thread, we don't care about threadgroup
+	 * stability. If the thread is `current`, it won't exit(2) under our
+	 * hands or change PID through exec(2). We exclude
+	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
+	 * callers by cgroup_mutex.
+	 * Therefore, we can skip the global lock.
+	 */
+	lockdep_assert_held(&cgroup_mutex);
+
+	if (pid || threadgroup)
+		*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
+	else
+		*lock_mode = CGRP_ATTACH_LOCK_NONE;
+
+	cgroup_attach_lock(*lock_mode);
+
+	if (threadgroup) {
+		if (!thread_group_leader(tsk)) {
+			/*
+			 * a race with de_thread from another thread's exec()
+			 * may strip us of our leadership, if this happens,
+			 * there is no choice but to throw this task away and
+			 * try again; this is
+			 * "double-double-toil-and-trouble-check locking".
+			 */
+			cgroup_attach_unlock(*lock_mode);
+			put_task_struct(tsk);
+			goto retry_find_task;
+		}
+	}
+
+	return tsk;
 
-out_unlock_threadgroup:
-	cgroup_attach_unlock(*lock_mode);
-	*lock_mode = CGRP_ATTACH_LOCK_NONE;
 out_unlock_rcu:
 	rcu_read_unlock();
 	return tsk;
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-10  6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
  2025-09-10  6:59   ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao
  2025-09-10  6:59   ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao
@ 2025-09-10  6:59   ` Yi Tao
  2025-09-10 17:47     ` Tejun Heo
  2 siblings, 1 reply; 44+ messages in thread
From: Yi Tao @ 2025-09-10  6:59 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

The static usage pattern of creating a cgroup, enabling controllers,
and then seeding it with CLONE_INTO_CGROUP doesn't require write
locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
patch.

To avoid affecting other users, the per threadgroup rwsem is only used
when the favordynmods is enabled.

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads  | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
critical section            | critical section       | critical section
up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

With this patch, under heavy fork and exec interference, the long-tail
latency of cgroup migration has been reduced from milliseconds to
microseconds. Under heavy cgroup migration interference, the multi-CPU
score of the spawn test case in UnixBench increased by 9%.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h     | 17 ++++++++-
 include/linux/sched/signal.h    |  4 ++
 init/init_task.c                |  3 ++
 kernel/cgroup/cgroup-internal.h |  6 ++-
 kernel/cgroup/cgroup-v1.c       |  8 ++--
 kernel/cgroup/cgroup.c          | 67 +++++++++++++++++++++++++--------
 kernel/fork.c                   |  4 ++
 7 files changed, 87 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 213b0d01464f..fe29152bceff 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -91,6 +91,12 @@ enum {
 	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
 	 * forks and exits into the slow path and more expensive.
 	 *
+	 * Alleviate the contention between fork, exec, exit operations and
+	 * writing to cgroup.procs by taking a per threadgroup rwsem instead of
+	 * the global cgroup_threadgroup_rwsem. Fork and other operations
+	 * from threads in different thread groups no longer contend with
+	 * writing to cgroup.procs.
+	 *
 	 * The static usage pattern of creating a cgroup, enabling controllers,
 	 * and then seeding it with CLONE_INTO_CGROUP doesn't require write
 	 * locking cgroup_threadgroup_rwsem and thus doesn't benefit from
@@ -146,6 +152,9 @@ enum cgroup_attach_lock_mode {
 
 	/* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */
 	CGRP_ATTACH_LOCK_NONE,
+
+	/* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */
+	CGRP_ATTACH_LOCK_PER_THREADGROUP,
 };
 
 /*
@@ -830,6 +839,7 @@ struct cgroup_subsys {
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
+extern bool cgroup_enable_per_threadgroup_rwsem;
 
 struct cgroup_of_peak {
 	unsigned long		value;
@@ -841,11 +851,14 @@ struct cgroup_of_peak {
  * @tsk: target task
  *
  * Allows cgroup operations to synchronize against threadgroup changes
- * using a percpu_rw_semaphore.
+ * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when
+ * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
  */
 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
 {
 	percpu_down_read(&cgroup_threadgroup_rwsem);
+	if (cgroup_enable_per_threadgroup_rwsem)
+		down_read(&tsk->signal->cgroup_threadgroup_rwsem);
 }
 
 /**
@@ -856,6 +869,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
  */
 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 {
+	if (cgroup_enable_per_threadgroup_rwsem)
+		up_read(&tsk->signal->cgroup_threadgroup_rwsem);
 	percpu_up_read(&cgroup_threadgroup_rwsem);
 }
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..7d6449982822 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -226,6 +226,10 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
+#ifdef CONFIG_CGROUPS
+	struct rw_semaphore cgroup_threadgroup_rwsem;
+#endif
+
 	/*
 	 * Thread is the potential origin of an oom condition; kill first on
 	 * oom
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..a55e2189206f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
 	},
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
+#ifdef CONFIG_CGROUPS
+	.cgroup_threadgroup_rwsem	= __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
+#endif
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index a6d6f30b6f65..22051b4f1ccb 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode);
-void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode);
+void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode,
+			struct task_struct *tsk);
+void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode,
+			  struct task_struct *tsk);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     enum cgroup_attach_lock_mode *lock_mode)
 	__acquires(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index f3838888ea6f..db48550bc4cc 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL);
+	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL);
+	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 	cgroup_unlock();
 
 	return retval;
@@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL);
+	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL);
+	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 	cgroup_unlock();
 	return ret;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 756807164091..344424dd365b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -216,6 +216,14 @@ static u16 have_canfork_callback __read_mostly;
 
 static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
 
+/*
+ * Write protected by cgroup_mutex and write-lock of cgroup_threadgroup_rwsem,
+ * read protected by either.
+ *
+ * Can only be turned on, but not turned off.
+ */
+bool cgroup_enable_per_threadgroup_rwsem __read_mostly;
+
 /* cgroup namespace for init task */
 struct cgroup_namespace init_cgroup_ns = {
 	.ns.count	= REFCOUNT_INIT(2),
@@ -1302,14 +1310,24 @@ void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
 {
 	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
 
-	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+	/*
+	 * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
+	 * favordynmods can flip while task is between
+	 * cgroup_threadgroup_change_begin and cgroup_threadgroup_change_end,
+	 * so down_write global cgroup_threadgroup_rwsem to synchronize them.
+	 */
+	percpu_down_write(&cgroup_threadgroup_rwsem);
 	if (favor && !favoring) {
+		cgroup_enable_per_threadgroup_rwsem = true;
 		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
 		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
 	} else if (!favor && favoring) {
+		if (cgroup_enable_per_threadgroup_rwsem)
+			WARN_ONCE(1, "cgroup favordynmods: per threadgroup rwsem mechanism can't be disabled\n");
 		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
 		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
 	}
+	percpu_up_write(&cgroup_threadgroup_rwsem);
 }
 
 static int cgroup_init_root_id(struct cgroup_root *root)
@@ -2459,7 +2477,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
 /**
  * cgroup_attach_lock - Lock for ->attach()
- * @lock_mode: whether to down_write cgroup_threadgroup_rwsem
+ * @lock_mode: whether acquire and acquire which rwsem
+ * @tsk: thread group to lock
  *
  * cgroup migration sometimes needs to stabilize threadgroups against forks and
  * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
@@ -2479,8 +2498,15 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * Resolve the situation by always acquiring cpus_read_lock() before optionally
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
+ *
+ * When favordynmods is enabled, take per threadgroup rwsem to reduce overhead
+ * on dynamic cgroup modifications. see the comment above
+ * CGRP_ROOT_FAVOR_DYNMODS definition.
+ *
+ * tsk is not NULL only when writing to cgroup.procs.
  */
-void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode)
+void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode,
+			struct task_struct *tsk)
 {
 	cpus_read_lock();
 
@@ -2490,6 +2516,9 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode)
 	case CGRP_ATTACH_LOCK_GLOBAL:
 		percpu_down_write(&cgroup_threadgroup_rwsem);
 		break;
+	case CGRP_ATTACH_LOCK_PER_THREADGROUP:
+		down_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		break;
 	default:
 		pr_warn("cgroup: Unexpected attach lock mode.");
 		break;
@@ -2498,9 +2527,11 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode)
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
- * @lock_mode: whether to up_write cgroup_threadgroup_rwsem
+ * @lock_mode: whether release and release which rwsem
+ * @tsk: thread group to lock
  */
-void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode)
+void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode,
+			  struct task_struct *tsk)
 {
 	switch (lock_mode) {
 	case CGRP_ATTACH_LOCK_NONE:
@@ -2508,6 +2539,9 @@ void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode)
 	case CGRP_ATTACH_LOCK_GLOBAL:
 		percpu_up_write(&cgroup_threadgroup_rwsem);
 		break;
+	case CGRP_ATTACH_LOCK_PER_THREADGROUP:
+		up_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		break;
 	default:
 		pr_warn("cgroup: Unexpected attach lock mode.");
 		break;
@@ -3019,7 +3053,6 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 		tsk = ERR_PTR(-EINVAL);
 		goto out_unlock_rcu;
 	}
-
 	get_task_struct(tsk);
 	rcu_read_unlock();
 
@@ -3033,12 +3066,16 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 */
 	lockdep_assert_held(&cgroup_mutex);
 
-	if (pid || threadgroup)
-		*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
-	else
+	if (pid || threadgroup) {
+		if (cgroup_enable_per_threadgroup_rwsem)
+			*lock_mode = CGRP_ATTACH_LOCK_PER_THREADGROUP;
+		else
+			*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
+	} else {
 		*lock_mode = CGRP_ATTACH_LOCK_NONE;
+	}
 
-	cgroup_attach_lock(*lock_mode);
+	cgroup_attach_lock(*lock_mode, tsk);
 
 	if (threadgroup) {
 		if (!thread_group_leader(tsk)) {
@@ -3049,7 +3086,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 			 * try again; this is
 			 * "double-double-toil-and-trouble-check locking".
 			 */
-			cgroup_attach_unlock(*lock_mode);
+			cgroup_attach_unlock(*lock_mode, tsk);
 			put_task_struct(tsk);
 			goto retry_find_task;
 		}
@@ -3068,11 +3105,11 @@ void cgroup_procs_write_finish(struct task_struct *task,
 	struct cgroup_subsys *ss;
 	int ssid;
 
+	cgroup_attach_unlock(lock_mode, task);
+
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
 
-	cgroup_attach_unlock(lock_mode);
-
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
 			ss->post_attach();
@@ -3165,7 +3202,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	else
 		lock_mode = CGRP_ATTACH_LOCK_NONE;
 
-	cgroup_attach_lock(lock_mode);
+	cgroup_attach_lock(lock_mode, NULL);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3186,7 +3223,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(lock_mode);
+	cgroup_attach_unlock(lock_mode, NULL);
 	return ret;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index c4ada32598bd..9a039867ecfd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->cgroup_threadgroup_rwsem);
+#endif
+
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer
  2025-09-10  6:59   ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao
@ 2025-09-10 17:26     ` Tejun Heo
  0 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2025-09-10 17:26 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

On Wed, Sep 10, 2025 at 02:59:33PM +0800, Yi Tao wrote:
> Dynamic cgroup migration involving threadgroup locks can be in one of
> two states: no lock held, or holding the global lock. Explicitly
> declaring the different lock modes to make the code easier to
> understand and facilitates future extensions of the lock modes.
> 
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>

Applied to cgroup/for-6.18. There was a conflict that I resolved. Please
check the tree to make sure that it looks alright.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start
  2025-09-10  6:59   ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao
@ 2025-09-10 17:31     ` Tejun Heo
  2025-09-11  3:22     ` Waiman Long
  1 sibling, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2025-09-10 17:31 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

On Wed, Sep 10, 2025 at 02:59:34PM +0800, Yi Tao wrote:
> Later patches will introduce a new parameter `task` to
> cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock
> within cgroup_procs_write_start.
> 
> Between obtaining the threadgroup leader via PID and acquiring the
> cgroup attach lock, the threadgroup leader may change, which could lead
> to incorrect cgroup migration. Therefore, after acquiring the cgroup
> attach lock, we check whether the threadgroup leader has changed, and if
> so, retry the operation.
> 
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>

Applied to cgroup/for-6.18 with minor comment adjustments.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-10  6:59   ` [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
@ 2025-09-10 17:47     ` Tejun Heo
  2025-09-11  1:52       ` Yi Tao
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2025-09-10 17:47 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

Applied to cgroup/for-6.18 with a comment update and
s/WARN_ONCE()/pr_warn_once()/ as the backtrace isn't likely helpful here.
The applied version follows:

------ 8< ------
From 0568f89d4fb82d98001baeb870e92f43cd1f7317 Mon Sep 17 00:00:00 2001
From: Yi Tao <escape@linux.alibaba.com>
Date: Wed, 10 Sep 2025 14:59:35 +0800
Subject: [PATCH] cgroup: replace global percpu_rwsem with per threadgroup
 resem when writing to cgroup.procs

The static usage pattern of creating a cgroup, enabling controllers,
and then seeding it with CLONE_INTO_CGROUP doesn't require write
locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
patch.

To avoid affecting other users, the per threadgroup rwsem is only used
when the favordynmods is enabled.

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads  | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
critical section            | critical section       | critical section
up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

With this patch, under heavy fork and exec interference, the long-tail
latency of cgroup migration has been reduced from milliseconds to
microseconds. Under heavy cgroup migration interference, the multi-CPU
score of the spawn test case in UnixBench increased by 9%.

tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to
    pr_warn_once().

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup-defs.h     | 17 +++++++-
 include/linux/sched/signal.h    |  4 ++
 init/init_task.c                |  3 ++
 kernel/cgroup/cgroup-internal.h |  6 ++-
 kernel/cgroup/cgroup-v1.c       |  8 ++--
 kernel/cgroup/cgroup.c          | 73 ++++++++++++++++++++++++++-------
 kernel/fork.c                   |  4 ++
 7 files changed, 93 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ff3c7d0e3e01..93318fce31f3 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -91,6 +91,12 @@ enum {
 	 * cgroup_threadgroup_rwsem. This makes hot path operations such as
 	 * forks and exits into the slow path and more expensive.
 	 *
+	 * Alleviate the contention between fork, exec, exit operations and
+	 * writing to cgroup.procs by taking a per threadgroup rwsem instead of
+	 * the global cgroup_threadgroup_rwsem. Fork and other operations
+	 * from threads in different thread groups no longer contend with
+	 * writing to cgroup.procs.
+	 *
 	 * The static usage pattern of creating a cgroup, enabling controllers,
 	 * and then seeding it with CLONE_INTO_CGROUP doesn't require write
 	 * locking cgroup_threadgroup_rwsem and thus doesn't benefit from
@@ -146,6 +152,9 @@ enum cgroup_attach_lock_mode {
 
 	/* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */
 	CGRP_ATTACH_LOCK_NONE,
+
+	/* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */
+	CGRP_ATTACH_LOCK_PER_THREADGROUP,
 };
 
 /*
@@ -846,6 +855,7 @@ struct cgroup_subsys {
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
+extern bool cgroup_enable_per_threadgroup_rwsem;
 
 struct cgroup_of_peak {
 	unsigned long		value;
@@ -857,11 +867,14 @@ struct cgroup_of_peak {
  * @tsk: target task
  *
  * Allows cgroup operations to synchronize against threadgroup changes
- * using a percpu_rw_semaphore.
+ * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when
+ * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
  */
 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
 {
 	percpu_down_read(&cgroup_threadgroup_rwsem);
+	if (cgroup_enable_per_threadgroup_rwsem)
+		down_read(&tsk->signal->cgroup_threadgroup_rwsem);
 }
 
 /**
@@ -872,6 +885,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
  */
 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 {
+	if (cgroup_enable_per_threadgroup_rwsem)
+		up_read(&tsk->signal->cgroup_threadgroup_rwsem);
 	percpu_up_read(&cgroup_threadgroup_rwsem);
 }
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..7d6449982822 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -226,6 +226,10 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
+#ifdef CONFIG_CGROUPS
+	struct rw_semaphore cgroup_threadgroup_rwsem;
+#endif
+
 	/*
 	 * Thread is the potential origin of an oom condition; kill first on
 	 * oom
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..a55e2189206f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
 	},
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
+#ifdef CONFIG_CGROUPS
+	.cgroup_threadgroup_rwsem	= __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem),
+#endif
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index a6d6f30b6f65..22051b4f1ccb 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode);
-void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode);
+void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode,
+			struct task_struct *tsk);
+void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode,
+			  struct task_struct *tsk);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     enum cgroup_attach_lock_mode *lock_mode)
 	__acquires(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 852ebe7ca3a1..a9e029b570c8 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -69,7 +69,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL);
+	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -81,7 +81,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL);
+	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 	cgroup_unlock();
 
 	return retval;
@@ -118,7 +118,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL);
+	cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -154,7 +154,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL);
+	cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL);
 	cgroup_unlock();
 	return ret;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a6b81b48bb70..fed701df1167 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -239,6 +239,14 @@ static u16 have_canfork_callback __read_mostly;
 
 static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
 
+/*
+ * Write protected by cgroup_mutex and write-lock of cgroup_threadgroup_rwsem,
+ * read protected by either.
+ *
+ * Can only be turned on, but not turned off.
+ */
+bool cgroup_enable_per_threadgroup_rwsem __read_mostly;
+
 /* cgroup namespace for init task */
 struct cgroup_namespace init_cgroup_ns = {
 	.ns.count	= REFCOUNT_INIT(2),
@@ -1325,14 +1333,30 @@ void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
 {
 	bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
 
-	/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
+	/*
+	 * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition.
+	 * favordynmods can flip while task is between
+	 * cgroup_threadgroup_change_begin() and end(), so down_write global
+	 * cgroup_threadgroup_rwsem to synchronize them.
+	 *
+	 * Once cgroup_enable_per_threadgroup_rwsem is enabled, holding
+	 * cgroup_threadgroup_rwsem doesn't exlude tasks between
+	 * cgroup_thread_group_change_begin() and end() and thus it's unsafe to
+	 * turn off. As the scenario is unlikely, simply disallow disabling once
+	 * enabled and print out a warning.
+	 */
+	percpu_down_write(&cgroup_threadgroup_rwsem);
 	if (favor && !favoring) {
+		cgroup_enable_per_threadgroup_rwsem = true;
 		rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
 		root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
 	} else if (!favor && favoring) {
+		if (cgroup_enable_per_threadgroup_rwsem)
+			pr_warn_once("cgroup favordynmods: per threadgroup rwsem mechanism can't be disabled\n");
 		rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
 		root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
 	}
+	percpu_up_write(&cgroup_threadgroup_rwsem);
 }
 
 static int cgroup_init_root_id(struct cgroup_root *root)
@@ -2482,7 +2506,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
 /**
  * cgroup_attach_lock - Lock for ->attach()
- * @lock_mode: whether to down_write cgroup_threadgroup_rwsem
+ * @lock_mode: whether acquire and acquire which rwsem
+ * @tsk: thread group to lock
  *
  * cgroup migration sometimes needs to stabilize threadgroups against forks and
  * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
@@ -2502,8 +2527,15 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * Resolve the situation by always acquiring cpus_read_lock() before optionally
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
+ *
+ * When favordynmods is enabled, take per threadgroup rwsem to reduce overhead
+ * on dynamic cgroup modifications. see the comment above
+ * CGRP_ROOT_FAVOR_DYNMODS definition.
+ *
+ * tsk is not NULL only when writing to cgroup.procs.
  */
-void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode)
+void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode,
+			struct task_struct *tsk)
 {
 	cpus_read_lock();
 
@@ -2513,6 +2545,9 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode)
 	case CGRP_ATTACH_LOCK_GLOBAL:
 		percpu_down_write(&cgroup_threadgroup_rwsem);
 		break;
+	case CGRP_ATTACH_LOCK_PER_THREADGROUP:
+		down_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		break;
 	default:
 		pr_warn("cgroup: Unexpected attach lock mode.");
 		break;
@@ -2521,9 +2556,11 @@ void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode)
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
- * @lock_mode: whether to up_write cgroup_threadgroup_rwsem
+ * @lock_mode: whether release and release which rwsem
+ * @tsk: thread group to lock
  */
-void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode)
+void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode,
+			  struct task_struct *tsk)
 {
 	switch (lock_mode) {
 	case CGRP_ATTACH_LOCK_NONE:
@@ -2531,6 +2568,9 @@ void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode)
 	case CGRP_ATTACH_LOCK_GLOBAL:
 		percpu_up_write(&cgroup_threadgroup_rwsem);
 		break;
+	case CGRP_ATTACH_LOCK_PER_THREADGROUP:
+		up_write(&tsk->signal->cgroup_threadgroup_rwsem);
+		break;
 	default:
 		pr_warn("cgroup: Unexpected attach lock mode.");
 		break;
@@ -3042,7 +3082,6 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 		tsk = ERR_PTR(-EINVAL);
 		goto out_unlock_rcu;
 	}
-
 	get_task_struct(tsk);
 	rcu_read_unlock();
 
@@ -3055,12 +3094,16 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 */
 	lockdep_assert_held(&cgroup_mutex);
 
-	if (pid || threadgroup)
-		*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
-	else
+	if (pid || threadgroup) {
+		if (cgroup_enable_per_threadgroup_rwsem)
+			*lock_mode = CGRP_ATTACH_LOCK_PER_THREADGROUP;
+		else
+			*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
+	} else {
 		*lock_mode = CGRP_ATTACH_LOCK_NONE;
+	}
 
-	cgroup_attach_lock(*lock_mode);
+	cgroup_attach_lock(*lock_mode, tsk);
 
 	if (threadgroup) {
 		if (!thread_group_leader(tsk)) {
@@ -3069,7 +3112,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 			 * may strip us of our leadership. If this happens,
 			 * throw this task away and try again.
 			 */
-			cgroup_attach_unlock(*lock_mode);
+			cgroup_attach_unlock(*lock_mode, tsk);
 			put_task_struct(tsk);
 			goto retry_find_task;
 		}
@@ -3085,10 +3128,10 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 void cgroup_procs_write_finish(struct task_struct *task,
 			       enum cgroup_attach_lock_mode lock_mode)
 {
+	cgroup_attach_unlock(lock_mode, task);
+
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
-
-	cgroup_attach_unlock(lock_mode);
 }
 
 static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
@@ -3178,7 +3221,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	else
 		lock_mode = CGRP_ATTACH_LOCK_NONE;
 
-	cgroup_attach_lock(lock_mode);
+	cgroup_attach_lock(lock_mode, NULL);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3199,7 +3242,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(lock_mode);
+	cgroup_attach_unlock(lock_mode, NULL);
 	return ret;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index c4ada32598bd..9a039867ecfd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->cgroup_threadgroup_rwsem);
+#endif
+
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs
  2025-09-10 17:47     ` Tejun Heo
@ 2025-09-11  1:52       ` Yi Tao
  0 siblings, 0 replies; 44+ messages in thread
From: Yi Tao @ 2025-09-11  1:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel

Thank you for the review and guidance.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start
  2025-09-10  6:59   ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao
  2025-09-10 17:31     ` Tejun Heo
@ 2025-09-11  3:22     ` Waiman Long
  1 sibling, 0 replies; 44+ messages in thread
From: Waiman Long @ 2025-09-11  3:22 UTC (permalink / raw)
  To: Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

On 9/10/25 2:59 AM, Yi Tao wrote:
> Later patches will introduce a new parameter `task` to
> cgroup_attach_lock, thus adjusting the position of cgroup_attach_lock
> within cgroup_procs_write_start.
>
> Between obtaining the threadgroup leader via PID and acquiring the
> cgroup attach lock, the threadgroup leader may change, which could lead
> to incorrect cgroup migration. Therefore, after acquiring the cgroup
> attach lock, we check whether the threadgroup leader has changed, and if
> so, retry the operation.
>
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
> ---
>   kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++----------------
>   1 file changed, 38 insertions(+), 23 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 2b88c7abaa00..756807164091 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2994,29 +2994,13 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>   	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
>   		return ERR_PTR(-EINVAL);
>   
> -	/*
> -	 * If we migrate a single thread, we don't care about threadgroup
> -	 * stability. If the thread is `current`, it won't exit(2) under our
> -	 * hands or change PID through exec(2). We exclude
> -	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> -	 * callers by cgroup_mutex.
> -	 * Therefore, we can skip the global lock.
> -	 */
> -	lockdep_assert_held(&cgroup_mutex);
> -
> -	if (pid || threadgroup)
> -		*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
> -	else
> -		*lock_mode = CGRP_ATTACH_LOCK_NONE;
> -
> -	cgroup_attach_lock(*lock_mode);
> -
> +retry_find_task:
>   	rcu_read_lock();
>   	if (pid) {
>   		tsk = find_task_by_vpid(pid);
>   		if (!tsk) {
>   			tsk = ERR_PTR(-ESRCH);
> -			goto out_unlock_threadgroup;
> +			goto out_unlock_rcu;
>   		}
>   	} else {
>   		tsk = current;
> @@ -3033,15 +3017,46 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>   	 */
>   	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
>   		tsk = ERR_PTR(-EINVAL);
> -		goto out_unlock_threadgroup;
> +		goto out_unlock_rcu;
>   	}
>   
>   	get_task_struct(tsk);
> -	goto out_unlock_rcu;
> +	rcu_read_unlock();
> +
> +	/*
> +	 * If we migrate a single thread, we don't care about threadgroup
> +	 * stability. If the thread is `current`, it won't exit(2) under our
> +	 * hands or change PID through exec(2). We exclude
> +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> +	 * callers by cgroup_mutex.
> +	 * Therefore, we can skip the global lock.
> +	 */
> +	lockdep_assert_held(&cgroup_mutex);
> +
> +	if (pid || threadgroup)
> +		*lock_mode = CGRP_ATTACH_LOCK_GLOBAL;
> +	else
> +		*lock_mode = CGRP_ATTACH_LOCK_NONE;
> +
> +	cgroup_attach_lock(*lock_mode);
> +
> +	if (threadgroup) {
> +		if (!thread_group_leader(tsk)) {
Nit: You can combine the 2 conditions together to avoid excessive indent.

  if (threadgroup && !thread_group_leader(tsk)) {

> +			/*
> +			 * a race with de_thread from another thread's exec()
Should be "de_thread()" to signal that it is a function.
> +			 * may strip us of our leadership, if this happens,
> +			 * there is no choice but to throw this task away and
> +			 * try again; this is
> +			 * "double-double-toil-and-trouble-check locking".

This "double-double-toil-and-trouble-check" is a new term in the kernel 
source tree. I will suggest to use something simpler to avoid confusion.

Cheers, Longman


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-09-11  3:22 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
2025-09-03 13:14 ` Waiman Long
2025-09-04  1:35   ` Chen Ridong
2025-09-04  4:59   ` escape
2025-09-04  5:02   ` escape
2025-09-03 16:53 ` Tejun Heo
2025-09-03 20:03   ` Michal Koutný
2025-09-03 20:45     ` Tejun Heo
2025-09-04  1:40       ` Chen Ridong
2025-09-04  6:43         ` escape
2025-09-04  6:52         ` Tejun Heo
2025-09-04  3:15   ` escape
2025-09-04  6:38     ` escape
2025-09-04  7:28     ` Tejun Heo
2025-09-04  8:10       ` escape
2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao
2025-09-04 11:39   ` [PATCH v2 1/1] " Yi Tao
2025-09-04 16:31     ` Tejun Heo
2025-09-05  2:16       ` escape
2025-09-05  2:27         ` Tejun Heo
2025-09-05  3:44           ` escape
2025-09-05  3:48             ` Tejun Heo
2025-09-05  4:30               ` escape
2025-09-05  2:02     ` Chen Ridong
2025-09-05 13:17     ` kernel test robot
2025-09-08 10:20 ` [PATCH v3 0/1] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
2025-09-08 10:20   ` [PATCH v3 1/1] " Yi Tao
2025-09-08 16:05     ` Tejun Heo
2025-09-08 19:39     ` Waiman Long
2025-09-09  7:55 ` [PATCH v4 0/3] " Yi Tao
2025-09-09  7:55   ` [PATCH v4 1/3] " Yi Tao
2025-09-09  7:55   ` [PATCH v4 2/3] cgroup: retry find task if threadgroup leader changed Yi Tao
2025-09-09 16:59     ` Tejun Heo
2025-09-09  7:55   ` [PATCH v4 3/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao
2025-09-09 17:00     ` Tejun Heo
2025-09-10  6:59 ` [PATCH v5 0/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
2025-09-10  6:59   ` [PATCH v5 1/3] cgroup: refactor the cgroup_attach_lock code to make it clearer Yi Tao
2025-09-10 17:26     ` Tejun Heo
2025-09-10  6:59   ` [PATCH v5 2/3] cgroup: relocate cgroup_attach_lock within cgroup_procs_write_start Yi Tao
2025-09-10 17:31     ` Tejun Heo
2025-09-11  3:22     ` Waiman Long
2025-09-10  6:59   ` [PATCH v5 3/3] cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs Yi Tao
2025-09-10 17:47     ` Tejun Heo
2025-09-11  1:52       ` Yi Tao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).