linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
@ 2025-09-03 11:11 Yi Tao
  2025-09-03 13:14 ` Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Yi Tao @ 2025-09-03 11:11 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads  | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
critical section            | critical section       | critical section
up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h     |  2 ++
 include/linux/sched/signal.h    |  4 +++
 init/init_task.c                |  3 ++
 kernel/cgroup/cgroup-internal.h |  6 ++--
 kernel/cgroup/cgroup-v1.c       |  8 ++---
 kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
 kernel/fork.c                   |  4 +++
 7 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b93a64115fe..8e0fdad8a440 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -838,6 +838,7 @@ struct cgroup_of_peak {
 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
 {
 	percpu_down_read(&cgroup_threadgroup_rwsem);
+	down_read(&tsk->signal->group_rwsem);
 }
 
 /**
@@ -848,6 +849,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
  */
 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 {
+	up_read(&tsk->signal->group_rwsem);
 	percpu_up_read(&cgroup_threadgroup_rwsem);
 }
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..86fbc99a9174 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -226,6 +226,10 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
+#ifdef CONFIG_CGROUPS
+	struct rw_semaphore group_rwsem;
+#endif
+
 	/*
 	 * Thread is the potential origin of an oom condition; kill first on
 	 * oom
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..0450093924a7 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
 	},
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
+#ifdef CONFIG_CGROUPS
+	.group_rwsem	= __RWSEM_INITIALIZER(init_signals.group_rwsem),
+#endif
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index b14e61c64a34..572c24b7e947 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(bool lock_threadgroup);
-void cgroup_attach_unlock(bool lock_threadgroup);
+void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
+			bool global);
+void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup,
+			  bool global);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     bool *locked)
 	__acquires(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 2a4a387f867a..3e5ead8c5bc5 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(NULL, true, true);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(NULL, true, true);
 	cgroup_unlock();
 
 	return retval;
@@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(NULL, true, true);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(NULL, true, true);
 	cgroup_unlock();
 	return ret;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..4e1d80a2741f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
  */
-void cgroup_attach_lock(bool lock_threadgroup)
+void cgroup_attach_lock(struct task_struct *tsk,
+			       bool lock_threadgroup, bool global)
 {
 	cpus_read_lock();
-	if (lock_threadgroup)
-		percpu_down_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (global)
+			percpu_down_write(&cgroup_threadgroup_rwsem);
+		else
+			down_write(&tsk->signal->group_rwsem);
+	}
 }
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
  * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
  */
-void cgroup_attach_unlock(bool lock_threadgroup)
+void cgroup_attach_unlock(struct task_struct *tsk,
+				 bool lock_threadgroup, bool global)
 {
-	if (lock_threadgroup)
-		percpu_up_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (global)
+			percpu_up_write(&cgroup_threadgroup_rwsem);
+		else
+			up_write(&tsk->signal->group_rwsem);
+	}
 	cpus_read_unlock();
 }
 
@@ -2970,12 +2980,23 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     bool *threadgroup_locked)
 {
-	struct task_struct *tsk;
+	struct task_struct *tsk, *tmp_tsk;
 	pid_t pid;
 
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
 		return ERR_PTR(-EINVAL);
 
+	rcu_read_lock();
+	if (pid) {
+		tsk = find_task_by_vpid(pid);
+		if (!tsk) {
+			tsk = ERR_PTR(-ESRCH);
+			goto out_unlock_rcu;
+		}
+	} else {
+		tsk = current;
+	}
+
 	/*
 	 * If we migrate a single thread, we don't care about threadgroup
 	 * stability. If the thread is `current`, it won't exit(2) under our
@@ -2986,18 +3007,9 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 */
 	lockdep_assert_held(&cgroup_mutex);
 	*threadgroup_locked = pid || threadgroup;
-	cgroup_attach_lock(*threadgroup_locked);
 
-	rcu_read_lock();
-	if (pid) {
-		tsk = find_task_by_vpid(pid);
-		if (!tsk) {
-			tsk = ERR_PTR(-ESRCH);
-			goto out_unlock_threadgroup;
-		}
-	} else {
-		tsk = current;
-	}
+	tmp_tsk = tsk;
+	cgroup_attach_lock(tmp_tsk, *threadgroup_locked, false);
 
 	if (threadgroup)
 		tsk = tsk->group_leader;
@@ -3017,7 +3029,7 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	goto out_unlock_rcu;
 
 out_unlock_threadgroup:
-	cgroup_attach_unlock(*threadgroup_locked);
+	cgroup_attach_unlock(tmp_tsk, *threadgroup_locked, false);
 	*threadgroup_locked = false;
 out_unlock_rcu:
 	rcu_read_unlock();
@@ -3032,7 +3044,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
 
-	cgroup_attach_unlock(threadgroup_locked);
+	cgroup_attach_unlock(task, threadgroup_locked, false);
 
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3119,7 +3131,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	 * write-locking can be skipped safely.
 	 */
 	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
-	cgroup_attach_lock(has_tasks);
+	cgroup_attach_lock(NULL, has_tasks, true);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3140,7 +3152,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(has_tasks);
+	cgroup_attach_unlock(NULL, has_tasks, true);
 	return ret;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499d..5218f9b93c77 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->group_rwsem);
+#endif
+
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
@ 2025-09-03 13:14 ` Waiman Long
  2025-09-04  1:35   ` Chen Ridong
                     ` (2 more replies)
  2025-09-03 16:53 ` Tejun Heo
  2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao
  2 siblings, 3 replies; 18+ messages in thread
From: Waiman Long @ 2025-09-03 13:14 UTC (permalink / raw)
  To: Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel


On 9/3/25 7:11 AM, Yi Tao wrote:
> As computer hardware advances, modern systems are typically equipped
> with many CPU cores and large amounts of memory, enabling the deployment
> of numerous applications. On such systems, container creation and
> deletion become frequent operations, making cgroup process migration no
> longer a cold path. This leads to noticeable contention with common
> process operations such as fork, exec, and exit.
>
> To alleviate the contention between cgroup process migration and
> operations like process fork, this patch modifies lock to take the write
> lock on signal_struct->group_rwsem when writing pid to
> cgroup.procs/threads instead of holding a global write lock.
>
> Cgroup process migration has historically relied on
> signal_struct->group_rwsem to protect thread group integrity. In commit
> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
> a global percpu_rwsem"), this was changed to a global
> cgroup_threadgroup_rwsem. The advantage of using a global lock was
> simplified handling of process group migrations. This patch retains the
> use of the global lock for protecting process group migration, while
> reducing contention by using per thread group lock during
> cgroup.procs/threads writes.
>
> The locking behavior is as follows:
>
> write cgroup.procs/threads  | process fork,exec,exit | process group migration
> ------------------------------------------------------------------------------
> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
> down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
> critical section            | critical section       | critical section
> up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>
> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
> signal_struct->group_rwsem.
>
> This patch eliminates contention between cgroup migration and fork
> operations for threads that belong to different thread groups, thereby
> reducing the long-tail latency of cgroup migrations and lowering system
> load.
Do you have any performance numbers to showcase how much performance 
benefit does this change provide?
> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
> ---
>   include/linux/cgroup-defs.h     |  2 ++
>   include/linux/sched/signal.h    |  4 +++
>   init/init_task.c                |  3 ++
>   kernel/cgroup/cgroup-internal.h |  6 ++--
>   kernel/cgroup/cgroup-v1.c       |  8 ++---
>   kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
>   kernel/fork.c                   |  4 +++
>   7 files changed, 55 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 6b93a64115fe..8e0fdad8a440 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -838,6 +838,7 @@ struct cgroup_of_peak {
>   static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>   {
>   	percpu_down_read(&cgroup_threadgroup_rwsem);
> +	down_read(&tsk->signal->group_rwsem);
>   }
>   
>   /**
> @@ -848,6 +849,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>    */
>   static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>   {
> +	up_read(&tsk->signal->group_rwsem);
>   	percpu_up_read(&cgroup_threadgroup_rwsem);
>   }
>   
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..86fbc99a9174 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -226,6 +226,10 @@ struct signal_struct {
>   	struct tty_audit_buf *tty_audit_buf;
>   #endif
>   
> +#ifdef CONFIG_CGROUPS
> +	struct rw_semaphore group_rwsem;
> +#endif
> +
>   	/*
>   	 * Thread is the potential origin of an oom condition; kill first on
>   	 * oom
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..0450093924a7 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>   	},
>   	.multiprocess	= HLIST_HEAD_INIT,
>   	.rlim		= INIT_RLIMITS,
> +#ifdef CONFIG_CGROUPS
> +	.group_rwsem	= __RWSEM_INITIALIZER(init_signals.group_rwsem),
> +#endif
>   	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>   	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>   #ifdef CONFIG_POSIX_TIMERS
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index b14e61c64a34..572c24b7e947 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
>   
>   int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
>   		       bool threadgroup);
> -void cgroup_attach_lock(bool lock_threadgroup);
> -void cgroup_attach_unlock(bool lock_threadgroup);
> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
> +			bool global);
> +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup,
> +			  bool global);
>   struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>   					     bool *locked)
>   	__acquires(&cgroup_threadgroup_rwsem);
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 2a4a387f867a..3e5ead8c5bc5 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>   	int retval = 0;
>   
>   	cgroup_lock();
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(NULL, true, true);
>   	for_each_root(root) {
>   		struct cgroup *from_cgrp;
>   
> @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>   		if (retval)
>   			break;
>   	}
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(NULL, true, true);
>   	cgroup_unlock();
>   
>   	return retval;
> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>   
>   	cgroup_lock();
>   
> -	cgroup_attach_lock(true);
> +	cgroup_attach_lock(NULL, true, true);
>   
>   	/* all tasks in @from are being moved, all csets are source */
>   	spin_lock_irq(&css_set_lock);
> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>   	} while (task && !ret);
>   out_err:
>   	cgroup_migrate_finish(&mgctx);
> -	cgroup_attach_unlock(true);
> +	cgroup_attach_unlock(NULL, true, true);
>   	cgroup_unlock();
>   	return ret;
>   }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..4e1d80a2741f 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>    * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
>    * CPU hotplug is disabled on entry.
>    */
> -void cgroup_attach_lock(bool lock_threadgroup)
> +void cgroup_attach_lock(struct task_struct *tsk,
> +			       bool lock_threadgroup, bool global)
>   {
>   	cpus_read_lock();
> -	if (lock_threadgroup)
> -		percpu_down_write(&cgroup_threadgroup_rwsem);
> +	if (lock_threadgroup) {
> +		if (global)
> +			percpu_down_write(&cgroup_threadgroup_rwsem);
> +		else
> +			down_write(&tsk->signal->group_rwsem);
> +	}
>   }

tsk is only used when global is false. So you can eliminate the 
redundant global argument and use the presence of tsk to indicate which 
lock to take.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
  2025-09-03 13:14 ` Waiman Long
@ 2025-09-03 16:53 ` Tejun Heo
  2025-09-03 20:03   ` Michal Koutný
  2025-09-04  3:15   ` escape
  2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao
  2 siblings, 2 replies; 18+ messages in thread
From: Tejun Heo @ 2025-09-03 16:53 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

On Wed, Sep 03, 2025 at 07:11:07PM +0800, Yi Tao wrote:
> As computer hardware advances, modern systems are typically equipped
> with many CPU cores and large amounts of memory, enabling the deployment
> of numerous applications. On such systems, container creation and
> deletion become frequent operations, making cgroup process migration no
> longer a cold path. This leads to noticeable contention with common
> process operations such as fork, exec, and exit.

If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
from cgroup side. Are there reasons why you can't use CLONE_INTO_CGROUP?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 16:53 ` Tejun Heo
@ 2025-09-03 20:03   ` Michal Koutný
  2025-09-03 20:45     ` Tejun Heo
  2025-09-04  3:15   ` escape
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Koutný @ 2025-09-03 20:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Yi Tao, hannes, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 530 bytes --]

On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote:
> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
> from cgroup side.

CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular
migration, no? Its effect is atomicity wrt clone.
Or, Tejum, what do you mean that it disappears? (I think we cannot give
up cgroup_mutex as it ensures synchronization of possible parent's
migration.)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 20:03   ` Michal Koutný
@ 2025-09-03 20:45     ` Tejun Heo
  2025-09-04  1:40       ` Chen Ridong
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2025-09-03 20:45 UTC (permalink / raw)
  To: Michal Koutný; +Cc: Yi Tao, hannes, cgroups, linux-kernel

Hello, Michal.

On Wed, Sep 03, 2025 at 10:03:39PM +0200, Michal Koutný wrote:
> On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
> > disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
> > from cgroup side.
> 
> CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular
> migration, no? Its effect is atomicity wrt clone.
> Or, Tejum, what do you mean that it disappears? (I think we cannot give
> up cgroup_mutex as it ensures synchronization of possible parent's
> migration.)

Sorry, I was confused. We no longer need to write lock threadgroup rwsem
when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need
cgroup_mutex.

  671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree")

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 13:14 ` Waiman Long
@ 2025-09-04  1:35   ` Chen Ridong
  2025-09-04  4:59   ` escape
  2025-09-04  5:02   ` escape
  2 siblings, 0 replies; 18+ messages in thread
From: Chen Ridong @ 2025-09-04  1:35 UTC (permalink / raw)
  To: Waiman Long, Yi Tao, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel



On 2025/9/3 21:14, Waiman Long wrote:
> 
> On 9/3/25 7:11 AM, Yi Tao wrote:
>> As computer hardware advances, modern systems are typically equipped
>> with many CPU cores and large amounts of memory, enabling the deployment
>> of numerous applications. On such systems, container creation and
>> deletion become frequent operations, making cgroup process migration no
>> longer a cold path. This leads to noticeable contention with common
>> process operations such as fork, exec, and exit.
>>

We are facing the same issue.

>> To alleviate the contention between cgroup process migration and
>> operations like process fork, this patch modifies lock to take the write
>> lock on signal_struct->group_rwsem when writing pid to
>> cgroup.procs/threads instead of holding a global write lock.
>>
>> Cgroup process migration has historically relied on
>> signal_struct->group_rwsem to protect thread group integrity. In commit
>> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
>> a global percpu_rwsem"), this was changed to a global
>> cgroup_threadgroup_rwsem. The advantage of using a global lock was
>> simplified handling of process group migrations. This patch retains the
>> use of the global lock for protecting process group migration, while
>> reducing contention by using per thread group lock during
>> cgroup.procs/threads writes.
>>
>> The locking behavior is as follows:
>>
>> write cgroup.procs/threads  | process fork,exec,exit | process group migration
>> ------------------------------------------------------------------------------
>> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
>> down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
>> critical section            | critical section       | critical section
>> up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
>> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>>
>> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
>> signal_struct->group_rwsem.
>>
>> This patch eliminates contention between cgroup migration and fork
>> operations for threads that belong to different thread groups, thereby
>> reducing the long-tail latency of cgroup migrations and lowering system
>> load.
> Do you have any performance numbers to showcase how much performance benefit does this change provide?

Yes, I would like this performance data as well.

>> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
>> ---
>>   include/linux/cgroup-defs.h     |  2 ++
>>   include/linux/sched/signal.h    |  4 +++
>>   init/init_task.c                |  3 ++
>>   kernel/cgroup/cgroup-internal.h |  6 ++--
>>   kernel/cgroup/cgroup-v1.c       |  8 ++---
>>   kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
>>   kernel/fork.c                   |  4 +++
>>   7 files changed, 55 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 6b93a64115fe..8e0fdad8a440 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -838,6 +838,7 @@ struct cgroup_of_peak {
>>   static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>   {
>>       percpu_down_read(&cgroup_threadgroup_rwsem);
>> +    down_read(&tsk->signal->group_rwsem);
>>   }
>>     /**
>> @@ -848,6 +849,7 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>    */
>>   static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>>   {
>> +    up_read(&tsk->signal->group_rwsem);
>>       percpu_up_read(&cgroup_threadgroup_rwsem);
>>   }
>>   diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>> index 1ef1edbaaf79..86fbc99a9174 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -226,6 +226,10 @@ struct signal_struct {
>>       struct tty_audit_buf *tty_audit_buf;
>>   #endif
>>   +#ifdef CONFIG_CGROUPS
>> +    struct rw_semaphore group_rwsem;
>> +#endif
>> +
>>       /*
>>        * Thread is the potential origin of an oom condition; kill first on
>>        * oom
>> diff --git a/init/init_task.c b/init/init_task.c
>> index e557f622bd90..0450093924a7 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>>       },
>>       .multiprocess    = HLIST_HEAD_INIT,
>>       .rlim        = INIT_RLIMITS,
>> +#ifdef CONFIG_CGROUPS
>> +    .group_rwsem    = __RWSEM_INITIALIZER(init_signals.group_rwsem),
>> +#endif
>>       .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>>       .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>>   #ifdef CONFIG_POSIX_TIMERS
>> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
>> index b14e61c64a34..572c24b7e947 100644
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
>>     int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
>>                  bool threadgroup);
>> -void cgroup_attach_lock(bool lock_threadgroup);
>> -void cgroup_attach_unlock(bool lock_threadgroup);
>> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
>> +            bool global);
>> +void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup,
>> +              bool global);
>>   struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>>                            bool *locked)
>>       __acquires(&cgroup_threadgroup_rwsem);
>> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
>> index 2a4a387f867a..3e5ead8c5bc5 100644
>> --- a/kernel/cgroup/cgroup-v1.c
>> +++ b/kernel/cgroup/cgroup-v1.c
>> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>>       int retval = 0;
>>         cgroup_lock();
>> -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>       for_each_root(root) {
>>           struct cgroup *from_cgrp;
>>   @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
>>           if (retval)
>>               break;
>>       }
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>         return retval;
>> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>>         cgroup_lock();
>>   -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>         /* all tasks in @from are being moved, all csets are source */
>>       spin_lock_irq(&css_set_lock);
>> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
>>       } while (task && !ret);
>>   out_err:
>>       cgroup_migrate_finish(&mgctx);
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>       return ret;
>>   }
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 312c6a8b55bb..4e1d80a2741f 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>>    * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
>>    * CPU hotplug is disabled on entry.
>>    */
>> -void cgroup_attach_lock(bool lock_threadgroup)
>> +void cgroup_attach_lock(struct task_struct *tsk,
>> +                   bool lock_threadgroup, bool global)
>>   {
>>       cpus_read_lock();
>> -    if (lock_threadgroup)
>> -        percpu_down_write(&cgroup_threadgroup_rwsem);
>> +    if (lock_threadgroup) {
>> +        if (global)
>> +            percpu_down_write(&cgroup_threadgroup_rwsem);
>> +        else
>> +            down_write(&tsk->signal->group_rwsem);
>> +    }
>>   }
> 
> tsk is only used when global is false. So you can eliminate the redundant global argument and use
> the presence of tsk to indicate which lock to take.
> 
> Cheers,
> Longman
> 

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 20:45     ` Tejun Heo
@ 2025-09-04  1:40       ` Chen Ridong
  2025-09-04  6:43         ` escape
  2025-09-04  6:52         ` Tejun Heo
  0 siblings, 2 replies; 18+ messages in thread
From: Chen Ridong @ 2025-09-04  1:40 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný; +Cc: Yi Tao, hannes, cgroups, linux-kernel



On 2025/9/4 4:45, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Sep 03, 2025 at 10:03:39PM +0200, Michal Koutný wrote:
>> On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote:
>>> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
>>> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
>>> from cgroup side.
>>
>> CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular
>> migration, no? Its effect is atomicity wrt clone.
>> Or, Tejum, what do you mean that it disappears? (I think we cannot give
>> up cgroup_mutex as it ensures synchronization of possible parent's
>> migration.)
> 
> Sorry, I was confused. We no longer need to write lock threadgroup rwsem
> when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need
> cgroup_mutex.
> 
>   671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree")
> 
> Thanks.
> 

I'm still a bit confused. Commit 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when
updating csses on an empty subtree") only applies to CSS updates. However, cloning with
CLONE_INTO_CGROUP still requires acquiring the threadgroup_rwsem.

cgroup_can_fork
  cgroup_css_set_fork
    	if (kargs->flags & CLONE_INTO_CGROUP)
		cgroup_lock();
	cgroup_threadgroup_change_begin(current);

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 16:53 ` Tejun Heo
  2025-09-03 20:03   ` Michal Koutný
@ 2025-09-04  3:15   ` escape
  2025-09-04  6:38     ` escape
  2025-09-04  7:28     ` Tejun Heo
  1 sibling, 2 replies; 18+ messages in thread
From: escape @ 2025-09-04  3:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel


在 2025/9/4 00:53, Tejun Heo 写道:
> Hello,
>
> On Wed, Sep 03, 2025 at 07:11:07PM +0800, Yi Tao wrote:
>> As computer hardware advances, modern systems are typically equipped
>> with many CPU cores and large amounts of memory, enabling the deployment
>> of numerous applications. On such systems, container creation and
>> deletion become frequent operations, making cgroup process migration no
>> longer a cold path. This leads to noticeable contention with common
>> process operations such as fork, exec, and exit.
> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
> from cgroup side. Are there reasons why you can't use CLONE_INTO_CGROUP?
>
> Thanks.
>
As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
still requires holding the threadgroup_rwsem, so contention with fork
operations persists.

CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
and deletion, but its usage comes with significant limitations:

1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
adoption is gradually increasing, many applications have not yet been
adapted to cgroup v2, and phasing out cgroup v1 will be a long and
gradual process.


2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the
time of process fork, effectively restricting cgroup migration to the
fork stage. This differs significantly from the typical cgroup attach
workflow. For example, in Kubernetes, systemd is the recommended cgroup
driver; kubelet communicates with systemd via D-Bus, and systemd
performs the actual cgroup attachment. In this case, the process being
attached typically does not have systemd as its parent. Using
CLONE_INTO_CGROUP in such a scenario is impractical and would require
coordinated changes to both systemd and kubelet.

Thanks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 13:14 ` Waiman Long
  2025-09-04  1:35   ` Chen Ridong
@ 2025-09-04  4:59   ` escape
  2025-09-04  5:02   ` escape
  2 siblings, 0 replies; 18+ messages in thread
From: escape @ 2025-09-04  4:59 UTC (permalink / raw)
  To: Waiman Long, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel



在 2025/9/3 21:14, Waiman Long 写道:
>
> On 9/3/25 7:11 AM, Yi Tao wrote:
>> As computer hardware advances, modern systems are typically equipped
>> with many CPU cores and large amounts of memory, enabling the deployment
>> of numerous applications. On such systems, container creation and
>> deletion become frequent operations, making cgroup process migration no
>> longer a cold path. This leads to noticeable contention with common
>> process operations such as fork, exec, and exit.
>>
>> To alleviate the contention between cgroup process migration and
>> operations like process fork, this patch modifies lock to take the write
>> lock on signal_struct->group_rwsem when writing pid to
>> cgroup.procs/threads instead of holding a global write lock.
>>
>> Cgroup process migration has historically relied on
>> signal_struct->group_rwsem to protect thread group integrity. In commit
>> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
>> a global percpu_rwsem"), this was changed to a global
>> cgroup_threadgroup_rwsem. The advantage of using a global lock was
>> simplified handling of process group migrations. This patch retains the
>> use of the global lock for protecting process group migration, while
>> reducing contention by using per thread group lock during
>> cgroup.procs/threads writes.
>>
>> The locking behavior is as follows:
>>
>> write cgroup.procs/threads  | process fork,exec,exit | process group 
>> migration
>> ------------------------------------------------------------------------------ 
>>
>> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
>> down_write(&p_rwsem)        | down_read(&p_rwsem)    | 
>> down_write(&g_rwsem)
>> critical section            | critical section       | critical section
>> up_write(&p_rwsem)          | up_read(&p_rwsem)      | 
>> up_write(&g_rwsem)
>> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>>
>> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
>> signal_struct->group_rwsem.
>>
>> This patch eliminates contention between cgroup migration and fork
>> operations for threads that belong to different thread groups, thereby
>> reducing the long-tail latency of cgroup migrations and lowering system
>> load.
> Do you have any performance numbers to showcase how much performance 
> benefit does this change provide?
Yes, here is the simulation test setup and results.

Machine Configuration:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2 NUMA node(s): 2
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8475B
BIOS Model name: Intel(R) Xeon(R) Platinum 8475B
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191

Test Program 1: cgroup_attach, four threads each write to cgroup.procs 
every 10ms,
simulating frequent cgroup process migrations.
Test Program 2: stress-ng or UnixBench, used to generate CPU and 
fork-intensive workloads.

Terminal 1: taskset -ac 0-47 ./cgroup_attach
Terminal 2: taskset -ac 48-95 stress-ng --cpu 48 --forkheavy 4

Use bpftrace to monitor the latency of cgroup1_procs_write.
Without the patch: The tail latency is 30ms
With the patch:The tail latency is 16μs.

Replace stress-ng with UnixBench to evaluate fork performance:

Terminal 2: taskset -ac 48-95 ./Run spawn

Without the patch: Multi-CPU score =15753.9

With the patch: Multi-CPU score = 17111.2

We can observe that the latency of cgroup process migration has been 
significantly
reduced, and the performance of process forking has improved.

In production environments running big data computing workloads, without 
this patch,
we observed a large number of processes entering the uninterruptible 
sleep (D) state
due to contention on `cgroup_threadgroup_rwsem`. With the patch applied, 
the system
load average has decreased.

cgroup_attach.c
```
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <fcntl.h>
#include <string.h>

#define NUM_THREADS 4
#define SLEEP_US 10000  // 10ms
#define CGROUP_ROOT "/sys/fs/cgroup/cpu"

pid_t gettid() {
     return syscall(SYS_gettid);
}


void* thread_func(void* arg) {
     char dirname[256];
     char procs_file[256];
     int fd;
     pid_t child_pid;
     FILE* file;

     pid_t tid = gettid();

     snprintf(dirname, sizeof(dirname), "%s/test-%d", CGROUP_ROOT, tid);
     snprintf(procs_file, sizeof(procs_file), "%s/cgroup.procs", dirname);

     if (mkdir(dirname, 0755) != 0) {
         perror(dirname);
         pthread_exit(NULL);
     }

     printf("thread %d:  cgroup dir %s\n", tid, dirname);


     while (1) {
         child_pid = fork();

         if (child_pid < 0) {
             perror("fork");
             continue;
         }

         if (child_pid == 0) {
             usleep(SLEEP_US);
             exit(0);
         }

         fd = open(procs_file, O_WRONLY);
         if (fd < 0) {
             perror("open cgroup.procs");
         } else {
             char pid_str[16];
             int len = snprintf(pid_str, sizeof(pid_str), "%d\n", 
child_pid);
             if (write(fd, pid_str, len) < 0) {
                 perror("write cgroup.procs");
             }
             close(fd);
         }
         close(fd);
         waitpid(child_pid, NULL, NULL);
         usleep(SLEEP_US);
     }

     pthread_exit(NULL);
}

int main() {
     pthread_t threads[NUM_THREADS];
     int i;

     for (i = 0; i < NUM_THREADS; i++) {
         if (pthread_create(&threads[i], NULL, thread_func, NULL) != 0) {
             perror("pthread_create");
             break;
         }
     }

     for (i = 0; i < NUM_THREADS; i++) {
         pthread_join(threads[i], NULL);
     }

     return 0;
}

```

>> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
>> ---
>>   include/linux/cgroup-defs.h     |  2 ++
>>   include/linux/sched/signal.h    |  4 +++
>>   init/init_task.c                |  3 ++
>>   kernel/cgroup/cgroup-internal.h |  6 ++--
>>   kernel/cgroup/cgroup-v1.c       |  8 ++---
>>   kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
>>   kernel/fork.c                   |  4 +++
>>   7 files changed, 55 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 6b93a64115fe..8e0fdad8a440 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -838,6 +838,7 @@ struct cgroup_of_peak {
>>   static inline void cgroup_threadgroup_change_begin(struct 
>> task_struct *tsk)
>>   {
>>       percpu_down_read(&cgroup_threadgroup_rwsem);
>> +    down_read(&tsk->signal->group_rwsem);
>>   }
>>     /**
>> @@ -848,6 +849,7 @@ static inline void 
>> cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>    */
>>   static inline void cgroup_threadgroup_change_end(struct task_struct 
>> *tsk)
>>   {
>> +    up_read(&tsk->signal->group_rwsem);
>>       percpu_up_read(&cgroup_threadgroup_rwsem);
>>   }
>>   diff --git a/include/linux/sched/signal.h 
>> b/include/linux/sched/signal.h
>> index 1ef1edbaaf79..86fbc99a9174 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -226,6 +226,10 @@ struct signal_struct {
>>       struct tty_audit_buf *tty_audit_buf;
>>   #endif
>>   +#ifdef CONFIG_CGROUPS
>> +    struct rw_semaphore group_rwsem;
>> +#endif
>> +
>>       /*
>>        * Thread is the potential origin of an oom condition; kill 
>> first on
>>        * oom
>> diff --git a/init/init_task.c b/init/init_task.c
>> index e557f622bd90..0450093924a7 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>>       },
>>       .multiprocess    = HLIST_HEAD_INIT,
>>       .rlim        = INIT_RLIMITS,
>> +#ifdef CONFIG_CGROUPS
>> +    .group_rwsem    = __RWSEM_INITIALIZER(init_signals.group_rwsem),
>> +#endif
>>       .cred_guard_mutex = 
>> __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>>       .exec_update_lock = 
>> __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>>   #ifdef CONFIG_POSIX_TIMERS
>> diff --git a/kernel/cgroup/cgroup-internal.h 
>> b/kernel/cgroup/cgroup-internal.h
>> index b14e61c64a34..572c24b7e947 100644
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, 
>> bool threadgroup,
>>     int cgroup_attach_task(struct cgroup *dst_cgrp, struct 
>> task_struct *leader,
>>                  bool threadgroup);
>> -void cgroup_attach_lock(bool lock_threadgroup);
>> -void cgroup_attach_unlock(bool lock_threadgroup);
>> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
>> +            bool global);
>> +void cgroup_attach_unlock(struct task_struct *tsk, bool 
>> lock_threadgroup,
>> +              bool global);
>>   struct task_struct *cgroup_procs_write_start(char *buf, bool 
>> threadgroup,
>>                            bool *locked)
>>       __acquires(&cgroup_threadgroup_rwsem);
>> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
>> index 2a4a387f867a..3e5ead8c5bc5 100644
>> --- a/kernel/cgroup/cgroup-v1.c
>> +++ b/kernel/cgroup/cgroup-v1.c
>> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct 
>> *from, struct task_struct *tsk)
>>       int retval = 0;
>>         cgroup_lock();
>> -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>       for_each_root(root) {
>>           struct cgroup *from_cgrp;
>>   @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct 
>> *from, struct task_struct *tsk)
>>           if (retval)
>>               break;
>>       }
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>         return retval;
>> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, 
>> struct cgroup *from)
>>         cgroup_lock();
>>   -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>         /* all tasks in @from are being moved, all csets are source */
>>       spin_lock_irq(&css_set_lock);
>> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, 
>> struct cgroup *from)
>>       } while (task && !ret);
>>   out_err:
>>       cgroup_migrate_finish(&mgctx);
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>       return ret;
>>   }
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 312c6a8b55bb..4e1d80a2741f 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>>    * write-locking cgroup_threadgroup_rwsem. This allows ->attach() 
>> to assume that
>>    * CPU hotplug is disabled on entry.
>>    */
>> -void cgroup_attach_lock(bool lock_threadgroup)
>> +void cgroup_attach_lock(struct task_struct *tsk,
>> +                   bool lock_threadgroup, bool global)
>>   {
>>       cpus_read_lock();
>> -    if (lock_threadgroup)
>> -        percpu_down_write(&cgroup_threadgroup_rwsem);
>> +    if (lock_threadgroup) {
>> +        if (global)
>> +            percpu_down_write(&cgroup_threadgroup_rwsem);
>> +        else
>> +            down_write(&tsk->signal->group_rwsem);
>> +    }
>>   }
>
> tsk is only used when global is false. So you can eliminate the 
> redundant global argument and use the presence of tsk to indicate 
> which lock to take.
>
> Cheers,
> Longman


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 13:14 ` Waiman Long
  2025-09-04  1:35   ` Chen Ridong
  2025-09-04  4:59   ` escape
@ 2025-09-04  5:02   ` escape
  2 siblings, 0 replies; 18+ messages in thread
From: escape @ 2025-09-04  5:02 UTC (permalink / raw)
  To: Waiman Long, tj, hannes, mkoutny; +Cc: cgroups, linux-kernel




在 2025/9/3 21:14, Waiman Long 写道:
>
> On 9/3/25 7:11 AM, Yi Tao wrote:
>> As computer hardware advances, modern systems are typically equipped
>> with many CPU cores and large amounts of memory, enabling the deployment
>> of numerous applications. On such systems, container creation and
>> deletion become frequent operations, making cgroup process migration no
>> longer a cold path. This leads to noticeable contention with common
>> process operations such as fork, exec, and exit.
>>
>> To alleviate the contention between cgroup process migration and
>> operations like process fork, this patch modifies lock to take the write
>> lock on signal_struct->group_rwsem when writing pid to
>> cgroup.procs/threads instead of holding a global write lock.
>>
>> Cgroup process migration has historically relied on
>> signal_struct->group_rwsem to protect thread group integrity. In commit
>> <1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
>> a global percpu_rwsem"), this was changed to a global
>> cgroup_threadgroup_rwsem. The advantage of using a global lock was
>> simplified handling of process group migrations. This patch retains the
>> use of the global lock for protecting process group migration, while
>> reducing contention by using per thread group lock during
>> cgroup.procs/threads writes.
>>
>> The locking behavior is as follows:
>>
>> write cgroup.procs/threads  | process fork,exec,exit | process group 
>> migration
>> ------------------------------------------------------------------------------ 
>>
>> cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
>> down_write(&p_rwsem)        | down_read(&p_rwsem)    | 
>> down_write(&g_rwsem)
>> critical section            | critical section       | critical section
>> up_write(&p_rwsem)          | up_read(&p_rwsem)      | 
>> up_write(&g_rwsem)
>> cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()
>>
>> g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
>> signal_struct->group_rwsem.
>>
>> This patch eliminates contention between cgroup migration and fork
>> operations for threads that belong to different thread groups, thereby
>> reducing the long-tail latency of cgroup migrations and lowering system
>> load.
> Do you have any performance numbers to showcase how much performance 
> benefit does this change provide?
>> Signed-off-by: Yi Tao <escape@linux.alibaba.com>
>> ---
>>   include/linux/cgroup-defs.h     |  2 ++
>>   include/linux/sched/signal.h    |  4 +++
>>   init/init_task.c                |  3 ++
>>   kernel/cgroup/cgroup-internal.h |  6 ++--
>>   kernel/cgroup/cgroup-v1.c       |  8 ++---
>>   kernel/cgroup/cgroup.c          | 56 ++++++++++++++++++++-------------
>>   kernel/fork.c                   |  4 +++
>>   7 files changed, 55 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 6b93a64115fe..8e0fdad8a440 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -838,6 +838,7 @@ struct cgroup_of_peak {
>>   static inline void cgroup_threadgroup_change_begin(struct 
>> task_struct *tsk)
>>   {
>>       percpu_down_read(&cgroup_threadgroup_rwsem);
>> +    down_read(&tsk->signal->group_rwsem);
>>   }
>>     /**
>> @@ -848,6 +849,7 @@ static inline void 
>> cgroup_threadgroup_change_begin(struct task_struct *tsk)
>>    */
>>   static inline void cgroup_threadgroup_change_end(struct task_struct 
>> *tsk)
>>   {
>> +    up_read(&tsk->signal->group_rwsem);
>>       percpu_up_read(&cgroup_threadgroup_rwsem);
>>   }
>>   diff --git a/include/linux/sched/signal.h 
>> b/include/linux/sched/signal.h
>> index 1ef1edbaaf79..86fbc99a9174 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -226,6 +226,10 @@ struct signal_struct {
>>       struct tty_audit_buf *tty_audit_buf;
>>   #endif
>>   +#ifdef CONFIG_CGROUPS
>> +    struct rw_semaphore group_rwsem;
>> +#endif
>> +
>>       /*
>>        * Thread is the potential origin of an oom condition; kill 
>> first on
>>        * oom
>> diff --git a/init/init_task.c b/init/init_task.c
>> index e557f622bd90..0450093924a7 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
>>       },
>>       .multiprocess    = HLIST_HEAD_INIT,
>>       .rlim        = INIT_RLIMITS,
>> +#ifdef CONFIG_CGROUPS
>> +    .group_rwsem    = __RWSEM_INITIALIZER(init_signals.group_rwsem),
>> +#endif
>>       .cred_guard_mutex = 
>> __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>>       .exec_update_lock = 
>> __RWSEM_INITIALIZER(init_signals.exec_update_lock),
>>   #ifdef CONFIG_POSIX_TIMERS
>> diff --git a/kernel/cgroup/cgroup-internal.h 
>> b/kernel/cgroup/cgroup-internal.h
>> index b14e61c64a34..572c24b7e947 100644
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -249,8 +249,10 @@ int cgroup_migrate(struct task_struct *leader, 
>> bool threadgroup,
>>     int cgroup_attach_task(struct cgroup *dst_cgrp, struct 
>> task_struct *leader,
>>                  bool threadgroup);
>> -void cgroup_attach_lock(bool lock_threadgroup);
>> -void cgroup_attach_unlock(bool lock_threadgroup);
>> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup,
>> +            bool global);
>> +void cgroup_attach_unlock(struct task_struct *tsk, bool 
>> lock_threadgroup,
>> +              bool global);
>>   struct task_struct *cgroup_procs_write_start(char *buf, bool 
>> threadgroup,
>>                            bool *locked)
>>       __acquires(&cgroup_threadgroup_rwsem);
>> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
>> index 2a4a387f867a..3e5ead8c5bc5 100644
>> --- a/kernel/cgroup/cgroup-v1.c
>> +++ b/kernel/cgroup/cgroup-v1.c
>> @@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct 
>> *from, struct task_struct *tsk)
>>       int retval = 0;
>>         cgroup_lock();
>> -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>       for_each_root(root) {
>>           struct cgroup *from_cgrp;
>>   @@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct 
>> *from, struct task_struct *tsk)
>>           if (retval)
>>               break;
>>       }
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>         return retval;
>> @@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, 
>> struct cgroup *from)
>>         cgroup_lock();
>>   -    cgroup_attach_lock(true);
>> +    cgroup_attach_lock(NULL, true, true);
>>         /* all tasks in @from are being moved, all csets are source */
>>       spin_lock_irq(&css_set_lock);
>> @@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, 
>> struct cgroup *from)
>>       } while (task && !ret);
>>   out_err:
>>       cgroup_migrate_finish(&mgctx);
>> -    cgroup_attach_unlock(true);
>> +    cgroup_attach_unlock(NULL, true, true);
>>       cgroup_unlock();
>>       return ret;
>>   }
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 312c6a8b55bb..4e1d80a2741f 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -2480,21 +2480,31 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>>    * write-locking cgroup_threadgroup_rwsem. This allows ->attach() 
>> to assume that
>>    * CPU hotplug is disabled on entry.
>>    */
>> -void cgroup_attach_lock(bool lock_threadgroup)
>> +void cgroup_attach_lock(struct task_struct *tsk,
>> +                   bool lock_threadgroup, bool global)
>>   {
>>       cpus_read_lock();
>> -    if (lock_threadgroup)
>> -        percpu_down_write(&cgroup_threadgroup_rwsem);
>> +    if (lock_threadgroup) {
>> +        if (global)
>> +            percpu_down_write(&cgroup_threadgroup_rwsem);
>> +        else
>> +            down_write(&tsk->signal->group_rwsem);
>> +    }
>>   }
>
> tsk is only used when global is false. So you can eliminate the 
> redundant global argument and use the presence of tsk to indicate 
> which lock to take.
>
> Cheers,
> Longman
Thank you for your suggestion. As you pointed out, we can determine 
which lock to acquire

based on whether the task is NULL. I'll incorporate this change in the 
next version of the patch.


Thanks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  3:15   ` escape
@ 2025-09-04  6:38     ` escape
  2025-09-04  7:28     ` Tejun Heo
  1 sibling, 0 replies; 18+ messages in thread
From: escape @ 2025-09-04  6:38 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel



在 2025/9/4 11:15, escape 写道:
>
> 在 2025/9/4 00:53, Tejun Heo 写道:
>> Hello,
>>
>> On Wed, Sep 03, 2025 at 07:11:07PM +0800, Yi Tao wrote:
>>> As computer hardware advances, modern systems are typically equipped
>>> with many CPU cores and large amounts of memory, enabling the 
>>> deployment
>>> of numerous applications. On such systems, container creation and
>>> deletion become frequent operations, making cgroup process migration no
>>> longer a cold path. This leads to noticeable contention with common
>>> process operations such as fork, exec, and exit.
>> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become 
>> cold. It
>> disappears completely and CLONE_INTO_CGROUP doesn't need any global 
>> locks
>> from cgroup side. Are there reasons why you can't use CLONE_INTO_CGROUP?
>>
>> Thanks.
>>
> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
> still requires holding the threadgroup_rwsem, so contention with fork
> operations persists.
Sorry, my understanding here was wrong; using CLONE_INTO_CGROUP can
indeed avoid the race condition with fork, but the restrictions do exist.

Thanks.
>
> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
> and deletion, but its usage comes with significant limitations:
>
> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
> adoption is gradually increasing, many applications have not yet been
> adapted to cgroup v2, and phasing out cgroup v1 will be a long and
> gradual process.
>
>
> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at 
> the
> time of process fork, effectively restricting cgroup migration to the
> fork stage. This differs significantly from the typical cgroup attach
> workflow. For example, in Kubernetes, systemd is the recommended cgroup
> driver; kubelet communicates with systemd via D-Bus, and systemd
> performs the actual cgroup attachment. In this case, the process being
> attached typically does not have systemd as its parent. Using
> CLONE_INTO_CGROUP in such a scenario is impractical and would require
> coordinated changes to both systemd and kubelet.
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  1:40       ` Chen Ridong
@ 2025-09-04  6:43         ` escape
  2025-09-04  6:52         ` Tejun Heo
  1 sibling, 0 replies; 18+ messages in thread
From: escape @ 2025-09-04  6:43 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Michal Koutný; +Cc: hannes, cgroups, linux-kernel



在 2025/9/4 09:40, Chen Ridong 写道:
>
> On 2025/9/4 4:45, Tejun Heo wrote:
>> Hello, Michal.
>>
>> On Wed, Sep 03, 2025 at 10:03:39PM +0200, Michal Koutný wrote:
>>> On Wed, Sep 03, 2025 at 06:53:36AM -1000, Tejun Heo <tj@kernel.org> wrote:
>>>> If you use CLONE_INTO_CGROUP, cgroup migration doesn't just become cold. It
>>>> disappears completely and CLONE_INTO_CGROUP doesn't need any global locks
>>>> from cgroup side.
>>> CLONE_INTO_CGROUP uses cgroup_mutex and threadgroup rwsem like regular
>>> migration, no? Its effect is atomicity wrt clone.
>>> Or, Tejum, what do you mean that it disappears? (I think we cannot give
>>> up cgroup_mutex as it ensures synchronization of possible parent's
>>> migration.)
>> Sorry, I was confused. We no longer need to write lock threadgroup rwsem
>> when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need
>> cgroup_mutex.
>>
>>    671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree")
>>
>> Thanks.
>>
> I'm still a bit confused. Commit 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when
> updating csses on an empty subtree") only applies to CSS updates. However, cloning with
> CLONE_INTO_CGROUP still requires acquiring the threadgroup_rwsem.
>
> cgroup_can_fork
>    cgroup_css_set_fork
>      	if (kargs->flags & CLONE_INTO_CGROUP)
> 		cgroup_lock();
> 	cgroup_threadgroup_change_begin(current);
>
When using CLONE_INTO_CGROUP, there is no need to write cgroup.procs. So 
although
a read lock is required here, there is no contention with the write lock.

Thanks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  1:40       ` Chen Ridong
  2025-09-04  6:43         ` escape
@ 2025-09-04  6:52         ` Tejun Heo
  1 sibling, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2025-09-04  6:52 UTC (permalink / raw)
  To: Chen Ridong; +Cc: Michal Koutný, Yi Tao, hannes, cgroups, linux-kernel

Hello,

On Thu, Sep 04, 2025 at 09:40:12AM +0800, Chen Ridong wrote:
...
> > Sorry, I was confused. We no longer need to write lock threadgroup rwsem
> > when CLONE_INTO_CGROUP'ing into an empty cgroup. We do still need
> > cgroup_mutex.
> > 
> >   671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree")
> > 
> > Thanks.
> > 
> 
> I'm still a bit confused. Commit 671c11f0619e ("cgroup: Elide write-locking threadgroup_rwsem when
> updating csses on an empty subtree") only applies to CSS updates. However, cloning with
> CLONE_INTO_CGROUP still requires acquiring the threadgroup_rwsem.
> 
> cgroup_can_fork
>   cgroup_css_set_fork
>     	if (kargs->flags & CLONE_INTO_CGROUP)
> 		cgroup_lock();
> 	cgroup_threadgroup_change_begin(current);

Ah, yeah, I'm misremembering things, sorry. What got elided in that commit
is down_write of threadgroup_rwsem when enabling controllers on empty
cgroups, which was the only operation which still needed to down_write the
rwsem. Here's an excerpt from the commit message:

    After this optimization, the usage pattern of creating a cgroup, enabling
    the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and
    then removing the cgroup after it becomes empty doesn't need to write-lock
    threadgroup_rwsem at all.

It's true that cgroup_threadgroup_change_begin() down_reads the
threadgroup_rwsem but that is a percpu_rwsem whose read operations are
percpu inc/dec. This doesn't add any noticeable overhead or has any
scalability concerns.

So, if you follow the "recommended" workflow, the only remaining possible
scalability bottleneck is cgroup_mutex.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  3:15   ` escape
  2025-09-04  6:38     ` escape
@ 2025-09-04  7:28     ` Tejun Heo
  2025-09-04  8:10       ` escape
  1 sibling, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2025-09-04  7:28 UTC (permalink / raw)
  To: escape; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

On Thu, Sep 04, 2025 at 11:15:26AM +0800, escape wrote:
> 在 2025/9/4 00:53, Tejun Heo 写道:
> > Hello,
...
> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
> still requires holding the threadgroup_rwsem, so contention with fork
> operations persists.

Sorry about my fumbling explanations repeatedly but this isn't true. On
cgroup2, if you create a cgroup, enable controllers and then seed it with
CLONE_INTO_CGROUP, threadgroup_rwsem is out of the picture. The only
remaining contention point is cgroup_mutex.

> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
> and deletion, but its usage comes with significant limitations:
> 
> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
> adoption is gradually increasing, many applications have not yet been
> adapted to cgroup v2, and phasing out cgroup v1 will be a long and
> gradual process.
> 
> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the
> time of process fork, effectively restricting cgroup migration to the
> fork stage. This differs significantly from the typical cgroup attach
> workflow. For example, in Kubernetes, systemd is the recommended cgroup
> driver; kubelet communicates with systemd via D-Bus, and systemd
> performs the actual cgroup attachment. In this case, the process being
> attached typically does not have systemd as its parent. Using
> CLONE_INTO_CGROUP in such a scenario is impractical and would require
> coordinated changes to both systemd and kubelet.

A percpu rwsem (threadgroup_rwsem) was used instead of per-threadgroup
locking to avoid adding overhead to hot paths - fork and exit - because
cgroup operations were expected to be a lot colder. Now, threadgroup rwsem
is *really* expensive for the writers, so the trade-off could be a bit too
extreme for some use cases.

However, now that the most common usage pattern doesn't involve
threadgroup_rwsem, I don't feel too enthusiastic about adding hot path
overhead to work around usage patterns that we want to move away from. Note
that dynamic migrations have other more fundamental problems for stateful
resources and we generally want to move away from it. Sure, a single rwsem
operation in fork/exit isn't a lot of overhead but it isn't nothing either
and this will impact everybody.

Maybe we can make it a mount option so that use cases that still depend on
it can toggle it on? In fact, there's already favordynmods mount option
which seems like a good fit. Maybe put the extra locking behind that flag?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04  7:28     ` Tejun Heo
@ 2025-09-04  8:10       ` escape
  0 siblings, 0 replies; 18+ messages in thread
From: escape @ 2025-09-04  8:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hannes, mkoutny, cgroups, linux-kernel



在 2025/9/4 15:28, Tejun Heo 写道:
> Hello,
>
> On Thu, Sep 04, 2025 at 11:15:26AM +0800, escape wrote:
>> 在 2025/9/4 00:53, Tejun Heo 写道:
>>> Hello,
> ...
>> As Ridong pointed out, in the current code, using CLONE_INTO_CGROUP
>> still requires holding the threadgroup_rwsem, so contention with fork
>> operations persists.
> Sorry about my fumbling explanations repeatedly but this isn't true. On
> cgroup2, if you create a cgroup, enable controllers and then seed it with
> CLONE_INTO_CGROUP, threadgroup_rwsem is out of the picture. The only
> remaining contention point is cgroup_mutex.
>
>> CLONE_INTO_CGROUP helps alleviate the contention between cgroup creation
>> and deletion, but its usage comes with significant limitations:
>>
>> 1. CLONE_INTO_CGROUP is only available in cgroup v2. Although cgroup v2
>> adoption is gradually increasing, many applications have not yet been
>> adapted to cgroup v2, and phasing out cgroup v1 will be a long and
>> gradual process.
>>
>> 2. CLONE_INTO_CGROUP requires specifying the cgroup file descriptor at the
>> time of process fork, effectively restricting cgroup migration to the
>> fork stage. This differs significantly from the typical cgroup attach
>> workflow. For example, in Kubernetes, systemd is the recommended cgroup
>> driver; kubelet communicates with systemd via D-Bus, and systemd
>> performs the actual cgroup attachment. In this case, the process being
>> attached typically does not have systemd as its parent. Using
>> CLONE_INTO_CGROUP in such a scenario is impractical and would require
>> coordinated changes to both systemd and kubelet.
> A percpu rwsem (threadgroup_rwsem) was used instead of per-threadgroup
> locking to avoid adding overhead to hot paths - fork and exit - because
> cgroup operations were expected to be a lot colder. Now, threadgroup rwsem
> is *really* expensive for the writers, so the trade-off could be a bit too
> extreme for some use cases.
>
> However, now that the most common usage pattern doesn't involve
> threadgroup_rwsem, I don't feel too enthusiastic about adding hot path
> overhead to work around usage patterns that we want to move away from. Note
> that dynamic migrations have other more fundamental problems for stateful
> resources and we generally want to move away from it. Sure, a single rwsem
> operation in fork/exit isn't a lot of overhead but it isn't nothing either
> and this will impact everybody.
>
> Maybe we can make it a mount option so that use cases that still depend on
> it can toggle it on? In fact, there's already favordynmods mount option
> which seems like a good fit. Maybe put the extra locking behind that flag?
>
> Thanks.
>
Thank you for your reply.

I agree that mounting with cgroupv2, creating a cgroup, enabling 
controllers,
and then seeding it with CLONE_INTO_CGROUP is an excellent solution. 
However,
we've encountered significant obstacles when applying this approach on some
older systems. We are simultaneously working on enabling CLONE_INTO_CGROUP
support in runc, systemd, and other components, but this will take some 
time.
This patch aims to alleviate the issue to some extent during this 
transitional
period.

Regarding the impact of the extra rwsem operations on hot paths, I have 
conducted
performance testing. In cases where there is no contention on down_write,
the UnixBench spawn test scores remain unaffected.

The suggestion to use the favordynmods flag to control whether the extra 
rwsem
is used is excellent. I will incorporate this condition in the next 
version of
the patch.

Thanks.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 0/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
  2025-09-03 13:14 ` Waiman Long
  2025-09-03 16:53 ` Tejun Heo
@ 2025-09-04 11:39 ` Yi Tao
  2025-09-04 11:39   ` [PATCH v2 1/1] " Yi Tao
  2 siblings, 1 reply; 18+ messages in thread
From: Yi Tao @ 2025-09-04 11:39 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

Changes in v2:
- Use favordynmods as the enabling switch.
- Determine whether to use the per-thread-group rwsem based on whether
the task is NULL.
- Fix system hang caused by acquiring cgroup_threadgroup_rwsem inside
rcu_read_lock.

Yi Tao (1):
  cgroup: replace global percpu_rwsem with signal_struct->group_rwsem
    when writing cgroup.procs/threads

 include/linux/cgroup-defs.h     |  6 +++
 include/linux/sched/signal.h    |  4 ++
 init/init_task.c                |  3 ++
 kernel/cgroup/cgroup-internal.h |  4 +-
 kernel/cgroup/cgroup-v1.c       |  8 ++--
 kernel/cgroup/cgroup.c          | 72 +++++++++++++++++++--------------
 kernel/fork.c                   |  4 ++
 7 files changed, 64 insertions(+), 37 deletions(-)

-- 
2.32.0.3.g01195cf9f


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao
@ 2025-09-04 11:39   ` Yi Tao
  2025-09-04 16:31     ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Yi Tao @ 2025-09-04 11:39 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads  | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock()               | down_read(&g_rwsem)    | cgroup_lock()
down_write(&p_rwsem)        | down_read(&p_rwsem)    | down_write(&g_rwsem)
critical section            | critical section       | critical section
up_write(&p_rwsem)          | up_read(&p_rwsem)      | up_write(&g_rwsem)
cgroup_unlock()             | up_read(&g_rwsem)      | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

To avoid affecting other users, the per-thread-group rwsem is only used
when the `favordynmods` flag is enabled.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
---
 include/linux/cgroup-defs.h     |  6 +++
 include/linux/sched/signal.h    |  4 ++
 init/init_task.c                |  3 ++
 kernel/cgroup/cgroup-internal.h |  4 +-
 kernel/cgroup/cgroup-v1.c       |  8 ++--
 kernel/cgroup/cgroup.c          | 72 +++++++++++++++++++--------------
 kernel/fork.c                   |  4 ++
 7 files changed, 64 insertions(+), 37 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b93a64115fe..0c068dc3e08d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -828,6 +828,8 @@ struct cgroup_of_peak {
 	struct list_head	list;
 };
 
+extern bool have_favordynmods;
+
 /**
  * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
  * @tsk: target task
@@ -838,6 +840,8 @@ struct cgroup_of_peak {
 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
 {
 	percpu_down_read(&cgroup_threadgroup_rwsem);
+	if (have_favordynmods)
+		down_read(&tsk->signal->group_rwsem);
 }
 
 /**
@@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
  */
 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
 {
+	if (have_favordynmods)
+		up_read(&tsk->signal->group_rwsem);
 	percpu_up_read(&cgroup_threadgroup_rwsem);
 }
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1ef1edbaaf79..86fbc99a9174 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -226,6 +226,10 @@ struct signal_struct {
 	struct tty_audit_buf *tty_audit_buf;
 #endif
 
+#ifdef CONFIG_CGROUPS
+	struct rw_semaphore group_rwsem;
+#endif
+
 	/*
 	 * Thread is the potential origin of an oom condition; kill first on
 	 * oom
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..0450093924a7 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -27,6 +27,9 @@ static struct signal_struct init_signals = {
 	},
 	.multiprocess	= HLIST_HEAD_INIT,
 	.rlim		= INIT_RLIMITS,
+#ifdef CONFIG_CGROUPS
+	.group_rwsem	= __RWSEM_INITIALIZER(init_signals.group_rwsem),
+#endif
 	.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
 	.exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock),
 #ifdef CONFIG_POSIX_TIMERS
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index b14e61c64a34..35005543f0c7 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -249,8 +249,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
 		       bool threadgroup);
-void cgroup_attach_lock(bool lock_threadgroup);
-void cgroup_attach_unlock(bool lock_threadgroup);
+void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup);
+void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 					     bool *locked)
 	__acquires(&cgroup_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 2a4a387f867a..9f1a4d1fc741 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -68,7 +68,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 	int retval = 0;
 
 	cgroup_lock();
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(NULL, true);
 	for_each_root(root) {
 		struct cgroup *from_cgrp;
 
@@ -80,7 +80,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
 		if (retval)
 			break;
 	}
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(NULL, true);
 	cgroup_unlock();
 
 	return retval;
@@ -117,7 +117,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 
 	cgroup_lock();
 
-	cgroup_attach_lock(true);
+	cgroup_attach_lock(NULL, true);
 
 	/* all tasks in @from are being moved, all csets are source */
 	spin_lock_irq(&css_set_lock);
@@ -153,7 +153,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from)
 	} while (task && !ret);
 out_err:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(true);
+	cgroup_attach_unlock(NULL, true);
 	cgroup_unlock();
 	return ret;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..22b1659b623c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -214,7 +214,7 @@ static u16 have_exit_callback __read_mostly;
 static u16 have_release_callback __read_mostly;
 static u16 have_canfork_callback __read_mostly;
 
-static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
+bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
 
 /* cgroup namespace for init task */
 struct cgroup_namespace init_cgroup_ns = {
@@ -2459,7 +2459,8 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
 
 /**
  * cgroup_attach_lock - Lock for ->attach()
- * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
+ * @tsk: thread group to lock
+ * @lock_threadgroup: whether to down_write rwsem
  *
  * cgroup migration sometimes needs to stabilize threadgroups against forks and
  * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
@@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
  * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
  * CPU hotplug is disabled on entry.
  */
-void cgroup_attach_lock(bool lock_threadgroup)
+void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup)
 {
 	cpus_read_lock();
-	if (lock_threadgroup)
-		percpu_down_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (tsk && favor_dynmods)
+			down_write(&tsk->signal->group_rwsem);
+		else
+			percpu_down_write(&cgroup_threadgroup_rwsem);
+	}
 }
 
 /**
  * cgroup_attach_unlock - Undo cgroup_attach_lock()
- * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
+ * @tsk: thread group to lock
+ * @lock_threadgroup: whether to up_write rwsem
  */
-void cgroup_attach_unlock(bool lock_threadgroup)
+void cgroup_attach_unlock(struct task_struct *tsk, bool lock_threadgroup)
 {
-	if (lock_threadgroup)
-		percpu_up_write(&cgroup_threadgroup_rwsem);
+	if (lock_threadgroup) {
+		if (tsk && favor_dynmods)
+			up_write(&tsk->signal->group_rwsem);
+		else
+			percpu_up_write(&cgroup_threadgroup_rwsem);
+	}
 	cpus_read_unlock();
 }
 
@@ -2976,24 +2986,12 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
 		return ERR_PTR(-EINVAL);
 
-	/*
-	 * If we migrate a single thread, we don't care about threadgroup
-	 * stability. If the thread is `current`, it won't exit(2) under our
-	 * hands or change PID through exec(2). We exclude
-	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
-	 * callers by cgroup_mutex.
-	 * Therefore, we can skip the global lock.
-	 */
-	lockdep_assert_held(&cgroup_mutex);
-	*threadgroup_locked = pid || threadgroup;
-	cgroup_attach_lock(*threadgroup_locked);
-
 	rcu_read_lock();
 	if (pid) {
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
 			tsk = ERR_PTR(-ESRCH);
-			goto out_unlock_threadgroup;
+			goto out_unlock_rcu;
 		}
 	} else {
 		tsk = current;
@@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
 	 */
 	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
 		tsk = ERR_PTR(-EINVAL);
-		goto out_unlock_threadgroup;
+		goto out_unlock_rcu;
 	}
-
 	get_task_struct(tsk);
-	goto out_unlock_rcu;
 
-out_unlock_threadgroup:
-	cgroup_attach_unlock(*threadgroup_locked);
-	*threadgroup_locked = false;
+	rcu_read_unlock();
+
+	/*
+	 * If we migrate a single thread, we don't care about threadgroup
+	 * stability. If the thread is `current`, it won't exit(2) under our
+	 * hands or change PID through exec(2). We exclude
+	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
+	 * callers by cgroup_mutex.
+	 * Therefore, we can skip the global lock.
+	 */
+	lockdep_assert_held(&cgroup_mutex);
+	*threadgroup_locked = pid || threadgroup;
+
+	cgroup_attach_lock(tsk, *threadgroup_locked);
+
+	return tsk;
+
 out_unlock_rcu:
 	rcu_read_unlock();
 	return tsk;
@@ -3032,7 +3042,7 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
 	/* release reference from cgroup_procs_write_start() */
 	put_task_struct(task);
 
-	cgroup_attach_unlock(threadgroup_locked);
+	cgroup_attach_unlock(task, threadgroup_locked);
 
 	for_each_subsys(ss, ssid)
 		if (ss->post_attach)
@@ -3119,7 +3129,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	 * write-locking can be skipped safely.
 	 */
 	has_tasks = !list_empty(&mgctx.preloaded_src_csets);
-	cgroup_attach_lock(has_tasks);
+	cgroup_attach_lock(NULL, has_tasks);
 
 	/* NULL dst indicates self on default hierarchy */
 	ret = cgroup_migrate_prepare_dst(&mgctx);
@@ -3140,7 +3150,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 	ret = cgroup_migrate_execute(&mgctx);
 out_finish:
 	cgroup_migrate_finish(&mgctx);
-	cgroup_attach_unlock(has_tasks);
+	cgroup_attach_unlock(NULL, has_tasks);
 	return ret;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499d..5218f9b93c77 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,6 +1688,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	tty_audit_fork(sig);
 	sched_autogroup_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->group_rwsem);
+#endif
+
 	sig->oom_score_adj = current->signal->oom_score_adj;
 	sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/1] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads
  2025-09-04 11:39   ` [PATCH v2 1/1] " Yi Tao
@ 2025-09-04 16:31     ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2025-09-04 16:31 UTC (permalink / raw)
  To: Yi Tao; +Cc: hannes, mkoutny, cgroups, linux-kernel

Hello,

On Thu, Sep 04, 2025 at 07:39:32PM +0800, Yi Tao wrote:
...
> To avoid affecting other users, the per-thread-group rwsem is only used
> when the `favordynmods` flag is enabled.

Can you please:

- Note that this isn't necessary for cgroup2's recommended workflow and is
  thus gated behind favordynmods.

- Include performance numbers briefly.

> +extern bool have_favordynmods;
> +
>  /**
>   * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
>   * @tsk: target task
> @@ -838,6 +840,8 @@ struct cgroup_of_peak {
>  static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>  {
>  	percpu_down_read(&cgroup_threadgroup_rwsem);
> +	if (have_favordynmods)
> +		down_read(&tsk->signal->group_rwsem);
>  }
>  
>  /**
> @@ -848,6 +852,8 @@ static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
>   */
>  static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
>  {
> +	if (have_favordynmods)
> +		up_read(&tsk->signal->group_rwsem);
>  	percpu_up_read(&cgroup_threadgroup_rwsem);

Hmm... I wonder whether turning on/off the flag is racy. ie. what prevents
have_favordynmods flipping while a task is between change_begin and end?

> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 1ef1edbaaf79..86fbc99a9174 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -226,6 +226,10 @@ struct signal_struct {
>  	struct tty_audit_buf *tty_audit_buf;
>  #endif
>  
> +#ifdef CONFIG_CGROUPS
> +	struct rw_semaphore group_rwsem;
> +#endif

Maybe name it more specific - e.g. cgroup_threadgroup_rwsem?

>  /**
>   * cgroup_attach_lock - Lock for ->attach()
> - * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
> + * @tsk: thread group to lock
> + * @lock_threadgroup: whether to down_write rwsem
>   *
>   * cgroup migration sometimes needs to stabilize threadgroups against forks and
>   * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
> @@ -2480,21 +2481,30 @@ EXPORT_SYMBOL_GPL(cgroup_path_ns);
>   * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
>   * CPU hotplug is disabled on entry.
>   */

Please expand the function comment to explain what's going on and why and
maybe point to it from a comment on top of favor_dynmods.

> -void cgroup_attach_lock(bool lock_threadgroup)
> +void cgroup_attach_lock(struct task_struct *tsk, bool lock_threadgroup)

As @tsk is an optional argument, it'd probably make more sense to put it at
the end.

> @@ -3010,15 +3008,27 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
>  	 */
>  	if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
>  		tsk = ERR_PTR(-EINVAL);
> -		goto out_unlock_threadgroup;
> +		goto out_unlock_rcu;
>  	}
> -
>  	get_task_struct(tsk);
> -	goto out_unlock_rcu;
>  
> -out_unlock_threadgroup:
> -	cgroup_attach_unlock(*threadgroup_locked);
> -	*threadgroup_locked = false;
> +	rcu_read_unlock();
> +
> +	/*
> +	 * If we migrate a single thread, we don't care about threadgroup
> +	 * stability. If the thread is `current`, it won't exit(2) under our
> +	 * hands or change PID through exec(2). We exclude
> +	 * cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
> +	 * callers by cgroup_mutex.
> +	 * Therefore, we can skip the global lock.
> +	 */
> +	lockdep_assert_held(&cgroup_mutex);
> +	*threadgroup_locked = pid || threadgroup;
> +
> +	cgroup_attach_lock(tsk, *threadgroup_locked);

I'm not sure this relocation is safe. What prevents e.g. @tsk changing its
group leader or signal struct before lock is grabbed?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-09-04 16:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-03 11:11 [PATCH] cgroup: replace global percpu_rwsem with signal_struct->group_rwsem when writing cgroup.procs/threads Yi Tao
2025-09-03 13:14 ` Waiman Long
2025-09-04  1:35   ` Chen Ridong
2025-09-04  4:59   ` escape
2025-09-04  5:02   ` escape
2025-09-03 16:53 ` Tejun Heo
2025-09-03 20:03   ` Michal Koutný
2025-09-03 20:45     ` Tejun Heo
2025-09-04  1:40       ` Chen Ridong
2025-09-04  6:43         ` escape
2025-09-04  6:52         ` Tejun Heo
2025-09-04  3:15   ` escape
2025-09-04  6:38     ` escape
2025-09-04  7:28     ` Tejun Heo
2025-09-04  8:10       ` escape
2025-09-04 11:39 ` [PATCH v2 0/1] " Yi Tao
2025-09-04 11:39   ` [PATCH v2 1/1] " Yi Tao
2025-09-04 16:31     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).