public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work
@ 2025-09-04  7:45 Chuyi Zhou
  2025-09-04  7:45 ` [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask Chuyi Zhou
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Chuyi Zhou @ 2025-09-04  7:45 UTC (permalink / raw)
  To: tj, mkoutny, hannes, longman; +Cc: linux-kernel, Chuyi Zhou

Now in cpuset_attach(), we need to synchronously wait for
flush_workqueue to complete. The execution time of flushing
cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
cpusets at that time. When the cpuset.mems of a cgroup occupying a large
amount of memory is modified, it may trigger extensive mm migration,
causing cpuset_attach() to block on flush_workqueue for an extended period.

            cgroup attach operation  | someone change cpuset.mems
                                     |
      -------------------------------+-------------------------------
       __cgroup_procs_write()                 cpuset_write_resmask()
	cgroup_kn_lock_live()
	cpuset_attach()				cpuset_migrate_mm()


	cpuset_post_attach()
	  flush_workqueue(cpuset_migrate_mm_wq);

This could be dangerous because cpuset_attach() is within the critical
section of cgroup_mutex, which may ultimately cause all cgroup-related
operations in the system to be blocked. We encountered this issue in the
production environment, and it can be easily reproduced locally using the
script below.

[Thu Sep  4 14:51:39 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Sep  4 14:51:39 2025] task:tee             state:D stack:0     pid:13330 tgid:13330 ppid:13321  task_flags:0x400100 flags:0x00004000
[Thu Sep  4 14:51:39 2025] Call Trace:
[Thu Sep  4 14:51:39 2025]  <TASK>
[Thu Sep  4 14:51:39 2025]  __schedule+0xcc1/0x1c60
[Thu Sep  4 14:51:39 2025]  ? find_held_lock+0x2d/0xa0
[Thu Sep  4 14:51:39 2025]  schedule+0x3e/0xe0
[Thu Sep  4 14:51:39 2025]  schedule_preempt_disabled+0x15/0x30
[Thu Sep  4 14:51:39 2025]  __mutex_lock+0x928/0x1230
[Thu Sep  4 14:51:39 2025]  ? cgroup_kn_lock_live+0x4a/0x240
[Thu Sep  4 14:51:39 2025]  ? cgroup_kn_lock_live+0x4a/0x240
[Thu Sep  4 14:51:39 2025]  cgroup_kn_lock_live+0x4a/0x240
[Thu Sep  4 14:51:39 2025]  __cgroup_procs_write+0x38/0x210
[Thu Sep  4 14:51:39 2025]  cgroup_procs_write+0x17/0x30
[Thu Sep  4 14:51:39 2025]  cgroup_file_write+0xa5/0x260
[Thu Sep  4 14:51:39 2025]  kernfs_fop_write_iter+0x13d/0x1e0
[Thu Sep  4 14:51:39 2025]  vfs_write+0x310/0x530
[Thu Sep  4 14:51:39 2025]  ksys_write+0x6e/0xf0
[Thu Sep  4 14:51:39 2025]  do_syscall_64+0x77/0x390
[Thu Sep  4 14:51:39 2025]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

This patchset attempts to defer the flush_workqueue() operation until
returning to userspace using the task_work which is originally proposed by
tejun[1], so that flush happens after cgroup_mutex is dropped. That way we
maintain the operation synchronicity while avoiding bothering anyone else.

[1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883

#!/bin/bash

sudo mkdir -p /sys/fs/cgroup/test

sudo mkdir -p /sys/fs/cgroup/test1
sudo mkdir -p /sys/fs/cgroup/test2

echo 0 > /sys/fs/cgroup/test1/cpuset.mems

echo 1 > /sys/fs/cgroup/test2/cpuset.mems

for i in {1..10}; do
    (
        pid=$BASHPID

        while true; do
	    echo "Add $pid to test1"

	    echo "$pid" | sudo tee /sys/fs/cgroup/test1/cgroup.procs >/dev/null

            sleep 5

	    echo "Add $pid to test2"

            echo "$pid" | sudo tee /sys/fs/cgroup/test2/cgroup.procs >/dev/null

        done
    ) &
done


echo 0 > /sys/fs/cgroup/test/cpuset.mems

echo $$ > /sys/fs/cgroup/test/cgroup.procs

stress --vm 100 --vm-bytes 2048M --vm-keep &

sleep 30

echo "begin change cpuset.mems"

echo 1 > /sys/fs/cgroup/test/cpuset.mems

Chuyi Zhou (3):
  cpuset: Don't always flush cpuset_migrate_mm_wq in
    cpuset_write_resmask
  cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work
  cgroup: Remove unused cgroup_subsys::post_attach

 include/linux/cgroup-defs.h |  1 -
 kernel/cgroup/cgroup.c      |  4 ----
 kernel/cgroup/cpuset.c      | 30 +++++++++++++++++++++++++-----
 3 files changed, 25 insertions(+), 10 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask
  2025-09-04  7:45 [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work Chuyi Zhou
@ 2025-09-04  7:45 ` Chuyi Zhou
  2025-09-04 14:30   ` Michal Koutný
                     ` (2 more replies)
  2025-09-04  7:45 ` [PATCH 2/3] cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work Chuyi Zhou
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 11+ messages in thread
From: Chuyi Zhou @ 2025-09-04  7:45 UTC (permalink / raw)
  To: tj, mkoutny, hannes, longman; +Cc: linux-kernel, Chuyi Zhou

It is unnecessary to always wait for the flush operation of
cpuset_migrate_mm_wq to complete in cpuset_write_resmask, as modifying
cpuset.cpus or cpuset.exclusive does not trigger mm migrations. The
flush_workqueue can be executed only when cpuset.mems is modified.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/cgroup/cpuset.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 27adb04df675d..3d8492581c8c4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3256,7 +3256,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 out_unlock:
 	mutex_unlock(&cpuset_mutex);
 	cpus_read_unlock();
-	flush_workqueue(cpuset_migrate_mm_wq);
+	if (of_cft(of)->private == FILE_MEMLIST)
+		flush_workqueue(cpuset_migrate_mm_wq);
 	return retval ?: nbytes;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work
  2025-09-04  7:45 [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work Chuyi Zhou
  2025-09-04  7:45 ` [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask Chuyi Zhou
@ 2025-09-04  7:45 ` Chuyi Zhou
  2025-09-04 15:14   ` Waiman Long
  2025-09-04  7:45 ` [PATCH 3/3] cgroup: Remove unused cgroup_subsys::post_attach Chuyi Zhou
  2025-09-04 17:27 ` [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work Tejun Heo
  3 siblings, 1 reply; 11+ messages in thread
From: Chuyi Zhou @ 2025-09-04  7:45 UTC (permalink / raw)
  To: tj, mkoutny, hannes, longman; +Cc: linux-kernel, Chuyi Zhou

Now in cpuset_attach(), we need to synchronously wait for
flush_workqueue to complete. The execution time of flushing
cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
cpusets at that time. When the cpuset.mems of a cgroup occupying a large
amount of memory is modified, it may trigger extensive mm migration,
causing cpuset_attach() to block on flush_workqueue for an extended period.
This could be dangerous because cpuset_attach() is within the critical
section of cgroup_mutex, which may ultimately cause all cgroup-related
operations in the system to be blocked.

This patch attempts to defer the flush_workqueue() operation until
returning to userspace using the task_work which is originally proposed by
tejun[1], so that flush happens after cgroup_mutex is dropped. That way we
maintain the operation synchronicity while avoiding bothering anyone else.

[1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883

Originally-by: tejun heo <tj@kernel.org>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/cgroup/cpuset.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3d8492581c8c4..ceb467079e41f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -40,6 +40,7 @@
 #include <linux/sched/isolation.h>
 #include <linux/wait.h>
 #include <linux/workqueue.h>
+#include <linux/task_work.h>
 
 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -2582,9 +2583,24 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 	}
 }
 
-static void cpuset_post_attach(void)
+static void flush_migrate_mm_task_workfn(struct callback_head *head)
 {
 	flush_workqueue(cpuset_migrate_mm_wq);
+	kfree(head);
+}
+
+static void schedule_flush_migrate_mm(void)
+{
+	struct callback_head *flush_cb;
+
+	flush_cb = kzalloc(sizeof(struct callback_head), GFP_KERNEL);
+	if (!flush_cb)
+		return;
+
+	init_task_work(flush_cb, flush_migrate_mm_task_workfn);
+
+	if (task_work_add(current, flush_cb, TWA_RESUME))
+		kfree(flush_cb);
 }
 
 /*
@@ -3141,6 +3157,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	struct cpuset *cs;
 	struct cpuset *oldcs = cpuset_attach_old_cs;
 	bool cpus_updated, mems_updated;
+	bool queue_task_work = false;
 
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
@@ -3191,15 +3208,18 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 			 * @old_mems_allowed is the right nodesets that we
 			 * migrate mm from.
 			 */
-			if (is_memory_migrate(cs))
+			if (is_memory_migrate(cs)) {
 				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
 						  &cpuset_attach_nodemask_to);
-			else
+				queue_task_work = true;
+			} else
 				mmput(mm);
 		}
 	}
 
 out:
+	if (queue_task_work)
+		schedule_flush_migrate_mm();
 	cs->old_mems_allowed = cpuset_attach_nodemask_to;
 
 	if (cs->nr_migrate_dl_tasks) {
@@ -3257,7 +3277,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	mutex_unlock(&cpuset_mutex);
 	cpus_read_unlock();
 	if (of_cft(of)->private == FILE_MEMLIST)
-		flush_workqueue(cpuset_migrate_mm_wq);
+		schedule_flush_migrate_mm();
 	return retval ?: nbytes;
 }
 
@@ -3725,7 +3745,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.can_attach	= cpuset_can_attach,
 	.cancel_attach	= cpuset_cancel_attach,
 	.attach		= cpuset_attach,
-	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.can_fork	= cpuset_can_fork,
 	.cancel_fork	= cpuset_cancel_fork,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] cgroup: Remove unused cgroup_subsys::post_attach
  2025-09-04  7:45 [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work Chuyi Zhou
  2025-09-04  7:45 ` [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask Chuyi Zhou
  2025-09-04  7:45 ` [PATCH 2/3] cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work Chuyi Zhou
@ 2025-09-04  7:45 ` Chuyi Zhou
  2025-09-04 15:17   ` Waiman Long
  2025-09-04 17:27 ` [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work Tejun Heo
  3 siblings, 1 reply; 11+ messages in thread
From: Chuyi Zhou @ 2025-09-04  7:45 UTC (permalink / raw)
  To: tj, mkoutny, hannes, longman; +Cc: linux-kernel, Chuyi Zhou

cgroup_subsys::post_attach callback was introduced in commit 5cf1cacb49ae
("cgroup, cpuset: replace cpuset_post_attach_flush() with
cgroup_subsys->post_attach callback") and only cpuset would use this
callback to wait for the mm migration to complete at the end of
__cgroup_procs_write(). Since the previous patch defer the flush operation
until returning to userspace, no one use this callback now. Remove this
callback from cgroup_subsys.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 include/linux/cgroup-defs.h | 1 -
 kernel/cgroup/cgroup.c      | 4 ----
 2 files changed, 5 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b93a64115fe9..432abdfdb2593 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -746,7 +746,6 @@ struct cgroup_subsys {
 	int (*can_attach)(struct cgroup_taskset *tset);
 	void (*cancel_attach)(struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_taskset *tset);
-	void (*post_attach)(void);
 	int (*can_fork)(struct task_struct *task,
 			struct css_set *cset);
 	void (*cancel_fork)(struct task_struct *task, struct css_set *cset);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb7..75819bb2f1148 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3033,10 +3033,6 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
 	put_task_struct(task);
 
 	cgroup_attach_unlock(threadgroup_locked);
-
-	for_each_subsys(ss, ssid)
-		if (ss->post_attach)
-			ss->post_attach();
 }
 
 static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask
  2025-09-04  7:45 ` [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask Chuyi Zhou
@ 2025-09-04 14:30   ` Michal Koutný
  2025-09-04 15:12   ` Waiman Long
  2025-09-04 17:15   ` Tejun Heo
  2 siblings, 0 replies; 11+ messages in thread
From: Michal Koutný @ 2025-09-04 14:30 UTC (permalink / raw)
  To: Chuyi Zhou; +Cc: tj, hannes, longman, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 613 bytes --]

On Thu, Sep 04, 2025 at 03:45:03PM +0800, Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> It is unnecessary to always wait for the flush operation of
> cpuset_migrate_mm_wq to complete in cpuset_write_resmask, as modifying
> cpuset.cpus or cpuset.exclusive does not trigger mm migrations. The
> flush_workqueue can be executed only when cpuset.mems is modified.
> 
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  kernel/cgroup/cpuset.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reasonable and AFAICT correct optimization.

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask
  2025-09-04  7:45 ` [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask Chuyi Zhou
  2025-09-04 14:30   ` Michal Koutný
@ 2025-09-04 15:12   ` Waiman Long
  2025-09-04 17:15   ` Tejun Heo
  2 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2025-09-04 15:12 UTC (permalink / raw)
  To: Chuyi Zhou, tj, mkoutny, hannes; +Cc: linux-kernel

On 9/4/25 3:45 AM, Chuyi Zhou wrote:
> It is unnecessary to always wait for the flush operation of
> cpuset_migrate_mm_wq to complete in cpuset_write_resmask, as modifying
> cpuset.cpus or cpuset.exclusive does not trigger mm migrations. The
> flush_workqueue can be executed only when cpuset.mems is modified.
>
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>   kernel/cgroup/cpuset.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 27adb04df675d..3d8492581c8c4 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3256,7 +3256,8 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   out_unlock:
>   	mutex_unlock(&cpuset_mutex);
>   	cpus_read_unlock();
> -	flush_workqueue(cpuset_migrate_mm_wq);
> +	if (of_cft(of)->private == FILE_MEMLIST)
> +		flush_workqueue(cpuset_migrate_mm_wq);
>   	return retval ?: nbytes;
>   }
>   

LGTM

Reviewed-by:  Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work
  2025-09-04  7:45 ` [PATCH 2/3] cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work Chuyi Zhou
@ 2025-09-04 15:14   ` Waiman Long
  0 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2025-09-04 15:14 UTC (permalink / raw)
  To: Chuyi Zhou, tj, mkoutny, hannes; +Cc: linux-kernel

On 9/4/25 3:45 AM, Chuyi Zhou wrote:
> Now in cpuset_attach(), we need to synchronously wait for
> flush_workqueue to complete. The execution time of flushing
> cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
> cpusets at that time. When the cpuset.mems of a cgroup occupying a large
> amount of memory is modified, it may trigger extensive mm migration,
> causing cpuset_attach() to block on flush_workqueue for an extended period.
> This could be dangerous because cpuset_attach() is within the critical
> section of cgroup_mutex, which may ultimately cause all cgroup-related
> operations in the system to be blocked.
>
> This patch attempts to defer the flush_workqueue() operation until
> returning to userspace using the task_work which is originally proposed by
> tejun[1], so that flush happens after cgroup_mutex is dropped. That way we
> maintain the operation synchronicity while avoiding bothering anyone else.
>
> [1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883
>
> Originally-by: tejun heo <tj@kernel.org>
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>   kernel/cgroup/cpuset.c | 29 ++++++++++++++++++++++++-----
>   1 file changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3d8492581c8c4..ceb467079e41f 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -40,6 +40,7 @@
>   #include <linux/sched/isolation.h>
>   #include <linux/wait.h>
>   #include <linux/workqueue.h>
> +#include <linux/task_work.h>
>   
>   DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
>   DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
> @@ -2582,9 +2583,24 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>   	}
>   }
>   
> -static void cpuset_post_attach(void)
> +static void flush_migrate_mm_task_workfn(struct callback_head *head)
>   {
>   	flush_workqueue(cpuset_migrate_mm_wq);
> +	kfree(head);
> +}
> +
> +static void schedule_flush_migrate_mm(void)
> +{
> +	struct callback_head *flush_cb;
> +
> +	flush_cb = kzalloc(sizeof(struct callback_head), GFP_KERNEL);
> +	if (!flush_cb)
> +		return;
> +
> +	init_task_work(flush_cb, flush_migrate_mm_task_workfn);
> +
> +	if (task_work_add(current, flush_cb, TWA_RESUME))
> +		kfree(flush_cb);
>   }
>   
>   /*
> @@ -3141,6 +3157,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>   	struct cpuset *cs;
>   	struct cpuset *oldcs = cpuset_attach_old_cs;
>   	bool cpus_updated, mems_updated;
> +	bool queue_task_work = false;
>   
>   	cgroup_taskset_first(tset, &css);
>   	cs = css_cs(css);
> @@ -3191,15 +3208,18 @@ static void cpuset_attach(struct cgroup_taskset *tset)
>   			 * @old_mems_allowed is the right nodesets that we
>   			 * migrate mm from.
>   			 */
> -			if (is_memory_migrate(cs))
> +			if (is_memory_migrate(cs)) {
>   				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
>   						  &cpuset_attach_nodemask_to);
> -			else
> +				queue_task_work = true;
> +			} else
>   				mmput(mm);
>   		}
>   	}
>   
>   out:
> +	if (queue_task_work)
> +		schedule_flush_migrate_mm();
>   	cs->old_mems_allowed = cpuset_attach_nodemask_to;
>   
>   	if (cs->nr_migrate_dl_tasks) {
> @@ -3257,7 +3277,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   	mutex_unlock(&cpuset_mutex);
>   	cpus_read_unlock();
>   	if (of_cft(of)->private == FILE_MEMLIST)
> -		flush_workqueue(cpuset_migrate_mm_wq);
> +		schedule_flush_migrate_mm();
>   	return retval ?: nbytes;
>   }
>   
> @@ -3725,7 +3745,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
>   	.can_attach	= cpuset_can_attach,
>   	.cancel_attach	= cpuset_cancel_attach,
>   	.attach		= cpuset_attach,
> -	.post_attach	= cpuset_post_attach,
>   	.bind		= cpuset_bind,
>   	.can_fork	= cpuset_can_fork,
>   	.cancel_fork	= cpuset_cancel_fork,
Reviewed-by:  Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] cgroup: Remove unused cgroup_subsys::post_attach
  2025-09-04  7:45 ` [PATCH 3/3] cgroup: Remove unused cgroup_subsys::post_attach Chuyi Zhou
@ 2025-09-04 15:17   ` Waiman Long
  0 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2025-09-04 15:17 UTC (permalink / raw)
  To: Chuyi Zhou, tj, mkoutny, hannes; +Cc: linux-kernel

On 9/4/25 3:45 AM, Chuyi Zhou wrote:
> cgroup_subsys::post_attach callback was introduced in commit 5cf1cacb49ae
> ("cgroup, cpuset: replace cpuset_post_attach_flush() with
> cgroup_subsys->post_attach callback") and only cpuset would use this
> callback to wait for the mm migration to complete at the end of
> __cgroup_procs_write(). Since the previous patch defer the flush operation
> until returning to userspace, no one use this callback now. Remove this
> callback from cgroup_subsys.
>
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>   include/linux/cgroup-defs.h | 1 -
>   kernel/cgroup/cgroup.c      | 4 ----
>   2 files changed, 5 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 6b93a64115fe9..432abdfdb2593 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -746,7 +746,6 @@ struct cgroup_subsys {
>   	int (*can_attach)(struct cgroup_taskset *tset);
>   	void (*cancel_attach)(struct cgroup_taskset *tset);
>   	void (*attach)(struct cgroup_taskset *tset);
> -	void (*post_attach)(void);
>   	int (*can_fork)(struct task_struct *task,
>   			struct css_set *cset);
>   	void (*cancel_fork)(struct task_struct *task, struct css_set *cset);
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb7..75819bb2f1148 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3033,10 +3033,6 @@ void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked
>   	put_task_struct(task);
>   
>   	cgroup_attach_unlock(threadgroup_locked);
> -
> -	for_each_subsys(ss, ssid)
> -		if (ss->post_attach)
> -			ss->post_attach();
>   }
>   
>   static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)

Note that we may have to add it back in the future if a new use case 
comes up.

Acked-by:  Waiman Long <longman@redhat.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask
  2025-09-04  7:45 ` [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask Chuyi Zhou
  2025-09-04 14:30   ` Michal Koutný
  2025-09-04 15:12   ` Waiman Long
@ 2025-09-04 17:15   ` Tejun Heo
  2 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2025-09-04 17:15 UTC (permalink / raw)
  To: Chuyi Zhou; +Cc: mkoutny, hannes, longman, linux-kernel

On Thu, Sep 04, 2025 at 03:45:03PM +0800, Chuyi Zhou wrote:
> It is unnecessary to always wait for the flush operation of
> cpuset_migrate_mm_wq to complete in cpuset_write_resmask, as modifying
> cpuset.cpus or cpuset.exclusive does not trigger mm migrations. The
> flush_workqueue can be executed only when cpuset.mems is modified.
> 
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>

Applied cgroup/for-6.18.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work
  2025-09-04  7:45 [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work Chuyi Zhou
                   ` (2 preceding siblings ...)
  2025-09-04  7:45 ` [PATCH 3/3] cgroup: Remove unused cgroup_subsys::post_attach Chuyi Zhou
@ 2025-09-04 17:27 ` Tejun Heo
  2025-09-05  2:15   ` Chuyi Zhou
  3 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2025-09-04 17:27 UTC (permalink / raw)
  To: Chuyi Zhou; +Cc: mkoutny, hannes, longman, linux-kernel

On Thu, Sep 04, 2025 at 03:45:02PM +0800, Chuyi Zhou wrote:
> Now in cpuset_attach(), we need to synchronously wait for
> flush_workqueue to complete. The execution time of flushing
> cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
> cpusets at that time. When the cpuset.mems of a cgroup occupying a large
> amount of memory is modified, it may trigger extensive mm migration,
> causing cpuset_attach() to block on flush_workqueue for an extended period.

Applied 1-3 to cgroup/for-6.18. There were a couple conflicts that I
resolved. It'd be great if you can take a look and make sure everything is
okay.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-6.18

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work
  2025-09-04 17:27 ` [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work Tejun Heo
@ 2025-09-05  2:15   ` Chuyi Zhou
  0 siblings, 0 replies; 11+ messages in thread
From: Chuyi Zhou @ 2025-09-05  2:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mkoutny, hannes, longman, linux-kernel



在 2025/9/5 01:27, Tejun Heo 写道:
> On Thu, Sep 04, 2025 at 03:45:02PM +0800, Chuyi Zhou wrote:
>> Now in cpuset_attach(), we need to synchronously wait for
>> flush_workqueue to complete. The execution time of flushing
>> cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
>> cpusets at that time. When the cpuset.mems of a cgroup occupying a large
>> amount of memory is modified, it may trigger extensive mm migration,
>> causing cpuset_attach() to block on flush_workqueue for an extended period.
> 
> Applied 1-3 to cgroup/for-6.18. There were a couple conflicts that I
> resolved. It'd be great if you can take a look and make sure everything is
> okay.
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-6.18
> 
> Thanks.
> 

Sorry, I forgot to rebase the latest cgroup branch before sending the 
patchset. I made sure everything is okay.

Thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-09-05  2:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04  7:45 [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work Chuyi Zhou
2025-09-04  7:45 ` [PATCH 1/3] cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmask Chuyi Zhou
2025-09-04 14:30   ` Michal Koutný
2025-09-04 15:12   ` Waiman Long
2025-09-04 17:15   ` Tejun Heo
2025-09-04  7:45 ` [PATCH 2/3] cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work Chuyi Zhou
2025-09-04 15:14   ` Waiman Long
2025-09-04  7:45 ` [PATCH 3/3] cgroup: Remove unused cgroup_subsys::post_attach Chuyi Zhou
2025-09-04 15:17   ` Waiman Long
2025-09-04 17:27 ` [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work Tejun Heo
2025-09-05  2:15   ` Chuyi Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox