public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: Tejun Heo <tj@kernel.org>, Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: cgroups@vger.kernel.org, hughd@google.com,
	wuyun.abel@bytedance.com, hezhongkun.hzk@bytedance.com,
	chenying.kernel@bytedance.com, zhanghaoyu.zhy@bytedance.com
Subject: Re: [problem] Hung task caused by memory migration when cpuset.mems changes
Date: Wed, 27 Mar 2024 13:14:49 -0400	[thread overview]
Message-ID: <d8e8b000-7d09-4747-82ec-bf99a73607ee@redhat.com> (raw)
In-Reply-To: <ZgMFPMjZRZCsq9Q-@slm.duckdns.org>

On 3/26/24 13:26, Tejun Heo wrote:
> Hello,
>
> On Mon, Mar 25, 2024 at 10:46:09PM +0800, Chuyi Zhou wrote:
>> In our production environment, we have observed several cases of hung tasks
>> blocked on the cgroup_mutex. The underlying cause is that when user modify
>> the cpuset.mems, memory migration operations are performed in the
>> work_queue. However, the duration of these operations depends on the memory
>> size of workloads and can consume a significant amount of time.
>>
>> In the __cgroup_procs_write operation, there is a flush_workqueue operation
>> that waits for the migration to complete while holding the cgroup_mutex.
>> As a result, most cgroup-related operations have the potential to
>> experience blocking.
>>
>> We have noticed the commit "cgroup/cpuset: Enable memory migration for
>> cpuset v2"[1]. This commit enforces memory migration when modifying the
>> cpuset. Furthermore, in cgroup v2, there is no option available for
>> users to disable CS_MEMORY_MIGRATE.
>>
>> In our scenario, we do need to perform memory migration when cpuset.mems
>> changes, while ensuring that other tasks are not blocked on cgroup_mutex
>> for an extended period of time.
>>
>> One feasible approach is to revert the commit "cgroup/cpuset: Enable memory
>> migration for cpuset v2"[1]. This way, modifying cpuset.mems will not
>> trigger memory migration, and we can manually perform memory migration
>> using migrate_pages()/move_pages() syscalls.
>>
>> Another solution is to use a lazy approach for memory migration[2]. In
>> this way we only walk through all the pages and sets pages to protnone,
>> and numa faults triggered by later touch will handle the movement. That
>> would significantly reduce the time spent in cpuset_migrate_mm_workfn.
>> But MPOL_MF_LAZY was disabled by commit 2cafb582173f ("mempolicy: remove
>> confusing MPOL_MF_LAZY dead code")
> One approach we can take is pushing the cpuset_migrate_mm_wq flushing to
> task_work so that it happens after cpuset mutex is dropped. That way we
> maintain the operation synchronicity for the issuer while avoiding bothering
> anyone else.
I think it is a good idea to use task_work() to wait for mm migration to 
finish before returning to user space.
>
> Can you see whether the following patch fixes the issue for you? Thanks.
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index ba36c073304a..8a8bd3f157ab 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -42,6 +42,7 @@
>   #include <linux/spinlock.h>
>   #include <linux/oom.h>
>   #include <linux/sched/isolation.h>
> +#include <linux/task_work.h>
>   #include <linux/cgroup.h>
>   #include <linux/wait.h>
>   #include <linux/workqueue.h>
> @@ -2696,6 +2697,26 @@ static void cpuset_migrate_mm_workfn(struct work_struct *work)
>   	kfree(mwork);
>   }
>   
> +static void flush_migrate_mm_task_workfn(struct callback_head *head)
> +{
> +	flush_workqueue(cpuset_migrate_mm_wq);
> +}
> +
> +static int schedule_flush_migrate_mm(void)
> +{
> +	struct callback_head *flush_cb;
> +
> +	flush_cb = kzalloc(sizeof(*flush_cb), GFP_KERNEL);
> +	if (!flush_cb)
> +		return -ENOMEM;
> +
> +	flush_cb->func = flush_migrate_mm_task_workfn;
> +	if (task_work_add(current, flush_cb, TWA_RESUME))
> +		kfree(flush_cb);
> +
> +	return 0;
> +}
> +
>   static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>   							const nodemask_t *to)
>   {
> @@ -2718,11 +2739,6 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>   	}
>   }
>   
> -static void cpuset_post_attach(void)
> -{
> -	flush_workqueue(cpuset_migrate_mm_wq);
> -}
> -
>   /*
>    * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
>    * @tsk: the task to change
> @@ -3276,6 +3292,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
>   	bool cpus_updated, mems_updated;
>   	int ret;
>   
> +	ret = schedule_flush_migrate_mm();
> +	if (ret)
> +		return ret;
> +

It may be too early to initiate the task_work at cpuset_can_attach() as 
no mm migration may happen. My suggestion is to do it at cpuset_attach() 
with at least one cpuset_migrate_mm() call.

Cheers,
Longman

>   	/* used later by cpuset_attach() */
>   	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
>   	oldcs = cpuset_attach_old_cs;
> @@ -3584,7 +3604,11 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   {
>   	struct cpuset *cs = css_cs(of_css(of));
>   	struct cpuset *trialcs;
> -	int retval = -ENODEV;
> +	int retval;
> +
> +	retval = schedule_flush_migrate_mm();
> +	if (retval)
> +		return retval;
>   
>   	buf = strstrip(buf);
>   
> @@ -3613,8 +3637,10 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   
>   	cpus_read_lock();
>   	mutex_lock(&cpuset_mutex);
> -	if (!is_cpuset_online(cs))
> +	if (!is_cpuset_online(cs)) {
> +		retval = -ENODEV;
>   		goto out_unlock;
> +	}
>   
>   	trialcs = alloc_trial_cpuset(cs);
>   	if (!trialcs) {
> @@ -3643,7 +3669,6 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>   	cpus_read_unlock();
>   	kernfs_unbreak_active_protection(of->kn);
>   	css_put(&cs->css);
> -	flush_workqueue(cpuset_migrate_mm_wq);
>   	return retval ?: nbytes;
>   }
>   
> @@ -4283,7 +4308,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
>   	.can_attach	= cpuset_can_attach,
>   	.cancel_attach	= cpuset_cancel_attach,
>   	.attach		= cpuset_attach,
> -	.post_attach	= cpuset_post_attach,
>   	.bind		= cpuset_bind,
>   	.can_fork	= cpuset_can_fork,
>   	.cancel_fork	= cpuset_cancel_fork,
>


  parent reply	other threads:[~2024-03-27 17:14 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-25 14:46 [problem] Hung task caused by memory migration when cpuset.mems changes Chuyi Zhou
2024-03-26 17:26 ` Tejun Heo
2024-03-27 14:07   ` Chuyi Zhou
2024-03-27 16:13     ` Tejun Heo
2024-03-27 17:14   ` Waiman Long [this message]
2024-03-27 21:43     ` Tejun Heo
2024-03-28  7:53   ` Abel Wu
2024-03-28 17:19     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d8e8b000-7d09-4747-82ec-bf99a73607ee@redhat.com \
    --to=longman@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenying.kernel@bytedance.com \
    --cc=hezhongkun.hzk@bytedance.com \
    --cc=hughd@google.com \
    --cc=tj@kernel.org \
    --cc=wuyun.abel@bytedance.com \
    --cc=zhanghaoyu.zhy@bytedance.com \
    --cc=zhouchuyi@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox