From: Waiman Long <longman@redhat.com>
To: Tejun Heo <tj@kernel.org>, Chuyi Zhou <zhouchuyi@bytedance.com>
Cc: cgroups@vger.kernel.org, hughd@google.com,
wuyun.abel@bytedance.com, hezhongkun.hzk@bytedance.com,
chenying.kernel@bytedance.com, zhanghaoyu.zhy@bytedance.com
Subject: Re: [problem] Hung task caused by memory migration when cpuset.mems changes
Date: Wed, 27 Mar 2024 13:14:49 -0400 [thread overview]
Message-ID: <d8e8b000-7d09-4747-82ec-bf99a73607ee@redhat.com> (raw)
In-Reply-To: <ZgMFPMjZRZCsq9Q-@slm.duckdns.org>
On 3/26/24 13:26, Tejun Heo wrote:
> Hello,
>
> On Mon, Mar 25, 2024 at 10:46:09PM +0800, Chuyi Zhou wrote:
>> In our production environment, we have observed several cases of hung tasks
>> blocked on the cgroup_mutex. The underlying cause is that when user modify
>> the cpuset.mems, memory migration operations are performed in the
>> work_queue. However, the duration of these operations depends on the memory
>> size of workloads and can consume a significant amount of time.
>>
>> In the __cgroup_procs_write operation, there is a flush_workqueue operation
>> that waits for the migration to complete while holding the cgroup_mutex.
>> As a result, most cgroup-related operations have the potential to
>> experience blocking.
>>
>> We have noticed the commit "cgroup/cpuset: Enable memory migration for
>> cpuset v2"[1]. This commit enforces memory migration when modifying the
>> cpuset. Furthermore, in cgroup v2, there is no option available for
>> users to disable CS_MEMORY_MIGRATE.
>>
>> In our scenario, we do need to perform memory migration when cpuset.mems
>> changes, while ensuring that other tasks are not blocked on cgroup_mutex
>> for an extended period of time.
>>
>> One feasible approach is to revert the commit "cgroup/cpuset: Enable memory
>> migration for cpuset v2"[1]. This way, modifying cpuset.mems will not
>> trigger memory migration, and we can manually perform memory migration
>> using migrate_pages()/move_pages() syscalls.
>>
>> Another solution is to use a lazy approach for memory migration[2]. In
>> this way we only walk through all the pages and sets pages to protnone,
>> and numa faults triggered by later touch will handle the movement. That
>> would significantly reduce the time spent in cpuset_migrate_mm_workfn.
>> But MPOL_MF_LAZY was disabled by commit 2cafb582173f ("mempolicy: remove
>> confusing MPOL_MF_LAZY dead code")
> One approach we can take is pushing the cpuset_migrate_mm_wq flushing to
> task_work so that it happens after cpuset mutex is dropped. That way we
> maintain the operation synchronicity for the issuer while avoiding bothering
> anyone else.
I think it is a good idea to use task_work() to wait for mm migration to
finish before returning to user space.
>
> Can you see whether the following patch fixes the issue for you? Thanks.
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index ba36c073304a..8a8bd3f157ab 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -42,6 +42,7 @@
> #include <linux/spinlock.h>
> #include <linux/oom.h>
> #include <linux/sched/isolation.h>
> +#include <linux/task_work.h>
> #include <linux/cgroup.h>
> #include <linux/wait.h>
> #include <linux/workqueue.h>
> @@ -2696,6 +2697,26 @@ static void cpuset_migrate_mm_workfn(struct work_struct *work)
> kfree(mwork);
> }
>
> +static void flush_migrate_mm_task_workfn(struct callback_head *head)
> +{
> + flush_workqueue(cpuset_migrate_mm_wq);
> +}
> +
> +static int schedule_flush_migrate_mm(void)
> +{
> + struct callback_head *flush_cb;
> +
> + flush_cb = kzalloc(sizeof(*flush_cb), GFP_KERNEL);
> + if (!flush_cb)
> + return -ENOMEM;
> +
> + flush_cb->func = flush_migrate_mm_task_workfn;
> + if (task_work_add(current, flush_cb, TWA_RESUME))
> + kfree(flush_cb);
> +
> + return 0;
> +}
> +
> static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> const nodemask_t *to)
> {
> @@ -2718,11 +2739,6 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> }
> }
>
> -static void cpuset_post_attach(void)
> -{
> - flush_workqueue(cpuset_migrate_mm_wq);
> -}
> -
> /*
> * cpuset_change_task_nodemask - change task's mems_allowed and mempolicy
> * @tsk: the task to change
> @@ -3276,6 +3292,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> bool cpus_updated, mems_updated;
> int ret;
>
> + ret = schedule_flush_migrate_mm();
> + if (ret)
> + return ret;
> +
It may be too early to initiate the task_work at cpuset_can_attach() as
no mm migration may happen. My suggestion is to do it at cpuset_attach()
with at least one cpuset_migrate_mm() call.
Cheers,
Longman
> /* used later by cpuset_attach() */
> cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
> oldcs = cpuset_attach_old_cs;
> @@ -3584,7 +3604,11 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
> {
> struct cpuset *cs = css_cs(of_css(of));
> struct cpuset *trialcs;
> - int retval = -ENODEV;
> + int retval;
> +
> + retval = schedule_flush_migrate_mm();
> + if (retval)
> + return retval;
>
> buf = strstrip(buf);
>
> @@ -3613,8 +3637,10 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
>
> cpus_read_lock();
> mutex_lock(&cpuset_mutex);
> - if (!is_cpuset_online(cs))
> + if (!is_cpuset_online(cs)) {
> + retval = -ENODEV;
> goto out_unlock;
> + }
>
> trialcs = alloc_trial_cpuset(cs);
> if (!trialcs) {
> @@ -3643,7 +3669,6 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
> cpus_read_unlock();
> kernfs_unbreak_active_protection(of->kn);
> css_put(&cs->css);
> - flush_workqueue(cpuset_migrate_mm_wq);
> return retval ?: nbytes;
> }
>
> @@ -4283,7 +4308,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
> .can_attach = cpuset_can_attach,
> .cancel_attach = cpuset_cancel_attach,
> .attach = cpuset_attach,
> - .post_attach = cpuset_post_attach,
> .bind = cpuset_bind,
> .can_fork = cpuset_can_fork,
> .cancel_fork = cpuset_cancel_fork,
>
next prev parent reply other threads:[~2024-03-27 17:14 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-25 14:46 [problem] Hung task caused by memory migration when cpuset.mems changes Chuyi Zhou
2024-03-26 17:26 ` Tejun Heo
2024-03-27 14:07 ` Chuyi Zhou
2024-03-27 16:13 ` Tejun Heo
2024-03-27 17:14 ` Waiman Long [this message]
2024-03-27 21:43 ` Tejun Heo
2024-03-28 7:53 ` Abel Wu
2024-03-28 17:19 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d8e8b000-7d09-4747-82ec-bf99a73607ee@redhat.com \
--to=longman@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chenying.kernel@bytedance.com \
--cc=hezhongkun.hzk@bytedance.com \
--cc=hughd@google.com \
--cc=tj@kernel.org \
--cc=wuyun.abel@bytedance.com \
--cc=zhanghaoyu.zhy@bytedance.com \
--cc=zhouchuyi@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox