* [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues
@ 2023-04-11 13:35 Waiman Long
[not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Waiman Long @ 2023-04-11 13:35 UTC (permalink / raw)
To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano, Waiman Long
v4:
- Add missing rcu_read_lock/unlock to cpuset_cancel_fork() in patch 3.
- Add patch 5 to reduce performance impact for the
non-CLONE_INTO_CGROUP case.
v3:
- Update patches 2 & 3 to put task_cs() call under rcu_read_lock().
v2:
- Drop v1 patch 3
- Add a new patch to fix an issue in cpuset_cancel_attach() and
another patch to add cpuset_can_fork() and cpuset_cacnel_fork()
methods.
The first patch in this series fixes a problem in
cpuset_cancel_attach(). Patches 2 and 3 fixes the CLONE_INTO_CGROUP
problem in cpuset. Patch 4 is a minor fix. The last patch is a
performance optimization patch for the non-CLONE_INTO_CGROUP case.
Waiman Long (5):
cgroup/cpuset: Wake up cpuset_attach_wq tasks in
cpuset_cancel_attach()
cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods
cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for
top_cpuset
cgroup/cpuset: Optimize out unneeded
cpuset_can_fork/cpuset_cancel_fork calls
include/linux/cgroup-defs.h | 6 ++
kernel/cgroup/cgroup.c | 23 +++--
kernel/cgroup/cpuset.c | 167 +++++++++++++++++++++++++++++-------
3 files changed, 159 insertions(+), 37 deletions(-)
--
2.31.1
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v4 1/5] cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()
[not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2023-04-11 13:35 ` Waiman Long
2023-04-11 13:35 ` [PATCH v4 2/5] cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly Waiman Long
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2023-04-11 13:35 UTC (permalink / raw)
To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano, Waiman Long
After a successful cpuset_can_attach() call which increments the
attach_in_progress flag, either cpuset_cancel_attach() or cpuset_attach()
will be called later. In cpuset_attach(), tasks in cpuset_attach_wq,
if present, will be woken up at the end. That is not the case in
cpuset_cancel_attach(). So missed wakeup is possible if the attach
operation is somehow cancelled. Fix that by doing the wakeup in
cpuset_cancel_attach() as well.
Fixes: e44193d39e8d ("cpuset: let hotplug propagation work wait for task attaching")
Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org>
---
kernel/cgroup/cpuset.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bc4dcfd7bee5..066689a7dcc3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2507,11 +2507,15 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
static void cpuset_cancel_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
+ struct cpuset *cs;
cgroup_taskset_first(tset, &css);
+ cs = css_cs(css);
percpu_down_write(&cpuset_rwsem);
- css_cs(css)->attach_in_progress--;
+ cs->attach_in_progress--;
+ if (!cs->attach_in_progress)
+ wake_up(&cpuset_attach_wq);
percpu_up_write(&cpuset_rwsem);
}
--
2.31.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v4 2/5] cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
[not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-04-11 13:35 ` [PATCH v4 1/5] cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach() Waiman Long
@ 2023-04-11 13:35 ` Waiman Long
2023-04-11 13:36 ` [PATCH v4 4/5] cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset Waiman Long
2023-04-12 18:26 ` [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Tejun Heo
3 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2023-04-11 13:35 UTC (permalink / raw)
To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano, Waiman Long
By default, the clone(2) syscall spawn a child process into the same
cgroup as its parent. With the use of the CLONE_INTO_CGROUP flag
introduced by commit ef2c41cf38a7 ("clone3: allow spawning processes
into cgroups"), the child will be spawned into a different cgroup which
is somewhat similar to writing the child's tid into "cgroup.threads".
The current cpuset_fork() method does not properly handle the
CLONE_INTO_CGROUP case where the cpuset of the child may be different
from that of its parent. Update the cpuset_fork() method to treat the
CLONE_INTO_CGROUP case similar to cpuset_attach().
Since the newly cloned task has not been running yet, its actual
memory usage isn't known. So it is not necessary to make change to mm
in cpuset_fork().
Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups")
Reported-by: Giuseppe Scrivano <gscrivan-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
kernel/cgroup/cpuset.c | 62 ++++++++++++++++++++++++++++--------------
1 file changed, 42 insertions(+), 20 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 066689a7dcc3..e954d5abb784 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2520,16 +2520,33 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
}
/*
- * Protected by cpuset_rwsem. cpus_attach is used only by cpuset_attach()
+ * Protected by cpuset_rwsem. cpus_attach is used only by cpuset_attach_task()
* but we can't allocate it dynamically there. Define it global and
* allocate from cpuset_init().
*/
static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_to;
+
+static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
+{
+ percpu_rwsem_assert_held(&cpuset_rwsem);
+
+ if (cs != &top_cpuset)
+ guarantee_online_cpus(task, cpus_attach);
+ else
+ cpumask_copy(cpus_attach, task_cpu_possible_mask(task));
+ /*
+ * can_attach beforehand should guarantee that this doesn't
+ * fail. TODO: have a better way to handle failure here
+ */
+ WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+
+ cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
+ cpuset_update_task_spread_flags(cs, task);
+}
static void cpuset_attach(struct cgroup_taskset *tset)
{
- /* static buf protected by cpuset_rwsem */
- static nodemask_t cpuset_attach_nodemask_to;
struct task_struct *task;
struct task_struct *leader;
struct cgroup_subsys_state *css;
@@ -2560,20 +2577,8 @@ static void cpuset_attach(struct cgroup_taskset *tset)
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
- cgroup_taskset_for_each(task, css, tset) {
- if (cs != &top_cpuset)
- guarantee_online_cpus(task, cpus_attach);
- else
- cpumask_copy(cpus_attach, task_cpu_possible_mask(task));
- /*
- * can_attach beforehand should guarantee that this doesn't
- * fail. TODO: have a better way to handle failure here
- */
- WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
-
- cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
- cpuset_update_task_spread_flags(cs, task);
- }
+ cgroup_taskset_for_each(task, css, tset)
+ cpuset_attach_task(cs, task);
/*
* Change mm for all threadgroup leaders. This is expensive and may
@@ -3271,11 +3276,28 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
*/
static void cpuset_fork(struct task_struct *task)
{
- if (task_css_is_root(task, cpuset_cgrp_id))
+ struct cpuset *cs;
+ bool same_cs;
+
+ rcu_read_lock();
+ cs = task_cs(task);
+ same_cs = (cs == task_cs(current));
+ rcu_read_unlock();
+
+ if (same_cs) {
+ if (cs == &top_cpuset)
+ return;
+
+ set_cpus_allowed_ptr(task, current->cpus_ptr);
+ task->mems_allowed = current->mems_allowed;
return;
+ }
- set_cpus_allowed_ptr(task, current->cpus_ptr);
- task->mems_allowed = current->mems_allowed;
+ /* CLONE_INTO_CGROUP */
+ percpu_down_write(&cpuset_rwsem);
+ guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+ cpuset_attach_task(cs, task);
+ percpu_up_write(&cpuset_rwsem);
}
struct cgroup_subsys cpuset_cgrp_subsys = {
--
2.31.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v4 3/5] cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods
2023-04-11 13:35 [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Waiman Long
[not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2023-04-11 13:35 ` Waiman Long
2023-04-11 13:36 ` [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls Waiman Long
2 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2023-04-11 13:35 UTC (permalink / raw)
To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner
Cc: cgroups, linux-kernel, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano, Waiman Long
In the case of CLONE_INTO_CGROUP, not all cpusets are ready to accept
new tasks. It is too late to check that in cpuset_fork(). So we need
to add the cpuset_can_fork() and cpuset_cancel_fork() methods to
pre-check it before we can allow attachment to a different cpuset.
We also need to set the attach_in_progress flag to alert other code
that a new task is going to be added to the cpuset.
Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups")
Suggested-by: Michal Koutn√Ω <mkoutny@suse.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 97 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 86 insertions(+), 11 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e954d5abb784..132cdae4f03c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2458,6 +2458,20 @@ static int fmeter_getrate(struct fmeter *fmp)
static struct cpuset *cpuset_attach_old_cs;
+/*
+ * Check to see if a cpuset can accept a new task
+ * For v1, cpus_allowed and mems_allowed can't be empty.
+ * For v2, effective_cpus can't be empty.
+ * Note that in v1, effective_cpus = cpus_allowed.
+ */
+static int cpuset_can_attach_check(struct cpuset *cs)
+{
+ if (cpumask_empty(cs->effective_cpus) ||
+ (!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
+ return -ENOSPC;
+ return 0;
+}
+
/* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */
static int cpuset_can_attach(struct cgroup_taskset *tset)
{
@@ -2472,16 +2486,9 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
percpu_down_write(&cpuset_rwsem);
- /* allow moving tasks into an empty cpuset if on default hierarchy */
- ret = -ENOSPC;
- if (!is_in_v2_mode() &&
- (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
- goto out_unlock;
-
- /*
- * Task cannot be moved to a cpuset with empty effective cpus.
- */
- if (cpumask_empty(cs->effective_cpus))
+ /* Check to see if task is allowed in the cpuset */
+ ret = cpuset_can_attach_check(cs);
+ if (ret)
goto out_unlock;
cgroup_taskset_for_each(task, css, tset) {
@@ -2498,7 +2505,6 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
* changes which zero cpus/mems_allowed.
*/
cs->attach_in_progress++;
- ret = 0;
out_unlock:
percpu_up_write(&cpuset_rwsem);
return ret;
@@ -3269,6 +3275,68 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css)
percpu_up_write(&cpuset_rwsem);
}
+/*
+ * In case the child is cloned into a cpuset different from its parent,
+ * additional checks are done to see if the move is allowed.
+ */
+static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
+{
+ struct cpuset *cs = css_cs(cset->subsys[cpuset_cgrp_id]);
+ bool same_cs;
+ int ret;
+
+ rcu_read_lock();
+ same_cs = (cs == task_cs(current));
+ rcu_read_unlock();
+
+ if (same_cs)
+ return 0;
+
+ lockdep_assert_held(&cgroup_mutex);
+ percpu_down_write(&cpuset_rwsem);
+
+ /* Check to see if task is allowed in the cpuset */
+ ret = cpuset_can_attach_check(cs);
+ if (ret)
+ goto out_unlock;
+
+ ret = task_can_attach(task, cs->effective_cpus);
+ if (ret)
+ goto out_unlock;
+
+ ret = security_task_setscheduler(task);
+ if (ret)
+ goto out_unlock;
+
+ /*
+ * Mark attach is in progress. This makes validate_change() fail
+ * changes which zero cpus/mems_allowed.
+ */
+ cs->attach_in_progress++;
+out_unlock:
+ percpu_up_write(&cpuset_rwsem);
+ return ret;
+}
+
+static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
+{
+ struct cpuset *cs = css_cs(cset->subsys[cpuset_cgrp_id]);
+ bool same_cs;
+
+ rcu_read_lock();
+ same_cs = (cs == task_cs(current));
+ rcu_read_unlock();
+
+ if (same_cs)
+ return;
+
+ percpu_down_write(&cpuset_rwsem);
+ cs->attach_in_progress--;
+ if (!cs->attach_in_progress)
+ wake_up(&cpuset_attach_wq);
+ percpu_up_write(&cpuset_rwsem);
+}
+
/*
* Make sure the new task conform to the current state of its parent,
* which could have been changed by cpuset just after it inherits the
@@ -3297,6 +3365,11 @@ static void cpuset_fork(struct task_struct *task)
percpu_down_write(&cpuset_rwsem);
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
cpuset_attach_task(cs, task);
+
+ cs->attach_in_progress--;
+ if (!cs->attach_in_progress)
+ wake_up(&cpuset_attach_wq);
+
percpu_up_write(&cpuset_rwsem);
}
@@ -3310,6 +3383,8 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
.attach = cpuset_attach,
.post_attach = cpuset_post_attach,
.bind = cpuset_bind,
+ .can_fork = cpuset_can_fork,
+ .cancel_fork = cpuset_cancel_fork,
.fork = cpuset_fork,
.legacy_cftypes = legacy_files,
.dfl_cftypes = dfl_files,
--
2.31.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v4 4/5] cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset
[not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-04-11 13:35 ` [PATCH v4 1/5] cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach() Waiman Long
2023-04-11 13:35 ` [PATCH v4 2/5] cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly Waiman Long
@ 2023-04-11 13:36 ` Waiman Long
2023-04-12 18:26 ` [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Tejun Heo
3 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2023-04-11 13:36 UTC (permalink / raw)
To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano, Waiman Long
It is found that attaching a task to the top_cpuset does not currently
ignore CPUs allocated to subpartitions in cpuset_attach_task(). So the
code is changed to fix that.
Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org>
---
kernel/cgroup/cpuset.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 132cdae4f03c..e4ca2dd2b764 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2540,7 +2540,8 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
if (cs != &top_cpuset)
guarantee_online_cpus(task, cpus_attach);
else
- cpumask_copy(cpus_attach, task_cpu_possible_mask(task));
+ cpumask_andnot(cpus_attach, task_cpu_possible_mask(task),
+ cs->subparts_cpus);
/*
* can_attach beforehand should guarantee that this doesn't
* fail. TODO: have a better way to handle failure here
--
2.31.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls
2023-04-11 13:35 [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Waiman Long
[not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-04-11 13:35 ` [PATCH v4 3/5] cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods Waiman Long
@ 2023-04-11 13:36 ` Waiman Long
[not found] ` <20230411133601.2969636-6-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2 siblings, 1 reply; 12+ messages in thread
From: Waiman Long @ 2023-04-11 13:36 UTC (permalink / raw)
To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner
Cc: cgroups, linux-kernel, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano, Waiman Long
The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls
are only needed when the CLONE_INTO_CGROUP flag is set which is not
likely. Adding an extra cpuset_can_fork() call does introduce a bit
of performance overhead in the fork/clone fastpath. To reduce this
performance overhead, introduce a new clone_into_cgroup_can_fork flag
into the cgroup_subsys structure. This flag, when set, will call the
can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag
is set.
The cpuset code is now modified to set this flag. The same cpuset
checking code in cpuset_can_fork() and cpuset_cancel_fork() will have
to stay as the cgroups can be different, but the cpusets may still be
the same. So the same check must be present in both cpuset_fork() and
cpuset_can_fork() to make sure that attach_in_progress is correctly set.
Signed-off-by: Waiman Long <longman@redhat.com>
---
include/linux/cgroup-defs.h | 6 ++++++
kernel/cgroup/cgroup.c | 23 ++++++++++++++++++-----
kernel/cgroup/cpuset.c | 1 +
3 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8a0d5466c7be..0087a47d80a2 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -701,6 +701,12 @@ struct cgroup_subsys {
*/
bool threaded:1;
+ /*
+ * If %true, the controller will call can_fork and cancel_fork
+ * methods only if CLONE_INTO_CGROUP flag is set.
+ */
+ bool clone_into_cgroup_can_fork:1;
+
/* the following two fields are initialized automatically during boot */
int id;
const char *name;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 83ea13f2ccb1..23701e959ef5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6517,6 +6517,10 @@ int cgroup_can_fork(struct task_struct *child, struct kernel_clone_args *kargs)
return ret;
do_each_subsys_mask(ss, i, have_canfork_callback) {
+ if (ss->clone_into_cgroup_can_fork &&
+ !(kargs->flags & CLONE_INTO_CGROUP))
+ continue;
+
ret = ss->can_fork(child, kargs->cset);
if (ret)
goto out_revert;
@@ -6528,8 +6532,12 @@ int cgroup_can_fork(struct task_struct *child, struct kernel_clone_args *kargs)
for_each_subsys(ss, j) {
if (j >= i)
break;
- if (ss->cancel_fork)
- ss->cancel_fork(child, kargs->cset);
+ if (!ss->cancel_fork ||
+ (ss->clone_into_cgroup_can_fork &&
+ !(kargs->flags & CLONE_INTO_CGROUP)))
+ continue;
+
+ ss->cancel_fork(child, kargs->cset);
}
cgroup_css_set_put_fork(kargs);
@@ -6552,9 +6560,14 @@ void cgroup_cancel_fork(struct task_struct *child,
struct cgroup_subsys *ss;
int i;
- for_each_subsys(ss, i)
- if (ss->cancel_fork)
- ss->cancel_fork(child, kargs->cset);
+ for_each_subsys(ss, i) {
+ if (!ss->cancel_fork ||
+ (ss->clone_into_cgroup_can_fork &&
+ !(kargs->flags & CLONE_INTO_CGROUP)))
+ continue;
+
+ ss->cancel_fork(child, kargs->cset);
+ }
cgroup_css_set_put_fork(kargs);
}
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e4ca2dd2b764..937ef4d60cd4 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3391,6 +3391,7 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
.dfl_cftypes = dfl_files,
.early_init = true,
.threaded = true,
+ .clone_into_cgroup_can_fork = true,
};
/**
--
2.31.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues
[not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (2 preceding siblings ...)
2023-04-11 13:36 ` [PATCH v4 4/5] cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset Waiman Long
@ 2023-04-12 18:26 ` Tejun Heo
[not found] ` <ZDb32V9xcWOi2_CL-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
3 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2023-04-12 18:26 UTC (permalink / raw)
To: Waiman Long
Cc: Zefan Li, Johannes Weiner, Christian Brauner,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano
On Tue, Apr 11, 2023 at 09:35:56AM -0400, Waiman Long wrote:
> v4:
> - Add missing rcu_read_lock/unlock to cpuset_cancel_fork() in patch 3.
> - Add patch 5 to reduce performance impact for the
> non-CLONE_INTO_CGROUP case.
>
> v3:
> - Update patches 2 & 3 to put task_cs() call under rcu_read_lock().
>
> v2:
> - Drop v1 patch 3
> - Add a new patch to fix an issue in cpuset_cancel_attach() and
> another patch to add cpuset_can_fork() and cpuset_cacnel_fork()
> methods.
>
> The first patch in this series fixes a problem in
> cpuset_cancel_attach(). Patches 2 and 3 fixes the CLONE_INTO_CGROUP
> problem in cpuset. Patch 4 is a minor fix. The last patch is a
> performance optimization patch for the non-CLONE_INTO_CGROUP case.
Applied 1-4 to cgroup/for-6.3-fixes w/ stable cc'd. Given that the fixes are
a bit involved, the breakages have been there for quite a while and the
cpuset code has changed quite a bit, backporting might not be trivial tho.
Let's see how that goes.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls
[not found] ` <20230411133601.2969636-6-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2023-04-12 18:27 ` Tejun Heo
[not found] ` <ZDb4G2jgQFK8h8Ys-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2023-04-12 18:27 UTC (permalink / raw)
To: Waiman Long
Cc: Zefan Li, Johannes Weiner, Christian Brauner,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano
On Tue, Apr 11, 2023 at 09:36:01AM -0400, Waiman Long wrote:
> The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls
> are only needed when the CLONE_INTO_CGROUP flag is set which is not
> likely. Adding an extra cpuset_can_fork() call does introduce a bit
> of performance overhead in the fork/clone fastpath. To reduce this
> performance overhead, introduce a new clone_into_cgroup_can_fork flag
> into the cgroup_subsys structure. This flag, when set, will call the
> can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag
> is set.
>
> The cpuset code is now modified to set this flag. The same cpuset
> checking code in cpuset_can_fork() and cpuset_cancel_fork() will have
> to stay as the cgroups can be different, but the cpusets may still be
> the same. So the same check must be present in both cpuset_fork() and
> cpuset_can_fork() to make sure that attach_in_progress is correctly set.
>
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Waiman, I'm not necessarily against this optimization but can we at least
have some performance numbers to show that this is actually meaningful?
Given how heavy our fork path is, I'm not too sure this would show up in any
meaningful way.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues
[not found] ` <ZDb32V9xcWOi2_CL-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
@ 2023-04-12 18:29 ` Waiman Long
0 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2023-04-12 18:29 UTC (permalink / raw)
To: Tejun Heo
Cc: Zefan Li, Johannes Weiner, Christian Brauner,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano
On 4/12/23 14:26, Tejun Heo wrote:
> On Tue, Apr 11, 2023 at 09:35:56AM -0400, Waiman Long wrote:
>> v4:
>> - Add missing rcu_read_lock/unlock to cpuset_cancel_fork() in patch 3.
>> - Add patch 5 to reduce performance impact for the
>> non-CLONE_INTO_CGROUP case.
>>
>> v3:
>> - Update patches 2 & 3 to put task_cs() call under rcu_read_lock().
>>
>> v2:
>> - Drop v1 patch 3
>> - Add a new patch to fix an issue in cpuset_cancel_attach() and
>> another patch to add cpuset_can_fork() and cpuset_cacnel_fork()
>> methods.
>>
>> The first patch in this series fixes a problem in
>> cpuset_cancel_attach(). Patches 2 and 3 fixes the CLONE_INTO_CGROUP
>> problem in cpuset. Patch 4 is a minor fix. The last patch is a
>> performance optimization patch for the non-CLONE_INTO_CGROUP case.
> Applied 1-4 to cgroup/for-6.3-fixes w/ stable cc'd. Given that the fixes are
> a bit involved, the breakages have been there for quite a while and the
> cpuset code has changed quite a bit, backporting might not be trivial tho.
> Let's see how that goes.
I know stable backport won't be straight forward. I am planning to help
in the backporting effort. Thanks for taking these into your cgroup tree.
Cheers,
Longman
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls
[not found] ` <ZDb4G2jgQFK8h8Ys-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
@ 2023-04-12 18:40 ` Waiman Long
[not found] ` <90b7bc16-0673-02b7-dad1-f24bc956f1c5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Waiman Long @ 2023-04-12 18:40 UTC (permalink / raw)
To: Tejun Heo
Cc: Zefan Li, Johannes Weiner, Christian Brauner,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano
On 4/12/23 14:27, Tejun Heo wrote:
> On Tue, Apr 11, 2023 at 09:36:01AM -0400, Waiman Long wrote:
>> The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls
>> are only needed when the CLONE_INTO_CGROUP flag is set which is not
>> likely. Adding an extra cpuset_can_fork() call does introduce a bit
>> of performance overhead in the fork/clone fastpath. To reduce this
>> performance overhead, introduce a new clone_into_cgroup_can_fork flag
>> into the cgroup_subsys structure. This flag, when set, will call the
>> can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag
>> is set.
>>
>> The cpuset code is now modified to set this flag. The same cpuset
>> checking code in cpuset_can_fork() and cpuset_cancel_fork() will have
>> to stay as the cgroups can be different, but the cpusets may still be
>> the same. So the same check must be present in both cpuset_fork() and
>> cpuset_can_fork() to make sure that attach_in_progress is correctly set.
>>
>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Waiman, I'm not necessarily against this optimization but can we at least
> have some performance numbers to show that this is actually meaningful?
> Given how heavy our fork path is, I'm not too sure this would show up in any
> meaningful way.
That make sense to me. I am OK to leave it for now as it is an
optimization patch anyway.
BTW, another question that I have is about the cgroup_threadgroup_rwsem.
It is currently a percpu rwsem. Is it possible to change it into a
regular rwsem instead? It is causing quite a bit of latency for
workloads that require rather frequent changes to cgroups. I know we
have a "favordynmods" mount option to disable the percpu operation. This
will still be less performant than a normal rwsem. Of course the
downside is that the fork/exit fastpaths will be slowed down a bit.
Thanks,
Longman
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls
[not found] ` <90b7bc16-0673-02b7-dad1-f24bc956f1c5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2023-04-12 19:17 ` Tejun Heo
[not found] ` <ZDcDxecF0Y5F0pV6-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2023-04-12 19:17 UTC (permalink / raw)
To: Waiman Long
Cc: Zefan Li, Johannes Weiner, Christian Brauner,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano
Hello,
On Wed, Apr 12, 2023 at 02:40:53PM -0400, Waiman Long wrote:
> On 4/12/23 14:27, Tejun Heo wrote:
> > On Tue, Apr 11, 2023 at 09:36:01AM -0400, Waiman Long wrote:
> > > The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls
> > > are only needed when the CLONE_INTO_CGROUP flag is set which is not
> > > likely. Adding an extra cpuset_can_fork() call does introduce a bit
> > > of performance overhead in the fork/clone fastpath. To reduce this
> > > performance overhead, introduce a new clone_into_cgroup_can_fork flag
> > > into the cgroup_subsys structure. This flag, when set, will call the
> > > can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag
> > > is set.
> > >
> > > The cpuset code is now modified to set this flag. The same cpuset
> > > checking code in cpuset_can_fork() and cpuset_cancel_fork() will have
> > > to stay as the cgroups can be different, but the cpusets may still be
> > > the same. So the same check must be present in both cpuset_fork() and
> > > cpuset_can_fork() to make sure that attach_in_progress is correctly set.
> > >
> > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Waiman, I'm not necessarily against this optimization but can we at least
> > have some performance numbers to show that this is actually meaningful?
> > Given how heavy our fork path is, I'm not too sure this would show up in any
> > meaningful way.
>
> That make sense to me. I am OK to leave it for now as it is an optimization
> patch anyway.
>
> BTW, another question that I have is about the cgroup_threadgroup_rwsem. It
> is currently a percpu rwsem. Is it possible to change it into a regular
> rwsem instead? It is causing quite a bit of latency for workloads that
> require rather frequent changes to cgroups. I know we have a "favordynmods"
> mount option to disable the percpu operation. This will still be less
> performant than a normal rwsem. Of course the downside is that the fork/exit
> fastpaths will be slowed down a bit.
I don't know. Maybe? A rwsem actually has a scalability factor in that the
more CPUs are forking, the more expensive the rwsem becomes, so it is a bit
more of a concern. Another factor is that in majority of use cases we're
almost completely bypassing write-locking percpu_rwsem, so it feel a bit sad
to convert it to a regular rwsem. So, if favordynmods is good enough, I'd
like to keep it that way.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls
[not found] ` <ZDcDxecF0Y5F0pV6-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
@ 2023-04-12 19:23 ` Waiman Long
0 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2023-04-12 19:23 UTC (permalink / raw)
To: Tejun Heo
Cc: Zefan Li, Johannes Weiner, Christian Brauner,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano
On 4/12/23 15:17, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 12, 2023 at 02:40:53PM -0400, Waiman Long wrote:
>> On 4/12/23 14:27, Tejun Heo wrote:
>>> On Tue, Apr 11, 2023 at 09:36:01AM -0400, Waiman Long wrote:
>>>> The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls
>>>> are only needed when the CLONE_INTO_CGROUP flag is set which is not
>>>> likely. Adding an extra cpuset_can_fork() call does introduce a bit
>>>> of performance overhead in the fork/clone fastpath. To reduce this
>>>> performance overhead, introduce a new clone_into_cgroup_can_fork flag
>>>> into the cgroup_subsys structure. This flag, when set, will call the
>>>> can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag
>>>> is set.
>>>>
>>>> The cpuset code is now modified to set this flag. The same cpuset
>>>> checking code in cpuset_can_fork() and cpuset_cancel_fork() will have
>>>> to stay as the cgroups can be different, but the cpusets may still be
>>>> the same. So the same check must be present in both cpuset_fork() and
>>>> cpuset_can_fork() to make sure that attach_in_progress is correctly set.
>>>>
>>>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> Waiman, I'm not necessarily against this optimization but can we at least
>>> have some performance numbers to show that this is actually meaningful?
>>> Given how heavy our fork path is, I'm not too sure this would show up in any
>>> meaningful way.
>> That make sense to me. I am OK to leave it for now as it is an optimization
>> patch anyway.
>>
>> BTW, another question that I have is about the cgroup_threadgroup_rwsem. It
>> is currently a percpu rwsem. Is it possible to change it into a regular
>> rwsem instead? It is causing quite a bit of latency for workloads that
>> require rather frequent changes to cgroups. I know we have a "favordynmods"
>> mount option to disable the percpu operation. This will still be less
>> performant than a normal rwsem. Of course the downside is that the fork/exit
>> fastpaths will be slowed down a bit.
> I don't know. Maybe? A rwsem actually has a scalability factor in that the
> more CPUs are forking, the more expensive the rwsem becomes, so it is a bit
> more of a concern. Another factor is that in majority of use cases we're
> almost completely bypassing write-locking percpu_rwsem, so it feel a bit sad
> to convert it to a regular rwsem. So, if favordynmods is good enough, I'd
> like to keep it that way.
It is just a thought that I have since Juri is in the process of
reverting the change of cpuset_mutex to cpuset_rwsem. Percpu rwsem can
be a bit problematic in PREEMPT_RT kernel since it does not support
proper priority inheritance though.
Cheers,
Longman
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-04-12 19:23 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-11 13:35 [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Waiman Long
[not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-04-11 13:35 ` [PATCH v4 1/5] cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach() Waiman Long
2023-04-11 13:35 ` [PATCH v4 2/5] cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly Waiman Long
2023-04-11 13:36 ` [PATCH v4 4/5] cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset Waiman Long
2023-04-12 18:26 ` [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Tejun Heo
[not found] ` <ZDb32V9xcWOi2_CL-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-04-12 18:29 ` Waiman Long
2023-04-11 13:35 ` [PATCH v4 3/5] cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods Waiman Long
2023-04-11 13:36 ` [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls Waiman Long
[not found] ` <20230411133601.2969636-6-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-04-12 18:27 ` Tejun Heo
[not found] ` <ZDb4G2jgQFK8h8Ys-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-04-12 18:40 ` Waiman Long
[not found] ` <90b7bc16-0673-02b7-dad1-f24bc956f1c5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-04-12 19:17 ` Tejun Heo
[not found] ` <ZDcDxecF0Y5F0pV6-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-04-12 19:23 ` Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox