* [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues
@ 2023-04-11 13:35 Waiman Long
2023-04-11 13:35 ` [PATCH v4 3/5] cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods Waiman Long
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Waiman Long @ 2023-04-11 13:35 UTC (permalink / raw)
To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann,
Michal Koutný, Giuseppe Scrivano, Waiman Long
v4:
- Add missing rcu_read_lock/unlock to cpuset_cancel_fork() in patch 3.
- Add patch 5 to reduce performance impact for the
non-CLONE_INTO_CGROUP case.
v3:
- Update patches 2 & 3 to put task_cs() call under rcu_read_lock().
v2:
- Drop v1 patch 3
- Add a new patch to fix an issue in cpuset_cancel_attach() and
another patch to add cpuset_can_fork() and cpuset_cacnel_fork()
methods.
The first patch in this series fixes a problem in
cpuset_cancel_attach(). Patches 2 and 3 fixes the CLONE_INTO_CGROUP
problem in cpuset. Patch 4 is a minor fix. The last patch is a
performance optimization patch for the non-CLONE_INTO_CGROUP case.
Waiman Long (5):
cgroup/cpuset: Wake up cpuset_attach_wq tasks in
cpuset_cancel_attach()
cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods
cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for
top_cpuset
cgroup/cpuset: Optimize out unneeded
cpuset_can_fork/cpuset_cancel_fork calls
include/linux/cgroup-defs.h | 6 ++
kernel/cgroup/cgroup.c | 23 +++--
kernel/cgroup/cpuset.c | 167 +++++++++++++++++++++++++++++-------
3 files changed, 159 insertions(+), 37 deletions(-)
--
2.31.1
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH v4 3/5] cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods 2023-04-11 13:35 [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Waiman Long @ 2023-04-11 13:35 ` Waiman Long [not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2023-04-11 13:36 ` [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls Waiman Long 2 siblings, 0 replies; 12+ messages in thread From: Waiman Long @ 2023-04-11 13:35 UTC (permalink / raw) To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner Cc: cgroups, linux-kernel, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano, Waiman Long In the case of CLONE_INTO_CGROUP, not all cpusets are ready to accept new tasks. It is too late to check that in cpuset_fork(). So we need to add the cpuset_can_fork() and cpuset_cancel_fork() methods to pre-check it before we can allow attachment to a different cpuset. We also need to set the attach_in_progress flag to alert other code that a new task is going to be added to the cpuset. Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Suggested-by: Michal Koutn√Ω <mkoutny@suse.com> Signed-off-by: Waiman Long <longman@redhat.com> --- kernel/cgroup/cpuset.c | 97 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 86 insertions(+), 11 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index e954d5abb784..132cdae4f03c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2458,6 +2458,20 @@ static int fmeter_getrate(struct fmeter *fmp) static struct cpuset *cpuset_attach_old_cs; +/* + * Check to see if a cpuset can accept a new task + * For v1, cpus_allowed and mems_allowed can't be empty. + * For v2, effective_cpus can't be empty. + * Note that in v1, effective_cpus = cpus_allowed. + */ +static int cpuset_can_attach_check(struct cpuset *cs) +{ + if (cpumask_empty(cs->effective_cpus) || + (!is_in_v2_mode() && nodes_empty(cs->mems_allowed))) + return -ENOSPC; + return 0; +} + /* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */ static int cpuset_can_attach(struct cgroup_taskset *tset) { @@ -2472,16 +2486,9 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) percpu_down_write(&cpuset_rwsem); - /* allow moving tasks into an empty cpuset if on default hierarchy */ - ret = -ENOSPC; - if (!is_in_v2_mode() && - (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))) - goto out_unlock; - - /* - * Task cannot be moved to a cpuset with empty effective cpus. - */ - if (cpumask_empty(cs->effective_cpus)) + /* Check to see if task is allowed in the cpuset */ + ret = cpuset_can_attach_check(cs); + if (ret) goto out_unlock; cgroup_taskset_for_each(task, css, tset) { @@ -2498,7 +2505,6 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) * changes which zero cpus/mems_allowed. */ cs->attach_in_progress++; - ret = 0; out_unlock: percpu_up_write(&cpuset_rwsem); return ret; @@ -3269,6 +3275,68 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css) percpu_up_write(&cpuset_rwsem); } +/* + * In case the child is cloned into a cpuset different from its parent, + * additional checks are done to see if the move is allowed. + */ +static int cpuset_can_fork(struct task_struct *task, struct css_set *cset) +{ + struct cpuset *cs = css_cs(cset->subsys[cpuset_cgrp_id]); + bool same_cs; + int ret; + + rcu_read_lock(); + same_cs = (cs == task_cs(current)); + rcu_read_unlock(); + + if (same_cs) + return 0; + + lockdep_assert_held(&cgroup_mutex); + percpu_down_write(&cpuset_rwsem); + + /* Check to see if task is allowed in the cpuset */ + ret = cpuset_can_attach_check(cs); + if (ret) + goto out_unlock; + + ret = task_can_attach(task, cs->effective_cpus); + if (ret) + goto out_unlock; + + ret = security_task_setscheduler(task); + if (ret) + goto out_unlock; + + /* + * Mark attach is in progress. This makes validate_change() fail + * changes which zero cpus/mems_allowed. + */ + cs->attach_in_progress++; +out_unlock: + percpu_up_write(&cpuset_rwsem); + return ret; +} + +static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset) +{ + struct cpuset *cs = css_cs(cset->subsys[cpuset_cgrp_id]); + bool same_cs; + + rcu_read_lock(); + same_cs = (cs == task_cs(current)); + rcu_read_unlock(); + + if (same_cs) + return; + + percpu_down_write(&cpuset_rwsem); + cs->attach_in_progress--; + if (!cs->attach_in_progress) + wake_up(&cpuset_attach_wq); + percpu_up_write(&cpuset_rwsem); +} + /* * Make sure the new task conform to the current state of its parent, * which could have been changed by cpuset just after it inherits the @@ -3297,6 +3365,11 @@ static void cpuset_fork(struct task_struct *task) percpu_down_write(&cpuset_rwsem); guarantee_online_mems(cs, &cpuset_attach_nodemask_to); cpuset_attach_task(cs, task); + + cs->attach_in_progress--; + if (!cs->attach_in_progress) + wake_up(&cpuset_attach_wq); + percpu_up_write(&cpuset_rwsem); } @@ -3310,6 +3383,8 @@ struct cgroup_subsys cpuset_cgrp_subsys = { .attach = cpuset_attach, .post_attach = cpuset_post_attach, .bind = cpuset_bind, + .can_fork = cpuset_can_fork, + .cancel_fork = cpuset_cancel_fork, .fork = cpuset_fork, .legacy_cftypes = legacy_files, .dfl_cftypes = dfl_files, -- 2.31.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
[parent not found: <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [PATCH v4 1/5] cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach() [not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2023-04-11 13:35 ` Waiman Long 2023-04-11 13:35 ` [PATCH v4 2/5] cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly Waiman Long ` (2 subsequent siblings) 3 siblings, 0 replies; 12+ messages in thread From: Waiman Long @ 2023-04-11 13:35 UTC (permalink / raw) To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano, Waiman Long After a successful cpuset_can_attach() call which increments the attach_in_progress flag, either cpuset_cancel_attach() or cpuset_attach() will be called later. In cpuset_attach(), tasks in cpuset_attach_wq, if present, will be woken up at the end. That is not the case in cpuset_cancel_attach(). So missed wakeup is possible if the attach operation is somehow cancelled. Fix that by doing the wakeup in cpuset_cancel_attach() as well. Fixes: e44193d39e8d ("cpuset: let hotplug propagation work wait for task attaching") Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Reviewed-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org> --- kernel/cgroup/cpuset.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bc4dcfd7bee5..066689a7dcc3 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2507,11 +2507,15 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) static void cpuset_cancel_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; + struct cpuset *cs; cgroup_taskset_first(tset, &css); + cs = css_cs(css); percpu_down_write(&cpuset_rwsem); - css_cs(css)->attach_in_progress--; + cs->attach_in_progress--; + if (!cs->attach_in_progress) + wake_up(&cpuset_attach_wq); percpu_up_write(&cpuset_rwsem); } -- 2.31.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v4 2/5] cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly [not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2023-04-11 13:35 ` [PATCH v4 1/5] cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach() Waiman Long @ 2023-04-11 13:35 ` Waiman Long 2023-04-11 13:36 ` [PATCH v4 4/5] cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset Waiman Long 2023-04-12 18:26 ` [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Tejun Heo 3 siblings, 0 replies; 12+ messages in thread From: Waiman Long @ 2023-04-11 13:35 UTC (permalink / raw) To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano, Waiman Long By default, the clone(2) syscall spawn a child process into the same cgroup as its parent. With the use of the CLONE_INTO_CGROUP flag introduced by commit ef2c41cf38a7 ("clone3: allow spawning processes into cgroups"), the child will be spawned into a different cgroup which is somewhat similar to writing the child's tid into "cgroup.threads". The current cpuset_fork() method does not properly handle the CLONE_INTO_CGROUP case where the cpuset of the child may be different from that of its parent. Update the cpuset_fork() method to treat the CLONE_INTO_CGROUP case similar to cpuset_attach(). Since the newly cloned task has not been running yet, its actual memory usage isn't known. So it is not necessary to make change to mm in cpuset_fork(). Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Reported-by: Giuseppe Scrivano <gscrivan-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- kernel/cgroup/cpuset.c | 62 ++++++++++++++++++++++++++++-------------- 1 file changed, 42 insertions(+), 20 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 066689a7dcc3..e954d5abb784 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2520,16 +2520,33 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset) } /* - * Protected by cpuset_rwsem. cpus_attach is used only by cpuset_attach() + * Protected by cpuset_rwsem. cpus_attach is used only by cpuset_attach_task() * but we can't allocate it dynamically there. Define it global and * allocate from cpuset_init(). */ static cpumask_var_t cpus_attach; +static nodemask_t cpuset_attach_nodemask_to; + +static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task) +{ + percpu_rwsem_assert_held(&cpuset_rwsem); + + if (cs != &top_cpuset) + guarantee_online_cpus(task, cpus_attach); + else + cpumask_copy(cpus_attach, task_cpu_possible_mask(task)); + /* + * can_attach beforehand should guarantee that this doesn't + * fail. TODO: have a better way to handle failure here + */ + WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach)); + + cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to); + cpuset_update_task_spread_flags(cs, task); +} static void cpuset_attach(struct cgroup_taskset *tset) { - /* static buf protected by cpuset_rwsem */ - static nodemask_t cpuset_attach_nodemask_to; struct task_struct *task; struct task_struct *leader; struct cgroup_subsys_state *css; @@ -2560,20 +2577,8 @@ static void cpuset_attach(struct cgroup_taskset *tset) guarantee_online_mems(cs, &cpuset_attach_nodemask_to); - cgroup_taskset_for_each(task, css, tset) { - if (cs != &top_cpuset) - guarantee_online_cpus(task, cpus_attach); - else - cpumask_copy(cpus_attach, task_cpu_possible_mask(task)); - /* - * can_attach beforehand should guarantee that this doesn't - * fail. TODO: have a better way to handle failure here - */ - WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach)); - - cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to); - cpuset_update_task_spread_flags(cs, task); - } + cgroup_taskset_for_each(task, css, tset) + cpuset_attach_task(cs, task); /* * Change mm for all threadgroup leaders. This is expensive and may @@ -3271,11 +3276,28 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css) */ static void cpuset_fork(struct task_struct *task) { - if (task_css_is_root(task, cpuset_cgrp_id)) + struct cpuset *cs; + bool same_cs; + + rcu_read_lock(); + cs = task_cs(task); + same_cs = (cs == task_cs(current)); + rcu_read_unlock(); + + if (same_cs) { + if (cs == &top_cpuset) + return; + + set_cpus_allowed_ptr(task, current->cpus_ptr); + task->mems_allowed = current->mems_allowed; return; + } - set_cpus_allowed_ptr(task, current->cpus_ptr); - task->mems_allowed = current->mems_allowed; + /* CLONE_INTO_CGROUP */ + percpu_down_write(&cpuset_rwsem); + guarantee_online_mems(cs, &cpuset_attach_nodemask_to); + cpuset_attach_task(cs, task); + percpu_up_write(&cpuset_rwsem); } struct cgroup_subsys cpuset_cgrp_subsys = { -- 2.31.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v4 4/5] cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset [not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2023-04-11 13:35 ` [PATCH v4 1/5] cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach() Waiman Long 2023-04-11 13:35 ` [PATCH v4 2/5] cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly Waiman Long @ 2023-04-11 13:36 ` Waiman Long 2023-04-12 18:26 ` [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Tejun Heo 3 siblings, 0 replies; 12+ messages in thread From: Waiman Long @ 2023-04-11 13:36 UTC (permalink / raw) To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano, Waiman Long It is found that attaching a task to the top_cpuset does not currently ignore CPUs allocated to subpartitions in cpuset_attach_task(). So the code is changed to fix that. Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Reviewed-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org> --- kernel/cgroup/cpuset.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 132cdae4f03c..e4ca2dd2b764 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2540,7 +2540,8 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task) if (cs != &top_cpuset) guarantee_online_cpus(task, cpus_attach); else - cpumask_copy(cpus_attach, task_cpu_possible_mask(task)); + cpumask_andnot(cpus_attach, task_cpu_possible_mask(task), + cs->subparts_cpus); /* * can_attach beforehand should guarantee that this doesn't * fail. TODO: have a better way to handle failure here -- 2.31.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues [not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ` (2 preceding siblings ...) 2023-04-11 13:36 ` [PATCH v4 4/5] cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset Waiman Long @ 2023-04-12 18:26 ` Tejun Heo [not found] ` <ZDb32V9xcWOi2_CL-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> 3 siblings, 1 reply; 12+ messages in thread From: Tejun Heo @ 2023-04-12 18:26 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Christian Brauner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano On Tue, Apr 11, 2023 at 09:35:56AM -0400, Waiman Long wrote: > v4: > - Add missing rcu_read_lock/unlock to cpuset_cancel_fork() in patch 3. > - Add patch 5 to reduce performance impact for the > non-CLONE_INTO_CGROUP case. > > v3: > - Update patches 2 & 3 to put task_cs() call under rcu_read_lock(). > > v2: > - Drop v1 patch 3 > - Add a new patch to fix an issue in cpuset_cancel_attach() and > another patch to add cpuset_can_fork() and cpuset_cacnel_fork() > methods. > > The first patch in this series fixes a problem in > cpuset_cancel_attach(). Patches 2 and 3 fixes the CLONE_INTO_CGROUP > problem in cpuset. Patch 4 is a minor fix. The last patch is a > performance optimization patch for the non-CLONE_INTO_CGROUP case. Applied 1-4 to cgroup/for-6.3-fixes w/ stable cc'd. Given that the fixes are a bit involved, the breakages have been there for quite a while and the cpuset code has changed quite a bit, backporting might not be trivial tho. Let's see how that goes. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <ZDb32V9xcWOi2_CL-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues [not found] ` <ZDb32V9xcWOi2_CL-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> @ 2023-04-12 18:29 ` Waiman Long 0 siblings, 0 replies; 12+ messages in thread From: Waiman Long @ 2023-04-12 18:29 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Christian Brauner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano On 4/12/23 14:26, Tejun Heo wrote: > On Tue, Apr 11, 2023 at 09:35:56AM -0400, Waiman Long wrote: >> v4: >> - Add missing rcu_read_lock/unlock to cpuset_cancel_fork() in patch 3. >> - Add patch 5 to reduce performance impact for the >> non-CLONE_INTO_CGROUP case. >> >> v3: >> - Update patches 2 & 3 to put task_cs() call under rcu_read_lock(). >> >> v2: >> - Drop v1 patch 3 >> - Add a new patch to fix an issue in cpuset_cancel_attach() and >> another patch to add cpuset_can_fork() and cpuset_cacnel_fork() >> methods. >> >> The first patch in this series fixes a problem in >> cpuset_cancel_attach(). Patches 2 and 3 fixes the CLONE_INTO_CGROUP >> problem in cpuset. Patch 4 is a minor fix. The last patch is a >> performance optimization patch for the non-CLONE_INTO_CGROUP case. > Applied 1-4 to cgroup/for-6.3-fixes w/ stable cc'd. Given that the fixes are > a bit involved, the breakages have been there for quite a while and the > cpuset code has changed quite a bit, backporting might not be trivial tho. > Let's see how that goes. I know stable backport won't be straight forward. I am planning to help in the backporting effort. Thanks for taking these into your cgroup tree. Cheers, Longman ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls 2023-04-11 13:35 [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Waiman Long 2023-04-11 13:35 ` [PATCH v4 3/5] cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods Waiman Long [not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2023-04-11 13:36 ` Waiman Long [not found] ` <20230411133601.2969636-6-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2 siblings, 1 reply; 12+ messages in thread From: Waiman Long @ 2023-04-11 13:36 UTC (permalink / raw) To: Tejun Heo, Zefan Li, Johannes Weiner, Christian Brauner Cc: cgroups, linux-kernel, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano, Waiman Long The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls are only needed when the CLONE_INTO_CGROUP flag is set which is not likely. Adding an extra cpuset_can_fork() call does introduce a bit of performance overhead in the fork/clone fastpath. To reduce this performance overhead, introduce a new clone_into_cgroup_can_fork flag into the cgroup_subsys structure. This flag, when set, will call the can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag is set. The cpuset code is now modified to set this flag. The same cpuset checking code in cpuset_can_fork() and cpuset_cancel_fork() will have to stay as the cgroups can be different, but the cpusets may still be the same. So the same check must be present in both cpuset_fork() and cpuset_can_fork() to make sure that attach_in_progress is correctly set. Signed-off-by: Waiman Long <longman@redhat.com> --- include/linux/cgroup-defs.h | 6 ++++++ kernel/cgroup/cgroup.c | 23 ++++++++++++++++++----- kernel/cgroup/cpuset.c | 1 + 3 files changed, 25 insertions(+), 5 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 8a0d5466c7be..0087a47d80a2 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -701,6 +701,12 @@ struct cgroup_subsys { */ bool threaded:1; + /* + * If %true, the controller will call can_fork and cancel_fork + * methods only if CLONE_INTO_CGROUP flag is set. + */ + bool clone_into_cgroup_can_fork:1; + /* the following two fields are initialized automatically during boot */ int id; const char *name; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 83ea13f2ccb1..23701e959ef5 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6517,6 +6517,10 @@ int cgroup_can_fork(struct task_struct *child, struct kernel_clone_args *kargs) return ret; do_each_subsys_mask(ss, i, have_canfork_callback) { + if (ss->clone_into_cgroup_can_fork && + !(kargs->flags & CLONE_INTO_CGROUP)) + continue; + ret = ss->can_fork(child, kargs->cset); if (ret) goto out_revert; @@ -6528,8 +6532,12 @@ int cgroup_can_fork(struct task_struct *child, struct kernel_clone_args *kargs) for_each_subsys(ss, j) { if (j >= i) break; - if (ss->cancel_fork) - ss->cancel_fork(child, kargs->cset); + if (!ss->cancel_fork || + (ss->clone_into_cgroup_can_fork && + !(kargs->flags & CLONE_INTO_CGROUP))) + continue; + + ss->cancel_fork(child, kargs->cset); } cgroup_css_set_put_fork(kargs); @@ -6552,9 +6560,14 @@ void cgroup_cancel_fork(struct task_struct *child, struct cgroup_subsys *ss; int i; - for_each_subsys(ss, i) - if (ss->cancel_fork) - ss->cancel_fork(child, kargs->cset); + for_each_subsys(ss, i) { + if (!ss->cancel_fork || + (ss->clone_into_cgroup_can_fork && + !(kargs->flags & CLONE_INTO_CGROUP))) + continue; + + ss->cancel_fork(child, kargs->cset); + } cgroup_css_set_put_fork(kargs); } diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index e4ca2dd2b764..937ef4d60cd4 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3391,6 +3391,7 @@ struct cgroup_subsys cpuset_cgrp_subsys = { .dfl_cftypes = dfl_files, .early_init = true, .threaded = true, + .clone_into_cgroup_can_fork = true, }; /** -- 2.31.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
[parent not found: <20230411133601.2969636-6-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls [not found] ` <20230411133601.2969636-6-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2023-04-12 18:27 ` Tejun Heo [not found] ` <ZDb4G2jgQFK8h8Ys-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Tejun Heo @ 2023-04-12 18:27 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Christian Brauner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano On Tue, Apr 11, 2023 at 09:36:01AM -0400, Waiman Long wrote: > The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls > are only needed when the CLONE_INTO_CGROUP flag is set which is not > likely. Adding an extra cpuset_can_fork() call does introduce a bit > of performance overhead in the fork/clone fastpath. To reduce this > performance overhead, introduce a new clone_into_cgroup_can_fork flag > into the cgroup_subsys structure. This flag, when set, will call the > can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag > is set. > > The cpuset code is now modified to set this flag. The same cpuset > checking code in cpuset_can_fork() and cpuset_cancel_fork() will have > to stay as the cgroups can be different, but the cpusets may still be > the same. So the same check must be present in both cpuset_fork() and > cpuset_can_fork() to make sure that attach_in_progress is correctly set. > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Waiman, I'm not necessarily against this optimization but can we at least have some performance numbers to show that this is actually meaningful? Given how heavy our fork path is, I'm not too sure this would show up in any meaningful way. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <ZDb4G2jgQFK8h8Ys-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls [not found] ` <ZDb4G2jgQFK8h8Ys-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> @ 2023-04-12 18:40 ` Waiman Long [not found] ` <90b7bc16-0673-02b7-dad1-f24bc956f1c5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Waiman Long @ 2023-04-12 18:40 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Christian Brauner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano On 4/12/23 14:27, Tejun Heo wrote: > On Tue, Apr 11, 2023 at 09:36:01AM -0400, Waiman Long wrote: >> The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls >> are only needed when the CLONE_INTO_CGROUP flag is set which is not >> likely. Adding an extra cpuset_can_fork() call does introduce a bit >> of performance overhead in the fork/clone fastpath. To reduce this >> performance overhead, introduce a new clone_into_cgroup_can_fork flag >> into the cgroup_subsys structure. This flag, when set, will call the >> can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag >> is set. >> >> The cpuset code is now modified to set this flag. The same cpuset >> checking code in cpuset_can_fork() and cpuset_cancel_fork() will have >> to stay as the cgroups can be different, but the cpusets may still be >> the same. So the same check must be present in both cpuset_fork() and >> cpuset_can_fork() to make sure that attach_in_progress is correctly set. >> >> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Waiman, I'm not necessarily against this optimization but can we at least > have some performance numbers to show that this is actually meaningful? > Given how heavy our fork path is, I'm not too sure this would show up in any > meaningful way. That make sense to me. I am OK to leave it for now as it is an optimization patch anyway. BTW, another question that I have is about the cgroup_threadgroup_rwsem. It is currently a percpu rwsem. Is it possible to change it into a regular rwsem instead? It is causing quite a bit of latency for workloads that require rather frequent changes to cgroups. I know we have a "favordynmods" mount option to disable the percpu operation. This will still be less performant than a normal rwsem. Of course the downside is that the fork/exit fastpaths will be slowed down a bit. Thanks, Longman ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <90b7bc16-0673-02b7-dad1-f24bc956f1c5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls [not found] ` <90b7bc16-0673-02b7-dad1-f24bc956f1c5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2023-04-12 19:17 ` Tejun Heo [not found] ` <ZDcDxecF0Y5F0pV6-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Tejun Heo @ 2023-04-12 19:17 UTC (permalink / raw) To: Waiman Long Cc: Zefan Li, Johannes Weiner, Christian Brauner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano Hello, On Wed, Apr 12, 2023 at 02:40:53PM -0400, Waiman Long wrote: > On 4/12/23 14:27, Tejun Heo wrote: > > On Tue, Apr 11, 2023 at 09:36:01AM -0400, Waiman Long wrote: > > > The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls > > > are only needed when the CLONE_INTO_CGROUP flag is set which is not > > > likely. Adding an extra cpuset_can_fork() call does introduce a bit > > > of performance overhead in the fork/clone fastpath. To reduce this > > > performance overhead, introduce a new clone_into_cgroup_can_fork flag > > > into the cgroup_subsys structure. This flag, when set, will call the > > > can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag > > > is set. > > > > > > The cpuset code is now modified to set this flag. The same cpuset > > > checking code in cpuset_can_fork() and cpuset_cancel_fork() will have > > > to stay as the cgroups can be different, but the cpusets may still be > > > the same. So the same check must be present in both cpuset_fork() and > > > cpuset_can_fork() to make sure that attach_in_progress is correctly set. > > > > > > Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Waiman, I'm not necessarily against this optimization but can we at least > > have some performance numbers to show that this is actually meaningful? > > Given how heavy our fork path is, I'm not too sure this would show up in any > > meaningful way. > > That make sense to me. I am OK to leave it for now as it is an optimization > patch anyway. > > BTW, another question that I have is about the cgroup_threadgroup_rwsem. It > is currently a percpu rwsem. Is it possible to change it into a regular > rwsem instead? It is causing quite a bit of latency for workloads that > require rather frequent changes to cgroups. I know we have a "favordynmods" > mount option to disable the percpu operation. This will still be less > performant than a normal rwsem. Of course the downside is that the fork/exit > fastpaths will be slowed down a bit. I don't know. Maybe? A rwsem actually has a scalability factor in that the more CPUs are forking, the more expensive the rwsem becomes, so it is a bit more of a concern. Another factor is that in majority of use cases we're almost completely bypassing write-locking percpu_rwsem, so it feel a bit sad to convert it to a regular rwsem. So, if favordynmods is good enough, I'd like to keep it that way. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <ZDcDxecF0Y5F0pV6-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls [not found] ` <ZDcDxecF0Y5F0pV6-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org> @ 2023-04-12 19:23 ` Waiman Long 0 siblings, 0 replies; 12+ messages in thread From: Waiman Long @ 2023-04-12 19:23 UTC (permalink / raw) To: Tejun Heo Cc: Zefan Li, Johannes Weiner, Christian Brauner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Juri Lelli, Dietmar Eggemann, Michal Koutný, Giuseppe Scrivano On 4/12/23 15:17, Tejun Heo wrote: > Hello, > > On Wed, Apr 12, 2023 at 02:40:53PM -0400, Waiman Long wrote: >> On 4/12/23 14:27, Tejun Heo wrote: >>> On Tue, Apr 11, 2023 at 09:36:01AM -0400, Waiman Long wrote: >>>> The newly introduced cpuset_can_fork() and cpuset_cancel_fork() calls >>>> are only needed when the CLONE_INTO_CGROUP flag is set which is not >>>> likely. Adding an extra cpuset_can_fork() call does introduce a bit >>>> of performance overhead in the fork/clone fastpath. To reduce this >>>> performance overhead, introduce a new clone_into_cgroup_can_fork flag >>>> into the cgroup_subsys structure. This flag, when set, will call the >>>> can_fork and cancel_fork methods only if the CLONE_INTO_CGROUP flag >>>> is set. >>>> >>>> The cpuset code is now modified to set this flag. The same cpuset >>>> checking code in cpuset_can_fork() and cpuset_cancel_fork() will have >>>> to stay as the cgroups can be different, but the cpusets may still be >>>> the same. So the same check must be present in both cpuset_fork() and >>>> cpuset_can_fork() to make sure that attach_in_progress is correctly set. >>>> >>>> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>> Waiman, I'm not necessarily against this optimization but can we at least >>> have some performance numbers to show that this is actually meaningful? >>> Given how heavy our fork path is, I'm not too sure this would show up in any >>> meaningful way. >> That make sense to me. I am OK to leave it for now as it is an optimization >> patch anyway. >> >> BTW, another question that I have is about the cgroup_threadgroup_rwsem. It >> is currently a percpu rwsem. Is it possible to change it into a regular >> rwsem instead? It is causing quite a bit of latency for workloads that >> require rather frequent changes to cgroups. I know we have a "favordynmods" >> mount option to disable the percpu operation. This will still be less >> performant than a normal rwsem. Of course the downside is that the fork/exit >> fastpaths will be slowed down a bit. > I don't know. Maybe? A rwsem actually has a scalability factor in that the > more CPUs are forking, the more expensive the rwsem becomes, so it is a bit > more of a concern. Another factor is that in majority of use cases we're > almost completely bypassing write-locking percpu_rwsem, so it feel a bit sad > to convert it to a regular rwsem. So, if favordynmods is good enough, I'd > like to keep it that way. It is just a thought that I have since Juri is in the process of reverting the change of cpuset_mutex to cpuset_rwsem. Percpu rwsem can be a bit problematic in PREEMPT_RT kernel since it does not support proper priority inheritance though. Cheers, Longman ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-04-12 19:23 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-11 13:35 [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Waiman Long
2023-04-11 13:35 ` [PATCH v4 3/5] cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods Waiman Long
[not found] ` <20230411133601.2969636-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-04-11 13:35 ` [PATCH v4 1/5] cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach() Waiman Long
2023-04-11 13:35 ` [PATCH v4 2/5] cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly Waiman Long
2023-04-11 13:36 ` [PATCH v4 4/5] cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset Waiman Long
2023-04-12 18:26 ` [PATCH v4 0/5] cgroup/cpuset: Fix CLONE_INTO_CGROUP problem & other issues Tejun Heo
[not found] ` <ZDb32V9xcWOi2_CL-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-04-12 18:29 ` Waiman Long
2023-04-11 13:36 ` [PATCH v4 5/5] cgroup/cpuset: Optimize out unneeded cpuset_can_fork/cpuset_cancel_fork calls Waiman Long
[not found] ` <20230411133601.2969636-6-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-04-12 18:27 ` Tejun Heo
[not found] ` <ZDb4G2jgQFK8h8Ys-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-04-12 18:40 ` Waiman Long
[not found] ` <90b7bc16-0673-02b7-dad1-f24bc956f1c5-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2023-04-12 19:17 ` Tejun Heo
[not found] ` <ZDcDxecF0Y5F0pV6-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-04-12 19:23 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox